Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models

Authors

DOI:

https://doi.org/10.24425/cejeme.2025.155564

Keywords:

topic models,, text analysis,, latent Dirichlet allocation,, Monte Carlo simulation,, text generation,, text preprocessing

Abstract

An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent terms believed to provide limited information about the corpus. Despite the popularity of vocabulary pruning, there are not many guidelines on how to implement it in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent terms for the quality of topics estimated using latent Dirichlet allocation (LDA). The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent term removal and various evaluation metrics. The results indicate that pruning is often beneficial and that the share of vocabulary that might be eliminated can be quite considerable.

Downloads

Published

2025-05-27

How to Cite

Bystrov, V., Naboka-Krell, V., Staszewska-Bystrova, A., & Winker, P. (2025). Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models. Central European Journal of Economic Modelling and Econometrics, 17(1), 61–85. https://doi.org/10.24425/cejeme.2025.155564

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.