Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models

Victor Bystrov; Viktoriia Naboka-Krell; Anna Staszewska-Bystrova; Peter Winker

doi:10.24425/cejeme.2025.155564

Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models

Authors

Victor Bystrov University of Lodz, Poland https://orcid.org/0000-0003-0980-2790
Viktoriia Naboka-Krell Justus Liebig University Giessen, Germany https://orcid.org/0000-0003-0690-2737
Anna Staszewska-Bystrova University of Lodz, Poland https://orcid.org/0000-0002-3941-4986
Peter Winker Justus Liebig University Giessen, Germany https://orcid.org/0000-0003-3412-4207

DOI:

https://doi.org/10.24425/cejeme.2025.155564

Keywords

topic models,, text analysis,, latent Dirichlet allocation,, Monte Carlo simulation,, text generation,, text preprocessing

Abstract

An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent terms believed to provide limited information about the corpus. Despite the popularity of vocabulary pruning, there are not many guidelines on how to implement it in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent terms for the quality of topics estimated using latent Dirichlet allocation (LDA). The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent term removal and various evaluation metrics. The results indicate that pruning is often beneficial and that the share of vocabulary that might be eliminated can be quite considerable.

Downloads

Download data is not yet available.

Downloads

Published

2025-05-27

How to Cite

Bystrov, V., Naboka-Krell, V., Staszewska-Bystrova, A., & Winker, P. (2025). Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models. Central European Journal of Economic Modelling and Econometrics, 17(1), 61–85. https://doi.org/10.24425/cejeme.2025.155564

Download Citation

Issue

Vol. 17 No. 1 (2025): Central European Journal of Economic Modelling and Econometrics

Section

ARTICLES

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models

Authors

DOI:

Keywords

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Make a Submission

Information

Indexing and Metrics

Similar Articles

Hybrid MSV-MGARCH Models - General Remarks and the GMSF-SBEKK Specification

On Sensitivity of Inference in Bayesian MSF-MGARCH Models

Bayesian Estimation and Prediction for ACD Models in the Analysis of Trade Durations from the Polish Stock Market

Accounting for Spatial Heterogeneity of Preferences in Discrete Choice Models

Bayesian Comparison of Bivariate Copula-GARCH and MGARCH Models

Improving the Effectiveness of Maximum Score Estimators for Binary Regression Models

Bayesian Inference and Gibbs Sampling in Generalized True Random-Effects Models

Copula-based Stochastic Frontier Model with Autocorrelated Inefficiency

State-dependent Autoregressive Models with p Lags Properties, Estimation and Forecasting

Early Warning Models of Banking Crises VIX and High Profits

Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models

Authors

DOI:

Keywords

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Make a Submission

Information

Indexing and Metrics

policy Privacy Policy