Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models
DOI:
https://doi.org/10.24425/cejeme.2025.155564Keywords:
topic models,, text analysis,, latent Dirichlet allocation,, Monte Carlo simulation,, text generation,, text preprocessingAbstract
An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent terms believed to provide limited information about the corpus. Despite the popularity of vocabulary pruning, there are not many guidelines on how to implement it in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent terms for the quality of topics estimated using latent Dirichlet allocation (LDA). The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent term removal and various evaluation metrics. The results indicate that pruning is often beneficial and that the share of vocabulary that might be eliminated can be quite considerable.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker

This work is licensed under a Creative Commons Attribution 4.0 International License.