Mapping the Unseen in Practice: Comparing Latent Dirichlet Allocation and BERTopic for Navigating Topic Spaces

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This article focuses on comparing two widely used techniques of topic modeling, namely latent Dirichlet allocation (LDA) and BERTopic. The first is a Bayesian probabilistic model and the latter is rooted in deep learning. It remains unclear what those differences imply in practice, and how they contribute to our sociological understanding of the inner works of science. This paper compares results obtained by LDA and BERTopic applied to the same dataset composed of all scientific articles (n=34,797) authored by all biology professors in Switzerland between 2008 and 2020. We propose a step-by-step demonstration from data pre-processing to the results. Hence we emphasize that understanding their underlying functioning is essential for effectively interpreting the outcomes and balance between the strengths and weaknesses of the two techniques. Although they differ in their operationalization, LDA and BERTopic produce topic spaces with a similar global configuration. However, major differences are observed when focusing on specific multidimensional concepts, such as gene. With evidence from our empirical demonstration, we overall stress that topic modeling offers a highly valuable ground for understanding the semantic structure of scientific fields when combined with in-depth knowledge of the object under scrutiny.

Article activity feed