Evaluating Embedding Models for Clustering Italian Political News: A Comparative Study of Text-Embedding-3-Large and UmBERTo
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In an era where social media platforms have become the battleground for shaping political narratives, understanding the nuances of disseminated political news content is crucial. Reliable unsupervised clustering of datasets containing excerpts from news stories circulated on social media is a central piece of the puzzle. Despite advancements in Natural Language Processing (NLP) techniques, studies led by social scientists that apply fully unsupervised techniques to Italian language content remain rare. While large language models promise to be game-changers, a proper comparison with previously available unsupervised NLP techniques is lacking. This study helps to fill this gap by comparing the performance of OpenAI's text-embedding-3-large model against the BERT-based UmBERTo model. The comparison utilizes two distinct datasets of political news stories circulated on Facebook before the 2018 and 2022 Italian elections. Using K-means and HDBSCAN, we find that text-embedding-3-large consistently outperforms UmBERTo in producing semantically coherent clusters.