Comparing Large Language Models for Text Classification: Model Selection Across Tasks, Texts, and Languages

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large-scale text analysis has grown rapidly as an analytic method in the socialsciences and beyond in recent years and recent advances in large language models(LLMs) have made automated text annotation increasingly viable. This paper focuseson the comparative viability of closed-source and open-source LLMS for textannotation, testing the performance of 28 different LLMs in text classification acrossa range of tasks, text types, and languages. Using data in seven languages across10 country contexts, the results show considerable variation in model performance,highlighting that researchers should carefully consider model selection as part of theirLLM-centered classification strategy. In general, the closed-source GPT-4 exhibits relativelystrong performance across all classification tasks, while open-source alternativessuch as LLama3 and Qwen2.5 also show similar or even superior performance on selecttasks. Many smaller open-source models, however, provide relatively unsatisfactoryperformance on more complex and non-English language coding tasks. The tradeoffsinherent in the use of each model are therefore highlighted to allow researchers to makeinformed decisions about model selection on a specific task-by-task basis.

Article activity feed