Don’t Look Up: Evaluating the Tradeoff between Performance and Sustainability of LLMs for Text Analysis.
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) are widely used as research tools, but their high resource demands raise significant environmental concerns. While LLMs offer advantages in certain applications, their high energy demands prompt a necessary question for social scientists: Is it worth considering LLMs for every text analysis task? This study systematically evaluates the trade-off between performance and energy usage across computational text analysis methods, including dictionaries, trained classifiers, and open “local” LLMs. Applying sentiment analysis, multi-class classification, and named entity recognition to political documents, we measure energy consumption, CO2 emissions, correlation with human raters, F1-Score and processing time. We find that LLMs perform well on sentiment analysis, closely matching human judgment, but at relatively high environmental costs. For classification and named entity recognition, task-specific models achieve superior accuracy and low environmental impact. Contrary to multi-purpose LLM benchmarks, larger parameter counts do not guarantee better performance on text classification tasks. Introducing a CO2 Adjusted F1-Score, we observe that smaller and more efficient models, such as Mistral-Nemo (12B), outperform larger quantized models like Deepseek-R1 (32B). Our findings highlight the necessity for thoughtful model selection, rather than defaulting to LLMs. A "right-fit" approach, employing task-specific, lighter methods offers performance and sustainability benefits.