Testing Content Analysis through Different Large Language Models: Towards a Gold Standard Protocol
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper proposes a comprehensive assessment of content analysis performed by different Large Language Models (LLMs), in order to develop a gold standard protocol for using LLMs in content analysis research. The study relies on 1,500 cases sampled from a dataset of approximately 4,000 Facebook posts from political leaders in six countries (Italy, Spain, France, Greece, Germany, and the Netherlands), which have been human-coded for 17 variables related to populism content and communication style. The same corpus of Facebook posts and the identical codebook used by human coders—containing variable descriptions and instructions—were applied to four distinct LLMs (six versions): ChatGPT (3.5 and 4o), Gemini 1.5 (flash and pro), Llama 3 (70B), and Mistral Large. The coding has been compared among the LLMs, as well as between the LLMs and human coders. To assess the performance of the LLMs, the consistency between LLMs and human coders was quantified using inter-coder reliability measures. In cases of discrepancies between human and LLM coding, the research team implemented a supervised assessment procedure to identify the correct response, simultaneously verifying the reliability of both human and LLM content analysis. The significance of this research lies in its potential to evaluate the capabilities of various LLMs, particularly in non-English languages, and to establish a standardized protocol for utilizing LLMs in content analysis within political and communication studies. This study aims to advance our understanding of LLMs’ efficacy in replicating human coding processes and their application in multilingual contexts, ultimately contributing to the methodological rigor in the field of computational content analysis.