Testing Content Analysis through Different Large Language Models: Towards a Gold Standard Protocol

Fabio Torreggiani
Giuliano Bobba
Federico Vegetti
Antonella SEDDONE
Moreno Mancosu
Elisa Iannone
Alessandra Malorgio
Costanza Massidda

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper proposes a comprehensive assessment of content analysis performed by different Large Language Models (LLMs), in order to develop a gold standard protocol for using LLMs in content analysis research. The study relies on 1,500 cases sampled from a dataset of approximately 4,000 Facebook posts from political leaders in six countries (Italy, Spain, France, Greece, Germany, and the Netherlands), which have been human-coded for 17 variables related to populism content and communication style. The same corpus of Facebook posts and the identical codebook used by human coders—containing variable descriptions and instructions—were applied to four distinct LLMs (six versions): ChatGPT (3.5 and 4o), Gemini 1.5 (flash and pro), Llama 3 (70B), and Mistral Large. The coding has been compared among the LLMs, as well as between the LLMs and human coders. To assess the performance of the LLMs, the consistency between LLMs and human coders was quantified using inter-coder reliability measures. In cases of discrepancies between human and LLM coding, the research team implemented a supervised assessment procedure to identify the correct response, simultaneously verifying the reliability of both human and LLM content analysis. The significance of this research lies in its potential to evaluate the capabilities of various LLMs, particularly in non-English languages, and to establish a standardized protocol for utilizing LLMs in content analysis within political and communication studies. This study aims to advance our understanding of LLMs’ efficacy in replicating human coding processes and their application in multilingual contexts, ultimately contributing to the methodological rigor in the field of computational content analysis.

Version published to 10.31235/osf.io/nbdpa_v1 on OSF Preprints
Mar 6, 2026

Evaluation of Linguistic Consistency of LLM-Generated Text Personalization Using Natural Language Processing

This article has 2 authors:
1. Linh Huynh
2. Danielle S. McNamara
This article has no evaluationsLatest version Feb 13, 2026
Standardized Assessment of LLM English Proficiency

This article has 7 authors:
1. Shangchao Min
2. Shaonan Wang
3. Xinyu Gao
4. Hui Wang
5. Zhiling Jin
6. Chen Ling
7. Nai Ding
This article has no evaluationsLatest version Feb 19, 2026
AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models

This article has 3 authors:
1. Anna Fuchs
2. Anna-Carolina Haensch
3. Wiebke Weber
This article has no evaluationsLatest version Mar 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluation of Linguistic Consistency of LLM-Generated Text Personalization Using Natural Language Processing

Standardized Assessment of LLM English Proficiency

AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models