A Methodological Guide on Using Large Language Models for Text Annotation in the Social Sciences and Humanities with Python and R

Qixiang Fang
Javier Garcia-Bernardo
Erik-Jan van Kesteren

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have become an essential tool for social science and humanities (SSH) researchers who work with text. One particularly valuable application is automating text annotation, a traditionally time-consuming step in preparing data for empirical analysis. Yet many SSH researchers face two challenges: getting started with LLMs and understanding how to address their limitations. Practically, the rapid pace of model development can make LLMs seem inaccessible or intimidating, while even experienced users may overlook how annotation errors can bias downstream statistical analyses (e.g., regression estimates and $p$-values), even when annotation accuracy appears high.This paper provides a comprehensive, step-by-step methodological guide for using LLMs for text annotation in SSH research, with clear Python and R code snippets. We cover (1) how LLMs work and what they can and cannot do; (2) how to identify an LLM-suitable research project and establish minimum data and computational requirements; (3) how to design prompts and run annotation tasks; (4) how to evaluate annotation quality and iteratively refine prompts without overfitting; (5) how to integrate LLM annotations into downstream statistical analyses while accounting for annotation error; and (6) how to manage cost, efficiency, and reproducibility when scaling up annotation. Throughout, we provide clear and intuitive methodological reasoning, concrete examples, code snippets, and best-practice guidance to help researchers confidently and transparently incorporate LLM-based annotation into their scientific workflows.

Version published to 10.31235/osf.io/v4eq6_v3 on OSF Preprints
May 28, 2026
Version published to 10.31235/osf.io/v4eq6_v2 on OSF Preprints
Mar 23, 2026
Version published to 10.31235/osf.io/v4eq6_v1 on OSF Preprints
Nov 13, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed