The Use of Large Language Models for Qualitative Research: DECOTA

Lois Player
Ryan Hughes
Kaloyan Mitev
Lorraine Whitmarsh
Christina Demski
Nicholas Nash
Trisevgeni Papakonstantinou
Mark Wilson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine-assisted approaches for free-text analysis are rising in popularity, owing to a growing need to rapidly analyse large volumes of qualitative data. In both research and policy settings, these approaches have promise in providing timely insights into public perceptions and enabling policymakers to understand their community’s needs. However, current approaches still require expert human interpretation – posing a financial and practical barrier for those outside of academia. For the first time, we propose and validate the Deep Computational Text Analyser (DECOTA) - a novel Machine Learning methodology that automatically analyses large free-text datasets and outputs concise themes. Building on Structural Topic Modelling (STM) approaches, we used two fine-tuned Large Language Models (LLMs) and sentence transformers to automatically derive ‘codes’ and their corresponding ‘themes’, as in Inductive Thematic Analysis. To automate the process, we designed and validated a novel algorithm to choose the optimal number of ‘topics’ following STM. This approach automatically derives key codes and themes from free-text data, the prevalence of each code, and how prevalence varies with covariates such as age and gender. Each code is accompanied by three representative quotes. Four datasets previously analysed using Thematic Analysis were triangulated with DECOTA’s codes and themes. We found that DECOTA is approximately 378 times faster and 1920 times cheaper than human coding, and consistently yields codes in agreement with or complementary to human coding (averaging 91.6% for codes, and 90% for themes). The implications for evidence-based policy development, public engagement with policymaking, and the development of psychometric measures are discussed.

Version published to 10.31234/osf.io/t5gbv on OSF Preprints
Jul 24, 2024

RCualiText: An Open-Source R/ShinyWeb Application for Qualitative Analysis

This article has 2 authors:
1. Jose Ventura-Leon
2. Miguel Barboza Palomino
This article has no evaluationsLatest version Feb 8, 2026
A Methodological Guide on Using Large Language Models for Text Annotation in the Social Sciences and Humanities with Python and R

This article has 3 authors:
1. Qixiang Fang
2. Javier Garcia-Bernardo
3. Erik-Jan van Kesteren
This article has no evaluationsLatest version Mar 23, 2026
Natural Language Processing in the Era of Large Language Models: Foundations, Integration, and Low-Resource Frontiers

This article has 1 author:
1. Monisha Gottam
This article has no evaluationsLatest version Mar 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

RCualiText: An Open-Source R/ShinyWeb Application for Qualitative Analysis

A Methodological Guide on Using Large Language Models for Text Annotation in the Social Sciences and Humanities with Python and R

Natural Language Processing in the Era of Large Language Models: Foundations, Integration, and Low-Resource Frontiers