Testing the Potential: Are LLMs Valid and Reliable Tools for Analysing Academic Documents?
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a methodological approach to employing Large Language Models (LLMs) for academic literature classification, using complexity theory literature as its empirical domain. We evaluate two state-of-the-art open-source LLMs—Qwen-QwQ-32B and Llama-4-maverick—on their capacity to accurately categorise scholarly papers according to predefined theoretical frameworks. Our methodology employs two distinct enhancement techniques: an iterative processing approach designed to improve reliability by generating multiple independent classifications for each document, and a calibrated weighting technique aimed at enhancing validity by adjusting probability distributions toward reference distributions. To assess validity, we employ proximity accuracy measurements that quantify the degree of alignment between LLM-generated classifications and human coding. The experiment included 25 iterations (3,100 records in total coded by LLMs). Our findings demonstrate that both LLMs achieve strong base performance (>70\% validity) that could be further improved through calibration (reaching up to 90.32\% validity), with diminishing returns after approximately 15 iterations. The impact of weighting varies significantly by category and model, suggesting that selective application of weighting approaches may be optimal. We conclude with recommendations for researchers employing LLMs in qualitative coding workflows, particularly for applications in Qualitative Comparative Analysis (QCA) and similar methodologies requiring rigorous category assignment. This research contributes to developing transparent, replicable procedures for LLM-assisted qualitative coding that maintain methodological integrity whilst leveraging computational efficiency.