From Job Titles to ISCO Codes: Enhancing Occupational Classification With RAG-based LLMs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate occupational classification from open-ended survey responses is vital for research in sociology, economics, and political science, yet manual coding remains resource-intensive and difficult to scale. We propose a novel pipeline that leverages large language models (LLMs) augmented with retrieval (RAG) to automate the assignment of International Standard Classification of Occupations (ISCO) codes. Drawing on survey data from a sample of recently arrived Afghan and Syrian refugees in Germany, we preprocess noisy occupational descriptions using LLMs and apply vector-based similarity search to retrieve candidate ISCO codes. The final classification is selected by LLMs, constrained to the retrieved candidates and accompanied by interpretable justifications. We evaluate the system’s performance against expert-coded labels, demonstrating high agreement and robustness across languages. Our findings suggest that RAG-powered LLMs can substantially improve the accuracy, scalability, and accessibility of occupational classification, with particular benefits for multilingual and resource-constrained research settings. In addition, we describe a prototypical pipeline that other researchers can readily adapt for applying LLMs to similar classification tasks, facilitating transparency, reproducibility, and broader adoption.

Article activity feed