LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models now routinely annotate text in computational social science, but they do not hold up at corpus sizes of several million sentences. Proprietary LLMs are financially prohibitive, local open-weight LLMs take days or weeks of computation, and both remain opaque and hard to reproduce. Using LLMs to train dedicated classifiers solves these problems, yet the approach itself remains sparsely tested in the social sciences and largely inaccessible to researchers without engineering support. We present LLMTool, an open-source Python package that runs the full hybrid workflow from a command line. On a bilingual corpus of 38,451 Canadian parliamentary debates and news media texts coded across four dimensions, classifiers trained on the best LLM labels reach amean Micro F1 of 68.9%. Open-weight models such as GPT-OSS match one of the best proprietary models available, GPT-5, and deliver a 109–395× inference speedup over directLLM annotation on standard workstations