LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

Antoine Claude Lemor
Shannon Dinan
Jeremy Gilbert

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models now routinely annotate text in computational social science, but they do not hold up at corpus sizes of several million sentences. Proprietary LLMs are financially prohibitive, local open-weight LLMs take days or weeks of computation, and both remain opaque and hard to reproduce. Using LLMs to train dedicated classifiers solves these problems, yet the approach itself remains sparsely tested in the social sciences and largely inaccessible to researchers without engineering support. We present LLMTool, an open-source Python package that runs the full hybrid workflow from a command line. On a bilingual corpus of 38,451 Canadian parliamentary debates and news media texts coded across four dimensions, classifiers trained on the best LLM labels reach amean Micro F1 of 68.9%. Open-weight models such as GPT-OSS match one of the best proprietary models available, GPT-5, and deliver a 109–395× inference speedup over directLLM annotation on standard workstations

Version published to 10.31235/osf.io/6q8yg_v2 on OSF Preprints
Apr 13, 2026
Version published to 10.31235/osf.io/6q8yg_v1 on OSF Preprints
Apr 12, 2026

All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories

This article has 3 authors:
1. Inessa Cohen
2. Hongyi Yu
3. Robert A. McDougal
This article has no evaluationsLatest version Apr 27, 2026
SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs

This article has 2 authors:
1. Matthew C. McFee
2. Philip M. Kim
This article has no evaluationsLatest version May 7, 2026
Guidance for high-quality functional gene embeddings from large language models

This article has 7 authors:
1. Rongyao Huang
2. Yaopan Hou
3. Wuye Zhao
4. Junbing Zhang
5. Jian Lu
6. Yimeng Kong
7. Peng Xu
This article has no evaluationsLatest version May 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories

SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs

Guidance for high-quality functional gene embeddings from large language models