Efficient and Verified Extraction of the Research Data Using LLM

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) hold considerable promise for automated extraction of structured biological information from scientific literature, yet their reliability in domain-specific tasks such as DNA probe parsing remains underexplored. We de-veloped a verification-focused, schema-guided extraction pipeline that transforms unstructured text from scientific articles into a normalized database of oligonucleotide probes, primers, and associated metadata. The system combines multi-turn JSON generation, strict schema validation, sequence-specific rule checks, and a post-processing recovery module that rescues systematically corrupted nucleotide outputs. Benchmarking across nine contemporary LLMs revealed distinct accuracy–hallucination trade-offs, with context-optimized Qwen3 model achieving the highest overall extraction efficiency while maintaining low hallucination rates. Iterative prompting substantially improved fidelity but introduced notable latency and vari-ance. Across all models, stable error profiles and the success of the recovery module indicate that most extraction failures stem from systematic and correctable formatting issues rather than semantic misunderstandings. These findings highlight both the potential and the current limitations of LLMs for structured scientific data extraction, and they provide a reproducible benchmark and extensible framework for future large-scale curation of molecular biology datasets.

Article activity feed