LLM-based data extraction for a large cancer registry, the Ontario Hereditary Cancer Research Network

Andres Felipe Melani De La Hoz
Jochen Weile
Pratham Hemlani
Elif Tuzlali
Sarah Ridd
Brandon Chan
Lauren K. Hughes
Kathy Chun
Harriet Feilotter
Daria Grafodatskaya
Jordan Lerner-Ellis
Laila Schenkel
Amanda Smith
Andrea Vaags
Hong Wang
Raymond H. Kim
Lincoln Stein
Benjamin Haibe-Kains
Melanie Courtot

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

Manual data extraction from genomic lab reports for on-line registries and databases is time-consuming for human resources such as clinical research coordinators. Automated tools, especially LLMs, can address these issues. Efficient and accurate data processing is crucial for building a reliable database.

Objective

To streamline the data extraction and curation process for genetic testing lab reports using an LLM-based approach.

Design

Nine sample molecular lab reports were selected for manual data extraction by two expert curators. The process was timed, and the results served as gold-standard for validating automated extraction. Eighteen fields from the OHRCN’s data model were selected as extraction targets.

Setting

The study was conducted within OHCRN, which unifies research, genomic, and clinical patient data from clinics and laboratories across Ontario, Canada.

Participants

Nine laboratories agreed to share sample molecular lab reports and two clinical research coordinators affiliated with OHCRN participated as data curators.

Exposure

LLM-based Extraction of Information (LEI), an automated data extraction pipeline, was developed using regular expressions, Trie search, and LLMs to extract data from molecular lab reports and structure it for inclusion into OHCRN’s database.

Main Outcomes and Measures

LEI was evaluated by measuring the F1-score on the extraction task of 18 entity types. These measures were compared against 15 extraction tools in the biomedical domain. Extraction time was also measured and compared against manual extraction times.

Results

LEI demonstrated quality on par with and surpassing other existing LLM-based extraction methods. Reference tools showed F1-scores around 70%, while LEI achieved an average score of 87.4%. LEI reduced extraction time by approximately 2-fold, with an average time of 7.59 minutes per report including results review by curators, compared to 14.88 minutes per report for manual extraction.

Conclusions and Relevance

LEI facilitates standardized, accurate, and efficient healthcare data extraction from unstructured texts, significantly improving the current OHCRN workflow. By automating the extraction process, LEI allows expert curators to focus on validating results rather than performing manual data entry. LEI’s simple interface enables researchers to easily guide extraction tasks and supports adaptability across diverse biomedical scenarios. Future improvements in accuracy may be achieved through fine-tuning techniques and ongoing advancements in LLM technologies.

Version published to 10.1101/2025.08.20.25334127 on medRxiv
Aug 26, 2025