LLM-based data extraction for a large cancer registry, the Ontario Hereditary Cancer Research Network

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Importance

Manual data extraction from genomic lab reports for on-line registries and databases is time-consuming for human resources such as clinical research coordinators. Automated tools, especially LLMs, can address these issues. Efficient and accurate data processing is crucial for building a reliable database.

Objective

To streamline the data extraction and curation process for genetic testing lab reports using an LLM-based approach.

Design

Nine sample molecular lab reports were selected for manual data extraction by two expert curators. The process was timed, and the results served as gold-standard for validating automated extraction. Eighteen fields from the OHRCN’s data model were selected as extraction targets.

Setting

The study was conducted within OHCRN, which unifies research, genomic, and clinical patient data from clinics and laboratories across Ontario, Canada.

Participants

Nine laboratories agreed to share sample molecular lab reports and two clinical research coordinators affiliated with OHCRN participated as data curators.

Exposure

LLM-based Extraction of Information (LEI), an automated data extraction pipeline, was developed using regular expressions, Trie search, and LLMs to extract data from molecular lab reports and structure it for inclusion into OHCRN’s database.

Main Outcomes and Measures

LEI was evaluated by measuring the F1-score on the extraction task of 18 entity types. These measures were compared against 15 extraction tools in the biomedical domain. Extraction time was also measured and compared against manual extraction times.

Results

LEI demonstrated quality on par with and surpassing other existing LLM-based extraction methods. Reference tools showed F1-scores around 70%, while LEI achieved an average score of 87.4%. LEI reduced extraction time by approximately 2-fold, with an average time of 7.59 minutes per report including results review by curators, compared to 14.88 minutes per report for manual extraction.

Conclusions and Relevance

LEI facilitates standardized, accurate, and efficient healthcare data extraction from unstructured texts, significantly improving the current OHCRN workflow. By automating the extraction process, LEI allows expert curators to focus on validating results rather than performing manual data entry. LEI’s simple interface enables researchers to easily guide extraction tasks and supports adaptability across diverse biomedical scenarios. Future improvements in accuracy may be achieved through fine-tuning techniques and ongoing advancements in LLM technologies.

Article activity feed