LLM-based data extraction for a large cancer registry, the Ontario Hereditary Cancer Research Network
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance
Manual data extraction from genomic lab reports for on-line registries and databases is time-consuming for human resources such as clinical research coordinators. Automated tools, especially LLMs, can address these issues. Efficient and accurate data processing is crucial for building a reliable database.
Objective
To streamline the data extraction and curation process for genetic testing lab reports using an LLM-based approach.
Design
Nine sample molecular lab reports were selected for manual data extraction by two expert curators. The process was timed, and the results served as gold-standard for validating automated extraction. Eighteen fields from the OHRCN’s data model were selected as extraction targets.
Setting
The study was conducted within OHCRN, which unifies research, genomic, and clinical patient data from clinics and laboratories across Ontario, Canada.
Participants
Nine laboratories agreed to share sample molecular lab reports and two clinical research coordinators affiliated with OHCRN participated as data curators.
Exposure
LLM-based Extraction of Information (LEI), an automated data extraction pipeline, was developed using regular expressions, Trie search, and LLMs to extract data from molecular lab reports and structure it for inclusion into OHCRN’s database.
Main Outcomes and Measures
LEI was evaluated by measuring the F1-score on the extraction task of 18 entity types. These measures were compared against 15 extraction tools in the biomedical domain. Extraction time was also measured and compared against manual extraction times.
Results
LEI demonstrated quality on par with and surpassing other existing LLM-based extraction methods. Reference tools showed F1-scores around 70%, while LEI achieved an average score of 87.4%. LEI reduced extraction time by approximately 2-fold, with an average time of 7.59 minutes per report including results review by curators, compared to 14.88 minutes per report for manual extraction.
Conclusions and Relevance
LEI facilitates standardized, accurate, and efficient healthcare data extraction from unstructured texts, significantly improving the current OHCRN workflow. By automating the extraction process, LEI allows expert curators to focus on validating results rather than performing manual data entry. LEI’s simple interface enables researchers to easily guide extraction tasks and supports adaptability across diverse biomedical scenarios. Future improvements in accuracy may be achieved through fine-tuning techniques and ongoing advancements in LLM technologies.