A Scalable Method for Validated Data Extraction from Electronic Health Records with Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose
Extracting and structuring relevant clinical information from electronic health records (EHRs) remains a challenge due to the heterogeneity of systems, documents, and documentation practices. Large Language Models (LLMs) provide an approach to processing semi-structured and unstructured EHR data, enabling classification, extraction, and standardization.
Methods
Medical documents are processed through a structured data pipeline to generate normalized FHIR data. Unstructured data undergoes preprocessing, including optical character recognition, document parsing, text chunking and embedding. Embedding enables search and classification which facilitate document retrieval for extraction. LLMs perform named entity recognition and relation extraction, with outputs mapped to FHIR R4 and OMOP and harmonized with pre-structured data for interoperability. Model performance is evaluated through human validation and automated consistency checks. Iterative refinement, error analysis, and standardized schema selection optimize use for analytics and downstream workflows.
Results
An LLM-based schema for medication extraction, validated on 34 patients, achieved ∼95% accuracy and F1-score across 7 data fields for 5,789 extracted medications. Deployment across 11,115 patients extracted 2.6 million medication records, increasing total medications by 27% and distinct drug ingredients by 31% over structured data in the EHR. LLM extraction increased oncology medications by 60%, distinct oncology therapies by 64%, and the number of patients with structured oncology medication data by 33%. The LLM enhanced data completeness, improving availability of indication for prescription (61% vs. 31%) and discontinuation reason (17% vs. 0%), outperforming pre-structured data in key clinical variables.
Conclusion
An LLM-powered extraction process that employs embeddings, machine learning classification, schema-based extraction, and mapping of extracted information to healthcare data standards, achieves a significant gain in clinically relevant information over pre-structured data available in the EHR.