SPELL: A Scalable NLP Method Using Regular Expressions and Large Language Models for Clinical Information Extraction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
Electronic health records (EHRs) contain valuable information for clinical research and decision-making. However, leveraging these data remains challenging due to data heterogeneity, inconsistent documentation, missing information, and evolving terminology, especially within unstructured clinical notes. We developed SPELL ( S nippet- P rimed r E gex LL M Pipeline), a scalable natural language processing (NLP) workflow to systematically extract structured clinical insights from large volumes of clinical narratives.
Materials and Methods
Our platform employs a hybrid approach combining regular expressions (regex) to rapidly identify relevant textual snippets with locally hosted large language models (LLMs) for accurate clinical interpretation. All data processing occurs securely within institutional computational environments. The modular Python-based workflow facilitates adaptation across institutions and is optimized for computational efficiency, supporting high-throughput processing even in resource-limited settings. We quantified computational scalability (elapsed time, out-of-memory events, GPU temperature, and energy consumed) and audited retrieval recall using clinician-annotated regex-negative notes enriched with relevant structured metadata.
Results
The pipeline efficiently processed 31 million clinical reports spanning 1976–2024 from eight affiliated hospitals. By analyzing targeted snippets rather than entire documents, our approach reduced processing time by 68% compared to traditional full-document LLM inference, and by >95% compared to manual physician annotation. Accuracy was rigorously validated across three obstetric tasks: extraction of numerical values (blood loss volumes), dates (estimated due dates), and diagnoses (hemolysis, elevated liver enzymes, and low platelets [HELLP] syndrome). Task-level performance included 94-98% exact-match accuracy for the three concepts on curated snippets. Generalizability was investigated using the publicly available MT Samples corpus (5,013 notes, 40 specialties), yielding 84% accuracy for ventricular tachycardia detection with markedly fewer false positives.
Discussion and Conclusions
A hybrid regex→snippet→LLM approach delivers accurate, privacy-preserving, and computationally efficient extraction for unstructured EHR data. By focusing inference on snippets and deploying local, open-weights models, SPELL aligns with institutional data governance requirements while enabling scalable clinical informatics studies across diverse extraction tasks.
Summary Statement
We developed SPELL, a scalable NLP pipeline combining regex and locally hosted LLMs for efficient information extraction from clinical narratives.