Scalable Identification of Clinically Relevant COPD Documents: A Lightweight NLP Model for Large-Scale EHR Datasets

Mohammed Al-Garadi
Sharon E. Davis
Michael E. Matheny
Dax Westerman
Adrienne K. Conger
Bradley W. Richmond
Thomas A. Lasko
Iben M. Ricket
Laura M. Paulin
Jeremiah R. Brown
Ruth M. Reeves

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

The widespread adoption of electronic health records (EHRs) has resulted in the generation of large volumes of clinical notes. Learning algorithms and large language models (LLMs) train from these resources but are susceptible to noise—irrelevant or non-informative data from them. This sensitivity can lead to significant challenges, including performance degradation and the generation of inaccurate predictions or “hallucinations.” This study addresses a critical challenge in clinical informatics: efficiently filtering millions of documents for relevance before advanced language model processing, particularly in resource-constrained environments. We present a novel framework for determining document relevance in clinical settings, utilizing a chronic obstructive pulmonary disease (COPD) dataset.

Methods

We developed a novel framework using weak supervision and domain-expert heuristics to generate “silver standard” labels for training data and expert annotated labels (gold stand),creating two datasets to optimize the model during the development phase and subsequent testing phase. Various text representation techniques, including Bag-of-Words, TF-IDF, lightweight document embeddings, compression-based features, and UMLS concept extraction, were evaluated. These representations were used to train Random Forest, XGBoost, and K-Nearest Neighbors classifiers. Models were optimized on a small expert-annotated dataset and evaluated on a held-out test set.

Results

The combination of lightweight document embedding with a Random Forest classifier demonstrated the best performance, achieving a precision of 0.75, recall of 0.89, and F1-score of 0.81 (95% CI: 0.76-0.87) for identifying relevant COPD documents. This significantly outperformed baseline heuristics (precision: 0.70, recall: 0.38, F1-score: 0.50, 95% CI: 0.43-0.56) and other tested methods.

Conclusion

Our study presents a novel framework for identifying COPD-relevant clinical documents using lightweight embedding and machine learning. This approach effectively filters pertinent documents, enhancing information retrieval precision. The framework’s scalability and minimal annotation needs make it promising for diverse healthcare applications, potentially optimizing clinical outcomes through efficient document selection for data-driven decision support systems.

Version published to 10.1101/2025.04.22.25326240v1 on medRxiv
Apr 25, 2025

Longitudinal Masked Representation Learning for Pulmonary Nodule Diagnosis from Language Embedded EHRs

This article has 9 authors:
1. Thomas Z. Li
2. John M. Still
3. Lianrui Zuo
4. Yihao Liu
5. Aravind R. Krishnan
6. Kim L. Sandler
7. Fabien Maldonado
8. Thomas A. Lasko
9. Bennett A. Landman
This article has no evaluationsLatest version May 11, 2025
Automated Insomnia Phenotyping from Electronic Health Records: Leveraging Large Language Models to Decode Clinical Narratives

This article has 11 authors:
1. Guillermo Lopez-Garcia
2. Davy Weissenbacher
3. Matthew Stadler
4. Karen O’Connor
5. Dongfang Xu
6. Lauren Gryboski
7. Jared Heavens
8. Noor Abu-el-Rub
9. Diego R. Mazzotti
10. Subhajit Chakravorty
11. Graciela Gonzalez-Hernandez
This article has no evaluationsLatest version Jun 3, 2025
Verifiable Summarization of Electronic Health Records Using Large Language Models to Support Chart Review

This article has 14 authors:
1. Ritchie Verma
2. Emily Alsentzer
3. Zachary Strasser
4. Leslie Chang
5. Kirollos Roman
6. Esteban Gershanik
7. Camellia Hernandez
8. Miguel Linares
9. Jorge Rodriguez
10. Durga Thakral
11. Ozan Unlu
12. Jacqueline You
13. Li Zhou
14. David Bates
This article has no evaluationsLatest version Jun 3, 2025

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

Longitudinal Masked Representation Learning for Pulmonary Nodule Diagnosis from Language Embedded EHRs

Automated Insomnia Phenotyping from Electronic Health Records: Leveraging Large Language Models to Decode Clinical Narratives

Verifiable Summarization of Electronic Health Records Using Large Language Models to Support Chart Review