Extracting social determinants of health from electronic health records: development and comparison of rule-based and large language models-based methods
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives
Social determinants of health (SDoH) are critical drivers of health outcomes but are often under-documented in structured electronic health record data. This study aimed to develop and evaluate scalable methods for extracting seven SDoH domain categories and 23 subcategories from unstructured clinical notes using both rule-based and large language model (LLM)-based approaches.
Methods
We constructed a gold-standard SDoH corpus comprising clinical text segments from 171 patients in the Mass General Brigham Research Patient Data Registry. A rule-based system (RBS) was developed and its performance compared with seven OpenAI GPT models (GPT-4o, 4.1, 4.1-mini, o4-mini, GPT-5, GPT-5-mini, and o3) under zero-shot and few-shot settings with multiple prompting strategies. We also implemented ensemble models combining RBS and LLM outputs via late fusion.
Results
The RBS achieved the highest precision for SDoH domain categories (0.97) but substantially lower recall (0.62). GPT-based models outperformed RBS in overall F1 scores, with GPT-5 and GPT-5-mini (few-shot) achieving the best domain-level F1 of 0.88 and o4-mini achieving the highest subcategory F1 of 0.79. The RBS-GPT ensemble improved domain-level performance to 0.89 F1 with balanced precision (0.90) and recall (0.89). Model performance was consistent across demographic groups.
Conclusion
State-of-the-art GPT models with advanced reasoning capabilities, including the recently released “mini” models (e.g., o4-mini and GPT-5-mini), demonstrated robust performance for SDoH extraction without fine-tuning and outperformed rule-based NLP. Integrating rule-based and LLM approaches further enhanced performance. Our results provide a scalable, cost-efficient framework for accurate identification of SDoH from clinical text, supporting downstream population health research and clinical informatics applications.