MedError: A Machine-Assisted Framework for Systematic Error Analysis in Clinical Concept Extraction

Hongfang Liu
Sunyang Fu
Qiuhao Lu
Jaerong Ahn
Fang Chen
Hanyun Yin
Julia Wen
Zhiyi Yue
Taylor Harrison
Jiang Jun
Xiaoyang Ruan
Ming Huang
Andrew Wen
Liwei Wang
Min Ji Kwak
Nahid Rianon
Yanshan Wang
Ruihong Huang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Error analysis is a critical step in evaluating and improving clinical concept extraction models, the most common clinical natural language processing (NLP) task. Unlike corpus annotation, which follows standardized protocols for creating gold-standard datasets, error analysis requires nuanced judgment grounded in both clinical expertise and NLP knowledge. This is especially important given the heterogeneity of clinical text, where variations in documentation style, note structure, and terminology can substantially influence model behavior. Despite its importance, there is currently no standardized, user-level framework to support systematic error analysis in clinical concept extraction task. In this study, we developed and validated MedError, a machine-assisted, human-in-the-loop framework designed to standardize and enhance error analysis for clinical concept extraction tasks. We collected and manually curated a corpus of 1,187 unique errors from a total of 4,237 notes across three different distinct hospitals. The error categories were defined using our previously validated error taxonomy and included 480 false negatives and 707 false positives across 25 error types and 48 clinical concept categories. We evaluated the performance of three proprietary and three open-source large language models (LLMs) in automatically classifying these errors into 26 and 15 predefined categories. We further developed a machine-assisted framework, MedError, which integrates best practices in error analysis, LLM-assisted classification and reasoning, and a user-friendly interface to enable more efficient, reproducible, and context-aware error analysis. The framework supports both single-site and federated multisite error analysis, facilitating the effective translation of clinical NLP systems into real-world settings.

Version published to 10.21203/rs.3.rs-7151650/v1 on Research Square
Sep 17, 2025

Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records

This article has 44 authors:
1. Federica Corso
2. Vittoria Peppoloni
3. Laura Mazzeo
4. Giuseppe Leone
5. Luana Passos
6. Vanja Mišković
7. Justin Armanini
8. Alberto Ferrarin
9. Isabella Catharina Wiest
10. Fabian Wolf
11. Giulia Montelatici
12. Rebecca Romanò
13. Ambrosini Paolo
14. Tommaso Capoccia
15. Stefano Natangelo
16. Simone Rota
17. Paola Andena
18. Marta De Ponti
19. Alessandra Russo
20. Giulia Stasi
21. Leonardo Provenzano
22. Andrea Spagnoletti
23. Marco Meazza Prina
24. Chiara Cavalli
25. Claudia Giani
26. Roberta Serino
27. Michele Borracino
28. Chiara Bonalume
29. Rosa Maria di Mauro
30. Claudia Agosta
31. Andra Diana Dumitrascu
32. Giorgia Di Liberti
33. Giulia Corrao
34. Teresa Beninato
35. Monica Ganzinelli
36. Mario Occhipinti
37. Marta Brambilla
38. Claudia Proto
39. Jakob Nicholas Kather
40. Alessandra Laura Giulia Pedrocchi
41. Filippo De Braud
42. Giuseppe Lo Russo
43. Paolo Baili
44. Arsela Prelaj
This article has no evaluationsLatest version Oct 21, 2025
Medical Abbreviation Disambiguation with Large Language Models: Zero- and Few-Shot Evaluation on the MeDAL Dataset

This article has 4 authors:
1. Nima Shafiei Rezvani Nezhad
2. Meysam Mansouri
3. Rabih Abdulkarim Zakaria
4. Ruhollah Abolhasani
This article has no evaluationsLatest version Sep 17, 2025
MedRAGent: An Automatic Literature Retrieval and Screening System Utilizing Large Language Models with Retrieval-Augmented Generation

This article has 7 authors:
1. Zhuoyi Chen
2. Tianyi Liu
3. Yangrui Mo
4. Qishen Fu
5. Sibin Lei
6. Tiejun Tong
7. Xiaoyu Tang
This article has no evaluationsLatest version Sep 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records

Medical Abbreviation Disambiguation with Large Language Models: Zero- and Few-Shot Evaluation on the MeDAL Dataset

MedRAGent: An Automatic Literature Retrieval and Screening System Utilizing Large Language Models with Retrieval-Augmented Generation