Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records

Federica Corso
Vittoria Peppoloni
Laura Mazzeo
Giuseppe Leone
Luana Passos
Vanja Mišković
Justin Armanini
Alberto Ferrarin
Isabella Catharina Wiest
Fabian Wolf
Giulia Montelatici
Rebecca Romanò
Ambrosini Paolo
Tommaso Capoccia
Stefano Natangelo
Simone Rota
Paola Andena
Marta De Ponti
Alessandra Russo
Giulia Stasi
Leonardo Provenzano
Andrea Spagnoletti
Marco Meazza Prina
Chiara Cavalli
Claudia Giani
Roberta Serino
Michele Borracino
Chiara Bonalume
Rosa Maria di Mauro
Claudia Agosta
Andra Diana Dumitrascu
Giorgia Di Liberti
Giulia Corrao
Teresa Beninato
Monica Ganzinelli
Mario Occhipinti
Marta Brambilla
Claudia Proto
Jakob Nicholas Kather
Alessandra Laura Giulia Pedrocchi
Filippo De Braud
Giuseppe Lo Russo
Paolo Baili
Arsela Prelaj

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Real-world data (RWD), largely stored in unstructured electronic health records (EHRs), are critical for understanding complex diseases like cancer. However, extracting structured information from these narratives is challenging due to linguistic variability, semantic complexity, and privacy concerns. This study evaluates the performance of four locally deployable and small language models (SLMs), LLaMA, Mistral, BioMistral, and MedLLaMA, for information extraction (IE) from Italian EHRs within the APOLLO 11 trial on non-small cell lung cancer (NSCLC). We examined three prompting strategies (zero-shot, few-shot, and annotated few-shot) across English and Italian, involving clinicians with varying expertise to assess prompt design’s impact on accuracy. Results show that general-purpose models (e.g., LLaMA 3.1 8B) outperform biomedical models in most tasks, particularly in extracting binary features. Multiclass variables such as TNM staging, PD-L1, and ECOG were more difficult due to implicit language and lack of standardization. Few-shot prompting and native-language inputs significantly improved performance and reduced hallucinations. Clinical expertise enhanced consistency in annotation, particularly among students using annotated examples. The study confirms that privacy-preserving SLMs can be deployed locally for efficient and secure cancer data extraction. Findings highlight the need for hybrid systems combining SLMs with expert input and underline the importance of aligning clinical documentation practices with SLM capabilities. This is the first study to benchmark SLMs on Italian EHRs and investigate the role of clinical expertise in prompt engineering, offering valuable insights for the future integration of SLMs into real-world clinical workflows.

Version published to 10.1101/2025.10.16.25337917 on medRxiv
Oct 21, 2025

Evaluating Language Models for Biomedical Fact-Checking: A Benchmark Dataset for Cancer Variant Interpretation Verification

This article has 15 authors:
1. Caralyn Reisle
2. Cameron J. Grisdale
3. Kilannin Krysiak
4. Arpad M. Danos
5. Mariam Khanfar
6. Erin Pleasance
7. Jason Saliba
8. Melika Hanos
9. Nilan V. Patel
10. Asmita Jain
11. Joshua F McMichael
12. Ajay C. Venigalla
13. Malachi Griffith
14. Obi L. Griffith
15. Steven J. M. Jones
This article has no evaluationsLatest version Sep 15, 2025
MedError: A Machine-Assisted Framework for Systematic Error Analysis in Clinical Concept Extraction

This article has 18 authors:
1. Hongfang Liu
2. Sunyang Fu
3. Qiuhao Lu
4. Jaerong Ahn
5. Fang Chen
6. Hanyun Yin
7. Julia Wen
8. Zhiyi Yue
9. Taylor Harrison
10. Jiang Jun
11. Xiaoyang Ruan
12. Ming Huang
13. Andrew Wen
14. Liwei Wang
15. Min Ji Kwak
16. Nahid Rianon
17. Yanshan Wang
18. Ruihong Huang
This article has no evaluationsLatest version Sep 17, 2025
From text to tables: Zero-shot extraction of structured clinical data from free-text CT scan reports using foundational large language models

This article has 10 authors:
1. Alex Hongslo
2. Amulya Gupta
3. Quynh Nguyen
4. Jake Caldwell
5. Ben Choi
6. Christopher J. Harvey
7. Jeffrey Thompson
8. Diego Mazzotti
9. Zijun Yao
10. Amit Noheria
This article has no evaluationsLatest version Oct 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating Language Models for Biomedical Fact-Checking: A Benchmark Dataset for Cancer Variant Interpretation Verification

MedError: A Machine-Assisted Framework for Systematic Error Analysis in Clinical Concept Extraction

From text to tables: Zero-shot extraction of structured clinical data from free-text CT scan reports using foundational large language models