Generation and Evaluation of Realistic Synthetic Clinical Progress Notes for Prostate Cancer using Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background and Objective
Access to real-world electronic health records (EHRs) remains limited by privacy, governance and annotation constraints, hindering the development of clinical natural language processing models. Realistic synthetic progress notes may provide EHR-like corpora that preserve clinically rigorous information on diagnoses, treatments, symptoms, imaging, laboratory findings and therapeutic trajectories without relying directly on sensitive patient records. This study evaluates whether large language models (LLMs) can generate realistic Spanish prostate cancer progress notes from published case reports, preserving clinical content, temporality and hospital-style conventions.
Methods
We compiled 109 Spanish prostate cancer case reports from the biomedical literature and characterised their clinical content using Spanish biomedical named-entity recognition (NER) models, complemented by rule-based extraction of prostate specific antigen (PSA) values and Gleason scores. GPT-5.4 Nano, Qwen 3.5:35B A3B and GLM-5 were used to generate EHR-style progress notes from these case reports under plain-text and entity-enriched prompting strategies, in both zero-shot and few-shot settings. Evaluation combined lexical and semantic similarity metrics with structured LLM-as-a-judge assessment using Claude Sonnet 4.6, binary safety screening and expert clinical review.
Results
All models preserved substantial clinical content, although lexical-overlap metrics showed variable agreement with semantic and clinical quality assessments, reflecting the abstractive nature of the task. Entity-enriched prompting improved lexical and semantic align-ment, but did not consistently improve clinical safety. Qwen 3.5:35B A3B was unstable under entity-enriched few-shot prompting, showing increased safety-critical errors and contradictions. GPT-5.4 Nano achieved strong automatic scores but showed isolated clinical inconsistencies. GLM-5 showed the most robust overall profile and performed close to human-authored notes in expert review.
Conclusions
LLMs can generate clinically plausible Spanish prostate cancer progress notes from published case reports under controlled conditions. These findings support the potential use of EHR-like synthetic corpora for clinical NLP, although reliability remains model- and prompt-dependent. Expert validation and safety-oriented evaluation are therefore necessary before downstream use or clinical deployment.