Generation and Evaluation of Realistic Synthetic Clinical Progress Notes for Prostate Cancer using Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background and Objective

Access to real-world electronic health records (EHRs) remains limited by privacy, governance and annotation constraints, hindering the development of clinical natural language processing models. Realistic synthetic progress notes may provide EHR-like corpora that preserve clinically rigorous information on diagnoses, treatments, symptoms, imaging, laboratory findings and therapeutic trajectories without relying directly on sensitive patient records. This study evaluates whether large language models (LLMs) can generate realistic Spanish prostate cancer progress notes from published case reports, preserving clinical content, temporality and hospital-style conventions.

Methods

We compiled 109 Spanish prostate cancer case reports from the biomedical literature and characterised their clinical content using Spanish biomedical named-entity recognition (NER) models, complemented by rule-based extraction of prostate specific antigen (PSA) values and Gleason scores. GPT-5.4 Nano, Qwen 3.5:35B A3B and GLM-5 were used to generate EHR-style progress notes from these case reports under plain-text and entity-enriched prompting strategies, in both zero-shot and few-shot settings. Evaluation combined lexical and semantic similarity metrics with structured LLM-as-a-judge assessment using Claude Sonnet 4.6, binary safety screening and expert clinical review.

Results

All models preserved substantial clinical content, although lexical-overlap metrics showed variable agreement with semantic and clinical quality assessments, reflecting the abstractive nature of the task. Entity-enriched prompting improved lexical and semantic align-ment, but did not consistently improve clinical safety. Qwen 3.5:35B A3B was unstable under entity-enriched few-shot prompting, showing increased safety-critical errors and contradictions. GPT-5.4 Nano achieved strong automatic scores but showed isolated clinical inconsistencies. GLM-5 showed the most robust overall profile and performed close to human-authored notes in expert review.

Conclusions

LLMs can generate clinically plausible Spanish prostate cancer progress notes from published case reports under controlled conditions. These findings support the potential use of EHR-like synthetic corpora for clinical NLP, although reliability remains model- and prompt-dependent. Expert validation and safety-oriented evaluation are therefore necessary before downstream use or clinical deployment.

Article activity feed