Agent-Based Large Language Model System for Extracting Structured Data from Breast Cancer Synoptic Reports: A Dual-Validation Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

To develop and validate an agent-based Large Language Model (LLM) system for extracting structured data from breast cancer synoptic pathology reports and assess the performance gap between synthetic and real-world validation.

Materials and Methods

We developed a modular AI agent-based framework employing sequential specialized LLMs for parsing pathology reports and extracting structured data. We normalized College of American Pathologists (CAP) cancer protocols into 8 sections, 86 subsections, and 229 discrete fields. Seven leading LLMs (gemini-2.5-pro, llama3.3-70b, phi4-14b, deepseek-r1 14B/70B, gemma3-27b, gemini-2.0-flash-lite) were validated using dual evaluation: synthetic validation (864 controlled test cases) and real-world ground truth (6,651 annotated fields from 90 pathology reports).

Results

Synthetic validation demonstrated strong performance (accuracy: 93.8-99.0%). Real-world evaluation revealed field extraction accuracy ranging from 61.8% to 87.7%, demonstrating a substantial “reality gap” with accuracy drops of 11-32 percentage points. The gemini-2.5-pro model achieved the highest real-world accuracy (87.7%). Model size did not predict performance: the 14B-parameter deepseek-r1 (77.6%) outperformed its 70B-parameter counterpart (70.4%).

Discussion

The substantial performance degradation from synthetic to real-world data underscores the complexity of authentic clinical documentation. Smaller models can achieve competitive or superior accuracy, reducing computational costs. With even the best models missing 12-38% of annotated fields, mandatory human verification is essential for clinical deployment.

Conclusion

While LLM-based extraction systems show promise for pathology data extraction, synthetic validation alone provides false confidence. Rigorous real-world ground truth evaluation with expert annotation is essential before clinical deployment. These systems are best positioned as screening tools with mandatory human oversight rather than autonomous decision-making systems.

Article activity feed