Employing Consensus-Based Reasoning with Locally Deployed LLMs for Enabling Structured Data Extraction from Surgical Pathology Reports

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Surgical pathology reports provide essential diagnostic information critical for cancer staging, treatment planning, and cancer registry documentation. However, their writing styles and formats vary widely, reflecting each pathologist’s stylistic choices, institutional norms, and largely inherited practices from residency training. When performing large-scale data analysis, this unstructured nature and variability across tumor types and institutions pose significant hurdles for automated data extraction. To overcome these challenges, we present a consensus-driven, reasoning-based framework that adapts multiple locally deployed large language models (LLMs) to extract both standard diagnostic variables (site, laterality, histology, stage, grade, and behavior) and organ-specific biomarkers. Each LLM generates structured outputs, accompanied by justifications, which are subsequently evaluated for accuracy and coherence by three separate reasoning models (DeepSeek-R1-large, Qwen3-32B, and QWQ-32B). Final consensus values are determined through aggregation. Board-certified pathologists conducted expert validation. This framework was applied to over 6,100 pathology reports from The Cancer Genome Atlas (TCGA) spanning 10 organ systems and 510 reports from Moffitt Cancer Center. For the TCGA dataset, automated evaluation demonstrated mean accuracy of 84.9% ± 7.3%, with histology (89.0%), site (88.3%), and behavior (87%) showing the highest extraction accuracy averaged across all models. Expert review of randomly selected 138 reports confirmed high agreement for behavior (100.0%), histology (99%), grade (97%), and site (95%) in the TCGA dataset, with slightly lower performance for stage (88%) and laterality (87%). In Moffitt Cancer Center reports (brain, breast, and lung), accuracy remained high (88.2% ± 7.2%), with behavior (99%), histology and laterality (96%), grade (93%), and site (91%) achieving strong agreement. Biomarker extraction achieved 70.6% ± 8.1% overall accuracy, with TP53 (85%) on brain tumor, Ki-67 (68%) on breast cancer, and ROS1 (82%) on lung cancer showing highest accuracy. Inter-evaluator agreement analysis revealed high concordance (correlations > 0.89) across the three evaluation models. Statistical analyses revealed significant main effects of model type (F=1716.82, p<0.001), variable (F=3236.68, p<0.001), and organ system (F=1946.43, p<0.001), as well as model × variable × organ interactions (F=24.74, p<0.001), emphasizing the role of clinical context in model performance. These results highlight the potential of stratified, multi-organ evaluation frameworks with multi-evaluator consensus in LLM benchmarking for clinical applications. Overall, this consensus-based approach demonstrates that locally deployed LLMs can provide a transparent, accurate, and auditable solution for integration into real-world pathology workflows such as synoptic reporting and cancer registry abstraction.

Article activity feed