Comprehensive Structured Abstraction of Pathology Reports Is Now Feasible Using Local Large Language Models

Nan-Haw Chow
Han Chang
Hung-Kai Chen
Chen-Yuan Lin
Ying-Lung Liu
Po-Yen Tseng
Li-Ju Shiu
Yen-Wei Chu
Pau-Choo Chung
Kai-Po Chang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Pathology reports contain the most detailed descriptions of cancer diagnoses, yet their unstructured format has long limited large-scale reuse for cancer registries and population surveillance. Prior applications of large language models (LLMs) have therefore focused on narrow extraction tasks, reflecting a persistent implementation trilemma: comprehensive abstraction, strict data privacy, and computational feasibility could not be achieved simultaneously in real-world clinical settings. Given current LLM capabilities, this trilemma can now be resolved. We show that recent open-weight LLMs enable reliable, full-length, schema-bound abstraction of pathology reports on standard on-premise hardware. We present a model-agnostic framework implemented using DSPy, a declarative framework for structured LLM pipelines, in which deterministic, programmatic prompting co-designed with pathologists enables end-to-end structured abstraction. Across 893 real-world pathology reports spanning ten major cancer types, the system achieved a mean exact-match accuracy of 94.3% across 193 CAP-aligned registry fields, including complex variable-length structures such as surgical margins, lymph nodes, and breast biomarkers. All processing was performed locally on a single workstation-class GPU, ensuring data privacy without sacrificing completeness or feasibility. Independent external validation using TCGA pathology reports confirmed robust generalizability.

Version published to 10.1101/2025.10.21.25338475 on medRxiv
Oct 23, 2025

Smart Diagnosis: AI and ML Powered Breast Cancer Classification

This article has 2 authors:
1. Sagar Verma
2. Vaibhav Sabale
This article has no evaluationsLatest version Jan 28, 2026
Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026
ReviewAid: An Open-Source Tool for Efficient PICO-Based Screening and Data Extraction in Systematic Reviews

This article has 2 authors:
1. Vihaan Sahu
2. Mohith Balakrishnan
This article has no evaluationsLatest version Jan 5, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Smart Diagnosis: AI and ML Powered Breast Cancer Classification

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

ReviewAid: An Open-Source Tool for Efficient PICO-Based Screening and Data Extraction in Systematic Reviews