Comprehensive Structured Abstraction of Pathology Reports Is Now Feasible Using Local Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Pathology reports contain the most detailed descriptions of cancer diagnoses, yet their unstructured format has long limited large-scale reuse for cancer registries and population surveillance. Prior applications of large language models (LLMs) have therefore focused on narrow extraction tasks, reflecting a persistent implementation trilemma: comprehensive abstraction, strict data privacy, and computational feasibility could not be achieved simultaneously in real-world clinical settings. Given current LLM capabilities, this trilemma can now be resolved. We show that recent open-weight LLMs enable reliable, full-length, schema-bound abstraction of pathology reports on standard on-premise hardware. We present a model-agnostic framework implemented using DSPy, a declarative framework for structured LLM pipelines, in which deterministic, programmatic prompting co-designed with pathologists enables end-to-end structured abstraction. Across 893 real-world pathology reports spanning ten major cancer types, the system achieved a mean exact-match accuracy of 94.3% across 193 CAP-aligned registry fields, including complex variable-length structures such as surgical margins, lymph nodes, and breast biomarkers. All processing was performed locally on a single workstation-class GPU, ensuring data privacy without sacrificing completeness or feasibility. Independent external validation using TCGA pathology reports confirmed robust generalizability.