Automating cancer registry abstraction with an autonomous, resource-efficient AI for multi-cancer pathology reports: a model development and validation study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
The manual abstraction of data from unstructured pathology reports for cancer registries is a major bottleneck in oncology informatics. While Large Language Models (LLMs) show promise, their adoption is hindered by computationally prohibitive resource requirements and data privacy concerns. There is an urgent need for an accessible, robust, and autonomous system deployable at the front lines of healthcare.
Methods
We developed a fully autonomous, multi-stage AI workflow that operates on a single professional-grade GPU. The system first deploys a detector that triages raw clinical documents to identify cancer surgical reports requiring cancer registry entry, and a classifier that routes pathology reports to one of ten cancer-specific information extraction engines. This privacy-preserving pipeline is powered by a locally deployed open-weight LLM (gpt-oss:20b) and programmed via the DSPy framework, a self-optimizing framework for LLMs. The end-to-end system was validated on 893 uncurated, real-world cancer-surgical reports, with a gold standard generated by two board-certified pathologists.
Findings
The triage detector achieved an accuracy of 96.6% and an F1 score of 97.9% on cancer surgical reports detection, and the cancer-type classifier routed reports with an accuracy of 97.8% and a macro average F1 score of 95.5%. Across ten distinct malignancies, the pipeline achieved a mean exact-match accuracy of 93.9%±8.6% for 196 registry fields. Critical oncologic information, including margin positivity and key biomarkers for breast cancer, were extracted with acceptable fidelity (90–100%). The entire autonomous process was executed at an acceptable speed.
Interpretation
This work provides a deployable, resource-efficient blueprint for a “digital registrar” robust enough for real-world clinical workflows. This marks a pivotal step toward transforming global cancer surveillance from a slow, retrospective process into a near real-time system.
Funding
National Science and Technology Council (NSTC) of Taiwan