A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cancer registries are essential infrastructures for population-level surveillance, yet existing AI efforts remain constrained by narrow tasks and proprietary dependence. We present a model-agnostic, privacy-first framework that transforms cancer registration into a scalable and globally accessible process. Running entirely on local, low-cost hardware, the system performs end-to-end abstraction of unstructured pathology reports, integrating multi-step reasoning with a DSPy-based prompting engine co-designed with pathologists. In validation across ten cancer types, it achieved 96.6% cancer type triage accuracy and 94.3% mean extraction accuracy across 193 CAP-aligned fields, in addition to capturing complex variable-length data for surgical margins, lymph nodes, and breast biomarkers with high fidelity. This framework resolves the clinical AI "implementation trilemma"—balancing comprehensive scope, strict privacy, and computational feasibility. By restoring data completeness on accessible workstation GPUs, we provide a democratized blueprint for unbiased surveillance that unlocks the rich diagnostic granularity of pathology reports for real-time public health action.