Retrospective Validation of an Artificial Intelligence System for Diagnostic Assessment of Prostate Biopsies on the ProMort Cohort: Study Protocol

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Introduction

Prostate cancer diagnosis and treatment planning depend on accurate histopathological assessment of needle biopsies, particularly through the Gleason scoring system. The inherently subjective nature of the grading creates variability between pathologists, potentially resulting in suboptimal patient management decisions. These reproducibility challenges extend beyond Gleason scoring to encompass other critical diagnostic and prognostic markers, including cancer volume quantification and detection of cribriform morphology patterns and perineural invasion. Artificial intelligence (AI) applications in digital pathology have emerged as promising solutions for enhancing diagnostic consistency and accuracy, with recent research demonstrating that automated systems can match expert-level performance in prostate biopsy evaluation. Nevertheless, comprehensive validation studies have revealed concerning limitations in model generalisability when deployed across different clinical environments and patient populations. Recent systematic reviews revealed widespread risk-of-bias limitations and insufficient external validation in AI diagnostic studies, highlighting critical needs for accumulated evidence supporting generalisability before clinical implementation. Rigorous external validation with preregistered protocols using independent datasets from diverse clinical settings remains essential to establish the reliability and safety of AI-assisted prostate pathology systems.

Methods and analysis

This study protocol establishes a framework for the retrospective external validation of an AI system developed for prostate biopsy assessment, to be conducted on the case-control samples of the National Prostate Cancer Register of Sweden, ProMort study (1998-2015). The primary aim is to evaluate the AI model’s diagnostic accuracy and Gleason grading performance using completely independent datasets separate from any model development or previously used validation cohorts. The diversity of the validation samples, spanning multiple geographic regions, temporal collection periods, and reference standards, allows evaluation of model robustness across varied clinical contexts. Secondary aims encompass evaluating AI performance in cancer length estimation and detection of cribriform patterns and perineural invasion. This protocol delineates procedures for data collection, reference standard clarification, and prespecified statistical analyses, ensuring comprehensive validation and reliable performance assessment. The study design conforms to established reporting guidelines CLAIM and STARD-AI, and recognised best practices for AI validation in medical imaging.

Ethics and dissemination

Data collection and usage were approved by the Swedish Regional Ethics Review Board and the Swedish Ethical Review Authority (permits 2012/1586-31/1, 2016/613-31/2, 2019-01395, 2019-05220). The study adheres to the Declaration of Helsinki principles, and findings will be made available in open access peer-reviewed publications.

STRENGTHS AND LIMITATIONS

  • This study incorporates case-control subsamples from Sweden’s largest clinical prostate cancer database (the National Prostate Cancer Register, NPCR), capturing a broad spectrum of variation across Swedish regions.

  • The validation dataset encompasses samples collected from 1998 to 2015, representing one of the first AI validation studies to systematically evaluate performance across such an extensive temporal range, capturing evolving histological sample preparation techniques and changing population characteristics.

  • A consistent scanning and annotation platform during digitisation eliminates equipment-related technical variation, while standardised annotation protocols among pathologists ensure traceable and reliable reference standards.

  • Case-control design with 50% cancer-related mortality may create spectrum and prevalence bias, limiting comparison with typical clinical populations and other AI studies.

  • Differences between the diagnostic reporting guidelines applied to the AI model’s training data and our validation dataset may introduce systematic differences that affect the interpretation of AI-pathologist concordance measurements.

Article activity feed