Coreset Selection for Scalable and Efficient Prostate Cancer Grading in Digital Pathology

Mario Verdicchio
Aram Movsisyan
Valentina Brancato
Marco Aiello
Anna Shahinyan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Digital pathology generates vast amounts of whole-slide images(WSIs), offering unprecedented opportunities for AI-driven precision medicine.However, the development of robust and generalizable models faces systemic challenges, primarily the scarcity and high cost of high-quality manual annotationsfrom expert clinicians. Efficient selection of representative subsets of datacoresets can reduce annotation effort and computational cost while maintainingmodel performance. This study introduces AIMS (An Informative MeaningfulSubset), a coreset selection strategy and validates it on a use case of prostatecancer (PCa) grading. Methods: This study uses a dataset of 187 H&E-stained whole-slide images(WSIs) from prostatectomy sections, tiled at 20× magnification with 512×512pixel patches. AIMS employs a convolutional autoencoder to learn a compactlatent representation of these tiles, followed by a geometry- and activation-drivensubsampling strategy. The feature space of mean activation vectors is partitioned, and representative coresets (0.5%, 1%, and 2.5% of the full dataset) areselected via cosine similarity. Performance is evaluated using ConvNeXt (trainedfrom scratch) and ResNet18 (with pre-trained UNI embeddings), comparingAIMS-selected subsets to random sampling across datasets with varying characteristics. Model performance is assessed using F1-macro, F1-micro, F1-weighted scores, AUC, and Cohen’s kappa, with additional evaluation of cross-scannergeneralization. Results: Coreset selection via AIMS consistently outperformed random sampling. ConvNeXt trained on AIMS-selected subsets achieved competitive performances using only a fraction of the original data (F1-macro = [0.724-0.782]and AUC = [0.921-0.945]). Cross-scanner evaluation demonstrated robustnessto hardware variability. Similarly, performance were obtained using AIMSselected tiles in the UNI feature space (F1-macro = [0.735-0.742] and AUC =[0.941-0.942]). Conclusion: AIMS-based targeted coreset selection can achieve promising classification performance, representing a viable strategy for substantially reducingannotation costs. This data-efficient approach offers a practical and scalablestrategy for deploying AI in digital pathology, particularly in data-constrainedenvironments, and represents a step toward real-world clinical applicability ofAI-driven PCa diagnosis and grading.

Version published to 10.21203/rs.3.rs-7751673/v1 on Research Square
Oct 1, 2025

Prognosis Prediction in Bladder Cancer Pathological Images Based on Nuclear Structure Encoding

This article has 7 authors:
1. Bo Guan
2. Yuan Gao
3. Feng Wang
4. Guangdi Chu
5. Jianchang Zhao
6. Haitao Niu
7. Jianmin Li
This article has no evaluationsLatest version Dec 29, 2025
UniSkin-Net: A Unified Multi-Task Framework for Skin Cancer Segmentation, Classification, and Detection

This article has 5 authors:
1. Eman Abdullah Aldakheel
2. Mohammed Zakariah
3. Syed Umar Amin
4. Parul Dubey
5. Zafar Iqbal Khan
This article has no evaluationsLatest version Dec 22, 2025
Deep Learning for Preoperative MRI-Based Endometrial Cancer Staging Prediction

This article has 5 authors:
1. Caili Gong
2. Yetong Qi
3. Ying Su
4. Tianjiao Li
5. Yongfeng Wei
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Prognosis Prediction in Bladder Cancer Pathological Images Based on Nuclear Structure Encoding

UniSkin-Net: A Unified Multi-Task Framework for Skin Cancer Segmentation, Classification, and Detection

Deep Learning for Preoperative MRI-Based Endometrial Cancer Staging Prediction