Real-World Benchmarking and Validation of Foundation Model Transformers for Endometrial Cancer Subtyping from Histopathology

Vincent M. Wagner
Casey M. Cosgrove
Stephanie J. Chen
Daniel T. Griffin
Megan I. Samuelson
Michael J. Goodheart
Jesus Gonzalez-Bosquet

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

To evaluate whether open-source histopathology foundation model pipelines, paired with attention-based multiple instance learning (MIL), can accurately classify molecular subtypes of endometrial cancer (EC) from whole-slide images (WSIs) and maintain performance in a real-world, independent cohort.

Methods

We assembled a public discovery cohort of 815 patients (1,195 WSIs) from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium, and an independent external cohort of 720 patients (1,357 WSIs) with molecular subtyping determined by mismatch repair immunohistochemistry plus TP53 and POLE sequencing. Four ImageNet-pretrained convolutional neural networks (CNNs) and six open-source foundation encoders using two MIL aggregation strategies (TransMIL and CLAM) were benchmarked within the STAMP pipeline. Models were trained with five-fold cross-validation and evaluated on an independent cohort. Macro–area under the receiver operating characteristic curve (AUC) was the primary outcome.

Results

In cross-validation, foundation models outperformed CNNs (macro-AUC 0.799-0.860 vs 0.715-0.829). The best configuration (Virchow2 with CLAM) achieved macro-AUC 0.860 (95%CI, 0.839-0.880), macro-F1 score 0.607, and balanced accuracy 0.647. External validation showed substantial degradation for CNNs, while foundation models retained higher discrimination (macro-AUC 0.667-0.780). UNI2 with CLAM had the highest external macro-AUC (0.780), and Virchow2 with CLAM had the best balanced accuracy (0.525). Subtype-level AUCs for UNI2 with CLAM were highest for p53abn (0.851).

Conclusions

Open-source foundation model pipelines with attention-based MIL can deliver accurate and generalizable molecular subtyping of EC directly from WSIs. These models outperform CNNs in real-world validation, supporting their potential as scalable, cost-effective tools to guide precision oncology and triage confirmatory molecular testing.

Version published to 10.1101/2025.10.10.25337691 on medRxiv
Oct 13, 2025

Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning

This article has 13 authors:
1. Abdul Rehman Akbar
2. Alejandro Levya
3. Ashwini Esnakula
4. Elshad Hasanov
5. Anne Noonan
6. Upender Manne
7. Vaibhav Sahai
8. Lingbin Meng
9. Susan Tsai
10. Anil Parwani
11. Wei Chen
12. Ashish Manne
13. Muhammad Khalid Khan Niazi
This article has no evaluationsLatest version Jan 16, 2026
Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

This article has 2 authors:
1. Rafael de Negreiros Botan
2. Joao Batista de Sousa
This article has no evaluationsLatest version Dec 19, 2025
Predicting gene expression from whole slide images in prostate cancer using deep learning

This article has 14 authors:
1. Anxuan Han
2. Bo Li
3. Chui Yan Mah
4. Jessica Logan
5. Yanan Wang
6. Ning Liu
7. Feargal Ryan
8. David Lynn
9. Darren Foreman
10. John O’Leary
11. Douglas Brooks
12. Jose Polo
13. Lisa Butler
14. Fuyi Li
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Methods

Results

Conclusions

Article activity feed

Related articles

Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

Predicting gene expression from whole slide images in prostate cancer using deep learning