Prediction of TP53 biomarkers and survival outcomes from whole slide images using a vision transformer-based multi-instance learning framework

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Accurate molecular profiling and prognostication from routine histopathology slides could transform precision oncology. We developed a Vision Transformer (ViT)-based multi-instance learning (MIL) framework for combined predictions of 32 solid tumour types, TP53 biomarker detection, and survival prediction directly from Whole Slide Images (WSIs).

Methods

11,060 primary tumours were curated from the TCGA Pan-Cancer Atlas with corresponding somatic mutations, RNA-seq, and clinical outcome data. TP53 alterations were classified as pathogenic drivers using COSMIC and hotspot annotations. WSIs underwent tissue masking, quality control, stain normalisation, and patch extraction (518 x 518) at 6x downsampling. Each patch was encoded by a ViT into a 768-dimensional embedding, which formed a token sequence for a 6-layer Transformer aggregator with learnable classification and positional embeddings. Seven task heads were developed to generate predictions for various outcomes, including cancer type, TP53 mutation status, TP53 RNA expression levels, overall survival (OS), progression-free interval (PFI), and the corresponding times for OS and PFI. The training process had two stages. First, the model was trained on tumour tissue patches from WSIs at five magnifications. In the second stage, it was fine-tuned using patches from all tissue regions with a content-aware strategy, updating all MIL layers for a maximum of 150 epochs at a learning rate of 1 × 10⁻⁵. The model’s performance was evaluated on an independent validation set of 1,729 slides using classification metrics, including the area under the receiver operating characteristic curve (AUROC), regression metrics, and Concordance indices (C-index).

Results

The multi-resolution ViT-based MIL model achieved an AUROC of 0.775 (95% CI: 0.749–0.801) for TP53 mutation detection on the validation set, demonstrating strong overall performance across classification and survival prediction tasks. The fine-tuned model attained robust performance across the tasks, with 0.7569 accuracy for cancer classification, 0.745 AUROC for TP53 mutation detection, C-indices of 0.686 and 0.650 for OS and PFI, and a mean squared error of 1.072 for TP53 RNA expression level estimation. The fine-tuned model attained an accuracy of 65.9% (95% CI: 0.636–0.681) in tumour classification and an AUROC of 0.766 (95% CI: 0.743–0.789) for detecting TP53 mutations on the external validation set. However, most tumour classes, aside from ovarian cancer, reached an AUROC above 0.88 with class-specific thresholding using the Youden Index. This indicates strong generalisation across 32 tumour types, providing reasonable molecular profiling but offering limited prognostic utility in surgical oncology.

Conclusion

A ViT-based MIL model can simultaneously infer tumour taxonomy, TP53 mutation status, and TP53 RNA expression levels directly from WSIs, with performance comparable to conventional genomic assays, while prognostic risk remains limited. This integrated, slide-level approach offers a scalable pipeline toward computational pathology.

Article activity feed