The Limitations of TabPFN for High-Dimensional RNA-seq Analysis

Summer Zhou
Vinayak Agarwal
Ashwin Gopinath
Timothy Kassis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Tabular Prior-Data Fitted Networks (TabPFN) demonstrate remarkable performance on small-to-medium tabular datasets through in-context learning, but struggle with high-dimensional genomic data such as RNA-seq with tens of thousands of features. We investigate multiple approaches to adapt TabPFN for transcriptomic analysis using two benchmark datasets: Age-ARCHS4, a regression dataset derived from the ARCHS4 dataset (57,873 samples, 10,000 genes), and an Inflammatory Bowel Disease (IBD) classification dataset encompassing Crohn’s Disease and Ulcerative Colitis samples (2,490 samples, 10,000 genes). Our experimental design proceeds in two phases: first evaluating existing optimization methods, then testing novel adaptations including (1) self-supervised embedding learning and (2) Bulk-Former integration. We demonstrate that when constrained to equal training conditions (500 features, 10,000 samples), TabPFN outperforms classical baselines like random forest and XGBoost. However, when classical methods utilize full feature sets while TabPFN adaptations attempt to handle higher-dimensional data, all TabPFN variants consistently underperform the naive baseline. Our findings reveal fundamental limitations in current approaches to adapting TabPFN for genomic applications, showing that architectural modifications paradoxically degrade performance, while intelligent metadata-based subgrouping emerges as the most effective strategy for deploying TabPFN on biological data.

Version published to 10.1101/2025.08.15.670537 on bioRxiv
Aug 21, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed