Towards a Cytometry Foundation Model: Interpretable Sample-level Predictive Modelling via Pretrained Transformers

Zixin Zhuang
Benjamin S. Mashford
Liang Zheng
T. Daniel Andrews

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Foundation models have transformed scientific data modelling across domains, yet flow cytometry has lacked one. Despite the abundance of high-dimensional cellular data, automated analysis remains bottlenecked by marker variability: prior studies are typically confined to fixed marker panels and homogeneous data, limiting scalability and generalisation due to architectural constraints. We present the Generalised Pretrained Cytometry Transformer (GPCT), an interpretable framework designed to learn from heterogeneous marker panels for sample-level predictive modelling. Through a novel cytometry-specific pretraining regime, GPCT learns transferable cellular representations that achieve high classification accuracy across diverse datasets. Notably, pretraining significantly boosts performance on data-scarce downstream tasks, marking a pivotal step towards a cytometry foundation model. Furthermore, GPCT maintains interpretability and identifies the specific cell subsets most influential to its predictions. This enables direct biological validation of learned patterns and provides a data-driven basis for refining traditional gating strategies.

Arcadia Science
Apr 27, 2026

This dual-masking formulation drives the model to learn robust representations by predicting masked values from complementary perspectives: r

bro. In Dataset 1, they have 16 panels but the leave-one-panel-out drop is <8%. That's the better evidence for robustness.

Then you claim pretraining drives "robust representations despite marker inconsistency" based on the KO task, where Dataset 2 has a completely consistent panel.

Those two claims aren't using the same evidence base and shouldn't be merged into one conclusion.

Read the original source
Arcadia Science
Apr 27, 2026

Notably, large-scale pretraining yielded considerable performance gains in small-data settings, attributable to robust cellular representations that recover biological signals despite marker inconsistency.

If you're concluding this based on per-class AUC and markers are inconsistent isn't this claim sketchy at best?

Read the original source
Arcadia Science
Apr 27, 2026

These results demonstrate the potential for a cytometry foundation model: via largescale GPCT pretraining.

Oh just claiming potential. ok nvm!

Read the original source
Arcadia Science
Apr 27, 2026

Dataset 1: Longitudinal mouse immunophenotype datasetAs part of a long-running mutagenesis project to investigate novel genetic causes of immune dysfunction [18], flow cytometry phenotypes for over forty thousand C57BL/6 mice were obtained at the Australian Phenomics Facility between 1995 and 2015. This data is comprised of predominantly eight-colour experiments with varying marker/antibody/fluorophore combinations, yet most samples include a backbone of six common markers (IgM, IgD, B220, CD44, CD4, CD3) (Supplementary Table 10 and 8).In the present analysis, we have chosen a subset of 14,014 flow cytometry samples (6,978 female, 7,036 male) with a consistent gender metadata label and mostly pan-leukocyte marker panels. Sexual dimorphism rarely produces landmark cell populations readily detectable by manual analysis of flow cytometry …

Dataset 1: Longitudinal mouse immunophenotype datasetAs part of a long-running mutagenesis project to investigate novel genetic causes of immune dysfunction [18], flow cytometry phenotypes for over forty thousand C57BL/6 mice were obtained at the Australian Phenomics Facility between 1995 and 2015. This data is comprised of predominantly eight-colour experiments with varying marker/antibody/fluorophore combinations, yet most samples include a backbone of six common markers (IgM, IgD, B220, CD44, CD4, CD3) (Supplementary Table 10 and 8).In the present analysis, we have chosen a subset of 14,014 flow cytometry samples (6,978 female, 7,036 male) with a consistent gender metadata label and mostly pan-leukocyte marker panels. Sexual dimorphism rarely produces landmark cell populations readily detectable by manual analysis of flow cytometry data. However, this has proven a tractable problem with application of neural networks [19], with discriminative signals usually subtle and dispersed across multiple cell populations.Dataset 2: Knockout Mouse Project immunophenotype datasetThe Knockout Mouse Project (KOMP) [20] generated mouse strains harbouring gene knockouts for the majority of genes in the mouse genome, accompanied by phenotype data including flow cytometry information for a subset of mutant mouse lines. For our purposes, we focus on a subset of samples subjected to flow cytometry assay of a T cell immunophenotyping panel [21] (Supplementary Table 3). Despite containing nearly 7000 samples, this dataset poses a classic lack-of-data problem, as each knockout (KO) is represented by only 10 to 20 samples. As most knockouts in this dataset were found to lack discernible cellular phenotypes [21], we selected just 5 knock-out lines with clear mutant phenotypes characterised by the original study. This yields 72 samples (Supplementary Table 9) for a 5-class KO classification task.

So you have 14k samples for dataset 1, a slight imbalance in male/female, but only 72 samples for dataset 2 because of selecting only 5 knockout lines? Also "most knockouts in this dataset were found to lack discernible cellular phenotypes"? Is that not concerning if you want to claim general ability/can build on for flow cytometry?

Pre-training distribution has a significant impact on downstream utility.

Read the original source
Arcadia Science
Apr 27, 2026

We evaluated the impact of cross-dataset pretraining on the model generalisation scenario using two configurations. The first model, the D1 encoder (Experiment A and B), was trained exclusively on Dataset 1. The second, the generic encoder (Experiments C), was pretrained on combined training data from Datasets 1 and 2 before downstream training on Dataset 1 only. Results in Fig. 2b (1) demonstrate that including even a small fraction of Dataset 2 in the pretraining phase significantly improved downstream generalisation to Dataset 2 testing samples.

What exactly do the pre-training distributions look like? Whats the exact mix? Is dataset 1 sufficiently different from dataset 2, specifically as it relates to sample quality and number of samples?

Read the original source
Arcadia Science
Apr 27, 2026

In this regard, GPCT can be interpreted through the attention mechanism used by the decoder: during inference, each attention head in the multi-head attention layer assigns a weight to every cell, representing its relative contribution to the decision-making process. These weights serve as a quantitative measure of per-cell “importance”, and while they are typically averaged across heads per layer for visualisation, each layer may capture distinct patterns that reflect the model’s internal processing steps.

Interesting concept to make them cell level. Why not clusters of cells?

Read the original source
Arcadia Science
Apr 27, 2026

under the same 7-fold cross validation setting as training mode B.

Why make k=7?

Read the original source
Version published to 10.64898/2026.03.31.712806 on bioRxiv
Apr 2, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed