Towards a Cytometry Foundation Model: Interpretable Sample-level Predictive Modelling via Pretrained Transformers
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Foundation models have transformed scientific data modelling across domains, yet flow cytometry has lacked one. Despite the abundance of high-dimensional cellular data, automated analysis remains bottlenecked by marker variability: prior studies are typically confined to fixed marker panels and homogeneous data, limiting scalability and generalisation due to architectural constraints. We present the Generalised Pretrained Cytometry Transformer (GPCT), an interpretable framework designed to learn from heterogeneous marker panels for sample-level predictive modelling. Through a novel cytometry-specific pretraining regime, GPCT learns transferable cellular representations that achieve high classification accuracy across diverse datasets. Notably, pretraining significantly boosts performance on data-scarce downstream tasks, marking a pivotal step towards a cytometry foundation model. Furthermore, GPCT maintains interpretability and identifies the specific cell subsets most influential to its predictions. This enables direct biological validation of learned patterns and provides a data-driven basis for refining traditional gating strategies.
Article activity feed
-
This dual-masking formulation drives the model to learn robust representations by predicting masked values from complementary perspectives: r
bro. In Dataset 1, they have 16 panels but the leave-one-panel-out drop is <8%. That's the better evidence for robustness.
Then you claim pretraining drives "robust representations despite marker inconsistency" based on the KO task, where Dataset 2 has a completely consistent panel.
Those two claims aren't using the same evidence base and shouldn't be merged into one conclusion.
-
Notably, large-scale pretraining yielded considerable performance gains in small-data settings, attributable to robust cellular representations that recover biological signals despite marker inconsistency.
If you're concluding this based on per-class AUC and markers are inconsistent isn't this claim sketchy at best?
-
These results demonstrate the potential for a cytometry foundation model: via largescale GPCT pretraining.
Oh just claiming potential. ok nvm!
-
Dataset 1: Longitudinal mouse immunophenotype datasetAs part of a long-running mutagenesis project to investigate novel genetic causes of immune dysfunction [18], flow cytometry phenotypes for over forty thousand C57BL/6 mice were obtained at the Australian Phenomics Facility between 1995 and 2015. This data is comprised of predominantly eight-colour experiments with varying marker/antibody/fluorophore combinations, yet most samples include a backbone of six common markers (IgM, IgD, B220, CD44, CD4, CD3) (Supplementary Table 10 and 8).In the present analysis, we have chosen a subset of 14,014 flow cytometry samples (6,978 female, 7,036 male) with a consistent gender metadata label and mostly pan-leukocyte marker panels. Sexual dimorphism rarely produces landmark cell populations readily detectable by manual analysis of flow cytometry …
Dataset 1: Longitudinal mouse immunophenotype datasetAs part of a long-running mutagenesis project to investigate novel genetic causes of immune dysfunction [18], flow cytometry phenotypes for over forty thousand C57BL/6 mice were obtained at the Australian Phenomics Facility between 1995 and 2015. This data is comprised of predominantly eight-colour experiments with varying marker/antibody/fluorophore combinations, yet most samples include a backbone of six common markers (IgM, IgD, B220, CD44, CD4, CD3) (Supplementary Table 10 and 8).In the present analysis, we have chosen a subset of 14,014 flow cytometry samples (6,978 female, 7,036 male) with a consistent gender metadata label and mostly pan-leukocyte marker panels. Sexual dimorphism rarely produces landmark cell populations readily detectable by manual analysis of flow cytometry data. However, this has proven a tractable problem with application of neural networks [19], with discriminative signals usually subtle and dispersed across multiple cell populations.Dataset 2: Knockout Mouse Project immunophenotype datasetThe Knockout Mouse Project (KOMP) [20] generated mouse strains harbouring gene knockouts for the majority of genes in the mouse genome, accompanied by phenotype data including flow cytometry information for a subset of mutant mouse lines. For our purposes, we focus on a subset of samples subjected to flow cytometry assay of a T cell immunophenotyping panel [21] (Supplementary Table 3). Despite containing nearly 7000 samples, this dataset poses a classic lack-of-data problem, as each knockout (KO) is represented by only 10 to 20 samples. As most knockouts in this dataset were found to lack discernible cellular phenotypes [21], we selected just 5 knock-out lines with clear mutant phenotypes characterised by the original study. This yields 72 samples (Supplementary Table 9) for a 5-class KO classification task.
So you have 14k samples for dataset 1, a slight imbalance in male/female, but only 72 samples for dataset 2 because of selecting only 5 knockout lines? Also "most knockouts in this dataset were found to lack discernible cellular phenotypes"? Is that not concerning if you want to claim general ability/can build on for flow cytometry?
Pre-training distribution has a significant impact on downstream utility.
-
We evaluated the impact of cross-dataset pretraining on the model generalisation scenario using two configurations. The first model, the D1 encoder (Experiment A and B), was trained exclusively on Dataset 1. The second, the generic encoder (Experiments C), was pretrained on combined training data from Datasets 1 and 2 before downstream training on Dataset 1 only. Results in Fig. 2b (1) demonstrate that including even a small fraction of Dataset 2 in the pretraining phase significantly improved downstream generalisation to Dataset 2 testing samples.
What exactly do the pre-training distributions look like? Whats the exact mix? Is dataset 1 sufficiently different from dataset 2, specifically as it relates to sample quality and number of samples?
-
In this regard, GPCT can be interpreted through the attention mechanism used by the decoder: during inference, each attention head in the multi-head attention layer assigns a weight to every cell, representing its relative contribution to the decision-making process. These weights serve as a quantitative measure of per-cell “importance”, and while they are typically averaged across heads per layer for visualisation, each layer may capture distinct patterns that reflect the model’s internal processing steps.
Interesting concept to make them cell level. Why not clusters of cells?
-
under the same 7-fold cross validation setting as training mode B.
Why make k=7?
-