Weakly supervised learning uncovers phenotypic signatures in single-cell data
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
To deliver clinically relevant insights from large patient cohorts profiled with single-cell technologies, a key challenge is to relate sample-level and single-cell measurements. We present MultiMIL, a deep learning framework that applies attention-based multiple-instance learning for phenotype prediction and cell state identification. We applied MultiMIL to peripheral blood mononuclear cells from COVID-19 patients, the Human Lung Cell Atlas, and a spatial proteomics breast cancer dataset, demonstrating how our model can be utilized to find phenotype-associated cell states, learn phenotype-informed sample representations, and expand disease signatures.
Article activity feed
-
We also investigated how consistent the cell attention scores are across cross-validation splits, com-
The paper notes that MultiMIL relies on batch-corrected embeddings to handle technical confounders, and that explicitly adding covariates like age/sex didn't improve performance. But beyond technical batch effects, patients often have heterogeneous biological states like unique inflammatory signatures, co-morbidities, disease subtypes, that are real biology but not common to the disease state. These could still be predictive in smaller cohorts without reflecting shared disease mechanisms. Does the attention mechanism have any inherent safeguard against overfitting to patient-specific biological features? In the stability analysis comparing embeddings (scVI versus scGPT), were there cases where the model consistently attended to …
We also investigated how consistent the cell attention scores are across cross-validation splits, com-
The paper notes that MultiMIL relies on batch-corrected embeddings to handle technical confounders, and that explicitly adding covariates like age/sex didn't improve performance. But beyond technical batch effects, patients often have heterogeneous biological states like unique inflammatory signatures, co-morbidities, disease subtypes, that are real biology but not common to the disease state. These could still be predictive in smaller cohorts without reflecting shared disease mechanisms. Does the attention mechanism have any inherent safeguard against overfitting to patient-specific biological features? In the stability analysis comparing embeddings (scVI versus scGPT), were there cases where the model consistently attended to features that were biologically real but unique to specific patients rather than the shared phenotype?
-