Unlocking biological insight from single-cell data with an interpretable dual-stream foundation model
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Deep learning foundation models are revolutionizing single-cell biology, yet learning holistic and discriminative representations from complex, high-dimensional data remains a central challenge. Although Transformer-based single-cell language models have shown significant progress, they typically rely on a single input-encoding scheme, a practice that results in the loss of critical gene expression information and hinders the effective learning of global cellular representations. To address these challenges, we introduce scDMC, an innovative dual-stream contrastive pre-training framework designed to synergistically optimize information fidelity at both the gene and cell levels. Pre-trained on only 2 million cells far fewer than the datasets used by mainstream models, scDMC sets a new state-of-the-art in multiple benchmark tasks, including cell annotation, clustering, and data integration. More importantly, we demonstrate that scDMC can uncover functional gene modules, infer cell-type-specific regulatory networks in a data-driven manner, and exhibits a high degree of biological interpretability. This work demonstrates an efficient pre-training approach that paves the way for the next generation of powerful and interpretable single-cell foundation models, promising to accelerate the pace of biological discovery.