Decoding sequence determinants of gene expression in diverse cellular and disease states
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Sequence-to-function models that predict gene expression from genomic DNA sequence have proven valuable for many biological tasks, including understanding cis-regulatory syntax and interpreting non-coding genetic variants. However, current state-of-the-art models have been trained largely on bulk expression profiles from healthy tissues or cell lines, and have not learned the properties of precise cell types and states that are captured in large-scale single-cell transcriptomic datasets. Thus, they lack the ability to perform these tasks at the resolution of specific cell types or states across diverse tissue and disease contexts. To address this gap, we present Decima, a model that predicts the cell type- and condition- specific expression of a gene from its surrounding DNA sequence. Decima is trained on single-cell or single-nucleus RNA sequencing data from over 22 million cells, and successfully predicts the cell type-specific expression of unseen genes based on their sequence alone. Here, we demonstrate Decima's ability to reveal the cis-regulatory mechanisms driving cell type-specific gene expression and its changes in disease, to predict non-coding variant effects at cell type resolution, and to design regulatory DNA elements with precisely tuned, context-specific functions.