Decoding the gene regulatory landscape through multimodal learning of protein-DNA interactions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The identity of a cell is governed by regulatory proteins binding to the genome to control gene expression. Mapping these genome-wide binding events across thousands of proteins and cell types is essential for understanding development and disease at scale, yet has remained a major experimental and computational barrier. Here we present Chromnitron, a multimodal foundation model that learns the rules of protein-DNA binding from protein sequence, DNA sequence, and context-specific chromatin states. Unlike prior single-task and multi-task learning approaches, Chromnitron implements a multimodal learning framework that accurately predicts the binding landscape for proteins and cell types not seen during training. Using Chromnitron, we discovered and experimentally validated new protein regulators of T cell exhaustion. Chromnitron also uncovered previously uncharacterized dynamic shifts in the binding landscape of regulatory proteins during neurogenesis. This marks a critical step toward a predictive model of interpretable gene regulatory programs across cell types, enabling rapid discovery of regulatory circuits and identification of new therapeutic targets.