Billion-Scale Deciphering of Human Gene Regulatory Grammar

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Predicting how DNA sequence specifies gene expression remains a core challenge across regulatory genomics. Most predictive assays and models depend on native genomic DNA, constraining the full biochemical engineering space for assessing and designing new sequences. Here, we address this gap with a scalable experimental–computational platform that rapidly generates million-scale sequence-to-expression datasets that directly link degenerate sequences to their function in human cells.

We built degenerate libraries of 200-bp promoter cassettes and performed pooled stable integration of up to 10 12 unique constructs, enabling the curation of million-scale sequence-to-expression datasets by fluorescently sorting billions of human cells. Biophysical modeling of transcription-factor occupancy on the data using position weight matrices reveals a broad spectrum of correlations between factor abundance and expression levels, with some co-abundances reaching Pearson’s r ≈ 0.99, consistent with cooperative and probabilistic regulation. Leveraging the dataset, we trained sequence-to-expression deep learning models that predict held-out expression with Pearson r ≈ 0.4, converge on shared sequence determinants, and agree strongly with each other (Pearson’s r = 0.93), indicating reproducible sequence-expression relationships. Finally, with minimal retraining the models generalize to an independently generated dataset collected under distinct sorting conditions, transferring sequence rules across contexts. Our platform enables repeated, rapid studies and supports deeper mechanistic insight while providing baseline models for forward design of human regulatory elements, advancing prediction beyond genomic-DNA-anchored methods.

Article activity feed