Learning sequence-function relationships with scalable, interpretable Gaussian processes

Juannan Zhou
Carlos Martí-Gómez
Samantha Petti
David M. McCandlish

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding the relationship between biological sequences, such as DNA, RNA or protein sequences, and their resulting phenotypes is one of the central goals of genetics. This task is complicated by epistasis, i.e., the context dependence of mutational effects. Advances in high-throughput phenotyping now make it possible to study these relationships at unprecedented scale, generating large datasets that measure phenotypes for tens or hundreds of thousands of sequences. However, standard regression models for analyzing such datasets often make unrealistic assumptions about the generalizability of mutational effects and epistatic coefficients across genetic backgrounds. Deep neural networks offer greater flexibility but suffer from limited interpretability and lack uncertainty quantification. Here, we introduce a family of interpretable Gaussian process models for sequence-function relationships that capture epistasis through flexible prior distributions that generalize classical theoretical models from the fitness landscape literature. In particular, these priors are parameterized by interpretable site-, allele-, and mutation-specific factors controlling the degree to which specific mutations decrease the predictability of the effects of other mutations. Using GPU acceleration to scale to large protein, RNA, and genome-wide SNP datasets, our models consistently deliver superior predictive performance while yielding interpretable parameters that both recover known features and uncover novel epistatic interactions. Overall, our methods provide new insights into the structure of the genotype-phenotype map and offer scalable, interpretable approaches for exploring complex genetic interactions across diverse biological systems.

Version published to 10.1101/2025.08.15.670613 on bioRxiv
Aug 19, 2025

Under which circumstances do genomic neural networks learn motifs and their interactions?

This article has 2 authors:
1. Mike Thompson
2. Ben Lehner
This article has no evaluationsLatest version Jul 27, 2025
Controllable Protein Design via Autoregressive Direct Coupling Analysis Conditioned on Principal Components

This article has 4 authors:
1. Francesco Caredda
2. Andrea Pagnani
3. Paolo De Los Rios
4. Lisa Gennai
This article has no evaluationsLatest version Aug 22, 2025
Coalescence and Translation: A Language Model for Population Genetics

This article has 5 authors:
1. Kevin Korfmann
2. Nathaniel S. Pope
3. Melinda Meleghy
4. Auélien Tellier
5. Andrew D. Kern
This article has no evaluationsLatest version Jun 27, 2025

Listed in

Abstract

Article activity feed

Related articles

Under which circumstances do genomic neural networks learn motifs and their interactions?

Controllable Protein Design via Autoregressive Direct Coupling Analysis Conditioned on Principal Components

Coalescence and Translation: A Language Model for Population Genetics