Rapid discovery of new-to-nature protein domains by novelty-first forcing of language models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Approximations for the existence and extent of physically permissible protein structures beyond those found in nature vary wildly. As predicted structure databases swell thanks to abundant sequence data and generative protein design models concurrently grow in their power to propose new aspects of protein structure, these questions and those of which essential features (e.g. stability, function, robustness) distinguish natural domains from novel ones have been cast in even sharper relief. We demonstrate that protein language models (PLMs) can simultaneously innovate in sequence and structure to suggest new-to-nature protein domains displaying supersecondary and tertiary elements outside of categorized CATH superfamilies. Developing and applying two orthogonal processes for obtaining compact and globular folds from PLMs without bias from other physicochemical or functional constraints, we discover putative novel domains that emerge parallel to known natural ones at rates far exceeding those obtainable by bioinformatic mining of structure databases. Computational characterization of these domain candidates indicates that many exhibit reasonable folding thermodynamics and kinetics, suggesting that natural protein structure-space is far from biophysically complete. These results point away from stability as the definitive selective force behind the observed landscape of real protein folds, and insinuate that many unrealized folds may be equally consistent with the structural rules of protein-based life.

Article activity feed