Toward De Novo Protein Design from Natural Language

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Programming biological function—designing bespoke proteins that execute specific tasks on demand—is a foundational goal of molecular engineering. Yet, current protein design paradigms remain fundamentally limited, typically requiring either an existing protein to evolve from, or deep, family-specific expertise to guide the design process. Here we introduce Pinal, a generative model that overcomes this barrier by translating functional descriptions in natural language directly into diverse and active proteins. This capability is built upon a 16-billion-parameter foundation model trained on an unprecedented synthetic corpus of 1.7 billion protein-text pairs, enabling it to ground functional language in the biophysical principles of protein structure. To provide definitive experimental validation, we tasked Pinal with designing four proteins from distinct functional classes: a fluorescent protein, a polyethylene terephthalate hydrolase, an alcohol dehydrogenase, and a metabolic H-protein. Remarkably, all four designs were functionally active and the two Pinal-designed enzymes achieved catalytic turnover for their respective reactions. Notably, the Pinal-designed H-protein even surpassed its natural counterpart, exhibiting 1.7-fold higher performance. Our results establish that natural language can serve as a programmable instruction set for biology, democratizing protein design and shifting the paradigm from the incremental modification of existing molecules to the direct creation of function from a conceptual description.

Article activity feed

  1. Multiple sequence alignment further demonstrated that the key catalytic site, cofactor binding site, and metal ion binding site were highly conserved in the long-chain Fe2+-containing ADH sequences, suggesting that Pinal is capable of designing enzyme sequences that retain critical catalytic activity sites based solely on natural language input (Figure. 7B)

    It would be interesting to design a sequence predicted to be inactive (e.g., by mutating the key catalytic residue or ion binding site) and then confirming its inactivity experimentally, to demonstrate that the model can distinguish functional from non-functional sequences. Similarly, would be interesting to compare ProTrek score (or any of the ranks/scores) against measured enzymatic activity to see if there's a correlation there.

  2. These sequences ranked highest across all evaluation metrics.

    The ADH validation confirming enzyme-dependent activity is a promising proof-of-concept. It will be interesting to see how Pinal performs with proteins with more complex functions. I also think it would be interesting to test a predicted inactive mutant as an additional control.