Decoding the Molecular Language of Proteins with Evolla

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Proteins, nature's intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - understanding how sequences and structures encode biological functions - remains a cornerstone challenge. Here, we introduce Evolla, an interactive protein-language model designed to transcend static classification by interpreting protein function through natural language queries. Trained on 546 million protein-text pairs and refined via Direct Preference Optimization, Evolla couples high-dimensional molecular representations with generative semantic decoding. Benchmarking establishes Evolla's superiority over general large language models in functional inference, demonstrates zero-shot performance parity with the state-of-the-art supervised model, and exposes remote functional relationships invisible to conventional alignment. We validate Evolla through two distinct applications: identifying candidate eukaryotic signature proteins in Asgard archaea, with functional Vps4 homologs validated via yeast complementation; and interactively discovering a novel deep-sea polyethylene terephthalate (PET) hydrolase, \textit{Ps}PETase, confirmed to degrade plastic films. These results position Evolla not merely as a predictor, but as a generative engine capable of complex hypothesis formulation, shifting the paradigm from static annotation to interactive, actionable discovery. The Evolla online service is available at \url{http://www.chat-protein.com/}.

Article activity feed