Generating All-Atom Protein Structure from Sequence-Only Training Data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Generative models for protein design are gaining interest for their potential scientific impact. However, protein function is mediated by many modalities, and simultaneously generating multiple modalities remains a challenge. We propose PLAID (Protein Latent Induced Diffusion) , a method for multimodal protein generation that learns and samples from the latent space of a predictor , mapping from a more abundant data modality (e.g., sequence) to a less abundant one (e.g., crystallography structure). Specifically, we address the all-atom structure generation setting, which requires producing both the 3D structure and 1D sequence to define side-chain atom placements. Importantly, PLAID only requires sequence inputs to obtain latent representations during training , enabling the use of sequence databases for generative model training and augmenting the data distribution by 2 to 4 orders of magnitude compared to experimental structure databases. Sequence-only training also allows access to more annotations for conditioning generation. As a demonstration, we use compositional conditioning on 2,219 functions from Gene Ontology and 3,617 organisms across the tree of life. Despite not using structure inputs during training, generated samples exhibit strong structural quality and consistency. Function-conditioned generations learn side-chain residue identities and atomic positions at active sites, as well as hydrophobicity patterns of transmembrane proteins, while maintaining overall sequence diversity. Model weights and code are publicly available at github.com/amyxlu/plaid .

Article activity feed