Evolutionary-scale prediction of atomic-level protein structure with a language model

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Article activity feed

  1. Fig. S7 shows results at different MSA depth thresholds. After filtering, there are 104 sequences with MSA depth ≤ 100, 70 sequences with MSA depth ≤ 10, and 22 sequences with MSA depth = 1. Beyond the constraint that no template has TM-score > 0.5, no filtering on the number of templates is performed.

    It would be interesting to know if there is anything in common / shared for the proteins for which you can still not predict structures. For example, are they more likely to come from certain environments or environmental conditions (e.g., low temperature samples, high temperature, high salt, etc)? Also is it possible to take into account any of the environmental conditions in the actual structural prediction? For example if samples came from a hydrothermal vent that was at 90C would this be useful in any of the predictions?