Metagenomic-scale analysis of the predicted protein structure universe
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein structure prediction breakthroughs, notably AlphaFold2 and ESMfold, have led to an unprecedented influx of computationally derived structures. The AlphaFold Protein Structure Database now provides over 200 million models, while the ESM Metagenomic Atlas includes more than 600 million predictions from uncultured microbes. Here, we combine these two resources into the AFESM, an 821-million-entry dataset, and cluster them using a two-step pipeline based on sequence and structure similarity, yielding 5.12 million non-singleton structural clusters. We identify common ancestors and biomes for these clusters to explore their environmental diversity and specificity, and we investigate their domain composition for structural novelties. Initial ESMfold-based predictions revealed no novel domain folds, re-predicting 2.3 million proteins with AlphaFold2 yielded only one new fold, suggesting near-saturation of the domain space and limitations of predictors. Nevertheless, we discovered many previously unseen domain combinations, highlighting how ESMatlas expands coverage of the known protein fold space. In particular, we find 11,941 multi-domain architectures not observed before, underscoring the importance of metagenomic data for illuminating underexplored regions of the protein structural universe.
Availability
An interactive webserver and data are available at afesm.foldseek.com.