Metagenomic-scale analysis of the predicted protein structure universe
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein structure prediction breakthroughs, AlphaFold2 and ESMfold, have led to an unprecedented influx of computationally derived structures. The AlphaFold Protein Structure Database provides over 200 million predictions, while the ESM Metagenomic Atlas (ESMatlas) includes over 600 million predictions from uncultured microbes. We combine these into AFESM, an 820-million-entry dataset, and cluster them using a scalable pipeline based on sequence and structure similarity, yielding 5.12 million non-singleton structural clusters. We identify common ancestors and biomes for these clusters to explore their environmental diversity and specificity, and we investigate their structural novelties. From non-singleton clusters unique to ESMatlas, we identified 12 novel domain folds, repredicting a subset (~45%) of low-quality domains with ColabFold yielded 33 additional novel folds. This underscores the importance of prediction quality in structural novelty discovery. We also identified 11,941 previously unseen domain combinations, highlighting the untapped structural diversity and importance of metagenomic data for illuminating underexplored regions of the protein structural universe. An interactive webserver and data are available at afesm.foldseek.com.