BFVD—a large repository of predicted viral protein structures

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

The AlphaFold Protein Structure Database (AFDB) is the largest repository of accurately predicted structures with taxonomic labels. Despite providing predictions for over 214 million UniProt entries, the AFDB does not cover viral sequences, severely limiting their study. To address this, we created the Big Fantastic Virus Database (BFVD), a repository of 351 242 protein structures predicted by applying ColabFold to the viral sequence representatives of the UniRef30 clusters. By utilizing homology searches across two petabases of assembled sequencing data, we improved 36% of these structure predictions beyond ColabFold’s initial results. BFVD holds a unique repertoire of protein structures as over 62% of its entries show no or low structural similarity to existing repositories. We demonstrate how a substantial fraction of bacteriophage proteins, which remained unannotated based on their sequences, can be matched with similar structures from BFVD. In that, BFVD is on par with the AFDB, while holding nearly three orders of magnitude fewer structures. BFVD is an important virus-specific expansion to protein structure repositories, offering new opportunities to advance viral research. BFVD can be freely downloaded at bfvd.steineggerlab.workers.dev and queried using Foldseek and UniProt labels at bfvd.foldseek.com.

Article activity feed

  1. The taxid for each BFVD sequence was retrieved from UniProt and its full lineage - from NCBI (31). The Sankey plot based on this information (Fig. 1a) was generated with Pavian (32). For each taxonomic rank, only the ten most abundant taxa were included in the plot.

    It would be interesting to see a side-by-side comparison to PDB and ViralZone. Where does BFDV have more coverage? Where does ViralZone or PDB have more depth (more sequences per represented sequence in BFDV)

  2. To that end, the 3,002 sequences longer than this threshold were split, resulting in 6,730 sequence fragments

    As commented above, it would be interesting to know more details on how you did this splitting

  3. Looking ahead, we aim to expand BFVD by predicting viral multimer structures, taking advantage of their compact genome size, and making them searchable using Foldseek Multimer (30).

    Would you consider expanding it to all viral proteins in UniProt, or to a higher clustering resolution than UniRef30 (such as UniRef50 or UniRef90)?

    In general, it would be interesting to know more about this trade off. Some questions I would love to know the answers to:

    1. How much computational cost and database size did you save by using UniRef30 over all UniProt viral proteins? It looks like you did 351K that represent 3M sequences. Is 3M too many to do?
    2. How taxonomically mixed are the UniRef30 clusters? For example, I looked at the representative sequence for a protein PB1 in influenza A and it's sequence PB1 in influenza B, so they're very closely related. Is this usually the case because viruses are so diverse?
    3. Do PDB and ViralZone provide more resolution for the virsues they do cover? For example, if I'm more interested in Eukaryotic viruses, will I get more exact results using one of those databases because they don't reduce down so much (e.g. UniRef30)?
  4. To demonstrate BFVD’s utility, we repeated and extended a part of a recent study by Say et al. (14) that annotated putative bacteriophages within metagenomically assembled contigs from wastewater. Say et al. developed a pipeline for enhanced annotations by integrating structural information from the AFDB with sequence data. Here, we applied the steps of their pipeline to one of the metagenomic samples from their study: the Granulated Activated Carbon sample 6 (GAC6). In addition to using the AFDB like they did, we included BFVD and ViralZone as reference databases for structural similarity search (Fig. 1h). Like Say et al., we found that the sequence-similarity based tool Bakta (28) could annotate on average 8% of the putative bacteriophage proteins on each contig, while Foldseek with the AFDB as reference annotated on average 51% of them. By using BFVD, we could annotate a comparable fraction of 46% of the putative bacteriophage proteins, despite the tremendous size difference between the AFDB and BFVD. However, when we searched the sample structures against the combined structure set of the AFDB and BFVD, we observed only a marginal increase in annotation performance. This suggests that the AFDB likely includes some BFVD bacte-riophage structures indirectly, through prophages embedded in bacterial genomes covered by the AFDB. While ViralZone improved Bakta’s annotations, its contribution was limited compared to the AFDB and BFVD, likely due to its focus on eukaryotic viruses.

    I think it could be interesting to repeat this experiment but with a metagenome where the viruses of interest are not bacteriophages. As written, this doesn't really highlight the benefit of BFVD.

    It may also be interesting to report the additional metadata you receive from annotating with BFVD instead of AFDB. If the phage structures come from hits to prophages, AFDB would presumably provide "host" information while BFVD would provide viral taxonomy (or at least taxonomy of sequences in the cluster that have a hit).

  5. Indeed, among the low-confidence structure predictions (pLDDT < 50), the majority (78%) had fewer than 30 homologs.

    Maybe I'm misunderstanding, but I thought in the previous paragraph you stated that most of these sequences had high pLDDT, so were high confidence? That makes this sentence confusing, as well as the following one.

  6. Focusing on the shortest proteins (≤ 70 residues), we found that 99% of them were singletons. Unlike longer proteins, only 4% of the shortest proteins exhibited low confidence scores (pLDDT < 50). This is consistent with a previous report of high pLDDTs in sequences shorter than 100 residues (26).

    Would you be willing to add a summary sentence here? I take this to mean that the structures are highly confident but that they are very unique?

    There is some evidence in humans that short ORFs (<100 amino acids) are evolutionarily young and not shared between closely related species, leading to the hypothesis that they may be a reservoir of functional innovation. I'm curious if there might be anything similar posited about the evolution of these things, or if the tools aren't accurate enough in this case to put forth these types of ideas

  7. To limit the computational demand of structure prediction, we split 3,002 sequences longer than 1,500 residues (< 1% of all) into 6,730 sequence fragments.

    Can you state more clearly what you did here? Did you split them into 1,500 residue chunks or divide them in half/thirds etc, or something more clever like relying on domain annotations

  8. a database of 67,715 predicted protein structures from 4,463 species of eukaryotic viruses.

    It would be helpful to know here whether they subsampled to representative genomes or clustered sequences and picked representatives to better compare against the approach taken here.