BFVD—a large repository of predicted viral protein structures

Rachel Seongeun Kim
Eli Levy Karin
Milot Mirdita
Rayan Chikhi
Martin Steinegger

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

The AlphaFold Protein Structure Database (AFDB) is the largest repository of accurately predicted structures with taxonomic labels. Despite providing predictions for over 214 million UniProt entries, the AFDB does not cover viral sequences, severely limiting their study. To address this, we created the Big Fantastic Virus Database (BFVD), a repository of 351 242 protein structures predicted by applying ColabFold to the viral sequence representatives of the UniRef30 clusters. By utilizing homology searches across two petabases of assembled sequencing data, we improved 36% of these structure predictions beyond ColabFold’s initial results. BFVD holds a unique repertoire of protein structures as over 62% of its entries show no or low structural similarity to existing repositories. We demonstrate how a substantial fraction of bacteriophage proteins, which remained unannotated based on their sequences, can be matched with similar structures from BFVD. In that, BFVD is on par with the AFDB, while holding nearly three orders of magnitude fewer structures. BFVD is an important virus-specific expansion to protein structure repositories, offering new opportunities to advance viral research. BFVD can be freely downloaded at bfvd.steineggerlab.workers.dev and queried using Foldseek and UniProt labels at bfvd.foldseek.com.

Version published to 10.1093/nar/gkae1119
Nov 22, 2024
Arcadia Science
Oct 7, 2024

Methods

Would you be willing to provide the code you used for this project as a GitHub repo or a gist?

Read the original source
Arcadia Science
Oct 7, 2024

colabfold_envdb_202108

Where is this available?

Read the original source
Arcadia Science
Oct 7, 2024

The taxid for each BFVD sequence was retrieved from UniProt and its full lineage - from NCBI (31). The Sankey plot based on this information (Fig. 1a) was generated with Pavian (32). For each taxonomic rank, only the ten most abundant taxa were included in the plot.

It would be interesting to see a side-by-side comparison to PDB and ViralZone. Where does BFDV have more coverage? Where does ViralZone or PDB have more depth (more sequences per represented sequence in BFDV)

Read the original source
Arcadia Science
Oct 7, 2024

To that end, the 3,002 sequences longer than this threshold were split, resulting in 6,730 sequence fragments

As commented above, it would be interesting to know more details on how you did this splitting

Read the original source
Arcadia Science
Oct 7, 2024

taxnomic

typo

Read the original source
Arcadia Science
Oct 7, 2024

https://gwdu111.gwdg.de/compbiol/uniclust/2023_02/

This URL is not found when I click on it

Read the original source
Arcadia Science
Oct 7, 2024
Looking ahead, we aim to expand BFVD by predicting viral multimer structures, taking advantage of their compact genome size, and making them searchable using Foldseek Multimer (30).

Would you consider expanding it to all viral proteins in UniProt, or to a higher clustering resolution than UniRef30 (such as UniRef50 or UniRef90)?

In general, it would be interesting to know more about this trade off. Some questions I would love to know the answers to:
1. How much computational cost and database size did you save by using UniRef30 over all UniProt viral proteins? It looks like you did 351K that represent 3M sequences. Is 3M too many to do?
2. How taxonomically mixed are the UniRef30 clusters? For example, I looked at the representative sequence for a protein PB1 in influenza A and it's sequence PB1 in influenza B, so they're very closely …
Looking ahead, we aim to expand BFVD by predicting viral multimer structures, taking advantage of their compact genome size, and making them searchable using Foldseek Multimer (30).

Would you consider expanding it to all viral proteins in UniProt, or to a higher clustering resolution than UniRef30 (such as UniRef50 or UniRef90)?

In general, it would be interesting to know more about this trade off. Some questions I would love to know the answers to:

How much computational cost and database size did you save by using UniRef30 over all UniProt viral proteins? It looks like you did 351K that represent 3M sequences. Is 3M too many to do?

How taxonomically mixed are the UniRef30 clusters? For example, I looked at the representative sequence for a protein PB1 in influenza A and it's sequence PB1 in influenza B, so they're very closely related. Is this usually the case because viruses are so diverse?

Do PDB and ViralZone provide more resolution for the virsues they do cover? For example, if I'm more interested in Eukaryotic viruses, will I get more exact results using one of those databases because they don't reduce down so much (e.g. UniRef30)?
Read the original source
Arcadia Science
Oct 7, 2024

To demonstrate BFVD’s utility, we repeated and extended a part of a recent study by Say et al. (14) that annotated putative bacteriophages within metagenomically assembled contigs from wastewater. Say et al. developed a pipeline for enhanced annotations by integrating structural information from the AFDB with sequence data. Here, we applied the steps of their pipeline to one of the metagenomic samples from their study: the Granulated Activated Carbon sample 6 (GAC6). In addition to using the AFDB like they did, we included BFVD and ViralZone as reference databases for structural similarity search (Fig. 1h). Like Say et al., we found that the sequence-similarity based tool Bakta (28) could annotate on average 8% of the putative bacteriophage proteins on each contig, while Foldseek with the AFDB as reference annotated on average 51% of …

To demonstrate BFVD’s utility, we repeated and extended a part of a recent study by Say et al. (14) that annotated putative bacteriophages within metagenomically assembled contigs from wastewater. Say et al. developed a pipeline for enhanced annotations by integrating structural information from the AFDB with sequence data. Here, we applied the steps of their pipeline to one of the metagenomic samples from their study: the Granulated Activated Carbon sample 6 (GAC6). In addition to using the AFDB like they did, we included BFVD and ViralZone as reference databases for structural similarity search (Fig. 1h). Like Say et al., we found that the sequence-similarity based tool Bakta (28) could annotate on average 8% of the putative bacteriophage proteins on each contig, while Foldseek with the AFDB as reference annotated on average 51% of them. By using BFVD, we could annotate a comparable fraction of 46% of the putative bacteriophage proteins, despite the tremendous size difference between the AFDB and BFVD. However, when we searched the sample structures against the combined structure set of the AFDB and BFVD, we observed only a marginal increase in annotation performance. This suggests that the AFDB likely includes some BFVD bacte-riophage structures indirectly, through prophages embedded in bacterial genomes covered by the AFDB. While ViralZone improved Bakta’s annotations, its contribution was limited compared to the AFDB and BFVD, likely due to its focus on eukaryotic viruses.

I think it could be interesting to repeat this experiment but with a metagenome where the viruses of interest are not bacteriophages. As written, this doesn't really highlight the benefit of BFVD.

It may also be interesting to report the additional metadata you receive from annotating with BFVD instead of AFDB. If the phage structures come from hits to prophages, AFDB would presumably provide "host" information while BFVD would provide viral taxonomy (or at least taxonomy of sequences in the cluster that have a hit).

Read the original source
Arcadia Science
Oct 7, 2024
Typo I think
Read the original source
Arcadia Science
Oct 7, 2024
Typo I think
Read the original source
Arcadia Science
Oct 7, 2024

Indeed, among the low-confidence structure predictions (pLDDT < 50), the majority (78%) had fewer than 30 homologs.

Maybe I'm misunderstanding, but I thought in the previous paragraph you stated that most of these sequences had high pLDDT, so were high confidence? That makes this sentence confusing, as well as the following one.

Read the original source
Arcadia Science
Oct 7, 2024

Focusing on the shortest proteins (≤ 70 residues), we found that 99% of them were singletons. Unlike longer proteins, only 4% of the shortest proteins exhibited low confidence scores (pLDDT < 50). This is consistent with a previous report of high pLDDTs in sequences shorter than 100 residues (26).

Would you be willing to add a summary sentence here? I take this to mean that the structures are highly confident but that they are very unique?

There is some evidence in humans that short ORFs (<100 amino acids) are evolutionarily young and not shared between closely related species, leading to the hypothesis that they may be a reservoir of functional innovation. I'm curious if there might be anything similar posited about the evolution of these things, or if the tools aren't accurate enough in this case to put forth these types of ideas

Read the original source
Arcadia Science
Oct 7, 2024

70.17, indicating medium confidence

Would you be willing to report mean and sd here as well?

Read the original source
Arcadia Science
Oct 7, 2024

To limit the computational demand of structure prediction, we split 3,002 sequences longer than 1,500 residues (< 1% of all) into 6,730 sequence fragments.

Can you state more clearly what you did here? Did you split them into 1,500 residue chunks or divide them in half/thirds etc, or something more clever like relying on domain annotations

Read the original source
Arcadia Science
Oct 7, 2024

a database of 67,715 predicted protein structures from 4,463 species of eukaryotic viruses.

It would be helpful to know here whether they subsampled to representative genomes or clustered sequences and picked representatives to better compare against the approach taken here.

Read the original source
Arcadia Science
Oct 7, 2024

(e.g., 16, 17)

These citations are missing their hyperlinks

Read the original source
Version published to 10.1101/2024.09.08.611582v1 on bioRxiv
Sep 9, 2024

Categorizing prediction modes within low-pLDDT regions of AlphaFold2 structures

This article has 4 authors:
1. Christopher J Williams
2. Vincent B Chen
3. David C Richardson
4. Jane S Richardson
Reviewed by Arcadia Science

This article has 7 evaluationsAppears in 1 listLatest version Jun 7, 2025Latest activity Jun 20, 2025
Balancing Speed and Precision in Protein Folding: A Comparison of AlphaFold2, ESMFold, and OmegaFold

This article has 3 authors:
1. Anna Hýskova
2. Eva Maršálková
3. Petr Šimeček
This article has no evaluationsLatest version Jun 21, 2025
Assessment of Protein Complex Predictions in CASP16: Are we making progress?

This article has 9 authors:
1. Jing Zhang
2. Rongqing Yuan
3. Andriy Kryshtafovych
4. Rachael C. Kretsch
5. R. Dustin Schaeffer
6. Jian Zhou
7. Rhiju Das
8. Nick V. Grishin
9. Qian Cong
This article has no evaluationsLatest version May 30, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Categorizing prediction modes within low-pLDDT regions of AlphaFold2 structures

Balancing Speed and Precision in Protein Folding: A Comparison of AlphaFold2, ESMFold, and OmegaFold

Assessment of Protein Complex Predictions in CASP16: Are we making progress?