Viro3D: a comprehensive database of virus protein structure predictions

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Viruses are intracellular parasites of organisms from all domains of life. They infect and cause disease in humans, animals and plants but also play crucial roles in the ecology of microbial communities. Tolerance to genetic change, high-mutation rates, adaptations to hosts and immune escape has driven high divergence of viral genes, hampering their functional annotation and phylogenetic inference. The protein structure is more conserved than sequence and can be used for searches of distant homologs and evolutionary analysis of divergent proteins. Structures of viral proteins are traditionally underrepresented in public databases, but recent advances in protein structure prediction allows us to address this issue. Combining two state-of-the-art approaches, AlphaFold2-ColabFold and ESMFold, we predicted models for 85,000 proteins from 4,400 human and animal viruses, expanding the structural coverage for viral proteins by 30 times compared to experimental structures. We also performed structural and network analyses of the models to demonstrate their utility for functional annotation and inference of distant phylogenetic relationships. Taking this approach, we examined the deep evolutionary history of viral class-I fusion glycoproteins, gaining insights on the origins of coronavirus spike protein. To enable further discoveries, we have created Viro3D ( https://viro3d.cvr.gla.ac.uk/ ), a virus species-centred protein structure database. It allows users to search, browse and download protein models from a virus of interest and explore similar structures present in other virus species. This resource will facilitate fundamental molecular virology, investigation of virus evolution, and may enable structure-informed design of therapies and vaccines.

Article activity feed

  1. Viro3D is fully searchable and browsable here: https://viro3d.cvr.gla.ac.uk/.

    thank you for sharing this awesome resource! it is really useful dataset.

    I tried it out using EBNA1 (uniprot ID P03211), and initially struggled to find the protein in the database. Ultimately, I had the easiest time finding the protein via sequence search or by searching for the virus name "Epstein-Barr" and then navigating to the protein from there. The "protein name" and "protein ID" were less intuitive.

    Here are my general thoughts on what might make it easier for outside researchers to use, of course feel free to take it or leave it:

    1. It would be helpful to rename the search bar terms with more specific titles, such that Protein ID = Genbank protein ID, and that Protein Name = Genbank gene name

    2. It may be used to add another another search term which is "Genbank Protein Product" where people can search for products, since right now that is combined with the "Protein Name" search (ie EBNA-1 and BKRF1 are protein products and gene names respectively, but both work as search terms through Protein Name).

    3. Also if it isn't too much work it may be worth also enable searching by uniprot IDs!

    Thanks again though for pulling all this together, beyond my initial struggles with the search function I found the database to be fast and visually intuitive, and anticipate using it more in the future!

  2. genomic termini are hotspots for evolutionary innovation

    this is an interesting pattern! a boring explanation for this may be that gene predictions are worse at genome termini, making a protein look novel when in fact it is an artifact. It might be worth looking into a few examples of these on a genome by genome level as a case study

  3. 10.9% from metagenomic and environmental samples

    these could represent hits to other viruses, which are abundant in metagenomes. If you are trying to track the origin of these proteins, it would likely be better to only query against sequenced whole genomes rather than community samples. With the caveat that WGS still can have viral contaminants from a variety of sources (endogenous, infecting, sample prep, environmental, etc)

  4. 15.9% from Bacteria

    I wonder if the bacterial hits are also from integrated viruses, phages share many signature proteins with eukaryotic viruses (ie, HK97 MCPs, RTs, PolBs) It might be worth quickly looking at the annotations of the hits to see if they can be explained that way, or if this is mostly reflective of deep evolutionary relationships of cellular bacterial proteins to viral proteins (as shown in https://www.pnas.org/doi/10.1073/pnas.2120620119)

  5. for example we achieved a 90% increase in the number of identified RdRp structures.

    Were these previously proteins of unknown function? Or were they annotated as something else?

  6. Four of these RTs are in a community of RdRp clusters (reflecting their shared ancestry) and therefore can be found using an RdRp probe, while one is clustered with other RTs.

    This brings up an interesting point - When used clusters to propagate annotations, how do you decide which annotation is correct? For example, where above you state that you achieve a 90% increase in RdRp structures, how do you know that they are true RdRps vs. RTs ?

  7. Predicted Viro3D structures for IAV proteome alongside the best matching counterparts from Nomburg et al. and BFVD. Ribbon diagrams are colour-coded by pLDDT confidence as denoted in the key.

    I'm surprised to see how visually different the Nomburg structures appear from the Viro3D structures. Can you clarify if these structures were predicted from the exact same protein sequence? I know for BFVD they likely are not, but I think the Nomburg dataset should allow for an apples to apples comparison. If they are the same sequences, it would be helpful to provide TM scores for the comparisons - visually it looks like the models are different but its hard to tell how divergent they are vs. how much of it is due to them being positioned slightly differently in space.

  8. structure prediction workflow of Nomburg et al.

    Do you have a hypothesis as to specifically what difference in their prediction workflow leads to lower model quality? From a brief glance, it seems like they are also using ColabFold. Is your model improvement coming from also using ESMFold and then selecting the highest confidence model?

  9. Functionally annotated structure similarity network of viral proteins.

    Is it possible to explore this network in the Viro3D database (seems like no)? Or could you provide the network file with node annotations that was used to create this figure as supp data? It may be interesting for users to be able to explore these clusters based on functions of interest.

  10. For records with protein length greater than 1,236 aa following settings were used: --max-tokens-per-batch 1 --chunk-size 128

    The number here and in the next sentence exceed the maxes in ESMFold I think. Does the chunk size mean you're folding these in 128 amino acid chunks and then joining the structure together?

  11. Due to memory constraints on the GPU, we were unable to predict models for records with protein length greater than 2,840 aa.

    Can you estimate how much memory you would need?

  12. 14.4% of the protein records form singleton clusters and potentially represent structural novelty. Our species-focussed approach captured the genomic context of all predicted structures and, therefore, allowed us to investigate the genome positions of singleton and non-singleton clusters (Figure 2g).

    Is there a length association here? or one with pLDDT?

  13. Consistent with the notion that viruses are a source of novel protein folds 29, the majority of viral proteins do not share detectable homology with cellular life, with only 17.8% of clusters (3,393 out of 19,067) having significant structural similarity to proteins in the AFDB.

    Is it straightforward to download these sets from your website or API?

  14. We also used this approach to expand functional annotation, propagating sequence-based annotation using structural clusters and structural network. Out of 85,162 protein records, 65.6% have at least partial Pfam annotation (Figure 2d). By propagating Pfam annotations to unannotated cluster members, we expanded the functional coverage by 3.99% (3,395 records). The propagation of Pfam annotation to clusters that do not have any annotated members using the structural network expanded the functional coverage by an additional 3.37% (2,870 records, Figure S5).

    Do you have a test to determine whether these expanded annotations are accurate?

  15. This likely reflects the sequence selection and structure prediction strategy underlying BFVD 28.

    Can you add another sentence or clause of details here (length, etc) for what this means so the reader doesn't have to go look at the BFVD paper to understand this point?

  16. Due to compute limitations in predicting longer proteins, ESMFold yielded slightly fewer successful predictions – 84,964 protein records (27.0 million residues, covering 92.3% of amino acid residues).

    It's not clear from the way this is phrased whether this is an inherent limitation to ESMFold or whether this could be surmounted with a larger computer

  17. We relied on data from the International Committee on Taxonomy of Viruses (ICTV) Virus Metadata Resource (VMR), which provides a comprehensive list of virus species and representative isolates, along with their GenBank accession numbers and host associations.

    Would it be possible to add information from the KEGG virus-host DB to have a higher resolution for host? https://www.genome.jp/virushostdb/

  18. ICTV Virus Metadata Resource VMR MSL38 v2

    Is this the datasheet you used? VMR_MSL39.v2_20240920.xlsx

    Would you be willing to provide the link to the datasheet in the methods section?

  19. It allows users to search, browse and download protein models from a virus of interest and explore similar structures present in other virus species.

    Would it be possible to enable download/visualization by host species? For example, it would be nice to be able to download all structures for viruses that infect humans.