Using artificial intelligence to document the hidden RNA virosphere

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

RNA viruses are diverse and abundant components of global ecosystems. The metagenomic identification of RNA viruses is currently limited to those that exhibit sequence similarity to known viruses. Consequently, the detection of highly divergent viruses with poor sequence similarity to known viruses remains a challenging task. We developed a deep learning algorithm, termed LucaProt, to identify highly divergent RNA-dependent RNA polymerase (RdRP) sequences in 10,487 metatranscriptomes from diverse global ecosystems. LucaProt integrates both sequence and structural information to accurately and efficiently detect RdRP sequences. With this approach we identified 161,979 putative RNA virus species and 180 RNA virus supergroups, among which only 21 contained members of phyla or classes currently defined by the International Committee on Taxonomy of Viruses, and includes many groups that were either undescribed or poorly characterized in previous studies. The newly identified RNA viruses were present in diverse ecological settings, including the air, hot springs and hydrothermal vents, and both virus diversity and abundance varied substantially among ecosystems. We also identified the longest RNA virus genome (nido-like virus) documented to date, at 47,250 nucleotides. This study marks the beginning of a new era of virus discovery, providing computational tools that will help expand our understanding of the global RNA virosphere and of virus evolution.

Article activity feed

  1. For all 10,487 data sets generated and collected for this study, reads were assembled de313novo into contigs using MEGAHIT v1.2.8 45 with default parameters

    It would be really interesting to see the alignment rates -- e.g. what fraction of each sample assembled, and if this varies by biome. This would give us some sort of idea if there were other viral reads left on the table due to non-assembly

  2. That the 180 RNA viral superclades identified represented RNA-based organisms was147verified by multiple lines of evidence.

    Did you do any sort of contamination screen here to see if any of your hits were off target or had homology to other sequences? Either against BLAST nt/nr or against metagenomes or something?

  3. Independently to the deep-learning111approach, we applied a more conventional approach (i.e., “ClstrSearch”) that clustered all112proteins based on their sequence homology and then used BLAST or HMM models to113identify any resemblance to viral RdRPs or non-RdRP proteins.

    Did you do validation here? We've recently done something similar and noticed that we have to filter our diamond BLAST-equivalent results to 90% identity, or else we get a ton of off target hits.

  4. The latter approach is114distinguished from previous BLAST or HMM based approaches because it queries on protein115clusters (i.e., alignments) instead of individual sequences, which greatly reduced both the116false positive and negative rates of virus identification.

    clever. Reminds me of NCBI's new clustered nr database

  5. The major AI algorithm used107here (i.e., “LucaProt”) is a deep learning, transformer-based model established based on108sequence and structural features of 5,979 well-characterized RdRPs and 229,434 non-RdRPs.109LucaProt had high accuracy (0.03% false positives) and specificity (0.20% false negatives)110on the test data set (Fig. 1b, Extended Data Fig. 4).

    Nice! I have two questions about this.

    1. Are there any problems that could arise in training because this training set is so unbalanced?
    2. How do your input RdRPs compare to those used in Serratus?