Using artificial intelligence to document the hidden RNA virosphere
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Article activity feed
-
-
-
-
For all 10,487 data sets generated and collected for this study, reads were assembled de313novo into contigs using MEGAHIT v1.2.8 45 with default parameters
It would be really interesting to see the alignment rates -- e.g. what fraction of each sample assembled, and if this varies by biome. This would give us some sort of idea if there were other viral reads left on the table due to non-assembly
-
That the 180 RNA viral superclades identified represented RNA-based organisms was147verified by multiple lines of evidence.
Did you do any sort of contamination screen here to see if any of your hits were off target or had homology to other sequences? Either against BLAST nt/nr or against metagenomes or something?
-
Independently to the deep-learning111approach, we applied a more conventional approach (i.e., “ClstrSearch”) that clustered all112proteins based on their sequence homology and then used BLAST or HMM models to113identify any resemblance to viral RdRPs or non-RdRP proteins.
Did you do validation here? We've recently done something similar and noticed that we have to filter our diamond BLAST-equivalent results to 90% identity, or else we get a ton of off target hits.
-
The latter approach is114distinguished from previous BLAST or HMM based approaches because it queries on protein115clusters (i.e., alignments) instead of individual sequences, which greatly reduced both the116false positive and negative rates of virus identification.
clever. Reminds me of NCBI's new clustered nr database
-
The major AI algorithm used107here (i.e., “LucaProt”) is a deep learning, transformer-based model established based on108sequence and structural features of 5,979 well-characterized RdRPs and 229,434 non-RdRPs.109LucaProt had high accuracy (0.03% false positives) and specificity (0.20% false negatives)110on the test data set (Fig. 1b, Extended Data Fig. 4).
Nice! I have two questions about this.
- Are there any problems that could arise in training because this training set is so unbalanced?
- How do your input RdRPs compare to those used in Serratus?
-