Using artificial intelligence to document the hidden RNA virosphere

Xin Hou
Yong He
Pan Fang
Shi-Qiang Mei
Zan Xu
Wei-Chen Wu
Jun-Hua Tian
Shun Zhang
Zhen-Yu Zeng
Qin-Yu Gou
Gen-Yang Xin
Shi-Jia Le
Yin-Yue Xia
Yu-Lan Zhou
Feng-Ming Hui
Yuan-Fei Pan
John-Sebastian Eden
Zhao-Hui Yang
Chong Han
Yue-Long Shu
Deyin Guo
Jun Li
Edward C. Holmes
Zhao-Rong Li
Mang Shi

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

No abstract available

Version published to 10.1016/j.cell.2024.09.027
Nov 1, 2024
Arcadia Science
May 10, 2023

For all 10,487 data sets generated and collected for this study, reads were assembled de313novo into contigs using MEGAHIT v1.2.8 45 with default parameters

It would be really interesting to see the alignment rates -- e.g. what fraction of each sample assembled, and if this varies by biome. This would give us some sort of idea if there were other viral reads left on the table due to non-assembly

Read the original source
Arcadia Science
May 10, 2023

That the 180 RNA viral superclades identified represented RNA-based organisms was147verified by multiple lines of evidence.

Did you do any sort of contamination screen here to see if any of your hits were off target or had homology to other sequences? Either against BLAST nt/nr or against metagenomes or something?

Read the original source
Arcadia Science
May 10, 2023

Independently to the deep-learning111approach, we applied a more conventional approach (i.e., “ClstrSearch”) that clustered all112proteins based on their sequence homology and then used BLAST or HMM models to113identify any resemblance to viral RdRPs or non-RdRP proteins.

Did you do validation here? We've recently done something similar and noticed that we have to filter our diamond BLAST-equivalent results to 90% identity, or else we get a ton of off target hits.

Read the original source
Arcadia Science
May 10, 2023

The latter approach is114distinguished from previous BLAST or HMM based approaches because it queries on protein115clusters (i.e., alignments) instead of individual sequences, which greatly reduced both the116false positive and negative rates of virus identification.

clever. Reminds me of NCBI's new clustered nr database

Read the original source
Arcadia Science
May 10, 2023
The major AI algorithm used107here (i.e., “LucaProt”) is a deep learning, transformer-based model established based on108sequence and structural features of 5,979 well-characterized RdRPs and 229,434 non-RdRPs.109LucaProt had high accuracy (0.03% false positives) and specificity (0.20% false negatives)110on the test data set (Fig. 1b, Extended Data Fig. 4).

Nice! I have two questions about this.

Are there any problems that could arise in training because this training set is so unbalanced?

How do your input RdRPs compare to those used in Serratus?
Read the original source
Version published to 10.1101/2023.04.18.537342 on bioRxiv
Apr 18, 2023

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025
Divergent Bacteriophages from Wastewater Reveal an Open Pan-Genome with No Shared Gene Families

This article has 4 authors:
1. Malihe Hamidzade
2. Kimia Sharifian
3. Seyed Jalal Kiani
4. Alieza Mohebbi
This article has no evaluationsLatest version Dec 19, 2025
Decrypting viral dark matter through key proteins using an NLP-enhanced framework

This article has 10 authors:
1. Zhihua Du
2. Min Li
3. Kaihuang Lin
4. Bo Xing
5. Yuehua Ou
6. Wenchen Song
7. Jie Chen
8. Junhua Li
9. Jianqiang Li
10. Minfeng Xiao
This article has no evaluationsLatest version Jan 13, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Divergent Bacteriophages from Wastewater Reveal an Open Pan-Genome with No Shared Gene Families

Decrypting viral dark matter through key proteins using an NLP-enhanced framework