Genomic Similarity of Nucleotides in SARS CoronaVirus using K-Means Unsupervised Learning Algorithm

Jairaj Singh

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

The drastic increase in the number of coronaviruses discovered and coronavirus genomes being sequenced have given us a great opportunity to perform genomics and bioinformatics analysis on this family of viruses. Coronaviruses possess the largest genomes (26.4 to 31.7 kb) among all known RNA viruses, with G + C contents varying from 32% to 43%. Phylogenetically, three genera, Alphacoronavirus, Betacoronavirus and Gammacoronavirus, with Betacoronavirus consisting of subgroups A, B, C were known to exist but now a new genus D also exists,namely the Deltacoronavirus. In such a situation, it becomes highly important for efficient classification of all virus data so that it helps us in suitable planning,containment and treatment. The objective of this paper is to classify SARS corona-virus nucleotide sequences based on parameters such as sequence length,percentage similarity between the sequence information,open and closed gaps in the sequence due to multiple mutations and many others.By doing this,we will be able to predict accurately the similarity of SARS CoV-2 virus with respect to other corona-viruses like the Wuhan corona-virus,the bat corona-virus and the pneumonia virus and would help us better understand about the taxonomy of the corona-virus family.

SUMMARY

In addition to the guidelines provided in the abstract above,the following points summarizes the article below:

The article discusses an application of Machine Learning in the field of virology.
It aims to classify the SARS CoV2 virus as per the already known sequences of the bat-coronavirus, the Wuhan Sea Food Market pneumonia virus and the Wuhan coronavirus.
To solve and predict the similarity of the SARS CoV2 coronavirus w.r.t other viruses discussed above, K-Means Unsupervised Learning Algorithm has been chosen.
The data-set used is MN997409.1-4NY0T82X016-Alignment-HitTable.csv found on www.kaggle.com .(Complete link shared in the references section). [17]
The results have been validated by using a simple data-correlation technique namely Spearman’s Rank Correlation Coeffecient .
I have also discussed my future work using Deep Neural Nets that can help predict new virus sequences and effectively find similarity if any with already discovered viruses.

SciScore for 10.1101/2020.10.12.336339: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Institutional Review Board Statement	not detected.
Randomization	Centroids are first allotted randomly and then by running for a certain number of iterations the K-Means algorithm fixes the centroids and allocates the clusters.
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
It is an accession-based data originally downloaded by performing a BLAST exhaustive search from NCBI(National Center for Biotechnology Information) database.	BLAST suggested: (BLASTX, RRID:SCR_001653)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when …

SciScore for 10.1101/2020.10.12.336339: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Institutional Review Board Statement	not detected.
Randomization	Centroids are first allotted randomly and then by running for a certain number of iterations the K-Means algorithm fixes the centroids and allocates the clusters.
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
It is an accession-based data originally downloaded by performing a BLAST exhaustive search from NCBI(National Center for Biotechnology Information) database.	BLAST suggested: (BLASTX, RRID:SCR_001653)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

No conflict of interest statement was detected. If there are no conflicts, we encourage authors to explicit state so.
No funding statement was detected.
No protocol registration statement was detected.

Read the original source

Version published to 10.1101/2020.10.12.336339 on bioRxiv
Oct 12, 2020

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025
Divergent Bacteriophages from Wastewater Reveal an Open Pan-Genome with No Shared Gene Families

This article has 4 authors:
1. Malihe Hamidzade
2. Kimia Sharifian
3. Seyed Jalal Kiani
4. Alieza Mohebbi
This article has no evaluationsLatest version Dec 19, 2025
Genome-Scale Analysis Reveals Strain Kdesi as a Distinct Evolutionary Lineage and Extensive Cryptic Diversity in the Genus Bdellovibrio

This article has 9 authors:
1. Temidayo Oluyomi Elufisan
2. Isabel Cristina Rodríguez-Luna
3. Yewande Olajumoke Ajao
4. Ibukun John Abulude
5. Alejandro Sánchez-Varela
6. Omotayo Opemipo Oyedara
7. Ronald Ferrera-Cerrato
8. Miguel Angel Villaloboz_Lopez
9. Xianwu Guo
This article has no evaluationsLatest version Jan 9, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

SUMMARY

Article activity feed

Related articles

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Divergent Bacteriophages from Wastewater Reveal an Open Pan-Genome with No Shared Gene Families

Genome-Scale Analysis Reveals Strain Kdesi as a Distinct Evolutionary Lineage and Extensive Cryptic Diversity in the Genus Bdellovibrio