(Machine) Learning the mutation signatures of SARS-CoV-2: a primer for predictive prognosis

Sunil Nagpal
Nishal Kumar Pinna
Divyanshu Srivastava
Rohan Singh
Sharmila S. Mande

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (ScreenIT)

Abstract

Motivation

Continuous emergence of new variants through appearance, accumulation and disappearance of mutations in viruses is a hallmark of many viral diseases. SARS-CoV-2 and its variants have particularly exerted tremendous pressure on global healthcare system owing to their life threatening and debilitating implications. The sheer plurality of the variants and huge scale of genome sequence data available for Covid19 have added to the challenges of traceability of mutations of concern. The latter however provides an opportunity to utilize SARS-CoV-2 genomes and the mutations therein as ‘big data records’ to comprehensively classify the variants through the (machine) learning of mutation patterns. The unprecedented sequencing effort and tracing of disease outcomes provide an excellent ground for identifying important mutations by developing machine learnt models or severity classifiers using mutation profile of SARS-CoV-2. This is expected to provide a significant impetus to the efforts towards not only identifying the mutations of concern but also exploring the potential of mutation driven predictive prognosis of SARS-CoV-2.

Results

We describe how a graduated approach of building various severity specific machine learning classifiers, using only the mutation corpus of SARS-CoV-2 genomes, can potentially lead to the identification of important mutations and guide potential prognosis of infection. We demonstrate the applicability of model derived important mutations and use of Shapley values in order to identify the significant mutations of concern as well as for developing sparse models of outcome classification. A total of 77,284 outcome traced SARS-CoV-2 genomes were employed in this study which represented a total corpus of 30346 unique nucleotide mutations and 18647 amino acid mutations. Machine learning models pertaining to graduated classifiers of target outcomes namely ‘Asymptomatic, Mild, Symptomatic/Moderate, Severe and Fatal’ were built considering the TRIPOD guidelines for predictive prognosis. Shapley values for model linked important mutations were employed to select significant mutations leading to identification of less than 20 outcome driving mutations from each classifier. We additionally describe the significance of adopting a ‘temporal modeling approach’ to benchmark the predictive prognosis linked with continuously evolving pathogens. A chronologically distinct sampling is important in evaluating the performance of models trained on ‘past data’ in accurately classifying prognosis linked with genomes of future (observed with new mutations). We conclude that while machine learning approach can play a vital role in identifying relevant mutations, caution should be exercised in using the mutation signatures for predictive prognosis in cases where new mutations have accumulated along with the previously observed mutations of concern.

Contact

sharmila.mande@tcs.com

Supplementary information

Supplementary data are enclosed.

ScreenIT
Sep 2, 2021
SciScore for 10.1101/2021.08.30.458244: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
- Thank …
SciScore for 10.1101/2021.08.30.458244: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
About SciScore
SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.
Read the original source
Version published to 10.1101/2021.08.30.458244v1 on bioRxiv
Aug 31, 2021

Accurate detection of pathogenic structural variants guided by multi-platform comparison

This article has 10 authors:
1. Nico Alavi
2. M-Hossein Moeinzadeh
3. Jakob Hertzberg
4. Uira Souto Melo
5. Lion Ward Al Raei
6. Paolo Infantino
7. Maryam Ghareghani
8. Marco Savarese
9. Stefan Mundlos
10. Martin Vingron
This article has no evaluationsLatest version May 27, 2025
Unraveling the Immunogenicity of P.1 Variant of SARS-CoV-2 Emerged in Manaus, Amazonas: Insights from Molecular Dynamics and Machine-Learning Mutation Analysis

This article has 9 authors:
1. Micael D.L. de Oliveira
2. Jonathas N. da Silva
3. Isabelle B. Cordeiro
4. Caroline Honaiser Lescano
5. Ana Carolina O. Lima
6. Nathália S. Faria
7. Adriana Malheiro
8. Emersom S. Lima
9. Kelson M.T. de Oliveira
This article has no evaluationsLatest version Jun 9, 2025
Genetic and Immunological Profiling of Recent SARS-CoV-2 Omicron Variants: Insights into Immune Evasion and Infectivity in Monoinfections and Coinfections

This article has 6 authors:
1. Nadine Alvarez
2. Irene Gonzalez-Jimenez
3. Risha Rasheed
4. Kira Goldgirsh
5. Steven Park
6. David S Perlin
This article has no evaluationsLatest version May 29, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Motivation

Results

Contact

Supplementary information

Article activity feed

Related articles

Accurate detection of pathogenic structural variants guided by multi-platform comparison

Unraveling the Immunogenicity of P.1 Variant of SARS-CoV-2 Emerged in Manaus, Amazonas: Insights from Molecular Dynamics and Machine-Learning Mutation Analysis

Genetic and Immunological Profiling of Recent SARS-CoV-2 Omicron Variants: Insights into Immune Evasion and Infectivity in Monoinfections and Coinfections