A Pan-Coronavirus Vaccine Candidate: Nine Amino Acid Substitutions in the ORF1ab Gene Attenuate 99% of 365 Unique Coronaviruses: A Comparative Effectiveness Research Study

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

The COVID-19 pandemic has been a watershed event. Industry and governments have reacted, investing over US$105 billion in vaccine research. 1 The ‘Holy Grail’ is a universal, pan-coronavirus, vaccine to protect humankind from future SARS-CoV-2 variants and the thousands of similar coronaviruses with pandemic potential. 2 This paper proposes a new vaccine candidate that appears to attenuate the SARS-Cov-2 coronavirus variants to render it safe to use as a vaccine. Moreover, these results indicate it may be efficacious against 99% of 365 coronaviruses. This research model is wet-dry-wet; it originated in genomic sequencing laboratories, evolved to computational modeling, and the candidate result now require validation back in a wet lab.

Objectives

This study’s purpose was to test the hypothesis that machine learning applied to sequenced coronaviruses’ genomes could identify which amino acid substitutions likely attenuate the viruses to produce a safe and effective pan-coronavirus vaccine candidate. This candidate is now eligible to be pre-clinically then clinically tested and proven. If validated, it would constitute a traditional attenuated virus vaccine to protect against hundreds of coronaviruses, including the many future variants of SARS-CoV-2 predicted from continuously recombining in unvaccinated populations and spreading by modern mass travel.

Methods

Using machine learning, this was an in silico comparative effectiveness research study on trinucleotide functions in nonstructural proteins of 365 novel coronavirus genomes. Sequences of 7,097 codons in the ORF1ab gene were collected from 65 global locations infecting 68 species and reported to the US National Institute of Health. The data were proprietarily transformed twice to enable machine learning ingestion, mapping, and interpretation. The set of 2,590,405 data points was randomly divided into three cohorts: 255 (70%) observations for training; and two cohorts of 55 (15%) observations each for testing. Machine learning models were trained in the statistical programming language R and compared to identify which mixture of the 7.097 × 10 23 possible amino-acid-location combinations would attenuate SARS-CoV-2 and other coronaviruses that have infected humans.

Results

Contests of machine-learning algorithms identified nine amino-acid point substitutions in the ORF1ab gene that likely attenuate 98.98% of 365 (361) novel coronaviruses. Notably, seven substitutions are for the amino acid alanine. Most of the locations (5 of 9) are in nonstructural proteins (NSPs) 2 and 3. The substitutions are alanine to (1) valine at codon 4273; (2) leucine at codon 5077; (3) phenylalanine at codon 2001; (4) leucine at codon 372; (5) proline at codon 354; (6) phenylalanine at codon 2811; (7) phenylalanine at codon 4703; (8) leucine to serine at codon 2333; and, (9) threonine to alanine at codon 5131.

Conclusions

The primary outcome is a new, highly promising, pan-coronavirus vaccine candidate based on nine amino-acid substitutions in the ORF1ab gene. The secondary outcome was evidence that sequences of wet-dry lab collaborations – here machine learning analysis of viral genomes informing codon functions -- may discover new broader and more stable vaccines candidates more quickly and inexpensively than traditional methods.

Article activity feed

  1. SciScore for 10.1101/2022.04.28.489618: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Information’s (NCBI) database, Nucleotide, which amalgamates genomic sequences from GenBank, RefSeq, Third-party Annotation (TPA), and Protein Data Bank (PDB) databases.23 Therein, nucleotides were sequenced in 365 novel coronaviruses, of which 167 infected humans and 198 infected other species, including bats, birds, camels, civets, cows, and pigs.
    RefSeq
    suggested: None

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Limitations: While this novel method can precisely locate the codons that enable transmission of emerging viruses to humans at the furthest point upstream at the root amino-acid level, the trinucleotide sequences do not reveal how they work. Therefore, it is a subject area of essential future research to study the structure of these molecules to understand their downstream systems and methods of functioning. Additional future research to evaluate how this method works for other viruses would also be valuable. Understanding the downstream repercussions on proteins of these precise edits, or point mutations, is essential but beyond the scope of this study. Moreover, no guarantee changing the amino acids at the predicted codon locations would result in a less infective or virulent virus strain. While such changes frequently cause a loss of function, the goal anticipated, sometimes the functional result is minimal or none.43 Most of all, like all good moist lab collaborations, it would be invaluable to return these insights to a wet lab to determine in animal models whether editing or substitution of these codons has the effect the machine learning analysis suggests, to introduce loss function in viruses rendering them unlikely or incapable of infecting humans or doing so with significantly less virulence.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.