What if we perceive SARS-CoV-2 genomes as documents? Topic modelling using Latent Dirichlet Allocation to identify mutation signatures and classify SARS-CoV-2 genomes

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Topic modeling is frequently employed for discovering structures (or patterns) in a corpus of documents. Its utility in text-mining and document retrieval tasks in various fields of scientific research is rather well known. An unsupervised machine learning approach, Latent Dirichlet Allocation (LDA) has particularly been utilized for identifying latent (or hidden) topics in document collections and for deciphering the words that define one or more topics using a generative statistical model. Here we describe how SARS-CoV-2 genomic mutation profiles can be structured into a ‘Bag of Words’ to enable identification of signatures (topics) and their probabilistic distribution across various genomes using LDA. Topic models were generated using ~47000 novel corona virus genomes (considered as documents), leading to identification of 16 amino acid mutation signatures and 18 nucleotide mutation signatures (equivalent to topics) in the corpus of chosen genomes through coherence optimization. The document assumption for genomes also helped in identification of contextual nucleotide mutation signatures in the form of conventional N-grams (e.g. bi-grams and tri-grams). We validated the signatures obtained using LDA driven method against the previously reported recurrent mutations and phylogenetic clades for genomes. Additionally, we report the geographical distribution of the identified mutation signatures in SARS-CoV-2 genomes on the global map. Use of the non-phylogenetic albeit classical approaches like topic modeling and other data centric pattern mining algorithms is therefore proposed for supplementing the efforts towards understanding the genomic diversity of the evolving SARS-CoV-2 genomes (and other pathogens/microbes).

Article activity feed

  1. SciScore for 10.1101/2020.08.20.258772: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    NextStrain’s Augur pipeline was employed with default parameters to align the sequences against the reference Wuhan/WIV04/2019 (EPI_ISL_402124)11.
    Augur
    suggested: None
    Topic (mutation signatures) modeling through Latent Dirichlet Allocation and hyper-parameter tuning: Python’s Gensim library was employed to estimate topic (mutation signature) models for ~47000 SARS-CoV-2 genomes through online variational Bayes (VB) algorithm as described previously13,14.
    Python’s
    suggested: (PyMVPA, RRID:SCR_006099)
    Further hyperparameter optimization was performed for a range of alpha and beta measures (between 0.001 – 0.1, step size of 0.009) and the number of topics, in order to maximize the coherence score, and optimal values for all three parameters were obtained using the grid-search alogrithm15. 2.5 Implementation: The entire implementation was executed in a 20 core Xeon 51 series 2.4GHz machine with 64GB RAM in a Python v3.7.6 kernel with Gensim v3.8.3 and Scikit-learn v0.23.1 for topic modelling using LDA.
    Python
    suggested: (IPython, RRID:SCR_001658)
    Gensim
    suggested: None
    Scikit-learn
    suggested: (scikit-learn, RRID:SCR_002577)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.