What if we perceive SARS-CoV-2 genomes as documents? Topic modelling using Latent Dirichlet Allocation to identify mutation signatures and classify SARS-CoV-2 genomes

Abstract

Topic modeling is frequently employed for discovering structures (or patterns) in a corpus of documents. Its utility in text-mining and document retrieval tasks in various fields of scientific research is rather well known. An unsupervised machine learning approach, Latent Dirichlet Allocation (LDA) has particularly been utilized for identifying latent (or hidden) topics in document collections and for deciphering the words that define one or more topics using a generative statistical model. Here we describe how SARS-CoV-2 genomic mutation profiles can be structured into a ‘Bag of Words’ to enable identification of signatures (topics) and their probabilistic distribution across various genomes using LDA. Topic models were generated using ~47000 novel corona virus genomes (considered as documents), leading to identification of 16 amino acid mutation signatures and 18 nucleotide mutation signatures (equivalent to topics) in the corpus of chosen genomes through coherence optimization. The document assumption for genomes also helped in identification of contextual nucleotide mutation signatures in the form of conventional N-grams (e.g. bi-grams and tri-grams). We validated the signatures obtained using LDA driven method against the previously reported recurrent mutations and phylogenetic clades for genomes. Additionally, we report the geographical distribution of the identified mutation signatures in SARS-CoV-2 genomes on the global map. Use of the non-phylogenetic albeit classical approaches like topic modeling and other data centric pattern mining algorithms is therefore proposed for supplementing the efforts towards understanding the genomic diversity of the evolving SARS-CoV-2 genomes (and other pathogens/microbes).

SciScore for 10.1101/2020.08.20.258772: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
NextStrain’s Augur pipeline was employed with default parameters to align the sequences against the reference Wuhan/WIV04/2019 (EPI_ISL_402124)11.	Augur suggested: None
Topic (mutation signatures) modeling through Latent Dirichlet Allocation and hyper-parameter tuning: Python’s Gensim library was employed to estimate topic (mutation signature) models for ~47000 SARS-CoV-2 genomes through online variational Bayes (VB) algorithm as described previously13,14.	Python’s suggested: (PyMVPA, RRID:SCR_006099)
Further hyperparameter optimization was performed for a range of alpha and beta measures …

SciScore for 10.1101/2020.08.20.258772: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
NextStrain’s Augur pipeline was employed with default parameters to align the sequences against the reference Wuhan/WIV04/2019 (EPI_ISL_402124)11.	Augur suggested: None
Topic (mutation signatures) modeling through Latent Dirichlet Allocation and hyper-parameter tuning: Python’s Gensim library was employed to estimate topic (mutation signature) models for ~47000 SARS-CoV-2 genomes through online variational Bayes (VB) algorithm as described previously13,14.	Python’s suggested: (PyMVPA, RRID:SCR_006099)
Further hyperparameter optimization was performed for a range of alpha and beta measures (between 0.001 – 0.1, step size of 0.009) and the number of topics, in order to maximize the coherence score, and optimal values for all three parameters were obtained using the grid-search alogrithm15. 2.5 Implementation: The entire implementation was executed in a 20 core Xeon 51 series 2.4GHz machine with 64GB RAM in a Python v3.7.6 kernel with Gensim v3.8.3 and Scikit-learn v0.23.1 for topic modelling using LDA.	Python suggested: (IPython, RRID:SCR_001658) Gensim suggested: None Scikit-learn suggested: (scikit-learn, RRID:SCR_002577)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Read the original source

What if we perceive SARS-CoV-2 genomes as documents? Topic modelling using Latent Dirichlet Allocation to identify mutation signatures and classify SARS-CoV-2 genomes

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Retrieval-Based AI Framework for Viral Genomic Analysis

ST-LDAW: A Topic-Model and Damped Weighted Least-Squares Method for Integrative Deconvolution of Single-Cell and Spatial Transcriptomics

Random forests in corpus research: A systematic review

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Retrieval-Based AI Framework for Viral Genomic Analysis

ST-LDAW: A Topic-Model and Damped Weighted Least-Squares Method for Integrative Deconvolution of Single-Cell and Spatial Transcriptomics

Random forests in corpus research: A systematic review