Understand research hotspots surrounding COVID-19 and other coronavirus infections using topic modeling
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
Background
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which results in global outbreak of novel coronavirus disease (COVID-19) currently. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using topic modeling.
Methods
We extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeling, we trained a topic model from the corpus, analyzed the semantic relationships between topics and compared the topic distribution between COVID-19 and other CoV infections.
Results
Eight topics emerged overall: clinical characterization, pathogenesis research, therapeutics research, epidemiological study, virus transmission, vaccines research, virus diagnostics, and viral genomics. It was observed that current COVID-19 research puts more emphasis on clinical characterization, epidemiological study, and virus transmission. In contrast, topics about diagnostics, therapeutics, vaccines, genomics and pathogenesis only account for less than 10% or even 4% of all the COVID-19 publications, much lower than those of other CoV infections.
Conclusions
These results identified knowledge gaps in the area of COVID-19 and offered directions for future research.
Article activity feed
-
SciScore for 10.1101/2020.03.26.20044164: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources CORD-19 contains all research about COVID-19, SARS-CoV-2, and other CoVs (e.g. SARS, MERS, etc.) up to March 20, 2020, including over 44,000 scholarly articles, from the following sources: 1) PubMed’s PMC open access corpus; 2) a corpus maintained by the WHO; and 3) bioRxiv and medRxiv pre-prints. bioRxivsuggested: (bioRxiv, RRID:SCR_003933)LDA modeling is performed and visualized on the entire corpus via the Python language. Pythonsuggested: (IPython, RRID:SCR_001658)Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share …
SciScore for 10.1101/2020.03.26.20044164: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources CORD-19 contains all research about COVID-19, SARS-CoV-2, and other CoVs (e.g. SARS, MERS, etc.) up to March 20, 2020, including over 44,000 scholarly articles, from the following sources: 1) PubMed’s PMC open access corpus; 2) a corpus maintained by the WHO; and 3) bioRxiv and medRxiv pre-prints. bioRxivsuggested: (bioRxiv, RRID:SCR_003933)LDA modeling is performed and visualized on the entire corpus via the Python language. Pythonsuggested: (IPython, RRID:SCR_001658)Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:This present work has several limitations that must be acknowledged to apprehend the findings. First, as a computer model, topic modeling has difficulties with understanding nuances and subtext. We identified eight major topics with frequent appearances, whereas some specific topics such as virus natural history and human-animal interface were buried due to low proportions. Future work will attempt to develop an approach to detect and visualize the conceptual sub-domains. Second, it is also necessary to be aware that the granularity of terms used to label topics may vary a little for different topics. We have tried our best to keep them at the same level through carefully curating the top words in each topic, as well as reviewing text intention of the corresponding publications. Third, the LDA model only provides a cross-sectional profile of the coronavirus research, which is methodologically different from academic disciplines such as evidence-based medicine and clinical assessment that synthesizing evidence centered on a specific question. But this study may facilitate these disciplines by informing what is already available and what is urgently required for the COVID-19 research. Future research could also study the occurrence of topics over time and analyze links to historic events and virus characteristics, in order to better understand possible temporal patterns. Notwithstanding its limitations, this work is the first to thoroughly assess research output of coronavirus ...
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
-
-