Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Evidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient.

Objective

We aimed to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of the literature. Further imminent themes can be predicted using machine learning on the evolving associations between words.

Methods

Frequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the World Health Organization database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month’s literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month’s network were forecasted based on prior patterns, and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months.

Results

We found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift toward the symptoms of long COVID complications was observed during March 2021, and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an area under the receiver operating characteristic curve of 0.87. Predictive modeling revealed predisposing conditions, symptoms, cross-infection, and neurological complications as dominant research themes in COVID-19 publications based on the patterns observed in previous months.

Conclusions

Machine learning–based prediction of emerging links can contribute toward steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.

Article activity feed

  1. SciScore for 10.1101/2021.01.14.21249855: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Word frequency distributions were visualized as chatter plots using ggplot2 package [7].
    ggplot2
    suggested: (ggplot2, RRID:SCR_014601)
    Nine proximity scores based upon network topology were computed using the NetworkX package[15] and a first order difference of these series were taken in order to capture temporal trends.
    NetworkX
    suggested: (NetworkX, RRID:SCR_016864)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    However, one limitation of the current tool is that the networks have been analysed with limited frequency of entities, primarily diseases. The future work in this direction can expand to include other medical entities such as genes, drugs or adverse drug effects. Advancement in the architecture of temporal link prediction can include larger data and complex models like RNN and LSTM to predict the links. We do not train deep learning models as the training data points are limited and these models would tend to overfit. However, the dashboard has surfaced as a great tool for a high-level study of the COVID-19 literature.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.