Identification of Risk Factors and Symptoms of COVID-19: Analysis of Biomedical Literature and Social Media Data

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

In December 2019, the COVID-19 outbreak started in China and rapidly spread around the world. Lack of a vaccine or optimized intervention raised the importance of characterizing risk factors and symptoms for the early identification and successful treatment of patients with COVID-19.

Objective

This study aims to investigate and analyze biomedical literature and public social media data to understand the association of risk factors and symptoms with the various outcomes observed in patients with COVID-19.

Methods

Through semantic analysis, we collected 45 retrospective cohort studies, which evaluated 303 clinical and demographic variables across 13 different outcomes of patients with COVID-19, and 84,140 Twitter posts from 1036 COVID-19–positive users. Machine learning tools to extract biomedical information were introduced to identify mentions of uncommon or novel symptoms in tweets. We then examined and compared two data sets to expand our landscape of risk factors and symptoms related to COVID-19.

Results

From the biomedical literature, approximately 90% of clinical and demographic variables showed inconsistent associations with COVID-19 outcomes. Consensus analysis identified 72 risk factors that were specifically associated with individual outcomes. From the social media data, 51 symptoms were characterized and analyzed. By comparing social media data with biomedical literature, we identified 25 novel symptoms that were specifically mentioned in tweets but have been not previously well characterized. Furthermore, there were certain combinations of symptoms that were frequently mentioned together in social media.

Conclusions

Identified outcome-specific risk factors, symptoms, and combinations of symptoms may serve as surrogate indicators to identify patients with COVID-19 and predict their clinical outcomes in order to provide appropriate treatments.

Article activity feed

  1. SciScore for 10.1101/2020.05.17.20104729: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variableIn terms of sex, when there were more males compared to females, we assumed there was a positive association with an outcome based on the case studies of sex and age of COVID-19 patients in Italy8 and New York City9 (as of April 14, 2020).

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Scispacy is a Python package to handle scientific document and extracts medical and clinical terminology7.
    Python
    suggested: (IPython, RRID:SCR_001658)
    The model was trained on publicly available domain-specific corpus of medical notes which consists of 1,500 PubMed articles with over 10,000 disease and related chemical terms.
    PubMed
    suggested: (PubMed, RRID:SCR_004846)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    One of the limitations of our study is the self-reported nature of social media data and the lack of more detailed information from the patients. We observed that 55% of social media users who were positive for COVID-19 mentioned symptoms, and 8% mentioned potential comorbidities. Thus, only 63% of social media users indicated any form of COVID-19 conditions, which means that at least 37% of users could be either false positives (they were not COVID-19 positive users) or asymptomatic patients. Alternatively, it is possible that we have not captured all of the COVID-19 positive patients in our social media collection due to the limited amount of keyword searches. Nevertheless, various articles have indicated that between 4% and 78% of all COVID-19 positive patients were asymptomatic17, and this seems to vary widely based on age of patients, test location, and time of testing after infection18–21. Thus, our research is in line with other studies demonstrating the vast range of COVID-19 patients who show or report no symptoms. It should also be noted that Twitter was the source of social media data that we examined, and perhaps more symptoms would be discovered if we analyzed other various sources. Twitter does have a wide, representative user base around the world, and provides open source information that can be easily gathered, but future research could examine alternate social media sources. Although social media may lack depth of patient information, it provides an effectiv...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.