Sexually Transmitted Disease–Related Reddit Posts During the COVID-19 Pandemic: Latent Dirichlet Allocation Analysis

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Sexually transmitted diseases (STDs) are common and costly, impacting approximately 1 in 5 people annually. Reddit, the sixth most used internet site in the world, is a user-generated social media discussion platform that may be useful in monitoring discussion about STD symptoms and exposure.

Objective

This study sought to define and identify patterns and insights into STD-related discussions on Reddit over the course of the COVID-19 pandemic.

Methods

We extracted posts from Reddit from March 2019 through July 2021. We used a topic modeling method, Latent Dirichlet Allocation, to identify the most common topics discussed in the Reddit posts. We then used word clouds, qualitative topic labeling, and spline regression to characterize the content and distribution of the topics observed.

Results

Our extraction resulted in 24,311 total posts. Latent Dirichlet Allocation topic modeling showed that with 8 topics for each time period, we achieved high coherence values (pre–COVID-19=0.41, prevaccination=0.42, and postvaccination=0.44). Although most topic categories remained the same over time, the relative proportion of topics changed and new topics emerged. Spline regression revealed that some key terms had variability in the percentage of posts that coincided with pre–COVID-19 and post–COVID-19 periods, whereas others were uniform across the study periods.

Conclusions

Our study’s use of Reddit is a novel way to gain insights into STD symptoms experienced, potential exposures, testing decisions, common questions, and behavior patterns (eg, during lockdown periods). For example, reduction in STD screening may result in observed negative health outcomes due to missed cases, which also impacts onward transmission. As Reddit use is anonymous, users may discuss sensitive topics with greater detail and more freely than in clinical encounters. Data from anonymous Reddit posts may be leveraged to enhance the understanding of the distribution of disease and need for targeted outreach or screening programs. This study provides evidence in favor of establishing Reddit as having feasibility and utility to enhance the understanding of sexual behaviors, STD experiences, and needed health engagement with the public.

Article activity feed

  1. SciScore for 10.1101/2022.02.13.22270890: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    The pushshift.io Reddit API was used for searching Reddit comments and submissions.[11] Reddit’s official API (Reddit 2021) was used to collect posts and associated metadata (date) from r/STD and r/sexualhealth from March 2019 to July 2021 resulting in 24,311 posts.[10] Only English posts were included in the analysis.
    Reddit
    suggested: (reddit, RRID:SCR_011983)
    After data preprocessing was complete, each string was passed to the WordCloud function in Python to generate a wordcloud.[18] For WordCloud visualization, we chose three etiologic terms (chlamydia, gonorrhea, syphilis) and three of the most common terminologies from topic search: herpes/HSV/HPV (as a single topic, due to correlations), diagnosis/testing, STI/STD.
    Python
    suggested: (IPython, RRID:SCR_001658)
    The plots were created using ggplot2 package in R.[19] For spline regression, we used cubic B-spline basis with two boundary knots and one interior knot placed at the median of the observed data values.
    ggplot2
    suggested: (ggplot2, RRID:SCR_014601)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.