Influence of social determinants of health and county vaccination rates on machine learning models to predict COVID-19 case growth in Tennessee

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The SARS-CoV-2 (COVID-19) pandemic has exposed health disparities throughout the USA, particularly among racial and ethnic minorities. As a result, there is a need for data-driven approaches to pinpoint the unique constellation of clinical and social determinants of health (SDOH) risk factors that give rise to poor patient outcomes following infection in US communities.

Methods

We combined county-level COVID-19 testing data, COVID-19 vaccination rates and SDOH information in Tennessee. Between February and May 2021, we trained machine learning models on a semimonthly basis using these datasets to predict COVID-19 incidence in Tennessee counties. We then analyzed SDOH data features at each time point to rank the impact of each feature on model performance.

Results

Our results indicate that COVID-19 vaccination rates play a crucial role in determining future COVID-19 disease risk. Beginning in mid-March 2021, higher vaccination rates significantly correlated with lower COVID-19 case growth predictions. Further, as the relative importance of COVID-19 vaccination data features grew, demographic SDOH features such as age, race and ethnicity decreased while the impact of socioeconomic and environmental factors, including access to healthcare and transportation, increased.

Conclusion

Incorporating a data framework to track the evolving patterns of community-level SDOH risk factors could provide policy-makers with additional data resources to improve health equity and resilience to future public health emergencies.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/5551162.

    This review is the result of a virtual, live-streamed preprint journal club organized and hosted by PREreview and OHSU's BioData Club. The discussion was joined by 9 people, including OHSU researchers and the event organizing team.

     

    Wylezinski et al. investigated the impact of clinical and social determinants of health (SDOH) risk factors on the COVID-19 case growth in Tennessee. To that aim, they used a variety of publicly available data to train machine learning (ML) models to predict COVID-19 rates, and ranked each SDOH factor's impact on the model's performance. The study shows that COVID-19 case growth data shows disparities in socioeconomic, environmental, demographic, and health outcomes (particularly, mental health). Such approaches could benefit community, policy, and research responses to disease outbreaks, COVID-19 and health disparities more generally. Our group particularly appreciated the use of openly available datasets. We do, however, have major concerns about the SDOH risk factors used. Description of the dataset and the ML models could benefit from more details. Major and minor concerns, as well as some suggestions on how to address them, are listed below.

     

    Major concerns and feedback

    1. The SDOH risk factors brought up confusion as they were not clearly defined nor was an explanation on how they were measured provided. Additionally, several distinct SDOH risk factors were combined with no explanation, such as race/ethnicity and poverty. We believe that listing all SDOH risk factors, their definition, source, and how they were measured in a table would greatly help with understanding the results and their implication on policy. 
    2. Description of the data is scarce. A descriptive table of the data that was used for training the models would be helpful. Also, the feature sets used for the ML models seem to be highly correlated, but there is no description of the methods used to account for such variables. State the results for each ML model used would be beneficial. Along the same lines, the manuscript mentions multiple methods were used for the training, but it is unclear which method is illustrated in the figures. We recommend stating that explicitly in the text and figure caption. While the methods section states cross-validation and hold-out methods were used, there are no further details on how it was implemented. We recommend stating the cross-validation method used and the number of samples that were held out. We also suggest mentioning the software package used for implementing the ML models and if existing packages were used, please cite the source.
    3. Some of the data are difficult to interpret (see minor comments below) and therefore it is hard to evaluate whether the conclusions are supported by the findings. Some conclusions seem overstated when correlating vaccination status and infection rates. We think it would be appropriate to present the conclusions with more transparency and acknowledgment regarding limitations. It would also be helpful to show that counties with low infection rates are color-coded for vaccination rates (complementary to Supplementary Figure 1C).
    4. There is not sufficient detail provided to allow the reproduction and validation of the study unless requesting more information from the authors, and it is unclear what qualifies as a "reasonable request". We would have found it helpful to have read information on the software package used, input samples and summary statistics, input features used, etc.
    5. In Figure 1, the color coding and size are difficult to interpret. It would be useful to have the captions better explain how the reader is supposed to interpret the visualization. Related to this, it is unclear what the difference (conceptually) is between blue and black dots. Figure 1 could be reconfigured, maybe with the addition of numbers and/or a smaller summary figure, to better display data and consistency. 
    6. There are inconsistencies between the data presented in Figure 1 and the text. Specifically, the 'race and ethnicity' SDOH risk factor increases in the figure while the text mentions that this SDOH risk factor decreases, and 7 timepoints are shown in the figure while the text mentions 13 timepoints were taken. We recommend updating these inconsistencies for easier understanding.
    7. SDOH risk factors in Figure 1 could be correlated but there seem to be no methods to control for this. It would be helpful to understand if this possible confounding effect was accounted for and controlled for.

     

    Minor concerns and feedback

    1. It is unclear if there were consistent data and variable definitions across the  datasets used. Any differences found in the samples in terms of data collection may require a control for differences in the site. It would be helpful to include in the methods section whether all features in Figure 1 were used in the ML model.
    2. It is unclear how county data was aggregated into groups. We recommend either to be more descriptive or find a simpler way to present some of the data (e.g., top 5 counties with the highest increase or decrease in infection rates).
    3. The ethical concerns surrounding this study have not been adequately discussed. Specifically, it would be important to discuss and acknowledge the study questions and its methodology in respect to recommendations and discussions about public health research involving race (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2837428). It seems unclear whether and how using multiple datasets reduces implicit bias from the researchers or from the datasets. It would be useful if there were a more extended explanation on the matter.
    4. Is there any reason why the Supplementary figures are not the main figures in the paper? We recommend moving them to the main paper as they present key information needed to interpret the results.
    5. Supplementary Figure 2C, the legend reads as a bias with "Best" and "Worst" rather than keeping consistent language with "Highest" and "Lowest". We recommend updating this.
    6. Supplementary Figure 2 could benefit from being replaced as textual description and/or possibly adding a figure for vaccination rate for each SDOH risk factor used.
    7. We recommend stating study limitations clearly in the conclusions. For instance, the different coronavirus variants were not taken into account in the prediction, possibly because of a lack of data, yet those undoubtedly would impact infection rates. Similarly, there is no explanation for why the specific time frame right after the onset of vaccine rollout was chosen; this period of time is rather short and involves a relatively small amount of data. It would be important to reflect on how these choices might have impacted the predictive models.

     

    We thank the authors for posting this work as a preprint and hope our feedback will help improve the next version of the manuscript. 

     

    Acknowledgments 

    The organizing team is grateful to all the participants of the PREreview + BioData Club's Open Reviewers Workshop. We especially thank those who engaged in the live-streamed preprint journal club discussion held during our last module of the workshop. It was a pleasure to have such a lively group.

  2. SciScore for 10.1101/2021.07.28.21260814: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    No key resources detected.


    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.