Explainable machine learning models to understand determinants of COVID-19 mortality in the United States

Piyush Mathur
Tavpritesh Sethi
Anya Mathur
Kamal Maheshwari
Jacek B Cywinski
Ashish K Khanna
Simran Dua
Frank Papay

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

Background

COVID-19 is now one of the leading causes of mortality amongst adults in the United States for the year 2020. Multiple epidemiological models have been built, often based on limited data, to understand the spread and impact of the pandemic. However, many geographic and local factors may have played an important role in higher morbidity and mortality in certain populations.

Objective

The goal of this study was to develop machine learning models to understand the relative association of socioeconomic, demographic, travel, and health care characteristics of different states across the United States and COVID-19 mortality.

Methods

Using multiple public data sets, 24 variables linked to COVID-19 disease were chosen to build the models. Two independent machine learning models using CatBoost regression and random forest were developed. SHAP feature importance and a Boruta algorithm were used to elucidate the relative importance of features on COVID-19 mortality in the United States.

Results

Feature importances from both the categorical models, i.e., CatBoost and random forest consistently showed that a high population density, number of nursing homes, number of nursing home beds and foreign travel were strongest predictors of COVID-19 mortality. Percentage of African American amongst the population was also found to be of high importance in prediction of COVID-19 mortality whereas racial majority (primarily, Caucasian) was not. Both models fitted the data well with a training R ² of 0.99 and 0.88 respectively. The effect of median age,median income, climate and disease mitigation measures on COVID-19 related mortality remained unclear.

Conclusions

COVID-19 policy making will need to take population density, pre-existing medical care and state travel policies into account. Our models identified and quantified the relative importance of each of these for mortality predictions using machine learning.

ScreenIT
Mar 1, 2021
SciScore for 10.1101/2020.05.23.20110189: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
- Thank…
SciScore for 10.1101/2020.05.23.20110189: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.
About SciScore
SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.
Read the original source
Version published to 10.1101/2020.05.23.20110189 on medRxiv
May 26, 2020

Data-driven Prediction of Fifteen-Year All-Cause Mortality among 2.3 Million Individuals in the VA

This article has 14 authors:
1. Sayera Dhaubhadel
2. Judith D. Cohn
3. Tanmoy Bhattacharya
4. Ruy M. Ribeiro
5. Kumkum Ganguly
6. Nicolas Hengartner
7. Janet P. Tate
8. Lauren Costa
9. Yuk-Lam Ho
10. Kelly Cho
11. Jean C. Beckham
12. Nathan A. Kimbrel
13. Amy C. Justice
14. Benjamin H. McMahon
This article has no evaluationsLatest version Jul 9, 2026
Calibrated and Interpretable Machine Learning for ICU Mortality Prediction Using First 24-Hour Clinical Data

This article has 3 authors:
1. Abdallah Alsammani
2. Merasia Johnson
3. Jessica Elrefaei
This article has no evaluationsLatest version Jun 2, 2026
Predicting county-level diagnosed diabetes prevalence in the United States using explainable gradient boosting and geographic interpretation

This article has 4 authors:
1. Yussif Yahaya
2. Sagor Khan
3. Priyanka Rani Saha
4. Md Al Amin Meia
This article has no evaluationsLatest version Jun 26, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Data-driven Prediction of Fifteen-Year All-Cause Mortality among 2.3 Million Individuals in the VA

Calibrated and Interpretable Machine Learning for ICU Mortality Prediction Using First 24-Hour Clinical Data

Predicting county-level diagnosed diabetes prevalence in the United States using explainable gradient boosting and geographic interpretation