An objective framework for evaluating unrecognized bias in medical AI models predicting COVID-19 outcomes

Hossein Estiri
Zachary H Strasser
Sina Rashidian
Jeffrey G Klann
Kavishwar B Wagholikar
Thomas H McCoy
Shawn N Murphy

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

Objective

The increasing translation of artificial intelligence (AI)/machine learning (ML) models into clinical practice brings an increased risk of direct harm from modeling bias; however, bias remains incompletely measured in many medical AI applications. This article aims to provide a framework for objective evaluation of medical AI from multiple aspects, focusing on binary classification models.

Materials and Methods

Using data from over 56 000 Mass General Brigham (MGB) patients with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), we evaluate unrecognized bias in 4 AI models developed during the early months of the pandemic in Boston, Massachusetts that predict risks of hospital admission, ICU admission, mechanical ventilation, and death after a SARS-CoV-2 infection purely based on their pre-infection longitudinal medical records. Models were evaluated both retrospectively and prospectively using model-level metrics of discrimination, accuracy, and reliability, and a novel individual-level metric for error.

Results

We found inconsistent instances of model-level bias in the prediction models. From an individual-level aspect, however, we found most all models performing with slightly higher error rates for older patients.

Discussion

While a model can be biased against certain protected groups (ie, perform worse) in certain tasks, it can be at the same time biased towards another protected group (ie, perform better). As such, current bias evaluation studies may lack a full depiction of the variable effects of a model on its subpopulations.

Conclusion

Only a holistic evaluation, a diligent search for unrecognized bias, can provide enough information for an unbiased judgment of AI bias that can invigorate follow-up investigations on identifying the underlying roots of bias and ultimately make a change.

Version published to 10.1093/jamia/ocac070
May 12, 2022
ScreenIT
Nov 3, 2021
SciScore for 10.1101/2021.10.28.21265629: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Ethics not detected.
Sex as a biological variable not detected.
Randomization not detected.
Blinding not detected.
Power Analysis not detected.
Table 2: Resources
No key resources detected.
Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
- Thank you for including a …
SciScore for 10.1101/2021.10.28.21265629: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Ethics not detected.
Sex as a biological variable not detected.
Randomization not detected.
Blinding not detected.
Power Analysis not detected.
Table 2: Resources
No key resources detected.
Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
About SciScore
SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.
Read the original source
Version published to 10.1101/2021.10.28.21265629 on medRxiv
Oct 30, 2021

Responsible AI for Sepsis Prediction: Bridging the Gap Between Machine Learning Performance and Clinical Trust

This article has 6 authors:
1. Thiago Q. Oliveira
2. Leandro A. Carvalho
3. Flávio R. C. Sousa
4. João B. F. Filho
5. Khalil F. Oliveira
6. Daniel A. B. Tavares
This article has no evaluationsLatest version Jan 30, 2026
Comparative Evaluation of SHAP and LIME for Clinical Interpretability in Postoperative Cardiac Surgery Mortality Prediction Models

This article has 4 authors:
1. Telmo Miguel-Medina
2. Susel Góngora Alonso
3. Isabel de la Torre Díez
4. Mª Lourdes del Río Solá
This article has no evaluationsLatest version Dec 10, 2025
ICU Mortality and LOS Prediction Models Using MachineLearning Based on Both Real and Simulated Data

This article has 3 authors:
1. Girma Neshir Alemneh
2. Hirut Bekele Ashagrie
3. Lemlem Kassa Tegegne
This article has no evaluationsLatest version Jan 14, 2026

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion

Conclusion

Article activity feed

Related articles

Responsible AI for Sepsis Prediction: Bridging the Gap Between Machine Learning Performance and Clinical Trust

Comparative Evaluation of SHAP and LIME for Clinical Interpretability in Postoperative Cardiac Surgery Mortality Prediction Models

ICU Mortality and LOS Prediction Models Using MachineLearning Based on Both Real and Simulated Data