Diagnostic Accuracy of Web-Based COVID-19 Symptom Checkers: Comparison Study

Nicolas Munsch
Alistair Martin
Stefanie Gruarin
Jama Nateqi
Isselmou Abdarahmane
Rafael Weingartner-Ortner
Bernhard Knapp

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

A large number of web-based COVID-19 symptom checkers and chatbots have been developed; however, anecdotal evidence suggests that their conclusions are highly variable. To our knowledge, no study has evaluated the accuracy of COVID-19 symptom checkers in a statistically rigorous manner.

Objective

The aim of this study is to evaluate and compare the diagnostic accuracies of web-based COVID-19 symptom checkers.

Methods

We identified 10 web-based COVID-19 symptom checkers, all of which were included in the study. We evaluated the COVID-19 symptom checkers by assessing 50 COVID-19 case reports alongside 410 non–COVID-19 control cases. A bootstrapping method was used to counter the unbalanced sample sizes and obtain confidence intervals (CIs). Results are reported as sensitivity, specificity, F1 score, and Matthews correlation coefficient (MCC).

Results

The classification task between COVID-19–positive and COVID-19–negative for “high risk” cases among the 460 test cases yielded (sorted by F1 score): Symptoma (F1=0.92, MCC=0.85), Infermedica (F1=0.80, MCC=0.61), US Centers for Disease Control and Prevention (CDC) (F1=0.71, MCC=0.30), Babylon (F1=0.70, MCC=0.29), Cleveland Clinic (F1=0.40, MCC=0.07), Providence (F1=0.40, MCC=0.05), Apple (F1=0.29, MCC=-0.10), Docyet (F1=0.27, MCC=0.29), Ada (F1=0.24, MCC=0.27) and Your.MD (F1=0.24, MCC=0.27). For “high risk” and “medium risk” combined the performance was: Symptoma (F1=0.91, MCC=0.83) Infermedica (F1=0.80, MCC=0.61), Cleveland Clinic (F1=0.76, MCC=0.47), Providence (F1=0.75, MCC=0.45), Your.MD (F1=0.72, MCC=0.33), CDC (F1=0.71, MCC=0.30), Babylon (F1=0.70, MCC=0.29), Apple (F1=0.70, MCC=0.25), Ada (F1=0.42, MCC=0.03), and Docyet (F1=0.27, MCC=0.29).

Conclusions

We found that the number of correctly assessed COVID-19 and control cases varies considerably between symptom checkers, with different symptom checkers showing different strengths with respect to sensitivity and specificity. A good balance between sensitivity and specificity was only achieved by two symptom checkers.

Version published to 10.2196/21299
Oct 6, 2020
Version published to 10.2196/preprints.21299
Jun 15, 2020
ScreenIT
Jun 9, 2020
SciScore for 10.1101/2020.05.22.20109777: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected.
Randomization not detected.
Blinding not detected.
Power Analysis not detected.
Sex as a biological variable not detected.
Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We found bar graphs of continuous data. We recommend …
SciScore for 10.1101/2020.05.22.20109777: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected.
Randomization not detected.
Blinding not detected.
Power Analysis not detected.
Sex as a biological variable not detected.
Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.
About SciScore
SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.
Read the original source
Version published to 10.1101/2020.05.22.20109777 on medRxiv
May 26, 2020

ReviewAid: An Open-Source Tool for Efficient PICO-Based Screening and Data Extraction in Systematic Reviews

This article has 2 authors:
1. Vihaan Sahu
2. Mohith Balakrishnan
This article has no evaluationsLatest version Jan 5, 2026
Classifying 25 Misinterpretations of Statistical Tests: A Comparison of Six Large Language Models

This article has 3 authors:
1. Alessandro Rovetta
2. Lucia Castaldo
3. Mohammad Ali Mansournia
This article has no evaluationsLatest version Jan 20, 2026
A Hybrid Pharmacovigilance Method for National-Scale Comorbidity Discovery: Association Rules with FDA-Approved PRR/Chi-square and EBGM Validation.

This article has 1 author:
1. Kaossara Osseni
This article has no evaluationsLatest version Dec 24, 2025

Institutional Review Board Statement	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

ReviewAid: An Open-Source Tool for Efficient PICO-Based Screening and Data Extraction in Systematic Reviews

Classifying 25 Misinterpretations of Statistical Tests: A Comparison of Six Large Language Models

A Hybrid Pharmacovigilance Method for National-Scale Comorbidity Discovery: Association Rules with FDA-Approved PRR/Chi-square and EBGM Validation.