Interactive Evaluation of an Adaptive-Questioning Symptom Checker Using Standardized Clinical Vignettes

Abstract

Objective

To evaluate the triage performance and history-taking quality of an adaptive-questioning symptom checker (CareRoute) using an interactive protocol on standardized clinical vignettes (Semigran et al., BMJ 2015; 45 cases).

Methods

Each session began with only the presenting complaint; CareRoute asked follow-up questions adaptively, and the evaluator answered concisely per the vignette. At the end of questioning, CareRoute issued a triage recommendation. We compared CareRoute’s issued triage with the reference triage and computed history-taking quality from normalized features derived from each vignette’s Condensed Format. History-taking quality comprised (i) elicitation coverage—the percentage of a vignette’s normalized features obtained through questioning, and (ii) elicitation fraction—the proportion of surfaced normalized features (elicited or volunteered) that were obtained through questioning. Primary outcomes were triage concordance and history-taking quality; the secondary outcome was user burden (time spent answering questions). We did not evaluate possible diagnoses, though CareRoute issues them.

Results

Exact 3-tier triage concordance was 88.9% (40/45; 95% CI 76.5–95.2%). Elicitation coverage had a median of 67% (IQR 60–71%), and elicitation fraction had a median of 70% (IQR 62–75%). CareRoute asked a median of 19 questions overall (IQR 16–20), with urgency-conditioned questioning: Emergency Care median 10 questions (IQR 4–14), Doctor Visit median 19 questions (IQR 18–20), Self Care median 19 questions (IQR 17–20).

Conclusions

In an interactive, vignette-constrained evaluation starting from only the presenting complaint, CareRoute achieved high 3-tier triage concordance (88.9%) with no under-triage on Emergency-reference vignettes, while eliciting most normalized features (median elicitation coverage 67%; median elicitation fraction 70%) with acceptable user burden via urgency-conditioned questioning.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/17298288.

This review is the result of a virtual, collaborative live review discussion organized and hosted by PREreview and JMIR Publications on September 18, 2025. The discussion was joined by 18 people: 2 facilitators from the PREreview Team, 1 member of the JMIR Publications team, 1 author, and 14 live review participants. The authors of this review have dedicated additional asynchronous time over the course of two weeks to help compose this final report using the notes from the Live Review. We thank all participants who contributed to the discussion and made it possible for us to provide feedback on this preprint.

Summary

Artificial Intelligence (AI) is rapidly transforming healthcare. AI is integrated into many clinical applications, including symptom checkers that help guide users make informed care decisions. The study aimed to evaluate the triage performance and history-taking quality of an adaptive-questioning symptom checker called CareRoute. CareRoute is designed to help improve health outcomes and reduce healthcare costs. There are three objectives: first, to evaluate CareRoute's triage accuracy and safety using an interactive protocol that begins with only the presenting complaint; second, to evaluate CareRoute's ability to elicit key clinical features through adaptive questioning; and third, to establish a reproducible methodology for evaluating the quality of history-taking symptom checker.

With the use of 45 standardized clinical vignettes (Semigran set, BMJ 2015), the authors compared the platform's triage recommendations against reference standards and introduced reproducible metrics to assess history-taking quality. A physician evaluator answered CareRoute's follow-up questions. To measure the quality of history-taking, the authors introduced two new metrics: elicitation coverage and elicitation fraction. They also recorded the duration of each session and the number of questions asked. The results showed that CareRoute matched expert triage decisions in 88.9% of cases, correctly identified all emergencies with no under-triage, and used urgency-aware questioning to remain efficient. Emergency cases required fewer questions and less time, while doctor visits and self-care cases involved longer interactions.

In summary, their findings show that CareRoute performed strongly and highlight the importance of measuring history-taking quality when evaluating symptom checkers. This study is timely given the rapid rise of digital health tools and makes a valuable contribution by proposing a reproducible framework for evaluating adaptive-questioning tools, offering valuable insights for improving and benchmarking future digital health applications. However, reliance on a single evaluator and a modest vignette sample size limit generalizability and may not fully reflect broader real-world use. Further work is needed to validate results across users and healthcare contexts.

List of major concerns and feedback

All the evaluation questions were answered by the same physician (marked as PM in the preprint), who is also one of the co-founders of the app CareRoute, meaning they are highly familiar with its functionality. This may introduce a positive bias. One of the major questions we are left with is whether the results might have differed if additional or independent physicians had been involved in the evaluation. The authors could include additional independent evaluators or add anonymized assessments to get unbiased results.
The statistical tools, thresholds, and confidence intervals are not reported, which makes it difficult for others to assess or reproduce the analysis. More statistical transparency is recommended.
Please compare the proposed metrics of elicitation coverage and elicitation fraction with metrics proposed by other authors, such as recall rate and efficiency rate. (Ben-Shabat, N., Sharvit, G., Meimis, B., Ben Joya, D., Sloma, A., Kiderman, D., Shabat, A., Tsur, A. M., Watad, A., & Amital, H. (2022). Assessing data gathering of chatbot-based symptom checkers - a clinical vignettes study. International journal of medical informatics, 168, 104897. https://doi.org/10.1016/j.ijmedinf.2022.104897)
Please consider discussing ethical issues such as:
- What influence can automation of triaging have on real-world healthcare systems? Do they replace humans? Could they misguide patients?
- Will healthcare systems need to adapt to triaging and history questioning apps? In what way do healthcare systems need to adopt to implement triaging / history questioning applications successfully?
- Could CareRoute increase or reduce the digital divide? Accessibility and inclusion - there is no mention of how accessible the tool is to people with low health literacy, disabilities or language barriers.
- How practical is it for a person experiencing an emergency condition to interact with the CareRoute app?
- AI transparency - no mention is made of how CareRoute arrives at its triage conclusions.
Page 3, 1st paragraph: It was mentioned that "CareRoute provides four triage levels (Emergency Care, Urgent Care, Doctor Visit, Self Care), but our analysis uses a conservative 3-tier mapping that collapses Urgent Care to Doctor Visit." Why this modification was performed is not clear. Please add more clarification, as it could be the reason for the difference from the original results of (Semigran et al., 2015)
You may consider elaborating on the strategic approach used to strengthen the internal validity of the vignette data (e.g., its relevance, reliability, effectiveness, and completeness). Clarifying this would help emphasize the role of the review process in shaping and supporting the quality of the data collected and analyzed. For instance, it could be helpful to describe any systematic methods applied to prevent data saturation, as well as any techniques used to identify or remove potentially biased elements from the vignettes. Similarly, outlining the strategies used to enhance generalisability would further strengthen the study's methodological transparency. Incorporating these reflections would contribute to the overall rigor and robustness of the findings. You might find the following reference useful in framing this discussion: Spalding NJ, Phillips T. (2007). Exploring the use of vignettes: from validity to trustworthiness. Qualitative Health Research, 17, 954–962. https://doi.org/10.1177/1049732307306187

List of minor concerns and feedback

The results can be hard to follow because of the limited number of visuals and tables. It is recommended to add more visual summaries, as it would make the findings much more clear and engaging.
- Section 2.2.1 "Normalized features: example mapping" could be visualized as a figure.
- Consider changing "3.4 Case example: Kidney stones" into a figure.
Some sentences in the methods and discussion are too long and a bit wordy. It is recommended to shorten them and add smoother transitions to make the manuscript more readable.
Reference 11 "Evaluating the use of digital symptom checkers in primary care: A mixed-methods study." can not be found on the internet (Google Scholar). Please check this reference.
- There are similar articles: El-Osta, A., Webber, I., Alaa, A., Bagkeris, E., Mian, S., Taghavi Azar Sharabiani, M., & Majeed, A. (2022). What is the suitability of clinical vignettes in benchmarking the performance of online symptom checkers? An audit study. BMJ open, 12(4), e053566. https://doi.org/10.1136/bmjopen-2021-053566
- If generative AI was used in the process of writing or for any other component of the manuscript, please declare its use.
Can the results be transferred to other countries or health care systems? Cultural / language bias should be considered or mentioned as a limitation—especially important to consider for global implementation.

Concluding remarks

We thank the authors of the preprint for posting their work openly for feedback. We also thank all participants of the Live Review call for their time and for engaging in the lively discussion that generated this review.

Competing interests

The authors declare that they have no competing interests.

Use of Artificial Intelligence (AI)

The authors declare that they did not use generative AI to come up with new ideas for their review.

Read the original source

Interactive Evaluation of an Adaptive-Questioning Symptom Checker Using Standardized Clinical Vignettes

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Objective

Methods

Results

Conclusions

Article activity feed

Summary

Summary

List of major concerns and feedback

List of minor concerns and feedback

Concluding remarks

Competing interests

Use of Artificial Intelligence (AI)

Chat GPT Against Medical Students: A Comparative Analysis of Image-Based Medical Examination Results

Measuring the Quality of AI-Generated Clinical Notes: A Systematic Review and Experimental Benchmark of Evaluation Methods

Recognizing “Conformity Bias” in Large Language Models: A New Risk for Clinical Use

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Objective

Methods

Results

Conclusions

Article activity feed

Summary

Summary

List of major concerns and feedback

List of minor concerns and feedback

Concluding remarks

Competing interests

Use of Artificial Intelligence (AI)

Related articles

Chat GPT Against Medical Students: A Comparative Analysis of Image-Based Medical Examination Results

Measuring the Quality of AI-Generated Clinical Notes: A Systematic Review and Experimental Benchmark of Evaluation Methods

Recognizing “Conformity Bias” in Large Language Models: A New Risk for Clinical Use