Asking the Right Questions: Evaluating Diagnostic Dialogue with Q4Dx

Mai Werthaim
Maya Kimhi
Alexander Apartsin
Yehudit Aperstein

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Physician–patient interactions related to diagnosis and treatment frequently originate from incomplete, uncertain, and noisy information about the patient’s symptoms and medical history, as communicated by the patient using non-specialist language. Physicians often face the requirement to obtain critical information by employing an adaptive sequence of focused inquiries, with each question contingent upon the patient’s preceding responses. The ultimate objective of this process is to arrive at an accurate diagnosis while minimizing the number and length of questions posed. Comparable requirements apply to an AI assistant engaging with patients, wherein the system must strategically determine subsequent inquiries to gather comprehensive information that leads to an accurate diagnosis efficiently.To the best of our knowledge, no standardized benchmark currently exists for assessing the strategic inquiry capabilities of human and AI-assisted physicians. In this study, we present the Q4Dx benchmark to address this gap. The benchmark comprises synthetically generated patient narratives derived from curated symptoms and disease data. To systematically evaluate performance under varying informational constraints, we generate multiple versions of each case, exposing 100%, 80%, and 50% of the relevant symptom information along with the ground-truth diagnosis.We further simulate patient–physician interactions by employing GPT-4.1 and GPT-0-40-mini agents to generate plausible sequences of clinician inquiries, patient responses, and intermediate diagnostic hypotheses. For each sequence, we measure the accuracy of the initial diagnostic hypothesis (Zero-shot Diagnostic Accuracy, ZDA) and task-specific metrics, including Mean Questions to Correct Diagnosis (MQD) and Inquire Sequence Efficiency (ISE), which captures the rate of convergence to the correct diagnostic outcome.Q4Dx provides a reusable framework for benchmarking large language models in goal-directed clinical dialogue and lays the groundwork for future AI-assisted diagnostic training tools. The dataset and benchmark are publicly available at: https://github.com/MaiWert/MedQDx.

Version published to 10.21203/rs.3.rs-6999490/v1 on Research Square
Jul 25, 2025

Performance of Large Language Artificial Intelligence Models on Clear Aligner Treatment: Evaluation of Accuracy and Readibility in Answering Questions for Patients

This article has 4 authors:
1. Osman Bahadır TOPCU
2. Güneş Kadriye Tiftikçi
3. Merve Aksöz
4. Furkan Dindaroğlu
This article has no evaluationsLatest version Jul 11, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025
Implementation of Large Language Models in Electronic Health Records

This article has 3 authors:
1. Maxime Griot
2. Jean Vanderdonckt
3. Demet Yuksel
This article has no evaluationsLatest version Jul 4, 2025

Listed in

Abstract

Article activity feed

Related articles

Performance of Large Language Artificial Intelligence Models on Clear Aligner Treatment: Evaluation of Accuracy and Readibility in Answering Questions for Patients

CLEVER: Clinical Large Language Model Evaluationby Expert Review

Implementation of Large Language Models in Electronic Health Records