Asking the Right Questions: Evaluating Diagnostic Dialogue with Q4Dx

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Physician–patient interactions related to diagnosis and treatment frequently originate from incomplete, uncertain, and noisy information about the patient’s symptoms and medical history, as communicated by the patient using non-specialist language. Physicians often face the requirement to obtain critical information by employing an adaptive sequence of focused inquiries, with each question contingent upon the patient’s preceding responses. The ultimate objective of this process is to arrive at an accurate diagnosis while minimizing the number and length of questions posed. Comparable requirements apply to an AI assistant engaging with patients, wherein the system must strategically determine subsequent inquiries to gather comprehensive information that leads to an accurate diagnosis efficiently.To the best of our knowledge, no standardized benchmark currently exists for assessing the strategic inquiry capabilities of human and AI-assisted physicians. In this study, we present the Q4Dx benchmark to address this gap. The benchmark comprises synthetically generated patient narratives derived from curated symptoms and disease data. To systematically evaluate performance under varying informational constraints, we generate multiple versions of each case, exposing 100%, 80%, and 50% of the relevant symptom information along with the ground-truth diagnosis.We further simulate patient–physician interactions by employing GPT-4.1 and GPT-0-40-mini agents to generate plausible sequences of clinician inquiries, patient responses, and intermediate diagnostic hypotheses. For each sequence, we measure the accuracy of the initial diagnostic hypothesis (Zero-shot Diagnostic Accuracy, ZDA) and task-specific metrics, including Mean Questions to Correct Diagnosis (MQD) and Inquire Sequence Efficiency (ISE), which captures the rate of convergence to the correct diagnostic outcome.Q4Dx provides a reusable framework for benchmarking large language models in goal-directed clinical dialogue and lays the groundwork for future AI-assisted diagnostic training tools. The dataset and benchmark are publicly available at: https://github.com/MaiWert/MedQDx.

Article activity feed