ChatGPT Is Still Not Good Enough at Giving Care-Seeking Advice, or Is It?
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial Intelligence tools like ChatGPT are increasingly used by patients to support their care-seeking decisions, although the accuracy of newer models remains unclear. We evaluated 16 ChatGPT models using 45 validated vignettes, each prompted ten times (7,200 total assessments). Each model classified the vignettes as requiring emergency care, non-emergency care, or self-care. We evaluated accuracy against each case’s gold standard solution, examined the variability across trials, and tested algorithms to aggregate multiple recommendations to improve accuracy. o1-mini achieved the highest accuracy (78%), but we could not observe an overall improvement with newer models – although reasoning models (e.g., o4-mini) improved their accuracy in identifying self-care cases. Selecting the lowest urgency level across multiple trials improved accuracy by 4 percentage points. Although newer models slightly outperform laypeople, their accuracy remains insufficient for standalone use. However, making use of output variability with aggregation algorithms can improve the performance of these models.