Real-World Evaluation of Artificial Intelligence (AI) Chatbots for Providing Sexual Health Information: A Consensus Study Using Clinical Queries

Phyu Mon Latt
Ei T. Aung
Kay Htaik
Nyi N. Soe
David Lee
Alicia J King
Ria Fortune
Jason J Ong
Eric P F Chow
Catriona S Bradshaw
Rashidur Rahman
Matthew Deneen
Sheranne Dobinson
Claire Randall
Lei Zhang
Christopher K. Fairley

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction Artificial Intelligence (AI) chatbots could potentially provide information on sensitive topics, including sexual health, to the public. However, their performance compared to human clinicians and across different AI chatbots, particularly in the field of sexual health, remains understudied. This study evaluated the performance of three AI chatbots - two prompt-tuned (Alice and Azure) and one standard chatbot (ChatGPT by OpenAI) - in providing sexual health information, compared to human clinicians. Methods We analysed 195 anonymised sexual health questions received by the Melbourne Sexual Health Centre phone line. A panel of experts in a blinded order using a consensus-based approach evaluated responses to these questions from nurses and the three AI chatbots. Performance was assessed based on overall correctness and five specific measures: guidance, accuracy, safety, ease of access, and provision of necessary information. We conducted subgroup analyses for clinic-specific (e.g., opening hours) and general sexual health questions and a sensitivity analysis excluding questions that Azure could not answer. Results Alice demonstrated the highest overall correctness (85.2%; 95% confidence interval (CI), 82.1%-88.0%), followed by Azure (69.3%; 95% CI, 65.3%-73.0%) and ChatGPT (64.8%; 95% CI, 60.7%-68.7%). Prompt-tuned chatbots outperformed the base ChatGPT across all measures. Azure achieved the highest safety score (97.9%; 95% CI, 96.4%-98.9%), indicating the lowest risk of providing potentially harmful advice. In subgroup analysis, all chatbots performed better on general sexual health questions compared to clinic-specific queries. Sensitivity analysis showed a narrower performance gap between Alice and Azure when excluding questions Azure could not answer. Conclusions Prompt-tuned AI chatbots demonstrated superior performance in providing sexual health information compared to base ChatGPT, with high safety scores particularly noteworthy. However, all AI chatbots showed susceptibility to generating incorrect information. These findings suggest the potential for AI chatbots as adjuncts to human healthcare providers for providing sexual health information while highlighting the need for continued refinement and human oversight. Future research should focus on larger-scale evaluations and real-world implementations.

Version published to 10.21203/rs.3.rs-5190887/v1 on Research Square
Oct 9, 2024

Discuss this preprint

Listed in

Abstract

Article activity feed