Comparative Evaluation of Viral Hepatitis Question Responses: ChatGPT-4.5 Outperforms Three Established Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Viral hepatitis is an important global public health problem that affects millions of people, which needs accurate information to help the public understand the disease correctly. This study evaluated four large language models (LLMs) including Gemini-2.0, Claude-3.5-sonnet, ChatGPT-4.5 and ChatGPT-4, and compared their responses to questions related to viral hepatitis to determine whether ChatGPT-4.5 was better than the other three models in this field. Methods This comparative evaluation study, conducted at Nanjing Drum Tower Hospital from March to April 2025, examined 52 questions pertaining to viral hepatitis. Four large language models were assessed based on their responses to these 52 questions which encompassed four domains: concepts, risk factors, diagnosis, and prevention and treatment. Initial evaluation used a three-point scale of good, borderline, and poor. Further evaluation criteria included relevance, comprehensiveness, accuracy, safety, and readability, with each response scored on a scale of 1 to 5. Results ChatGPT-4.5 achieved the highest performance, with 89.1% of its responses rated as good, significantly outperforming Claude-3.5-sonnet (71.15% good), Gemini-2.0 (62.82% good), and ChatGPT-4 (50.64% good). Statistical analysis confirmed superior performance of ChatGPT-4.5 in all evaluated dimensions. Consistently, ChatGPT-4.5 scored the highest across all five criteria: relevance, comprehensiveness, accuracy, safety, and readability. Conclusions ChatGPT-4.5 demonstrates superior performance in addressing viral hepatitis queries compared to other three models. Its high reliability makes it a valuable tool for healthcare professionals and patients by improving information accessibility.