Zero-Shot Evaluation of Kimi K2 on Pediatric Clinical Cases

Gianluca Mondillo
Mariapia Masino
Simone Colosimo
Alessandra Perrotta
Vittoria Frattolillo
Fabio Giovanni Abbate

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

The application of large language models (LLMs) in pediatric medicine requires rigorous performance evaluation prior to clinical implementation.

Objective

To evaluate the accuracy of the Kimi K2 model in analyzing pediatric clinical cases using a zero-shot approach. Methods: 2,249 multiple-choice questions from pediatric clinical cases, ranging in age from 1 day to 16 years, extracted from the MedQA dataset were analyzed. The model was tested via API with standardized parameters, temperature set to zero, and zero-shot prompts. Accuracy was calculated by comparing the responses with the dataset’s ground truth.

Results

Kimi K2 achieved an overall accuracy of 78.39%, corresponding to 1,763 correct answers out of 2,249 total, with 100% of responses in the required format. Conclusions: The model demonstrates competitive performance for medical education and diagnostic support, while still having limitations that require human clinical supervision.

Version published to 10.1101/2025.07.29.25332368 on medRxiv
Jul 29, 2025

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

This article has 10 authors:
1. Weihua Yang
2. Shoujun Huang
3. Junhong Chen
4. Jiaoman Wang
5. Ping Zhang
6. Wending Du
7. Yuan Hong
8. Dexing Kong
9. Wei Lou
10. Wei Chi
This article has no evaluationsLatest version Jul 23, 2025
Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

This article has 6 authors:
1. Chiyu Sheng
2. Shumin Shen
3. Lin Wang
4. Wei Chen
5. Shanghu Wang
6. Nianfei Wang
This article has no evaluationsLatest version Sep 1, 2025
Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

This article has 6 authors:
1. Gianluca Mondillo
2. Fabio Giovanni Abbate
3. Mariapia Masino
4. Simone Colosimo
5. Alessandra Perrotta
6. Vittoria Frattolillo
This article has no evaluationsLatest version Aug 29, 2025

Listed in

Abstract

Background

Objective

Results

Article activity feed

Related articles

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics