Classifying Domains, Benchmarking GPT-4, A Portuguese Dataset for Medical AI Q&A

Felipe Akio Matsuoka
Henrique Nunes Onaga

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Artificial Intelligence (AI), particularly large language models (LLMs), has demonstrated remarkable capabilities in addressing complex tasks, including professional-level medical question answering. While standardized benchmarks like the USMLE have been widely used for evaluating LLM performance in English, there is a significant gap in evaluating these models in other languages, such as Portuguese. To address this, we present a curated dataset derived from the Teste de Progresso (TP), a widely adopted Brazilian progress test used to assess medical knowledge across six key domains: Basic Sciences, Internal Medicine, Surgery, Obstetrics and Gynecology, Public Health, and Pediatrics. The dataset consists of 720 multiple-choice questions spanning five years (2019–2023). We demonstrate two primary applications of this dataset. First, we benchmark the performance of GPT-4, which achieved an overall accuracy of 90% across the six medical domains, with the highest performance in Internal Medicine (10%) and the lowest in Public Health (80%). Second, we develop a classification model based on BERTimbau, achieving an overall accuracy of 94% in categorizing questions into their respective medical domains. Our results highlight the utility of the dataset for both benchmarking AI models and automating medical question classification. This work emphasizes the importance of creating domain-specific datasets in underrepresented languages, like Portuguese, to advance AI-driven medical applications, ensure equitable access to AI technologies, and address linguistic and cultural gaps in healthcare education.

Version published to 10.1101/2024.12.17.627801 on bioRxiv
Dec 20, 2024

A Novel Framework for Evaluating the Clinical Reasoning Process of Large Language Models: A Comparative Study in Nephrology

This article has 16 authors:
1. Yuichiro Yano
2. Hiroaki Kakizaki
3. Hajime Nagasu
4. Seiji Kishi
5. Takeo Koshida
6. Yoshihito Nihei
7. Akira Hirano
8. Masaomi Nangaku
9. Hirotake Mori
10. Toshio Naito
11. Mizuki Ohashi
12. Shoichi Maruyama
13. Isao Matsui
14. Yoshitaka Isaka
15. Yusuke Suzuki
16. Naoki Kashihara
This article has no evaluationsLatest version Sep 7, 2025
Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

This article has 6 authors:
1. Gianluca Mondillo
2. Fabio Giovanni Abbate
3. Mariapia Masino
4. Simone Colosimo
5. Alessandra Perrotta
6. Vittoria Frattolillo
This article has no evaluationsLatest version Aug 29, 2025
How Good Are Large Language Models at Supporting Frontline Healthcare Workers in Low-Resource Settings – A Benchmarking Study & Dataset

This article has 13 authors:
1. Samuel Rutunda
2. Gwydion Williams
3. Kleber Kabanda
4. Francis Nkurunziz
5. Solange Uwiduhaye
6. Eulade Rugegamanzi
7. Cyprien Nshimiyimana
8. Vaishnavi Menon
9. Mira Emmanuel-Fabula
10. Alastair K. Denniston
11. Xiaoxuan Liu
12. Emery Hezagira
13. Bilal A. Mateen
This article has no evaluationsLatest version Aug 28, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Novel Framework for Evaluating the Clinical Reasoning Process of Large Language Models: A Comparative Study in Nephrology

Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

How Good Are Large Language Models at Supporting Frontline Healthcare Workers in Low-Resource Settings – A Benchmarking Study & Dataset