Classifying Domains, Benchmarking GPT-4, A Portuguese Dataset for Medical AI Q&A
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial Intelligence (AI), particularly large language models (LLMs), has demonstrated remarkable capabilities in addressing complex tasks, including professional-level medical question answering. While standardized benchmarks like the USMLE have been widely used for evaluating LLM performance in English, there is a significant gap in evaluating these models in other languages, such as Portuguese. To address this, we present a curated dataset derived from the Teste de Progresso (TP), a widely adopted Brazilian progress test used to assess medical knowledge across six key domains: Basic Sciences, Internal Medicine, Surgery, Obstetrics and Gynecology, Public Health, and Pediatrics. The dataset consists of 720 multiple-choice questions spanning five years (2019–2023). We demonstrate two primary applications of this dataset. First, we benchmark the performance of GPT-4, which achieved an overall accuracy of 90% across the six medical domains, with the highest performance in Internal Medicine (10%) and the lowest in Public Health (80%). Second, we develop a classification model based on BERTimbau, achieving an overall accuracy of 94% in categorizing questions into their respective medical domains. Our results highlight the utility of the dataset for both benchmarking AI models and automating medical question classification. This work emphasizes the importance of creating domain-specific datasets in underrepresented languages, like Portuguese, to advance AI-driven medical applications, ensure equitable access to AI technologies, and address linguistic and cultural gaps in healthcare education.