Evaluating Large Language Models’ Performance in FDA Regulatory Science
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Clinical and population decision-making relies on the systematic evaluation of extensive regulatory evidence. The FDA drug reviews provide detailed information on clinical trial design, enrollment criteria, sample size, randomization, comparators, endpoints, and indications. However, extracting these data is resource-intensive and time-consuming. Generative Artificial Intelligence large language models (LLMs) may accelerate the extraction and synthesis of such information. This study compares the performance of three LLMs, ChatGPT-4o, Gemini 2.5 Pro, and DeepSeek R1, in extracting and synthesizing regulatory and clinical information to inform FDA decision-making using antibiotics approved for complicated urinary tract infections (cUTIs) between 2010 and 2025. Methods LLM models were evaluated using general (short, direct) and detailed (structured, guidance-referencing) prompts across five domains including accuracy (precision and recall), explanation quality, error type (hallucination rate, misclassification, and omission), efficiency (response time, correct answers per second, and seconds per correct answer), and consistency with responses generated in duplicate runs. Two investigators independently reviewed outputs against FDA guidance, resolving discrepancies by consensus. Statistical analyses included χ², Wilcoxon, and Kruskal–Wallis tests with false discovery rate correction. Results Among 324 responses, accuracy differed significantly across models (χ², p < 0.001) with Gemini 2.5 Pro achieving the highest accuracy (66.7%), followed by ChatGPT-4o (51.9%) and DeepSeek R1 (37.0%). General prompts outperformed detailed prompts (59.3% vs 44.4%; p = 0.011). Gemini 2.5 Pro showed highest explanation quality and most consistent outputs, while ChatGPT-4o had shortest response times and highest efficiency. Hallucination was the most frequent error type across models. Conclusions LLMs showed variable capability in extracting regulatory and clinical trial information. Gemini 2.5 Pro showed the strongest overall performance, while ChatGPT-4o was faster but less accurate, and DeepSeek R1 underperformed across most domains. These findings highlight both the promise and limitations of LLMs in regulatory science and support their complementary use with human review to streamline evidence synthesis and inform FDA decision-making.