Artificial Intelligence in Biomedical Data Analysis: A Comparative Assessment of Large Language Models for Automated Clinical Trial Interpretation and Statistical Evaluation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Clinical trials provide evidence of the efficacy and safety of experimental treatment regimens. Analysis of data from these trials is a time-intensive process traditionally requiring advanced multidisciplinary expertise in biomedicine, clinical research, biostatistics and data science. Large language models (LLMs), such as OpenAI’s GPT-4 and Google’s Gemini Advanced, present new opportunities for data analysis in medical research by leveraging natural language understanding and data interpretation capabilities.
Objective
This study investigates the ability of LLMs to analyze and report clinical trial results, starting with de-identified individual patient data. Here, we evaluate two LLMs for their ability to recapitulate the analysis of a clinical trial that evaluated LY2510924 in combination with carboplatin and etoposide for the treatment of extensive-stage small cell lung cancer (ES-SCLC). The main objectives are to (i) assess whether LLMs can be effectively used without specialized machine learning training and (ii) compare LLM-driven analyses to those conducted by experienced data scientists.
Methods
Data from the Project Data Sphere (PDS) platform were used, and multiple investigators employed both ChatGPT and Gemini Advanced for analysis. A chain-of-thought (CoT) prompting framework was applied to guide the LLMs through a systematic evaluation of baseline characteristics, progression-free survival (PFS), overall survival (OS), safety data, and biomarker information. Results were compared across investigators and LLMs to assess consistency.
Results
While LLMs could process the trial data and generate relevant insights, discrepancies were observed across the investigators’ analyses, particularly in primary and secondary endpoints. One investigator found a significant improvement in PFS with LY2510924, contradicting other results. Variations in reporting objective response rate (ORR) and adverse event analyses also highlighted challenges in reporting between different LLMs. These discrepancies may be due to differences in LLM capabilities and behaviors, prompting strategies, and potential model drift over time.
Conclusion
LLMs such as ChatGPT-4 and Gemini Advanced offer promising capabilities in clinical data analysis, though variability in results underscores the need for tailored CoT frameworks and specialized prompting strategies. Addressing issues such as model drift and ensuring consistent model versions are crucial for reliable application. LLMs also show potential to accelerate clinical research by drafting clinical trial reports, but further refinements are needed to ensure accuracy and consistency in their application. The observed discrepancies across LLM results and in comparison, to the expert-authored trial report highlight the need for highly trained subject matter experts to review and revise LLM-generated clinical trial analyses.