Large Language Models for Supporting Clear Writing and Detecting Spin in Randomized Controlled Trials in Oncology

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Importance

Accurate interpretation of randomized controlled trial (RCT) results is essential for guiding clinical practice in oncology. Reporting “spin” can misrepresent treatment efficacy, potentially leading to suboptimal clinical decisions. Standardized methods could help to detect misleading reporting.

Objective

To determine whether large language models (LLMs) can accurately classify oncology RCTs as positive or negative based on primary endpoint achievement when provided with different sections of the trial report, thereby assessing their utility in identifying potential spin in conclusions.

Design

Methodological evaluation using LLMs and human annotations of previously published clinical trials.

Setting

Random sample of RCTs from seven major medical journals published between 2005 and 2023.

Participants

250 two-arm, single primary endpoint oncology RCT reports were randomly selected from the specified journals and publication years.

Exposure(s)

Human annotators independently classified trials based on primary endpoint results before LLM evaluation. Three commercial LLMs (GPT-3.5 Turbo, GPT-4o, and o1) classified trials based on four different text inputs: 1) conclusion only, 2) methods and conclusion, 3) methods, results, and conclusion, or 4) title and full abstract.

Main Outcome(s) and Measure(s)

Performance of LLMs in classifying trials as positive or negative, primarily measured using the F1 score.

Results

The analysis included 250 RCT reports; based on human annotation, 146 (58.4%) were positive and 104 (41.6%) were negative. o1 demonstrated the highest performance across all input conditions, achieving F1 scores of 0.932 (conclusion only), 0.960 (methods and conclusion), 0.980 (methods, results, and conclusion), and 0.970 (title and full abstract). Analysis of trials incorrectly classified as positive by the LLM when using only the conclusion revealed patterns such as absence of primary endpoint data, emphasis on secondary or subgroup findings, or unclear endpoint distinctions within the conclusion.

Conclusions and Relevance

LLMs can accurately classify oncology RCT outcomes. Discrepancies between classifications based on conclusions versus more complete text indicate potential spin. This approach could serve as a valuable supplementary tool for researchers, reviewers, and editors to enhance transparency and critical appraisal of oncology trial reporting, though further validation is required, especially for trials with more complex designs.

Key Points

Question

Can large language models (LLMs) accurately classify oncology randomized controlled trials (RCTs) as positive or negative based on primary endpoint achievement, and can this help identify potential “spin” in trial conclusions?

Findings

In this methodological evaluation of 250 oncology RCTs, the o1 LLM achieved high accuracy in classifying trials based on the title and full abstract, outperforming classifications based on the conclusion alone. Discrepancies between classifications using conclusions versus more complete text often indicated patterns that could be considered as spin.

Meaning

LLMs show promise as a supplementary tool to detect potential spin by identifying inconsistencies between conclusions and overall results.

Article activity feed