Fact-Checks Can Help Inoculate LLMs Against Disinformation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) have become the first source millions consult when evaluating political claims, yet they were trained on an open internet that state-sponsored disinformation operations have been designed to pollute. We audit four frontier models across 2,268 responses to 63 fabrications from eight documented influence operations and find that even a single published fact-check can shift a model from hedging about a fabrication to rejecting it outright. Overall, models correctly rejected only 81% of fabrications while repeating disinformation in 3% of cases. Critically, the remaining 16% of responses would have left users unable to determine whether an fabricated claims were true or false. Fact-checks may present an avenue for addressing this vulnerability. Narratives which had been debunked by at least one IFCN certified fact-checking organization were correctly rejected as disinformation 93% of the time compared to just 76% of the time for unchecked narratives. Exploratory analyses suggest that this mechanism operates through training data, with models echoing the specific vocabulary of published corrections when available during their training windows. We conclude by discussing how fact-checks designed for human audiences appear to serve as effective protection for LLMs, converting both model uncertainty and outright disinformation into definitive rejections

Article activity feed