AI Cannot Be Reproduced: Medical AI Needs a Culture of Logging, Not Reruns

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs), including generative AI, are rapidly being adopted for tasks ranging from manuscript drafting to clinical decision support. Yet the evaluation and governance of these systems still rely heavily on a traditional notion of “reproducibility”: if one runs the same procedure under the same conditions, one should obtain the same result. For probabilistic and highly environment-dependent LLMs, bit-identical reruns are neither realistic nor especially useful.Drawing on OpenAI’s work on weight-sparse transformers and recent proposals such as MedLog for event-level clinical AI logging, this article argues that medical AI—particularly AI-assisted writing and explanation—should shift its primary evaluative target from strict reproducibility to auditability. In practice, this means designing layered “log cultures” that record how systems were used, with appropriate granularity for different levels of risk. Even if future models achieve substantially higher technical reproducibility, meaningful verification of individual outputs will still depend on access to their provenance: what model, in which version, was run with what inputs and settings, under whose oversight. Journals such as NEJM AI are well positioned to move from merely asking, “Was AI used?” to setting concrete expectations about what must be logged, retained, and, where appropriate, shared. Such standards would guide clinicians, institutions, vendors, and payers in building log practices that make AI use contestable rather than opaque.

Article activity feed