Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems

Abstract

Large language model (LLM) systems can now generate complete research manuscripts, yet their reliability in clinical medicine — where citation accuracy and reporting standards carry direct consequences — has not been systematically assessed. We introduce MedResearchBench, a benchmark of three clinical epidemiology tasks built on NHANES data, and use it to evaluate six AI research systems across six quality dimensions. Evaluation combines programmatic citation verification, rule-based reporting compliance checks, and multi-model LLM judging, providing a more discriminative assessment than conventional single-judge approaches.

Citation integrity emerged as the decisive quality dimension. Hallucination rates ranged from 2.9% to 36.8% across systems, and a hard-rule threshold on per-task citation scores capped four of six systems’ total scores at the penalty ceiling. Adding a multi-agent citation verification and repair pipeline to the best-performing system improved its citation integrity score from 40.0 to 90.9 and raised the weighted total from 68.9 to 81.8. Strikingly, a single-model evaluation ranked this system last (55.5), while our three-tier framework ranked it first (81.8) —a complete reversal that exposes the limitations of subjective LLM-only evaluation.

These results suggest that programmatic citation verification should be a core metric in future evaluations of AI scientific writing systems, and that multi-agent quality assurance can bridge the gap between fluent text generation and trustworthy scholarship.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/19453657.

Shi et al have written an interesting and timely piece on the reliability of large langauge models (LLMs) in producing medical research manuscripts. They introduce MedResearchBench, a benchmarking tool to assess the reliability of LLM outputs, and report on how different LLMs perform on their programmatic benchmark.

Major issues

The biggest issue is that the authors find that the proposed benchmark's best 'signal' is citation integrity, as the LLMs had high hallucination rates (from 2.9% to 36.8% across systems) and detecting hallucinated references was therefore a fingerprint of low-integrity or problematic research.
This is undermined, however, by the authors parallel finding that adding a "citation verification and repair pipeline to the best-performing system improved its citation integrity score from 40.0 to 90.9". The authors acknowledge this as a limitation, but still report that "multi-agent quality assurance can bridge the gap between fluent text generation and trustworthy scholarship". The findings do not seem consistent with this conclusion, rather they seem to suggest the opposite, that programmatic assurance is easily gamed.
In addition, the manuscript does not address whether the replacement references are relevant to the claims they support — only that they exist (either in CrossRef or PubMed). So the benchmark is easily gamed in ways that may not improve integrity.
This is also consistent with our group's research, which found for example that programmatic assessment tools including iThenticate can be trivially gamed by introducing syntactic alteration to an LLM-based workflow: https://doi.org/10.1186/s12916-025-04569-y
Similar points have been made on the 'arms race' between low-integrity actors and publishers by Marcus Munafo and George Davey Smith: https://doi.org/10.1371/journal.pbio.3003660
The authors may wish to consider reframing their findings by setting them in the context of this arms race.

Minor issues

There are also some limitations of the manuscript that may be worth flagging. The manuscript reports a bimodal distribution for 'with' and 'without' hallucinations but n = 7 is too few to make these sorts of distributional claims
There is not much detail in the STROBE compliance scoring (D4) methodology, and all the 7 tools score highly. It might be worth specifying what 'automated text detection' under D4 entails.
There are some numerical inconsistencies in the report, for example in Table 3 the hallucination rate would be calculated as 4.08% (1 failed + 1 corrupted, divided by 49) but the paper reports 2.9%. This appears to be a formula error in the last column.
The authors could usefully acknowledge the limitations of D2 (numerical fidelity) and whether they have considered that low scores under D2 could simply be due to regex quality as much as numerical accuracy. Similarly, D6 does not offer meaningful differentiation, but could this be because of how the rubric was defined (too coarse in its granularity). It could also be worth assessing inter-task variability in a more systematic way.
There is also a conflict of interest that could be discussed (or at least made explicit) which is that the authors have chosen their benchmarks, set their own thresholds, selected the competing systems, and did not pre-register the workflow. This is not necessarily a problem, but may be worth mentioning and pre-registering a workflow can sometimes assist with transparency around why choices are made or models selected, later on.
A very minor point, but the computing environment is reported, but is unnecessary detail as the whole work (unless I am mistaken) is conducted by API calls, therefore the operator environment is independent of any work and findings.

Overall, however, I enjoyed reading this work and wish the authors' success in their endeavors.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

Read the original source

Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Major issues

Major issues

Minor issues

Competing interests

Use of Artificial Intelligence (AI)

Rethinking Medical LLM Hallucinations: A System-Level Survey

AutoPsychDx: An LLM Agent Framework for Automated Psychometric Diagnosis Using Multi-Method Classification

AutoPsychDx: An LLM Agent Framework for Automated Psychometric Diagnosis Using Multi-Method Classification

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Major issues

Major issues

Minor issues

Competing interests

Use of Artificial Intelligence (AI)

Related articles

Rethinking Medical LLM Hallucinations: A System-Level Survey

AutoPsychDx: An LLM Agent Framework for Automated Psychometric Diagnosis Using Multi-Method Classification

AutoPsychDx: An LLM Agent Framework for Automated Psychometric Diagnosis Using Multi-Method Classification