Accuracy and hallucination of DeepSeek and ChatGPT in scientific figure interpretation and reference retrieval

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Artificial intelligence-based large language models (AI-LLMs) are increasingly used in biomedical research, but concerns remain regarding their accuracy and reliability, particularly in interpreting scientific data and generating references. This study assessed the performance of three AI-LLMs—DeepSeek-R1, ChatGPT-4o, and Deep Research—in interpreting scientific figures and retrieving bibliographic references. Fifteen figures were analyzed using five parameters: relevance, clarity, depth, focus, and coherence. Reference accuracy was evaluated across seven topics, and hallucination scores were calculated based on errors in titles, DOIs, journals, authors, or publication dates. ChatGPT-4o significantly outperformed DeepSeek-R1 in image interpretation (p < 0.001). In reference retrieval, DeepSeek-R1 had the highest hallucination rate (91.43%), while ChatGPT-4o and Deep Research had lower rates (39.14% and 26.57%, respectively), with Deep Research producing the most accurate references. Although ChatGPT-4o and Deep Research showed better overall performance, the presence of hallucinations in all models highlights the need to carefully verify AI-generated content in academic contexts and improve AI reference generation tools.

Article activity feed