Accuracy and hallucination of DeepSeek and ChatGPT in scientific figure interpretation and reference retrieval
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial intelligence-based large language models (AI-LLMs) are increasingly used in biomedical research, but concerns remain regarding their accuracy and reliability, particularly in interpreting scientific data and generating references. This study assessed the performance of three AI-LLMs—DeepSeek-R1, ChatGPT-4o, and Deep Research—in interpreting scientific figures and retrieving bibliographic references. Fifteen figures were analyzed using five parameters: relevance, clarity, depth, focus, and coherence. Reference accuracy was evaluated across seven topics, and hallucination scores were calculated based on errors in titles, DOIs, journals, authors, or publication dates. ChatGPT-4o significantly outperformed DeepSeek-R1 in image interpretation (p < 0.001). In reference retrieval, DeepSeek-R1 had the highest hallucination rate (91.43%), while ChatGPT-4o and Deep Research had lower rates (39.14% and 26.57%, respectively), with Deep Research producing the most accurate references. Although ChatGPT-4o and Deep Research showed better overall performance, the presence of hallucinations in all models highlights the need to carefully verify AI-generated content in academic contexts and improve AI reference generation tools.