Benchmarking generative AI tools for literature retrieval and summarization in genomic variant interpretation

Andrea Gazzo
Silvia Berardelli
Matteo Biancospino
Lorenzo Cuollo
Flavia Dei Zotti
Emanuela Ferraro
Antonio Marra
Enrico Tartarotti
Paolo Magni

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Generative AI is increasingly used to extract structured information across domains, but its reliability in academic and clinical research, where precision and accuracy are essential, remains largely unexplored. This study evaluates the ability of Large Language Models (LLMs)-based algorithms to generate accurate, literature-based summaries of human genomic variants, with a focus on real-world usability.

Results

We benchmarked five open-access generative AI platforms—ChatGPT, MistralAI, VarChat, Perplexity, and ScholarAI—across 40 curated variants equally divided between somatic and germline settings. For each variant, summary reports were generated and blindly evaluated by domain experts using five defined metrics. VarChat emerged as the top-ranked tool, showing the highest summarization accuracy, citation relevance, and robustness against hallucinations. Gpt-4o consistently ranked second, showing particularly stable robustness in conditions where the literature was scarce. Perplexity and ScholarAI, despite being literature-focused, ranked lowest across most metrics. Tool performance was strongly influenced by the availability of peer-reviewed literature, confirming that current generative models remain sensitive to data scarcity.

Conclusions

Our findings highlight the heterogeneity of current generative AI tools in genomic variant interpretation workflows. While some platforms already provide useful outputs, reliable integration into basic and clinical research requires expert validation and domain-related fine-tuning. This work provides for the first time a curated benchmark for assessing LLM-generated content in variant genomics and underscores the need for caution when using these tools to support variant interpretation.

Version published to 10.1101/2025.09.29.679212 on bioRxiv
Oct 1, 2025

Accelerating Insight Discovery in Large Biomedical Text with Scalable Processing Framework

This article has 3 authors:
1. Dongeun Kim
2. Megan Hauptman
3. Matthew T. Patrick
This article has no evaluationsLatest version Aug 19, 2025
Automating Candidate Gene Prioritization with Large Language Models: From Naive Scoring to Literature-Grounded Validation

This article has 13 authors:
1. Taushif Khan
2. Mohammed Toufiq
3. Marina Yurieva
4. Nitaya Indrawattana
5. Akanitt Jittmittraphap
6. Nathamon Kosoltanapiwat
7. Pornpan Pumirat
8. Passanesh Sukphopetch
9. Muthita Vanaporn
10. Karolina Palucka
11. Basirudeen Kabeer
12. Darawan Rinchai
13. Damien Chaussabel
This article has no evaluationsLatest version Sep 20, 2025
Query Augmented Generation (QAG) from the Genomic Data Commons for Accurate Variant Statistics

This article has 7 authors:
1. Aarti Venkat
2. William P. Wysocki
3. Michael Lukowski
4. Steven Song
5. Anirudh Subramanyam
6. Zhenyu Zhang
7. Robert L. Grossman
This article has no evaluationsLatest version Sep 7, 2025

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Accelerating Insight Discovery in Large Biomedical Text with Scalable Processing Framework

Automating Candidate Gene Prioritization with Large Language Models: From Naive Scoring to Literature-Grounded Validation

Query Augmented Generation (QAG) from the Genomic Data Commons for Accurate Variant Statistics