Evaluating Large Language Models for Gene-to-Phenotype Mapping: The Critical Role of Full-Text Database Access

Nicolas Matthew Suhardi
Anastasia Oktarina
Julia Retzky
Damanpreet Dhillon
Dona Ninan
Mathias P.G. Bostrom
Xu Yang
Vincentius Jeremy Suhardi

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Transformer-based large language models (LLMs) have demonstrated significant potential in the biological and medical fields due to their ability to effectively learn from large-scale, diverse datasets and perform a wide range of downstream tasks. However, LLMs are limited by issues such as information processing inaccuracies and data confabulation. These limitations hinder their utility for literature searches and other tasks requiring accurate and comprehensive extraction of information from extensive scientific literature. In this study, we evaluated the performance of various LLMs in accurately retrieving peer-reviewed literature and mapping correlations between 198 genes and six phenotypes: bone formation, cartilage formation, fibrosis, cell proliferation, tendon formation, and ligament formation. Our analysis included three types of models. First, standard transformer-based LLMs (ChatGPT-4o and Gemini 1.5 Pro). Second, specialized LLMs with dedicated custom databases containing peer-reviewed articles (SciSpace and ScholarAI). Third, specialized LLMs without dedicated databases (PubMedGPT and ScholarGPT). Using human-curated gene-to-phenotype mappings as the ground truth, we found that specialized LLMs with dedicated databases achieved the highest accuracy (>80%) in gene-to-phenotype mapping. Additionally, these models were able to provide relevant peer-reviewed publications supporting each gene-to-phenotype correlation. These findings underscore the importance of database augmentation and specialization in enhancing the reliability and utility of LLMs for biomedical research applications.

Version published to 10.1101/2025.06.11.659165v1 on bioRxiv
Jun 12, 2025

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

This article has 3 authors:
1. Shashi Dhanasekar
2. Akash Saranathan
3. Pengtao Xie
This article has no evaluationsLatest version Jun 6, 2025
Exploring the Potential of Large Language Models in Differential Abundance Analysis

This article has 3 authors:
1. Roberto Franco-Alba
2. Irina Goryanin
3. Igor Goryanin
This article has no evaluationsLatest version Jun 7, 2025
Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

This article has 3 authors:
1. Eren Çamur
2. Turay Cesur
3. Yasin Celal Güneş
This article has no evaluationsLatest version May 7, 2025

Listed in

Abstract

Article activity feed

Related articles

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

Exploring the Potential of Large Language Models in Differential Abundance Analysis

Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?