Exposing the Molecular Reaction Blind Spots of LLMs with PathwayQA

Gowri Nayar
Kristy A. Carpenter
Delaney A. Smith
Betty Xiong
Russ B. Altman

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Proteins mediate a large portion of cellular activity, and understanding protein pathways can yield novel biological insights. Large language models (LLMs) have become increasingly adept at performing inference tasks across different fields of science and engineering. These models could facilitate the analysis of protein networks and help generate hypotheses about protein interactions in a scalable and accessible manner. However, the performance of LLMs in inferring protein-mediated biochemical reactions remains understudied. Here, we evaluate nine LLMs in reasoning over protein pathways included in the curated Reactome database. We find that all nine models struggle to infer products of a reaction when given reactants and enzymes. GPT-4o mini performed the best with a median recovery score of 0.6667, but no model surpassed the baseline strategy of parroting reactants back as predicted products. Most LLMs also performed poorly when inferring whether a protein pathway is associated with a human disease, with an average accuracy of 0.5980. DeepSeek 7B Chat performed the best with an accuracy of 0.9100. This study highlights an area where LLMs still struggle to make correct inferences and provides an opportunity for further work in developing biological LLMs. We also provide a novel question-answer dataset, PathwayQA, which is based on the Reactome database. PathwayQA can be used to benchmark and improve model performance on reasoning over protein-interaction networks. PathwayQA is available at https://github.com/Helix-Research-Lab/PathwayQA .

Version published to 10.1101/2025.08.12.669911 on bioRxiv
Aug 15, 2025

C3PI: Component Puzzle Protein-Protein Interaction Prediction

This article has 3 authors:
1. SeyedMohsen Hosseini
2. G. Brian Golding
3. Lucian Ilie
This article has no evaluationsLatest version Jul 31, 2025
Leveraging Large Language Models for Redundancy-Aware Pathway Analysis and Deep Biological Interpretation

This article has 12 authors:
1. Yifei Ge
2. Feifan Zhang
3. Yijiang Liu
4. Chao Jiang
5. Peng Gao
6. Nguan Soon Tan
7. Sai Zhang
8. Yuchen Shen
9. Qianyi Zhou
10. Xin Zhou
11. Chuchu Wang
12. Xiaotao Shen
This article has no evaluationsLatest version Aug 28, 2025
DARKIN: A zero-shot benchmark for phosphosite–dark kinase association using protein language models

This article has 5 authors:
1. Emine Ayşe Sunar
2. Zeynep Işık
3. Mert Pekey
4. Ramazan Gökberk Cinbiş
5. Oznur Tastan
This article has no evaluationsLatest version Sep 1, 2025

Listed in

Abstract

Article activity feed

Related articles

C3PI: Component Puzzle Protein-Protein Interaction Prediction

Leveraging Large Language Models for Redundancy-Aware Pathway Analysis and Deep Biological Interpretation

DARKIN: A zero-shot benchmark for phosphosite–dark kinase association using protein language models