Exposing the Molecular Reaction Blind Spots of LLMs with PathwayQA
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Proteins mediate a large portion of cellular activity, and understanding protein pathways can yield novel biological insights. Large language models (LLMs) have become increasingly adept at performing inference tasks across different fields of science and engineering. These models could facilitate the analysis of protein networks and help generate hypotheses about protein interactions in a scalable and accessible manner. However, the performance of LLMs in inferring protein-mediated biochemical reactions remains understudied. Here, we evaluate nine LLMs in reasoning over protein pathways included in the curated Reactome database. We find that all nine models struggle to infer products of a reaction when given reactants and enzymes. GPT-4o mini performed the best with a median recovery score of 0.6667, but no model surpassed the baseline strategy of parroting reactants back as predicted products. Most LLMs also performed poorly when inferring whether a protein pathway is associated with a human disease, with an average accuracy of 0.5980. DeepSeek 7B Chat performed the best with an accuracy of 0.9100. This study highlights an area where LLMs still struggle to make correct inferences and provides an opportunity for further work in developing biological LLMs. We also provide a novel question-answer dataset, PathwayQA, which is based on the Reactome database. PathwayQA can be used to benchmark and improve model performance on reasoning over protein-interaction networks. PathwayQA is available at https://github.com/Helix-Research-Lab/PathwayQA .