Benchmarking of signaling networks generated by large language models

Jeevan Tewari
Benjamin W Dahl
Jeffrey J Saucerman

Curated by eLife

eLife Assessment

The authors address a hard question and propose a pipeline for using Large Language Models to reconstruct signalling networks as well as to benchmark future models. The findings are valuable for a defined subfield, as the proposed framework allows for assessing such approaches systematically. The overall support is solid, although the present evaluation remains limited in scope and would benefit from a wider range of networks and performance metrics.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Computational models of signaling networks provide frameworks for predicting how molecular cues guide cell decisions. But they are typically limited by manual curation from incomplete literature. Here, we test whether general-purpose large language models (LLMs) generate accurate models of signaling networks. We find that general purpose LLMs generate 24-58% of the reactions of literature-curated networks for cardiomyocyte hypertrophy, myofibroblast activation, and mechano-signaling, and predicting network responses to perturbations with accuracies of 5-26%. While current general-purpose LLMs generate signaling networks with limited accuracy, this study provides a pipeline and benchmarks to guide future improvements.

eLife
Jan 22, 2026

eLife Assessment

The authors address a hard question and propose a pipeline for using Large Language Models to reconstruct signalling networks as well as to benchmark future models. The findings are valuable for a defined subfield, as the proposed framework allows for assessing such approaches systematically. The overall support is solid, although the present evaluation remains limited in scope and would benefit from a wider range of networks and performance metrics.

Read the original source
eLife
Jan 22, 2026

Reviewer #1 (Public review):

Summary:

Large language models (LLMs) have been developed rapidly in recent years and are already contributing to progress across scientific fields. The manuscript tries to address a specific question: whether LLMs can accurately infer signaling networks from gene lists. However, the evaluation is inadequate due to four major weaknesses described below. Despite these limitations, the authors conclude that current general-purpose LLMs lack adequate accuracy, which is already widely recognized. Its key contribution should instead be to provide concrete recommendations for the development of specialized LLMs for this task, which is completely absent. Developing such specific LLMs would be highly valuable, as they could substantially reduce the time required by researchers to analyze signaling networks.

Strengths:

Reviewer #1 (Public review):

Summary:

Large language models (LLMs) have been developed rapidly in recent years and are already contributing to progress across scientific fields. The manuscript tries to address a specific question: whether LLMs can accurately infer signaling networks from gene lists. However, the evaluation is inadequate due to four major weaknesses described below. Despite these limitations, the authors conclude that current general-purpose LLMs lack adequate accuracy, which is already widely recognized. Its key contribution should instead be to provide concrete recommendations for the development of specialized LLMs for this task, which is completely absent. Developing such specific LLMs would be highly valuable, as they could substantially reduce the time required by researchers to analyze signaling networks.

Strengths:

The manuscript raises a good question: whether current LLMs can accurately generate signaling networks from gene lists.

Weaknesses:

(1) The authors evaluate LLM performance using only three signaling networks: "hypertrophy", "fibroblast", and "mechanosignaling". Given the large number of well-established signaling pathways available, this is not a comprehensive assessment. Moreover, the analysis need not be restricted to signaling networks. Other network types, including metabolic and transcriptional regulatory networks, are already accessible in well-known databases such as KEGG, Reactome, BioCyc, WikiPathways, and Pathway Commons. Including these additional networks would substantially strengthen the evaluation.

(2) In LLM evaluation, the authors use the gene lists that exactly match those in their "ground truth" networks, thereby fixing the set of nodes and evaluating only the predicted edges. However, in practical research, the relevant genes or nodes are not fully known. A more realistic assessment would therefore include gene lists with both genes present in the ground-truth network and additional genes absent from it, to evaluate the ability of the LLM to exclude irrelevant genes.

(3) The authors report only the recall/sensitivity of the LLM, without assessing specificity. In practical applications, if an LLM generates a large number of incorrect interactions that greatly exceed the correct ones, researchers may be misled or may lose confidence in the LLM output. Therefore, a comprehensive evaluation must include both sensitivity and specificity. Furthermore, it would be informative to check whether some of the "false positives" might in fact represent biologically plausible interactions that are absent from the manually curated "ground truth". Manually generated "ground truth" can overlook genuine interactions, and the ability of LLMs to recover such missing edges could be particularly valuable. This may even represent one of the most important potential contributions of LLMs.

(4) It is widely known that applying differential equation models to highly complex biological networks, such as the three networks in the manuscript, is meaningless, because these systems involve a large number of parameters whose values can drastically alter the results. As Richard Feynman once said: "with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." Thus, the evaluation of LLMs on "logic-based differential equation models" does not make much sense.

Read the original source
eLife
Jan 22, 2026

Reviewer #2 (Public review):

Summary:

The authors evaluate whether commonly used LLMs (ChatGPT, Claude and Gemini) can reconstruct signalling networks and predict effects of network perturbations, and propose a pipeline for benchmarking future models. Across three phenotypes (hypertrophy, fibroblast signalling, and mechanosignalling), LLMs capture upstream ligand-receptor interactions and conserved crosstalk but fail to recover downstream transcriptional programmes. Logic-based simulations show that LLM-derived networks underperform compared to manually curated models. The authors also propose that their pipeline can be used for benchmarking future models aimed at reconstructing signalling networks.

Strength:

The authors compare the outcomes from three LLMs with three manually curated and validated models. Additionally, they have …

Reviewer #2 (Public review):

Summary:

The authors evaluate whether commonly used LLMs (ChatGPT, Claude and Gemini) can reconstruct signalling networks and predict effects of network perturbations, and propose a pipeline for benchmarking future models. Across three phenotypes (hypertrophy, fibroblast signalling, and mechanosignalling), LLMs capture upstream ligand-receptor interactions and conserved crosstalk but fail to recover downstream transcriptional programmes. Logic-based simulations show that LLM-derived networks underperform compared to manually curated models. The authors also propose that their pipeline can be used for benchmarking future models aimed at reconstructing signalling networks.

Strength:

The authors compare the outcomes from three LLMs with three manually curated and validated models. Additionally, they have investigated gene network reconstruction in the context of three distinct phenotypes. Using logic-based modelling, the authors assessed how LLM-derived networks predict perturbation effects, providing functional validation beyond network overlap.

Weaknesses:

The authors have used legacy models for all three LLMs, and the study would benefit from testing the current versions of the LLMs (ChatGPT 5.2, Claude 4.5 and Gemini 2.5). Additional metrics such as node coverage, node invention, direction accuracy and sign accuracy would be useful to make robust comparisons across models.

Read the original source
Version published to 10.7554/elife.109709.1 on eLife
Jan 22, 2026
Version published to 10.7554/elife.109709 on eLife
Jan 22, 2026
Version published to 10.1101/2025.07.28.667217 on bioRxiv
Jul 29, 2025

What Drives GNN Performance in Tissue Dynamics? Insights from Vertex-Model Simulations

This article has 7 authors:
1. Matej Krajnc
2. Troy Comi
3. Siqi Miao
4. Adnan Hafeez
5. Hadar Serviansky
6. Pan Li
7. Tomer Stern
This article has no evaluationsLatest version Jan 20, 2026
Optimal Inference of Asynchronous Boolean Network Models

This article has 1 author:
1. Guy Karlebach
This article has no evaluationsLatest version Dec 19, 2025
GPU-accelerated modeling of biological regulatory networks

This article has 7 authors:
1. Joyce Reimer
2. Pranta Saha
3. Chris Chen
4. Neeraj Dhar
5. Brook Byrns
6. Steven Rayan
7. Gordon Broderick
This article has no evaluationsLatest version Jan 5, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

What Drives GNN Performance in Tissue Dynamics? Insights from Vertex-Model Simulations

Optimal Inference of Asynchronous Boolean Network Models

GPU-accelerated modeling of biological regulatory networks