BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Protein function annotation is fundamental to understanding biological mechanisms, designing therapeutics, and advancing biomedical research. Current computational methods either rely on shallow sequence similarity or treat function prediction as isolated classification tasks, failing to capture the integrative reasoning across sequence, structure, domains, and interactions that expert biologists perform to infer function. We introduce BioReason-Pro, the first multimodal reasoning large language model (LLM) for protein function prediction that integrates protein embeddings with biological context to generate structured reasoning traces. A key input into BioReason-Pro is the set of GO term predictions made by GO-GPT, our autoregressive transformer that captures hierarchical and cross-aspect dependencies of GO terms. BioReason-Pro is trained via supervised fine-tuning on synthetic reasoning traces generated by GPT-5 for over 130K proteins and further optimized through reinforcement learning. It achieves 73.6% F max on GO term prediction and an LLM judge score of 8/10 on functional summaries, substantially outperforming previous methods. Evaluations with human protein experts show that BioReason-Pro annotations are preferred over ground truth UniProt annotations in 79% of cases. Remarkably, BioReason-Pro de novo predicted experimentally confirmed binding partners with per-residue attention localizing to the exact contact residues resolved in cryo-EM structures of those complexes. Together, GO-GPT and BioReason-Pro establish a framework for protein function prediction that combines precise ontology modeling with interpretable biological reasoning.

Article activity feed

  1. BioReason-Pro has several important limitations. The model was trained on synthetic reasoning traces generated by GPT-5 (Singh et al., 2025), which may contain subtle reasoning errors that propagate into the model. Furthermore, training requires proteins with experimental GO annotations, a resource that remains costly and limited in throughput to produce. Reasoning quality is heavily influenced by the availability of recognizable protein domains and degrades for proteins that lack identifiable InterPro annotations (Blum et al., 2024). Performance also degrades for extremely short peptides below 50 amino acids, where limited sequence information constrains the model’s ability to ground its functional predictions. For synthetic proteins that lack identifiable domains, such as the EvoAcr sequences (Merchant et al., 2025), predictions become heavily dependent on the organism label, producing divergent functional annotations and interaction partners across organisms for the same sequence. That said, several of these predictions were biologically coherent with known phage-encoded effector biology, suggesting that BioReason-Pro can nominate plausible hypotheses even in this challenging regime. LLM-based evaluation with GPT-5.1 may harbor systematic biases, and human expert evaluation covered 162 proteins, a sample size that limits statistical power for fine-grained comparisons. The model is also computationally expensive, requiring sequential inference through ESM3 (Hayes et al., 2024), GO-GPT, and the reasoning LLM. Finally, whether BioReason-Pro learns genuine biological reasoning or sophisticated imitation of reasoning patterns remains an open scientific question.

    I appreciate this transparency! Could you explain what practical implications these limitations have, and if they limit utility of this approach?

  2. We employed GPT-5.1, the most capable model available at the time of evaluation, as an expert judge,

    As you note later on, there is some concern about bias here given the use of GPT-5 for the training too. Do you have further plans to mitigate this possible bias?