Large language model inference of macromolecular complex composition via model consensus and experimental data integration

Mikhail Zhernevskii
Daniel Lynch
Vasilii Gorbunov
Yue Bao
Dmitry Korkin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are poised to reshape how biologists retrieve specialized knowledge at scale. Yet their performance on deep, domain-specific queries is poorly defined because much biological information resides in structured databases or large experimental datasets rather than in a free text format. One such gap in cellular biology lies in identifying major macromolecular complexes, conserved biological units essential to many cellular processes. Cataloging large complexes, such as the ribosome or RNA polymerase, along with their constituent genes, presents a significant challenge for LLMs because of their tendency to hallucinate and to produce incomplete or inconsistent lists of components. Here, we systematically evaluate six state-of-the-art LLMs on the task of retrieving the gene components of 91 protein complexes and develop an integrative framework that combines LLM output consensus with experimental multi-omics data to reconcile and filter model responses. We found that two extensions of a basic single-LLM baseline, (i) aggregating LLM outputs into a consensus and (ii) integrating LLM predictions with the experimental data, each improved retrieval accuracy. Furthermore, a consensus of LLM outputs integrated with the incomplete experimental data using a graph-theoretic approach achieved the highest accuracy (F1 score of 82.5%), compared to the best stand-alone singe LLM (F1 score of 76.4%). These results show that optimized integration of predictions from multiple LLMs and high-throughput experimental data can support scalable, semi-automated curation of specialized biological resources, providing a general template for benchmarking and deploying LLMs for the structured knowledge retrieval tasks in molecular biology.

Version published to 10.64898/2026.05.20.726735 on bioRxiv
May 23, 2026

Partner determination from protein sequences using class information with CLAPP

This article has 5 authors:
1. Lisa Gennai
2. Francesco Caredda
3. Mathieu E. Rebeaud
4. Andrea Pagnani
5. Paolo De Los Rios
This article has no evaluationsLatest version May 11, 2026
Just Add Structure: Protein Language Models Combined with Structural Equivariance Excel at Protein Tasks

This article has 5 authors:
1. Qurat-ul-ain
2. Carlos Outeiral
3. Matteo Cagiada
4. Yee Whye Teh
5. Charlotte M. Deane
This article has no evaluationsLatest version May 29, 2026
Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp

This article has 2 authors:
1. Justin T. Landis
2. Michael I. Love
This article has no evaluationsLatest version May 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Partner determination from protein sequences using class information with CLAPP

Just Add Structure: Protein Language Models Combined with Structural Equivariance Excel at Protein Tasks

Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp