Genolator: A Multimodal Large Language Model Fusing Natural Language, Genomic, and Structural Tokens for Protein Function Interpretation

Martin Danner
Tanhim Islam
Matthias Begemann
Florian Kraft
Miriam Elbracht
Ingo Kurth
Jeremias Krause

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Decoding the genetic code to unveil its genome functionality is a monumental task which would greatly advance the understanding of disease mechanisms and development of targeted treatment approaches. Although large language models (LLMs) have transformed natural language processing across diverse domains, translating the complex language of DNA into human-readable form remains challenging due to genomic data complexity and unexplored regions of the human genome. Current (genomic) language models either have a solid understanding of natural language or of the genomic code. Models fusing both aspects are largely lacking. Results: Here we present Genolator, a multimodal large language model that integrates embeddings from DNA sequences, amino acid sequences, and protein structures with natural language queries. Fine-tuned on over 370,000 question-answer pairs generated using abstracted Gene-Ontology (GO) terms, Genolator effectively answers queries regarding protein subcellular localization, molecular function, and biological processes. Evaluation demonstrates high accuracy in confirming or denying protein function associations, outperforming baseline models such as openly available allrounder LLMs like GPT 4.1 as well as smaller domain specific models integrating knowledge from foundation models like Evo2 and ESM-2. Explorations of the Genolator's hidden states unveil a biologically and linguistically plausible organization of its learned representations. Conclusion: Genolator enhances accessibility to genomic information by enabling natural language interaction with protein data, facilitating biological discovery, and clinical research. It represents a step towards bridging genomic code and human language through the integration of a multimodal LLM.

Version published to 10.1101/2025.11.14.688396 on bioRxiv
Nov 14, 2025

Multilingual transfer ability: Find Rosetta Stone between DNA Language and Natural Language

This article has 1 author:
1. Wang Liang
This article has no evaluationsLatest version Nov 18, 2025
WITHDRAWN: Large mRNA Language Foundation Modeling with NUWA for Unified Sequence Perception and Generation

This article has 5 authors:
1. Yunshan Zhong
2. Weiqi Yan
3. Yiming Zhang
4. Kuanhuan Tan
5. Bian Bian
This article has no evaluationsLatest version Nov 2, 2025
Nona: A unifying multimodal masking framework for functional genomics

This article has 10 authors:
1. Surag Nair
2. Ehsan Hajiramezanali
3. Alex Tseng
4. Nathaniel Diamant
5. Johannes Hingerl
6. Avantika Lal
7. Tommaso Biancalani
8. Héctor Corrada Bravo
9. Gabriele Scalia
10. Gökcen Eraslan
This article has no evaluationsLatest version Nov 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multilingual transfer ability: Find Rosetta Stone between DNA Language and Natural Language

WITHDRAWN: Large mRNA Language Foundation Modeling with NUWA for Unified Sequence Perception and Generation

Nona: A unifying multimodal masking framework for functional genomics