Multilingual transfer ability: Find Rosetta Stone between DNA Language and Natural Language

Wang Liang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study aims to explore whether Large Language Models (LLMs) can transfer abstract structural reasoning capabilities from natural language to the genetic language, which lacks explicit semantics, thereby finding a "Rosetta Stone" to connect the two domains. We validated this hypothesis through a dual experimental design: first, a standard LLM fine-tuned on a natural language similarity task (PAWS-X) was used to assess biological sequence similarity; second, a custom model pre-trained on a multimodal corpus (including natural language, DNA, and protein) was fine-tuned in the same manner to determine the correct alignment of DNA-protein coding pairs. The results show that the transfer of basic similarity judgment ability was successful (with accuracy up to 89%), while for the more complex coding alignment task, the multimodal pre-trained model achieved a zero-shot accuracy of 81%. This study confirms that abstract structural pattern recognition can be transferred between the two languages, with its effectiveness highly dependent on the structural similarity of the tasks, and that multimodal pre-training is key to enabling complex rule transfer, establishing a new paradigm for using LLMs in biological discovery.

Version published to 10.21203/rs.3.rs-7898312/v1 on Research Square
Nov 18, 2025

Genolator: A Multimodal Large Language Model Fusing Natural Language, Genomic, and Structural Tokens for Protein Function Interpretation

This article has 7 authors:
1. Martin Danner
2. Tanhim Islam
3. Matthias Begemann
4. Florian Kraft
5. Miriam Elbracht
6. Ingo Kurth
7. Jeremias Krause
This article has no evaluationsLatest version Nov 14, 2025
WITHDRAWN: Large mRNA Language Foundation Modeling with NUWA for Unified Sequence Perception and Generation

This article has 5 authors:
1. Yunshan Zhong
2. Weiqi Yan
3. Yiming Zhang
4. Kuanhuan Tan
5. Bian Bian
This article has no evaluationsLatest version Nov 2, 2025
NeSyMatch: A Neuro-Symbolic Approach for Knowledge Alignment

This article has 2 authors:
1. Abhisek Sharma
2. Sarika Jain
This article has no evaluationsLatest version Nov 20, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genolator: A Multimodal Large Language Model Fusing Natural Language, Genomic, and Structural Tokens for Protein Function Interpretation

WITHDRAWN: Large mRNA Language Foundation Modeling with NUWA for Unified Sequence Perception and Generation

NeSyMatch: A Neuro-Symbolic Approach for Knowledge Alignment