Multilingual transfer ability: Find Rosetta Stone between DNA Language and Natural Language

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study aims to explore whether Large Language Models (LLMs) can transfer abstract structural reasoning capabilities from natural language to the genetic language, which lacks explicit semantics, thereby finding a "Rosetta Stone" to connect the two domains. We validated this hypothesis through a dual experimental design: first, a standard LLM fine-tuned on a natural language similarity task (PAWS-X) was used to assess biological sequence similarity; second, a custom model pre-trained on a multimodal corpus (including natural language, DNA, and protein) was fine-tuned in the same manner to determine the correct alignment of DNA-protein coding pairs. The results show that the transfer of basic similarity judgment ability was successful (with accuracy up to 89%), while for the more complex coding alignment task, the multimodal pre-trained model achieved a zero-shot accuracy of 81%. This study confirms that abstract structural pattern recognition can be transferred between the two languages, with its effectiveness highly dependent on the structural similarity of the tasks, and that multimodal pre-training is key to enabling complex rule transfer, establishing a new paradigm for using LLMs in biological discovery.

Article activity feed