Life as a Function: Why Transformer Architectures Struggle to Gain Genome-Level Foundational Capabilities

Hassan Hassan
Kyle Puhger
Ali Saadat
Johannes Mayer
Maximilian Sprang

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Recent advances in generative models for nucleotide sequences have shown promise, but their practical utility remains limited. In this study, we explore DNA as a complex functional representation of evolutionary processes and assess the ability of transformer-based models to capture this complexity. Through experiments with both synthetic and real DNA sequences, we demonstrate that current transformer architectures, particularly auto-regressive models relying on next-token prediction, struggle to effectively learn the underlying biological functions. Our findings suggest that these models face inherent limitations, that cannot be overcome with scale, highlighting the need for alternative approaches that incorporate evolutionary constraints and structural information. We propose potential future directions, including the integration of topological methods or the switch of modelling paradigms, to enhance the generation of genomic sequences.

Arcadia Science
Apr 17, 2025

Figure axes are Cartesian coordinates of 2-dimensionalvectors representing each nucleotide and then summed up to get the 2D line as by Yau et al. [21].

I think it would be helpful to briefly describe how these 2D lines are constructed, as they may be an unfamiliar way of visualizing sequence data for many biologists.

Read the original source
Arcadia Science
Apr 17, 2025

As model size increases, outputaccuracy seems to improve, with the biggest model showing a generated sequence without repeats. This behavior arisesfrom the abundance of training samples for Enterovirus C (4141) compared with Gallivirus A (15).

Could it instead be that larger models are simply less prone to getting stuck in repeats?

Read the original source
Arcadia Science
Apr 17, 2025

Incontrast, in panel B, the two largest models learn the Enterovirus C function very well, suggestingmemorisation rather than generalisation,

From the figure, it doesn't look like the models are simply recapitulating the true sequence. I think a more direct comparison like a sequence alignment between the true and generated sequence would be more informative here.

Read the original source
Arcadia Science
Apr 17, 2025

Said differently,although mutations at any replication point might be Brownian and viral-cell interactions chaotic,viral genomes adapted to a host (i.e. a particular niche) approximate a temporary Lyapunovstable point to minimize system-wide energy expenditure

This feels pretty vague and hand-wave-y. The term "Lyapunov stable point" is mentioned without establishing or discussing the mathematical conditions or assumptions that would justify using it. Also, phrases like "system-wide energy expenditure" are not defined; what is meant by "energy", and why would viruses adapt in order to minimize it?

Read the original source
Arcadia Science
Apr 17, 2025

Two, its constraints arefunctional not sequential, backward not forward looking, developmental not constructive

This is so succinct that it may be hard for readers to understand what is being referred to here. If these constraints are important, I think it would be worth elaborating on them a little bit.

Read the original source
Arcadia Science
Apr 17, 2025

From this foundation, we can define Ω to be a summed characteristicacross 𝑁 entities

I appreciate that this presentation is intentionally abstract, but I think it would be helpful to explain more explicitly what omega represents here. In other words, what is the nature of the "characteristics" z_i? I think readers may assume that these characteristics refer to phenotypes, but later on, it becomes clear that these "characteristics" actually refer to something like "genomic state" (if not simply a genotype)

Read the original source
Version published to 10.1101/2025.01.13.632745v2 on bioRxiv
Mar 25, 2025
Version published to 10.1101/2025.01.13.632745v1 on bioRxiv
Jan 14, 2025

A generalized and efficient approach for complete mRNA design improves translation, stability and specificity

This article has 7 authors:
1. Aidan T. Riley
2. McKayla Vlasity
3. Joey Zhuoying Huang
4. Wyatt M. Becicka
5. Wilson W. Wong
6. Mark W. Grinstaff
7. Alexander A. Green
This article has no evaluationsLatest version Jun 17, 2025
RNAGym: Large-scale Benchmarks for RNA Fitness and Structure Prediction

This article has 12 authors:
1. Rohit Arora
2. Murphy Angelo
3. Christian Andrew Choe
4. Courtney A. Shearer
5. Aaron W. Kollasch
6. Fiona Qu
7. Ruben Weitzman
8. Artem Gazizov
9. Sarah Gurev
10. Erik Xie
11. Debora S. Marks
12. Pascal Notin
This article has no evaluationsLatest version Jun 17, 2025
scoup: Simulate Codon Sequences with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process

This article has 2 authors:
1. Hassan Sadiq
2. Darren P. Martin
This article has no evaluationsLatest version Jun 19, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

A generalized and efficient approach for complete mRNA design improves translation, stability and specificity

RNAGym: Large-scale Benchmarks for RNA Fitness and Structure Prediction

scoup: Simulate Codon Sequences with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process