Life as a Function: Why Transformer Architectures Struggle to Gain Genome-Level Foundational Capabilities

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Recent advances in generative models for nucleotide sequences have shown promise, but their practical utility remains limited. In this study, we explore DNA as a complex functional representation of evolutionary processes and assess the ability of transformer-based models to capture this complexity. Through experiments with both synthetic and real DNA sequences, we demonstrate that current transformer architectures, particularly auto-regressive models relying on next-token prediction, struggle to effectively learn the underlying biological functions. Our findings suggest that these models face inherent limitations, that cannot be overcome with scale, highlighting the need for alternative approaches that incorporate evolutionary constraints and structural information. We propose potential future directions, including the integration of topological methods or the switch of modelling paradigms, to enhance the generation of genomic sequences.

Article activity feed

  1. Figure axes are Cartesian coordinates of 2-dimensionalvectors representing each nucleotide and then summed up to get the 2D line as by Yau et al. [21].

    I think it would be helpful to briefly describe how these 2D lines are constructed, as they may be an unfamiliar way of visualizing sequence data for many biologists.

  2. As model size increases, outputaccuracy seems to improve, with the biggest model showing a generated sequence without repeats. This behavior arisesfrom the abundance of training samples for Enterovirus C (4141) compared with Gallivirus A (15).

    Could it instead be that larger models are simply less prone to getting stuck in repeats?

  3. Incontrast, in panel B, the two largest models learn the Enterovirus C function very well, suggestingmemorisation rather than generalisation,

    From the figure, it doesn't look like the models are simply recapitulating the true sequence. I think a more direct comparison like a sequence alignment between the true and generated sequence would be more informative here.

  4. Said differently,although mutations at any replication point might be Brownian and viral-cell interactions chaotic,viral genomes adapted to a host (i.e. a particular niche) approximate a temporary Lyapunovstable point to minimize system-wide energy expenditure

    This feels pretty vague and hand-wave-y. The term "Lyapunov stable point" is mentioned without establishing or discussing the mathematical conditions or assumptions that would justify using it. Also, phrases like "system-wide energy expenditure" are not defined; what is meant by "energy", and why would viruses adapt in order to minimize it?

  5. Two, its constraints arefunctional not sequential, backward not forward looking, developmental not constructive

    This is so succinct that it may be hard for readers to understand what is being referred to here. If these constraints are important, I think it would be worth elaborating on them a little bit.

  6. From this foundation, we can define Ω to be a summed characteristicacross 𝑁 entities

    I appreciate that this presentation is intentionally abstract, but I think it would be helpful to explain more explicitly what omega represents here. In other words, what is the nature of the "characteristics" z_i? I think readers may assume that these characteristics refer to phenotypes, but later on, it becomes clear that these "characteristics" actually refer to something like "genomic state" (if not simply a genotype)