MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

Austin Wijaya
Hayden Leung
Hyoungjoon Yoo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene’s natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts.

Results

In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / κ 0.821, compared with 0.672 / κ 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.

Availability and implementation

Source code: https://github.com/Austin-Senna/dna_to_text ; Python ≥3.11.

Contact

asw2215@columbia.edu

Supplementary information

Supplementary tables, figures, and reproducibility details are included at the end of this preprint.

Version published to 10.64898/2026.05.25.727711 on bioRxiv
May 28, 2026

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

This article has 3 authors:
1. Jonathan G. Hedley
2. Philip H. S. Torr
3. Kaspar Märtens
This article has no evaluationsLatest version Apr 20, 2026
LOCALE: Local-Alignment Embeddings for Noise-Robust DNA Search at SRA Scale

This article has 4 authors:
1. Ryan P. Synk
2. Prashant Pandey
3. S. Cenk Sahinalp
4. Ramani Duraiswami
This article has no evaluationsLatest version May 14, 2026
Informational blueprints reveal condition-dependent gene regulatory architectures

This article has 7 authors:
1. Doruk Efe Gökmen
2. Rosalind Wenshan Pan
3. Tom Röschinger
4. Stephen Quake
5. Hernan G Garcia
6. Rob Phillips
7. Vincenzo Vitelli
This article has no evaluationsLatest version May 20, 2026

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability and implementation

Contact

Supplementary information

Article activity feed

Related articles

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

LOCALE: Local-Alignment Embeddings for Noise-Robust DNA Search at SRA Scale

Informational blueprints reveal condition-dependent gene regulatory architectures