MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene’s natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts.

Results

In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / κ 0.821, compared with 0.672 / κ 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.

Availability and implementation

Source code: https://github.com/Austin-Senna/dna_to_text ; Python ≥3.11.

Contact

asw2215@columbia.edu

Supplementary information

Supplementary tables, figures, and reproducibility details are included at the end of this preprint.

Article activity feed