MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene’s natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts.
Results
In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / κ 0.821, compared with 0.672 / κ 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.
Availability and implementation
Source code: https://github.com/Austin-Senna/dna_to_text ; Python ≥3.11.
Contact
asw2215@columbia.edu
Supplementary information
Supplementary tables, figures, and reproducibility details are included at the end of this preprint.