In-Context Learning in Genomic Language Models as a Biological Evaluation Task
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We study whether base (non-instruction-tuned) genomic language models (gLMs) exhibit in-context learning (ICL) on DNA. Using an adapted NanoGPT trained on multiple Escherichia coli references with BPEtokenization, we frame promoter completion as a genomic ICL task:a 1,000bp upstream context (prompt) conditions autoregressive generation of downstream bases. We introduce an intrinsic evaluation suite that quantifies overall, compositional, structural, and local consistency similarity between generated and ground-truth promoter sequences, alongside loss and GC% diagnostics. Preliminary results suggest the base model learns aggregate nucleotide patterns and motif ordering signals, while position- wise fidelity remains limited. We discuss tokenization–compression trade- offs, scaling behavior, and cross-species transfer directions for evaluating emergent behavior in genomic models. Research Track: CSCI-RTCB