In-Context Learning in Genomic Language Models as a Biological Evaluation Task

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We study whether base (non-instruction-tuned) genomic language models (gLMs) exhibit in-context learning (ICL) on DNA. Using an adapted NanoGPT trained on multiple Escherichia coli references with BPEtokenization, we frame promoter completion as a genomic ICL task:a 1,000bp upstream context (prompt) conditions autoregressive generation of downstream bases. We introduce an intrinsic evaluation suite that quantifies overall, compositional, structural, and local consistency similarity between generated and ground-truth promoter sequences, alongside loss and GC% diagnostics. Preliminary results suggest the base model learns aggregate nucleotide patterns and motif ordering signals, while position- wise fidelity remains limited. We discuss tokenization–compression trade- offs, scaling behavior, and cross-species transfer directions for evaluating emergent behavior in genomic models. Research Track: CSCI-RTCB

Article activity feed