Mut-BPE: A Modified BPE Strategy Improves Variant Effect Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Byte Pair Encoding (BPE) is widely used in genome foundation models for its ability to compress long DNA sequences into fewer tokens. However, its variable-length tokens often span multiple nucleotides, limiting the model’s sensitivity to single-nucleotide variations—an essential requirement for accurate Variant Effect Prediction (VEP). We introduce Mut-BPE, a training-free, plug-and-play tokenization strategy that augments BPE with explicit single-nucleotide resolution at variant sites. Mut-BPE preserves the efficiency of BPE while enhancing its ability to represent subtle genomic alterations. To evaluate its effectiveness, we applied Mut-BPE to DNABERT-2 and conducted extensive experiments across six diverse datasets spanning gene expression, pathogenicity, and trait-associated variants, under zero-shot and fine-tuning settings. Mut-BPE consistently outperformed conventional BPE tokenization, yielding significant improvements in both AUROC and AUPRC, particularly in imbalanced datasets. These results highlight Mut-BPE as a practical enhancement for genomic foundation models in VEP tasks. Code availability: https://anonymous.4open.science/r/Mut-BPE-1182

Article activity feed