seqLens: optimizing language models for genomic predictions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Understanding evolutionary variation in genomic sequences through the lens of language modeling has the potential to revolutionize biological research. Yet to maximize the utility of language modeling in genomics, we must overcome computational challenges in tokenization and model architecture adapted to diverse genomic features across evolutionary timescales. In this study, we investigated key elements in genomic language modeling (gLM), including tokenization, pretraining datasets, fine-tuning approaches, pooling methods, and domain adaptation, and applied the language models to diverse genomic data. We gathered two evolutionarily distinct pretraining datasets: one consisting of 19,551 reference genomes, including over 18,000 prokaryotic genomes (115B nucleotides) and the remainder eukaryotic genomes, and another more balanced dataset with 1,354 genomes, including 1,166 prokaryotic and 188 eukaryotic reference genomes (180B nucleotides). We trained five byte-pair encoding tokenizers and pretrained 52 gLMs, systematically comparing different architectures, hyperparameters, and classification heads. We introduce seqLens , a family of models based on disentangled attention with relative positional encoding, which outperforms relatively similar-sized models in 13 of 19 benchmarking phenotypic predictions. We further explore continual pretraining, domain adaptation, and parameter-efficient fine-tuning methods to assess trade-offs between computational efficiency and accuracy. Our findings demonstrate that relevant pretraining data significantly boost performance, alternative pooling techniques can enhance classification, tokenizers with larger vocabulary sizes negatively impact generalization, and gLMs are capable of understanding evolutionary relationships. These insights provide a foundation for optimizing genomic language models for identifying diverse evolutionary genomic features and improving genome annotations.