Genomic Language Models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate bacterial gene prediction is essential for understanding microbial functions and advancing biotechnology. Traditional methods based on sequence homology and statistical models often struggle with complex genetic variations and novel sequences due to their limited ability to interpret the “language of genes.” To overcome these challenges, we explore Genomic Language Models (gLMs) —inspired by Large Language Models in Natural Language Processing— to enhance bacterial gene prediction. These models learn patterns and contextual dependencies within genetic sequences, similar to how LLMs process human language. We employ transformers, specifically DNABERT, for bacterial gene prediction using a two-stage framework: first, identifying Coding Sequence (CDS) regions, and then refining predictions by identifying the correct Translation Initiation Sites (TIS). DNABERT is fine-tuned on a curated set of NCBI complete bacterial genomes using a k-mer tokenizer for sequence processing. Our results show that GeneLM significantly improves gene prediction accuracy. Compared to Prodigal, a leading prokaryotic gene finder, GeneLM reduces missed CDS predictions while increasing matched annotations. More notably, our TIS predictions surpass traditional methods when tested against experimentally verified sites. GeneLM demonstrates the power of gLMs in decoding genetic information, achieving state-of-the-art performance in bacterial genome analysis. This advancement highlights the potential of language models to revolutionize genome annotation, outperforming conventional tools and enabling more precise genetic insights.