Influ-BERT: A Domain-Adaptive Genomic Language Model for Advancing Influenza A Virus Research
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Influenza A Virus (IAV) poses a persistent threat to global public health due to its broad host adaptability, frequent anti-genic variation, and potential for cross-species transmission. Accurate identification of IAV subtypes is essential for effective epidemic surveillance and precise disease control. Here, we present Influ-BERT, a domain-adaptive pretrained model based on the Transformer architecture. Optimized from DNABERT-2, Influ-BERT was developed using a dedicated corpus of approximately 900,000 influenza genome sequences. We constructed a custom Byte Pair Encoding (BPE) tokenizer, and employed a two-stage training strategy involving domain-adaptive pretraining followed by task-specific fine-tuning. This approach significantly enhanced identification performance for IAV subtypes. Experimental results demonstrate that Influ-BERT outperforms both traditional machine learning approaches and general genomic language models, such as DNABERT-2 and MegaDNA, in the task of IAV subtype identification. The model achieved F1-scores consistently above 97% and exhibited stable performance gains for subtypes that are underrepresented in sequencing data, including H5N8, H5N1, H7N9, and H9N2. Beyond subtype identification, Influ-BERT was successfully applied to additional tasks including respiratory virus identification, IAV pathogenicity prediction, and identification of IAV genomic fragments and functional genes, demonstrating robust performance throughout. Further interpretability analysis using sliding window perturbation confirmed that the model focuses on biologically significant genomic regions, providing insight into its improved predictive capability.
Contact
songshh@big.ac.cn (Song S), atrv@lncc.br (Ana Tereza Ribeiro Vasconcelos)