Influ-BERT: An Interpretable Model for Enhancing Low-Frequency Influenza A virus Subtype Recognition

Rongye Ye
Lun Li
Shuhui Song

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Influenza A Virus (IAV) poses a continuous threat to global public health due to its wide host adaptability, high-frequency antigenic variation, and potential for cross-species transmission. Accurate recognition of IAV subtypes is crucial for the early pandemic warning. Here, we propose Influ-BERT, a domain-adaptive pretraining model based on the transformer architecture. Optimized from DNABERT-2, Influ-BERT constructed a dedicated corpus of approximately 900,000 in-fluenza genome sequences, developed a custom Byte Pair Encoding (BPE) tokenizer, and employ a two-stage training strategy involving domain-adaptive pretraining followed by task-specific fine-tuning. This approach significantly enhanced recognition performance for low-frequency subtypes. Experimental results demonstrate that Influ-BERT outper-forms traditional machine learning methods and general genomic language models (DNABERT-2, MegaDNA) in sub-type recognition, achieving a substantial improvement in F1-score, particularly for subtypes H5N8, H5N1, H7N9, H9N2. Furthermore, sliding window perturbation analysis revealed the model’s specific focus on key regions of the IAV genome, providing interpretable evidence supporting the observed performance gains.

Availability

Source code is written in PyTorch and available at https://github.com/oooo111/Influenza-BERT and https://huggingface.co/rongye1/Influenza_BERT under the MIT license.

Contact

songshh@big.ac.cn (Song S).

Biographical note

Rongye Ye is currently a master’s student at the Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation.

Lun Li is an assistant professor at Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation.

Shuhui Song is a professor at Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation.

Version published to 10.1101/2025.07.31.667841 on bioRxiv
Aug 2, 2025

NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

This article has 3 authors:
1. Ke Ding
2. Brian John Parker
3. Jiayu Wen
This article has no evaluationsLatest version Aug 17, 2025
Mapping antigenic evolution of influenza A virus using deep learning-based prediction of hemagglutination inhibition titers

This article has 6 authors:
1. Bingyi Yang
2. Yifan Yin
3. Lin Wang
4. Tim K. Tsang
5. Nicholas C. Wu
6. Henrik Salje
This article has no evaluationsLatest version Aug 23, 2025
Accurate and scalable multi-disease classification from adaptive immune repertoires

This article has 8 authors:
1. Natnicha Jiravejchakul
2. Ayan Sengupta
3. Songling Li
4. Debottam Upadhyaya
5. Mara A. Llamas-Covarrubias
6. Florian Hauer
7. Soichiro Haruna
8. Daron M. Standley
This article has no evaluationsLatest version Aug 16, 2025

Listed in

Abstract

Availability

Contact

Biographical note

Article activity feed

Related articles

NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Mapping antigenic evolution of influenza A virus using deep learning-based prediction of hemagglutination inhibition titers

Accurate and scalable multi-disease classification from adaptive immune repertoires