Interpretable sequence-based machine learning consolidates candidate H3N2 hemagglutinin antigenic sites

Austin G. Meyer
Mauricio Santillana

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Vaccine strain selection for seasonal influenza A(H3N2) depends on knowing which hemagglutinin (HA) substitutions are most likely to erode neutralizing antibody recognition, yet published antigenic site sets disagree substantially on which positions matter most. We applied interpretable gradient-boosted tree models with SHAP-based site attribution to two complementary hemagglutination inhibition (HI) datasets to produce a more consolidated ranking of candidate antigenic positions. Models trained on a Neher/Bedford benchmark dataset recover the canonical cluster-transition sites established by prior analyses. Moreover, after filtering the WIC dataset for confounding factors, our models recover the majority of positions from four major prior reference sets (Koel, Neher/Bedford, Harvey, and Shah) and improve concordance between rankings derived from the Neher/Bedford and WIC datasets. Rankings from our models also agree more strongly with models trained to predict sampling time or passage identity than with standard evolutionary metrics used to detect diversifying selection. Our results show that interpretable sequence-based models can provide a more integrative ranking of candidate antigenic positions across different data sources and modeling approaches. This work should aid efforts to prioritize H3N2 substitutions for epidemic surveillance.

Significance Statement

Every year, health authorities must update the seasonal flu vaccine to account for mutations in influenza A(H3N2) that allow the virus to escape existing immunity. Knowing which specific positions in the hemagglutinin protein drive this immune escape is essential for evaluating newly emerging variants, but published studies disagree substantially on which positions matter most. We show that interpretable machine learning models applied to two hemagglutination inhibition datasets, the Neher/Bedford benchmark dataset and the larger WHO Collaborating Centre dataset, can help to resolve the disagreements. The models recover canonical cluster-transition sites from the Neher/Bedford benchmark data, and show that our analysis approach with the WIC data improves concordance across several prior rankings produced from distinct datasets and modeling approaches. The resulting rankings provide a practical, consolidated reference for prioritizing hemagglutinin mutations most likely to affect vaccine effectiveness.

Version published to 10.64898/2026.04.28.721429 on bioRxiv
May 1, 2026

EpitopeGNN: A Graph Neural Network for Influenza A Virus Hemagglutinin Subtype Classification Based on 3D Structure

This article has 4 authors:
1. Andrey Timofeev
2. Alexander Anufriev
3. Oleg Ergashev
4. Irina Isakova-Sivak
This article has no evaluationsLatest version Apr 27, 2026
A sequence-based proactive intelligence for influenza antigenic profiling improves vaccine strain selection

This article has 14 authors:
1. Yihao Chen
2. Ying Xu
3. Yanhui Cheng
4. Xianzhi Qi
5. Tian Bai
6. Jiaying Yang
7. Huanle Luo
8. Xiangjun Du
9. Lin Zhu
10. Lei Yang
11. Mang Shi
12. Dayan Wang
13. Zhaorong Li
14. Yuelong Shu
This article has no evaluationsLatest version Apr 21, 2026
Resolution of recursive data corruption to transform T-cell epitope discovery

This article has 9 authors:
1. Grzegorz Preibisch
2. Michał Tyrolski
3. Piotr Kucharski
4. Stanislaw Giziński
5. Piotr Grzegorczyk
6. Sungho Moon
7. Sangwoo Kim
8. Balyn Zaro
9. Anna Gambin
This article has no evaluationsLatest version Apr 1, 2026

Discuss this preprint

Listed in

Abstract

Significance Statement

Article activity feed

Related articles

EpitopeGNN: A Graph Neural Network for Influenza A Virus Hemagglutinin Subtype Classification Based on 3D Structure

A sequence-based proactive intelligence for influenza antigenic profiling improves vaccine strain selection

Resolution of recursive data corruption to transform T-cell epitope discovery