Interpretable sequence-based machine learning consolidates candidate H3N2 hemagglutinin antigenic sites

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vaccine strain selection for seasonal influenza A(H3N2) depends on knowing which hemagglutinin (HA) substitutions are most likely to erode neutralizing antibody recognition, yet published antigenic site sets disagree substantially on which positions matter most. We applied interpretable gradient-boosted tree models with SHAP-based site attribution to two complementary hemagglutination inhibition (HI) datasets to produce a more consolidated ranking of candidate antigenic positions. Models trained on a Neher/Bedford benchmark dataset recover the canonical cluster-transition sites established by prior analyses. Moreover, after filtering the WIC dataset for confounding factors, our models recover the majority of positions from four major prior reference sets (Koel, Neher/Bedford, Harvey, and Shah) and improve concordance between rankings derived from the Neher/Bedford and WIC datasets. Rankings from our models also agree more strongly with models trained to predict sampling time or passage identity than with standard evolutionary metrics used to detect diversifying selection. Our results show that interpretable sequence-based models can provide a more integrative ranking of candidate antigenic positions across different data sources and modeling approaches. This work should aid efforts to prioritize H3N2 substitutions for epidemic surveillance.

Significance Statement

Every year, health authorities must update the seasonal flu vaccine to account for mutations in influenza A(H3N2) that allow the virus to escape existing immunity. Knowing which specific positions in the hemagglutinin protein drive this immune escape is essential for evaluating newly emerging variants, but published studies disagree substantially on which positions matter most. We show that interpretable machine learning models applied to two hemagglutination inhibition datasets, the Neher/Bedford benchmark dataset and the larger WHO Collaborating Centre dataset, can help to resolve the disagreements. The models recover canonical cluster-transition sites from the Neher/Bedford benchmark data, and show that our analysis approach with the WIC data improves concordance across several prior rankings produced from distinct datasets and modeling approaches. The resulting rankings provide a practical, consolidated reference for prioritizing hemagglutinin mutations most likely to affect vaccine effectiveness.

Article activity feed