Explainable machine learning reveals evolutionary signals in Influenza hemagglutinin

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Identifying amino acid changes that lead to phenotypic change is a central problem that is critical to viral surveillance. Common metrics used to measure protein evolution like site-wise evolutionary rates and entropy are truly measures of variability rather than phenotypic importance. Here I show that supervised, explainable machine learning models provide a complementary approach that could serve to date and classify sequences, identify important mutations for host adaptation, and control directly for confounding covariates like sampling date and geography of origin. I curated 39,121 hemagglutinin (H3) protein sequences from GISAID with passage annotations and associated sample metadata to create models of sequence change. Gradient boosted decision trees were trained with encoded amino acids plus latitude, longitude, and date; SHAP values quantified site importance. The passage classifier achieved 81% overall accuracy (balanced accuracy = 0.77), distinguishing egg grown from unpassaged isolates with nearly 90% recall, and recovering known and novel adaptive substitutions. A separate regressor, trained solely on unpassaged sequences, predicted sample collection date with R 2 = 0.98 and a mean absolute error of 74.5 days. Crucially, the sites identified as most important by the models showed a strong enrichment for experimentally validated antigenic sites, with the passage model ranking these functionally critical residues far more effectively than traditional evolutionary metrics. Across both tasks, correlations between SHAP values and standard evolutionary metrics were strong ( 0.63 ≤ ρ ≤ 0.9 ), indicating a strong connection between importance and variability depending on model specification. These results demonstrate that explainable machine learning can reveal important substitutions, deliver tree free molecular dating, and may transform passage metadata from a nuisance into an experimental probe.

Article activity feed