Explainable machine learning reveals evolutionary signals in Influenza hemagglutinin

Austin G. Meyer

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Identifying amino acid changes that lead to phenotypic change is a central problem that is critical to viral surveillance. Common metrics used to measure protein evolution like site-wise evolutionary rates and entropy are truly measures of variability rather than phenotypic importance. Here I show that supervised, explainable machine learning models provide a complementary approach that could serve to date and classify sequences, identify important mutations for host adaptation, and control directly for confounding covariates like sampling date and geography of origin. I curated 39,121 hemagglutinin (H3) protein sequences from GISAID with passage annotations and associated sample metadata to create models of sequence change. Gradient boosted decision trees were trained with encoded amino acids plus latitude, longitude, and date; SHAP values quantified site importance. The passage classifier achieved 81% overall accuracy (balanced accuracy = 0.77), distinguishing egg grown from unpassaged isolates with nearly 90% recall, and recovering known and novel adaptive substitutions. A separate regressor, trained solely on unpassaged sequences, predicted sample collection date with R ² = 0.98 and a mean absolute error of 74.5 days. Crucially, the sites identified as most important by the models showed a strong enrichment for experimentally validated antigenic sites, with the passage model ranking these functionally critical residues far more effectively than traditional evolutionary metrics. Across both tasks, correlations between SHAP values and standard evolutionary metrics were strong ( 0.63 ≤ ρ ≤ 0.9 ), indicating a strong connection between importance and variability depending on model specification. These results demonstrate that explainable machine learning can reveal important substitutions, deliver tree free molecular dating, and may transform passage metadata from a nuisance into an experimental probe.

Version published to 10.1101/2025.09.21.677610 on bioRxiv
Sep 23, 2025

Machine Learning–Driven Discovery of Host Genetic Factors for Paratuberculosis in Goats Within the One Health Framework

This article has 11 authors:
1. Yalçın Yaman
2. Ahmet ESER
3. Devran Coşkun
4. Ramazan Aymaz
5. Yiğit Emir Kişi
6. Murat Keleş
7. Serdar Yağcı
8. Özgül Gülaydın
9. Serkan Süleyman Şengül
10. Kıvanç İrak
11. Memiş Bolacalı
This article has no evaluationsLatest version Jan 30, 2026
Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025
Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model

This article has 13 authors:
1. Peilin Xie
2. Xingchen Liu
3. Lantian Yao
4. Zhihao Zhao
5. Anming Yang
6. Jiahui Guan
7. Zijun Jiao
8. Zhihong Liu
9. Junwen Wang
10. Tzong-Yi Lee
11. Zigang Li
12. Bingyu Cui
13. Ying-Chih Chiang
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Machine Learning–Driven Discovery of Host Genetic Factors for Paratuberculosis in Goats Within the One Health Framework

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model