Predicting unknown viral hosts with Dynamic Positive-Unlabeled learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Most emerging infectious diseases originate from animals (i.e. zoonoses), but our knowledge of host-pathogen links remains scant. AI models have been used to predict unknown zoonotic hosts, but face challenges from biased data and the absence of confirmed negative host-pathogen associations. Here, we introduce the Dynamic Positive-Unlabeled (DPU) learning framework, an extension of classical Positive-Unlabeled learning that enables Graph Neural Networks to predict missing links in incomplete networks. DPU learning integrates a propensity score model that estimates the likelihood of observing existing links with a classifier that predicts true link existence. This approach corrects predictions to account for sampling bias and recognizes that missing links may result from either a true absence of association or gaps in data collection. We applied DPU learning to predict associations between 5,330 wild mammalian species and 33 viral families worldwide, leveraging phylogeographic relationships between mammals, observed mammal-virus association patterns, mammalian life-history traits, and genetic features of the viruses. The approach demonstrated high validation performances, providing unbiased and accurate estimation of pathogen distribution across species. DPU learning emerges a valuable tool to support strategic, data-driven surveillance activities for proactive zoonotic risk mitigation.