An Improved Dataset for Predicting Mammal Infecting Viruses from Genetic Sequence Information

Tyler Reddy
Austin Schneider
Aaron R Hall
Adam Witmer
Nick Hengartner

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

There have been several attempts to develop machine learning (ML) models to identify human infecting viruses from their genomic sequences, with varying degrees of success. Direct comparison between models is problematic, because these models are typically trained and evaluated on different datasets with alternative data splitting schemes, features, and model performance metrics. In this paper we present a standardized dataset of mammal infecting and non-infecting viral pathogens, refined from the previous work of Mollentze et al . to include the latest literature evidence, roughly doubling the number of curated host-virus records available to the community, and new host target labels, primate and mammal. The new host labels were included for several reasons, including previous reports that classification performance is better at broader taxonomic ranks and the idea that there may be more data for primate infection that might serve as a suitable proxy for zoonotic potential and avoidance of false positives for human infection due to absence of evidence. On this dataset, we report the performance of eight machine learning models for predicting mammal-infecting viruses from their genomic sequences. We find that randomly assigning cases in our improved dataset to training/testing sets, when compared to the original assignments into training/testing in Mollentze et al ., increases the overall average ROC AUC of prediction of human infection from 0.663 ± 0.070 to 0.784 ± 0.013 , consistent with the reduction in phylogenetic distance between train and test sets (relative entropy change from 3.00 to 0.08). The broadest host category of mammal infection can be predicted most reliably at 0.850 ± 0.020 . We share our improved dataset and code to enable standardized comparisons of machine learning methods to predict human host infections. Overall, we have presented preliminary evidence that classification of virus host infection is more tractable at higher taxonomic ranks, that unsurprisingly reducing the phylogenetic distance between training and test sets can improve predictive performance, that peptide kmer features appear to be harmful to out of sample model performance, and we are left with the question of whether models for virus host prediction can reasonably be expected to perform well in out of sample scenarios given the likelihood that viruses do not share a common ancestor.

Version published to 10.1101/2025.09.17.676952 on bioRxiv
Sep 20, 2025

Machine Learning Models in Classifying, Predicting and Managing COVID-19 Severity

This article has 10 authors:
1. Larysa Sydorchuk
2. Maksym Sokolenko
3. Miroslav Škoda
4. Denys Nevinskyi
5. Yaroslav Vyklyuk
6. Ruslan Sydorchuk
7. Alina Sokolenko
8. Ludmila Sokolenko
9. Andrii Sydorchuk
10. Oleksandr Sokolenko
This article has no evaluationsLatest version Jan 27, 2026
Machine Learning Models in Classifying, Predicting and Managing COVID-19 Severity

This article has 10 authors:
1. Larysa Sydorchuk
2. Maksym Sokolenko
3. Miroslav Škoda
4. Denys Nevinskyi
5. Yaroslav Vyklyuk
6. Ruslan Sydorchuk
7. Alina Sokolenko
8. Ludmila Sokolenko
9. Andrii Sydorchuk
10. Oleksandr Sokolenko
This article has no evaluationsLatest version Jan 27, 2026
Retrieval-Based AI Framework for Viral Genomic Analysis

This article has 3 authors:
1. Ahmed M. Fahmy
2. Melissa Ayad
3. Hassan M. Ahmed
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Machine Learning Models in Classifying, Predicting and Managing COVID-19 Severity

Machine Learning Models in Classifying, Predicting and Managing COVID-19 Severity

Retrieval-Based AI Framework for Viral Genomic Analysis