An Improved Dataset for Predicting Mammal Infecting Viruses from Genetic Sequence Information
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
There have been several attempts to develop machine learning (ML) models to identify human infecting viruses from their genomic sequences, with varying degrees of success. Direct comparison between models is problematic, because these models are typically trained and evaluated on different datasets having different baseline, outcome, and features. In this paper we present a standardized dataset of mammal infecting and non-infecting viral pathogens, refined from the previous work of Mollentze et al . to include the latest literature evidence and new host target labels, primate and mammal. On this dataset, we report the performance of eight machine learning models for predicting mammal-infecting viruses from their genomic sequences. We find that randomly assigning cases in our improved dataset to training/testing sets, when compared to the original assignments into training/testing in Mollentze et al ., increases the overall average ROC AUC of prediction of human infection from 0.663±0.070 to 0.784±0.013 , while the broadest host category of mammal infection can be predicted most reliably at 0.850 ± 0.020 . We share our improved dataset and code to enable standardized comparisons of machine learning methods to predict human host infections.