Challenges in predicting protein-protein interactions of understudied viruses: Arenavirus-Human interactions
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Understanding protein-protein interactions (PPIs) between viruses and human proteins is crucial for uncovering infection mechanisms and identifying potential therapeutic targets. The ability to generalize PPI predictive models across understudied viruses presents a significant challenge. In this work, we use arenavirus-human PPIs to illustrate the difficulties associated with model generalization, which are compounded by a lack of both positive and negative data. We employ a Transfer Learning approach to investigate arenavirus-human PPI by utilizing models trained on better-studied virus-human and human-human interactions. Additionally, we curate and assess four types of negative sampling datasets to evaluate their impact on model performance. Despite the overall high accuracies (93-99%) and AUPRC scores (0.8-0.9) appearing promising, further analysis indicates that these performance metrics can be misleading due to data leakage, data bias, and overfitting, especially concerning under-represented viral proteins. We reveal these gaps and assess the impact of data imbalance through standard k-fold cross-validation and Independent Blind Testing with a Balanced Dataset, leading to a drop in accuracy below 50%. We propose a viral protein-specific evaluation framework that groups viral proteins into majority and minority classes based on their representation in the dataset, allowing for comparison of model performance across these groups using balanced accuracies. This framework offers a more robust evaluation of model generalizability, addressing biases inherent in standard evaluation techniques and paving the way for more reliable PPI prediction models for understudied viruses.