upsAI: A high-accuracy machine learning classifier for predicting Plasmodium falciparum var gene upstream groups
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Plasmodium falciparum erythrocyte membrane protein 1 ( Pf EMP1), encoded by the hypervariable var gene family, is central to malaria pathogenesis, influencing both disease severity and immune evasion. Classifying var genes into upstream groups (upsA, upsB, upsC, upsE) is important for understanding parasite biology and clinical outcomes, but remains challenging, especially with partial sequences, such as the DBLα tag or RNA-Seq assemblies.
We developed upsAI, a machine learning-based classifier trained on 2,530 curated var genes, to accurately assign upstream groups using sequence features from different partial gene regions. We compared seven different methods, including support vector machines, random forest, XGB boost and HMMer models. The best model of upsAI for DBLα-tags sequences achieves an overall accuracy of 83%, 92% and for full-length var genes, therefore significantly outperforming existing tools. Further, we propose a new model to distinguish between internal and subtelomeric var genes with high accuracy and scalability.
upsAI is available at https://github.com/sii-scRNA-Seq/upsAI , providing a robust and efficient resource for large-scale var gene analysis. It can classify var genes from 20 genomes in under one second.