The GenPPI tool enhanced Protein Interaction Network Generation with Machine Learning-Based Protein Similarity Inference

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The study of protein interaction is still promising for a better understanding of diseases. At first glance, the term interaction among proteins could allude to, for instance, the interaction between the protein of a host and a pathogen. Such an interaction is being pursued mainly via machine learning algorithms since it is difficult to discriminate direct rules for it. However, the interaction among proteins on the same genome is also critical, for instance, to understand how a pathogen survives, starts, or maintains an infection. We can analyze interactions within a genome deterministically at the price of significant hardware employment. Our software GenPPI, in its first edition, allows us to explore interaction networks in genomes using mainly the known rules for neighborhood and phylogenetic profiles conserved. However, despite the speed, it suffered from underrepresentation from the core pangenome due to a simplistic algorithm to raise that, losing a pair of proteins possessing less than 90% of amino acid identity. The present work describes the new GenPPI software enhancements on determining homology between protein pairs, which is one of the principal bottlenecks in creating ab initio interaction networks from genomes and the primary step inferring neighborhood and phylogenetic profiles conserved for all genomes under analysis. This improvement was achieved using the Random Forest algorithm, working on biophysical features derived from ten amino acid propensity indexes used to calculate sixty features for each genome's proteins. We crafted a training data set of homolog and non-homolog proteins using nine full proteomes from critical bacteria. A significant number of expressive genomes as the training dataset allowed us to classify similar proteins with more than 65% amino acid identity via a machine learning test, an average result obtained from dozens of validations. Such a strategy resulted in more comprehensive and accurate protein interaction networks capable of analyzing genomes of different organisms. Our testing of the new GenPPI improvement using the bacterium Buchenera aphidicola yielded impressive results. We achieved an overlap of 62% with the interactions documented in the STRING, surpassing the previous GenPPI version, which was limited to less than 50% compared to STRING. More notably, we were able to achieve a full overlap using alternative GenPPI parameters, albeit at the cost of interactions absent on the STRING database. This significant achievement underscores the software's potential as a flexible tool for advancing research in various areas of biomedicine and other scientific fields, balancing precision, completeness, and a lower density of interaction networks. GenPPI is available for access at \url{https://genppi.facom.ufu.br/

Article activity feed