Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting protein–protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models’ performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem.
Author Summary
Protein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together “hybrid features.” Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.