Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Nisa A. Sindhi
Nikhil M. Pawar
Jamie D. Dixson
Dana M. García

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Predicting protein–protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models’ performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem.

Author Summary

Protein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together “hybrid features.” Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

Version published to 10.64898/2026.05.15.725340 on bioRxiv
May 18, 2026

Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting

This article has 2 authors:
1. Jett Appel
2. Nathan Butcher
This article has no evaluationsLatest version Apr 28, 2026
GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

This article has 7 authors:
1. Bing Rao
2. Jie Bai
3. Maha A. Thafar
4. Somayah Albaradei
5. Kamran Arshad
6. Apilak Worachartcheewanh
7. Muhammad Arif
This article has no evaluationsLatest version Mar 26, 2026
emb2dis: a novel protein disorder prediction tool based on ResNets, dilated convolutions & protein language models

This article has 8 authors:
1. S.A. Duarte
2. M. Mehdiabadi
3. L.A. Bugnon
4. M.C. Aspromonte
5. D. Piovesan
6. D.H. Milone
7. S.C.E. Tosatto
8. G. Stegmayer
This article has no evaluationsLatest version Apr 1, 2026

Discuss this preprint

Listed in

Abstract

Author Summary

Article activity feed

Related articles

Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting

GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

emb2dis: a novel protein disorder prediction tool based on ResNets, dilated convolutions & protein language models