Investigating cross-organism prediction of prokaryotic essential proteins using unsupervised language model and ensemble strategy

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Cross-organism prediction of essential proteins significantly improves the discovery of new drugs and the ability to re-engineer microorganisms. Despite the proficiency of current methodologies in machine learning and deep learning for predicting essential proteins, most of them do not address the applicability of such methods to cross-organism prediction. Such applicability is important because a useful predictor must perform well outside the organisms on which it is trained. Results In this study, we develop a large language model-and bidirectional long short-term memory-based approach called 'DeepPEP' that reliably transfers essential protein annotations between distantly related organisms. We curate 66 prokaryotic datasets and utilize these datasets to systematically investigate DeepPEP's capability in cross-organism prediction. Initially, we explore pair-wise prediction, wherein a model trained on data from one organism is utilized to predict outcomes for another organism. Our findings reveal a correlation between prediction performance and evolutionary distance, prompting us to consider utilizing species with closer evolutionary relationships for cross-species prediction tasks. However, subsequent investigations indicate that integrating training sets from multiple organisms yields superior performance. Subsequently, we devise a scenario closely resembling real-world cross-organism applications. In this scenario, the performance of DeepPEP is comparable to the state-of-the-art tool: Geptop 2.0, with DeepPEP demonstrating more sensitivity in predicting species-specific essential proteins. Finally, we conduct a case study to exemplify DeepPEP's efficacy in predicting essential proteins in novel genomes. Conclusions Our proposed model represents a valuable strategy for cross-organism prediction of prokaryotic essential proteins. Moreover, the scenario we establish, which closely resembles real-world applications, can serve as a benchmark for evaluating the performance of future models.

Article activity feed