SignalGen: A Protein Language Model Based AI Agent For Optimal Signal Peptide Prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Signal peptides are short amino acid sequences attached to the N-termini of mature proteins. They play determinant roles in protein expression as well as localization of mature proteins. While sequence-based machine learning (ML) models have been developed to identify the signal peptide sequences given the full or mature protein sequences, no model has been created to design optimal signal peptides with the localization of the mature proteins taken into account. Here, we develop a ML model that considers the mature protein sequence, organism, and localization as inputs, encodes and processes them through a Latent Residual Transformer (LRT), and outputs the optimal signal peptide sequences for enhanced expression of the mature proteins, regardless of whether the proteins are non-native to the organism or de novo. The model is trained using the latest data from the UniProt database up until July 2025. Benchmarking of our ML model shows good performance in predicting the signal peptides for both human and non-human proteins from the UniProt database. Furthermore, our ML model is implemented with an artificial intelligence (AI) agent to enhance accessibility for the general scientific community. Findings from this study provide a framework for predicting optimal signal peptides for non-native protein expression of viral and bacterial vaccine candidates in human cells and for enhanced expression of de novo proteins.