BERT-T6: Towards High-accuracy T6SS Bacterial Toxin Identification Using Protein Language Model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Type VI secretion system effectors target the cell wall, membranes and nucleic acids, leading to the killing of bacteria or impairment of host cell defense mechanisms. Accurate identification of T6SEs will be beneficial to understand the virulence of these bacteria via type VI secretion systems as well as bacterial pathogenesis. Although some traditional machine learning-based and deep learning-based tools have been developed to distinguish T6SEs from non-T6SEs, we believe there is still room for further improvement. To obtain the robust feature for model construction, we successively investigate various classic sequence-based features and embeddings from pre-trained transformer-based protein language models. Building upon the model incorporating ProtBert embeddings, we employed a transfer learning approach to fine-tune the ProtBert protein language model with a downstream T6SE classification task. The resulting BERT-T6 model demonstrates performance significantly superior to baseline models. More importantly, with an accuracy of 0.959, a sensitivity of 0.909, a specificity of 0.973, a precision of 0.905, a F1-score of 0.907, MCC of 0.881, our model achieves performance competitive with state-of-the-art binary and multi-class predictors. This work highlights the effectiveness of utilizing BERT with transfer learning for T6SE prediction. BERT-T6 provides a robust and precise approach for identifying T6SEs, offering promise for enhancing studies of bacterial virulence mechanisms.