PlasGO: enhancing GO-based function prediction for plasmid-encoded proteins based on genetic structure
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaScience)
Abstract
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
Article activity feed
-
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control …
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
This work has been peer reviewed in *GigaScience *(see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer name: **David Burstein ** Version: Revision 1
Review content: The authors thoroughly answered all my questions and addressed all the raised concerns. I have no further comments, and I congratulate them on a well executed study.
-
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control …
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer name: **Nguyen Quoc Khanh Le ** Version: Revision 1
Review content: No further comments to authors.
-
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control …
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
This work has been peer reviewed in *GigaScience *(see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer name: **David Burstein **
Review content:
In this paper, the authors introduce "PlasGO," a language model for GO annotation of plasmid proteins. The model takes into account two levels of representation: (1) the amino acid level, producing embeddings of the analyzed proteins based on a foundation protein language model, and (2) the plasmid gene level, where the aa-based embeddings are considered as part of a language model representing each protein in the genetic context in which it is encoded. This approach leverages the modular organization of different functions on plasmid genomes. Benchmarking performed by the authors against other deep-learning GO annotation algorithms demonstrates a considerable improvement of PlasGO over existing methods. The research is timely, well-performed, and clearly explained. Main issues:
- The authors acknowledge that only a relatively small portion of the proteins in their database have GO term annotations, which may limit the model's ability to learn plasmid patterns effectively. As they correctly point out, an iterative approach could be useful to improve performance. Specifically, high-confidence GO annotations predicted by PlasGO could be used as input for another round of prediction, and this process can be repeated until no new reliable predictions are produced. Given that the authors have all the data and models required to run such an iterative search, I would warmly recommend doing so and reporting if and how the predictions improve.
- The gLM model (Hwang et al.) is highly similar to PlasGO in terms of the general approach, combining protein embedding (ESM2 in gLM) with genomic contextual data. Discussing the differences between the approaches and comparing their performances would provide important context and highlight the novelty of PlasGO.
- The agreement of the PlasGO prediction with the GO terms retrieved from sequence databases ("ground truth") was determined by calculating the ratio of terms shared between the high-confidence predictions and ground truth, divided by the number of high-confidence predictions. This measure is asymmetrical and might generate over-optimistic results. At the extreme, if the algorithm produces a very large number of predictions, this value will tend to be very high just because there are many more GO terms predicted than GO terms in the ground truth. I strongly recommend using a symmetrical measure, such as the Jaccard index.
- The methodology for calculating average precision and recall is potentially skewed. The authors compute average precision over proteins with at least one annotation, ignoring proteins lacking annotation (instead of counting these as misclassifications). This approach makes sense given that numerous plasmid proteins lack GO annotations. However, the average recall is calculated across all proteins (N). For unannotated proteins, the correct classification is not defined. Since these cases are also considered in the measure of recall, I assume PlasGO high-confidence predictions were considered correct. This seems like a problematic assumption that might lead to skewed results. I would therefore suggest that unannotated proteins be omitted from the recall calculation, as was done in the precision calculation.
- The authors identify and filter out "elusive" GO terms that are difficult to predict. This is reasonable in the scope of this paper, but since it is still a central limitation of PlasGO, I would suggest discussing (even if not implementing) approaches to improve the predictions in these challenging cases.
- In Figures 8 and 9, a perfect AUPR of 1 is reported in several cases. Such perfect classification performances are highly unusual and warrant an examination to double-check this result and if it persists discuss the underlying reasons for these perfect results.
- The masking approach during training is not entirely clear. If I understand correctly, annotated proteins are being masked during prediction. This is expected to lead to the loss of a lot of contextual information. On the other hand, during training, the unannotated proteins are masked, losing potentially informative sequence data. I would suggest splitting complete plasmids between train/test/validation sets, and if needed, performing cross-validation to cover the entire dataset. This way for each plasmid the entire protein sequence and context information will be used.
- There seems to be somewhat of a contradiction between the two following statements appearing in the paper: (1) "CaLM, despite being a pre-trained PLM, did not surpass the top three tools using ProtTrans, which is consistent with the results reported in CaLM's paper" and (2) "Experimental results demonstrate that the protein representations derived from CaLM outperform other PLMs in the classification of GO terms." Furthermore, other PLMs, such as ESM, performed better at GO annotation prediction according to the CaLM paper. These might have been more appropriate for this task. CodonBERT, a codon-based PLM also based on ProtTrans, could also have been a suitable alternative.
Minor issues:- To improve the reading flow of the paper, consider reordering the ablation section to precede the "Performance on the RefSeq test set" section, since the ablation studies section provides the rationale for the choices of architecture and foundation protein language model.- "We initially downloaded all available plasmids from the NCBI RefSeq database" - I would suggest specifying the query or approach used to acquire all plasmids from RefSeq.- I would recommend using the term "protein embedding" instead of "protein token," which may be misleading. The term "token embeddings" used in Figure 3 is more accurate than "protein token," and yet "protein embeddings" is probably the most accurate term in this case.- Figure 1: To provide an accurate depiction of representative plasmids, I suggest including unannotated genes in Figure 1.- Figure 4: "Global average pooling" was misspelled.- Figure 10: "The prediction precision of PlasGO is determined by calculating the ratio of the number of proteins in set A that are also present in set B to the total number of predicted high-confidence proteins (|A|)". If I understand the figure correctly, it should be "number of GO terms" instead of "number of proteins" in both cases.- A figure (or supplementary figure) depicting one of the plasmids with some of the high-confidence predictions in the case study section (along the same lines as Figure 1 but with a distinction between previously known and unknown annotations) could enhance the clarity and impact of the results.
-
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control …
Plasmid, as a mobile genetic element, plays a pivotal role in facilitating the transfer of traits, such as antimicrobial resistance, among the bacterial community. Annotating plasmid-encoded proteins with the widely used Gene Ontology (GO) vocabulary is a fundamental step in various tasks, including plasmid mobility classification. However, GO prediction for plasmid-encoded proteins faces two major challenges: the high diversity of functions and the limited availability of high-quality GO annotations. Thus, we introduce PlasGO, a tool that leverages a hierarchical architecture to predict GO terms for plasmid proteins. PlasGO utilizes a powerful protein language model to learn the local context within protein sentences and a BERT model to capture the global context within plasmid sentences. Additionally, PlasGO allows users to control the precision by incorporating a self-attention confidence weighting mechanism. We rigorously evaluated PlasGO and benchmarked it against six state-of-the-art tools in a series of experiments. The experimental results collectively demonstrate that PlasGO has achieved commendable performance. PlasGO significantly expanded the annotations of the plasmid-encoded protein database by assigning high-confidence GO terms to over 95% of previously unannotated proteins, showcasing impressive precision of 0.8229, 0.7941, and 0.8870 for the three GO categories, respectively, as measured on the novel protein test set.
This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer name: **Nguyen Quoc Khanh Le **
Review content:
- The manuscript introduces PlasGO, which leverages a hierarchical architecture for GO term prediction in plasmid-encoded proteins. However, the novelty of the approach could be questioned. While the combination of protein language models and BERT for GO prediction is innovative, similar methods have been applied in other contexts.
- The study heavily relies on data from the RefSeq database, yet there is limited discussion on the quality and completeness of this data. The manuscript should address potential biases introduced by incomplete or incorrect GO annotations in the database. Moreover, the study uses protein sequences of up to 1K length, which might exclude relevant longer sequences, potentially limiting the model's applicability to all plasmid-encoded proteins.
- The manuscript claims that PlasGO can generalize well to novel proteins, but this claim is based on a specific dataset. The model's generalizability should be tested on more diverse and independent datasets, including plasmids from different bacterial species or environmental contexts.
- While the model's performance is quantitatively evaluated, the interpretability of the results remains unclear. The study should include an analysis of how well the model's predictions align with known biological functions and pathways. Additionally, it would be helpful to include examples where PlasGO provides novel insights that other models do not, thereby demonstrating its practical utility.
- The manuscript does not provide detailed information on the computational resources required to train and run PlasGO. Given the complexity of the model, there are potential concerns about its scalability, particularly for larger plasmid datasets or in settings with limited computational power.
- The manuscript compares PlasGO with several state-ofthe-art tools, but the comparison might not be fully exhaustive. Additionally, statistical significance tests for performance differences should be provided to support the comparative analysis.
- Language models have been used in previous bioinformatics studies i.e., PMID: 37381841, PMID: 38636332. Therefore, the authors are suggested to refer to more works in this description to attract a broader readership.
- The study should discuss any ethical considerations related to the use of public datasets, particularly regarding data privacy and consent if any sensitive data is involved. Furthermore, the potential commercial implications of the PlasGO tool, especially if it is used for proprietary research, should be addressed.
- While the manuscript mentions that PlasGO's code will be made available, it is crucial to ensure that all aspects of the research are fully reproducible.
- The hierarchical architecture and the use of extensive training data might lead to overfitting, especially given the high dimensionality of the input features. The manuscript should discuss the measures taken to prevent overfitting, such as regularization techniques, dropout, or cross-validation strategies.
- The study could benefit from a more detailed discussion on the practical implications of using PlasGO in real-world plasmid research. How can this tool be integrated into existing workflows for plasmid function prediction? What are the potential limitations in practical applications?
-