Evolutionary conservation of sequence motifs at sites of protein modification

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

No abstract available

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    General Statements:

    We were very pleased and appreciative of the reviewer’s comments, and constructive suggestions for improving the manuscript. In response to their suggestions, we have added new text to better emphasize the importance of the question, the novelty of our approach, the significance of the results, and the potential for future discovery,

    To summarize our key findings, we have identified 3,500 instances where – despite their shared ancestry - only one of two paralogous proteins undergoes a specific post-translational modification. By comparing adjoining sequences across 1012 isolates of the same yeast species, we determined that sequence conservation near sites of modification is greater than at sites that are not modified. We postulate that these differences in sequence are partly responsible for the differences in post-translational modifications, and that differences in modification allow duplicated proteins to be differentially regulated. These differences may account for their retention after 100M years of evolution.

    Our analysis is clearly distinct from earlier investigations. In particular, we use new and substantially larger proteomics datasets reporting multiple types of post-translational modifications, new tools to analyze protein structure (AlphaFold), as well as new and expanded protein interactome datasets. Perhaps most importantly, we rely entirely on *in-species *sequence conservation data, with particular emphasis on duplicated proteins. Finally, we developed a custom algorithm (CoSMoS.c.) and web site that quantifies sequence conservation, in an automated fashion, across all 1012 unique strain isolates.

    We propose that in-species comparisons of paralogs will prove to be more reliable than cross-species comparisons of orthologous proteins and/or in-species comparisons of non-homologous proteins. Comparison of paralogs is powerful because they are likely to have similar structures and functions, due to their shared evolutionary origin. Comparison within a single species is powerful because it avoids non-biological sources of uncertainty, such as potential alignment errors and any accompanying structural differences. Thus, by comparing unique modifications in closely-related gene products and across closely-related strain isolates, investigators using CoSMoS.c. will be better able to predict new enzyme-substrate relationships, identify new motifs for post-translational modifications, and prioritize mechanistic investigations of those modifications.

    All of the reviewers asked that we explain the motivation for the design choice, compare our design with those used in earlier studies, add new controls for the effects of protein abundance, and provide examples of how our novel approach may be useful to investigators who study post-translational modifications. We are pleased to report that we were able to address all of these issues with revised text, additional references, two new control experiments, and real-world examples of individual paralog-paralog comparisons that have been useful in the past.

    Finally, we have changed the title to: Differential modification____* of protein ____paralogs reveals conserved sequence determinants of post-translational ____modification*

    And we have changed the running title to: In-species evolution of protein modification sites

    Reply to the Reviewers:

    *Reviewer #1 (Evidence, reproducibility and clarity (Required)): **

    Summary: This paper reports bioinformatics analysis of population variation in PTM sites in paralogs from the yeast whole-genome-duplication. If I understand it correctly, the main finding is that modified sites show less population variation than paralagous unmodified sites. The results are largely in line with what is expected based on previous studies, though the authors do not present their results in that context.

    Major comments:

    1. The study benefits from two clever design choices:

    First, comparison of sites between paralogs is a very powerful test for an evolutionary hypothesis because paralogous sites are expected to have relatively similar structural context. Second, use of within species polymorphism data is much less susceptible to alignment errors that can be an issue for longer evolutionary comparisons.

    However, these design choices are not discussed or motivated by the authors. Nor are they compared to the designs of previous studies. Examples of previous studies (PMID: 22588506, PMID: 21273632, PMID: 20594336,PMID: 20594336, PMID: 24465218, PMID: 22889910, PMID: 20368267, PMID: 28054638)** *

    We were very pleased and appreciative of the reviewer’s comments, and constructive suggestions for improving the manuscript. We have added nearly all references suggested by the reviewer, as well as new text describing____ the central findings of these papers, as follows:

    ”Most importantly, and in contrast with previous studies, we restricted our analysis to modified and unmodified pairs of paralogous proteins. This represents a very powerful test for the hypothesis because paralogs have a shared evolutionary history and are expected to have similar secondary structures. Moreover, the use of within-species polymorphism data is much less susceptible to the alignment errors that often occur with longer evolutionary comparisons.”

    and

    “Our analysis is clearly distinct from - and complementary to - earlier investigations of post-translational modifications in yeasts. … Our analysis builds on these foundational studies, by considering new and substantially larger proteomics datasets, multiple additional types of post-translational modifications, new and sophisticated models of protein structure, large-scale kinase interactome data, and *in-species *sequence conservation data – with particular emphasis on duplicated proteins.

    We propose that in-species comparisons of paralogs will prove to be more reliable than cross-species comparisons of orthologous proteins, or in-species comparisons of non-homologous proteins. Comparison of paralogs is powerful because they are likely to have similar structures and functions, due to their shared evolutionary origin (56, 58, 68). Comparison within a single species is powerful because it allows us to avoid important non-biological sources of uncertainty, such as potential alignment errors and unknown structural or functional differences.”

      1. One essential control that needs to be added is how much of the effect the authors observe can be explained by protein abundance. In yeast, protein abundance is strongly negatively correlated with evolutionary rate, and is strongly positively correlated with identification of PTMs in MS and other assays (extensively discussed in some of the previous studies I listed above). The authors need to assess whether their findings are due to the slow evolution of highly expressed proteins, and the detection bias for these proteins in PTM identification experiments. As far as I could tell this was not discussed by the authors.*

    This point was also raised by Reviewer #2. We have added additional text stating that detection of PTMs by mass spectrometry is correlated with protein abundance.____ In addition, and as suggested by the reviewers, we have now done a control experiment using cross-study conservation of PTMs and limiting our comparison to proteins of similar abundance. By both methods, and as detailed below, we were able to confirm our original findings:

    “We then reanalyzed our data to account for possible effects of protein abundance, which in cross species comparisons was observed to negatively correlate with evolutionary rate and positively correlate with modification detection by mass spectrometry (39). Accordingly, we restricted our in-species analysis to a subset of 270 paralog pairs that have similar ( 100 instances each of phosphorylation, ubiquitylation and succinylation, where the target and paralog have the same amino acid, but only the target is modified. Even with this restricted dataset, we obtained similar results for all three types of analysis (Dataset S9). We also considered the potential effect of false positives and false negatives among the reported modification sites. False positives can result from ambiguous assignments, as might arise through misidentification of modified sites within peptides that contain multiple potential sites of modification. False negatives can result from difficulties in detecting modifications in poorly expressed proteins (39), or an overly strict reliance on high confidence sites. We then further restricted the data to only include modifications identified in multiple studies. After applying this additional filter, we were left with > 100 instances of phosphorylation. Once again, we obtained similar results for Symmetric Average Score and One-sided Average Score analysis, but not for Chemical Similarity Average Score, which is further restricted by splitting the data into five chemical categories (Dataset S10).”

    • 3.A major weakness of the paper is its lack of focus. It includes a rambling historical introduction and discussion that omits discussion of the relevant recent research directly related to the questions at hand. For example, the paper describes historical work on phosphorylase, but gives not a single example of a paralog pair with a polymorphic PTM site identified in their study. The authors introduce gene duplication in a very general way, even though several papers have focused specifically on evolution of protein regulation in paralogs (e.g., PMID: 20080574, PMID: 27003913, PMID: 25474245) The paper of Nguyen Ba et al. 2014 (PMID: 25474245) seems especially relevant, as in addition to perfoming a genome-wide analysis, their abstract reads "We examine changes in constraints on known regulatory sequences and show that for the Rck1/Rck2, Fkh1/Fkh2, Ace2/Swi5 paralogs, they are associated with previously characterized differences in posttranslational regulation." It seems that the results of that study could be directly compared to the analysis performed here.*

    This point was also raised by Reviewer 2. At the suggestion of the reviewers, we have moved or removed discussion of these foundational studies of PTM mapping and added discussion of well-characterized examples of paralog pairs with polymorphic PTM sites, based on the references provided, as follows:

    “We propose that in-species comparisons of paralogs will prove to be more reliable than cross-species comparisons of orthologous proteins, or in-species comparisons of non-homologous proteins. Comparison of paralogs is powerful because they are likely to have similar structures and functions, due to their shared evolutionary origin (56, 58, 68). Comparison within a single species is powerful because it allows us to avoid important non-biological sources of uncertainty, such as potential alignment errors and unknown structural or functional differences. This is supported by a small number of prior studies, which compared four sets of paralogous proteins in yeast - Rck1 v. Rck2, Fkh1 v. Fkh2, Ace2 v. Swi5 (68), and Boi1 v. Boi2 (58), and concluded that divergence in short linear motifs is likely responsible for differences in phosphorylation. While paralogs are far less common in other organisms, a similar conclusion emerged from a comparison of predicted sites of phosphorylation in mammalian p53, p63 and p73 (69).

    Our analysis of differentially-modified pairs of paralogous proteins revealed that the most common modifications – phosphorylation, ubiquitylation and acylation but not N-glycosylation – occur within regions of high sequence conservation. Further studies will benefit from the availability of our search algorithm CoSMoS.c.. For example, when studying a particular protein kinase, CoSMoS.c. can be used to identify specific motifs near potentially modified serines, threonines and tyrosines (Table 2). When studying a particular substrate of ubiquitylation, CoSMoS.c. can be used to prioritize conserved versus non-conserved sequences flanking potentially modified lysines. For rare modifications, CoSMoS.c. can also be used to locate highly conserved regions as the starting points for finding new sequence motifs. Thus, by comparing unique modifications in closely-related gene products and across closely-related strain isolates, we can prioritize mechanistic investigations of modifications that are likely to have functional importance, to identify recognition motifs for specific modifying enzymes, and to better predict new enzyme-substrate relationships.”

    *Reviewer #1 (Significance (Required)):

    The significance is hard to assess because the research is not given proper context and motivation.

    I believe the study could be of interest to research studying cell signalling and its evolution, as well as those interested in gene family diversification. However, as written, no specific examples are given or clear hypotheses tested, making the paper seem largely descriptive.

    My keywords: molecular evolution, signalling, intrinsically disordered regions, computational biology

    *Reviewer #2 (Evidence, reproducibility and clarity (Required)): **

    Summary

    The authors of this work study how S. cerevisiae paralogue pairs are differentially modified with respect to five major PTM classes: phosphorylation, ubiquitination, mono-acetylation, N-glycosylation, and succinylation. Emphasis is placed on paralogue pairs where a modification is found in only one of the two paralogues at homologous positions. A conservation analysis is then performed across 1011 S. cerevisiae isolates to check for differences in conservation between the modified target and its unmodified paralogue. The authors claim that, for most of the PTM classes, modified targets tend to be more conserved than their unmodified paralogues. Phosphorylation sites between paralogue pairs were also compared using AlphaFold2 and a database of kinase interactions (YeastKID), revealing differential interactions between paralogues but no significant structural differences. *

    We were very pleased and appreciative of the reviewer’s comments, and constructive suggestions for improving the manuscript.____* **

    *Major:

    1. A major issue with this work is that the problem of 'false negatives' for PTM detection is never adequately addressed or controlled for. As the authors allude to in the manuscript, the number of PTM sites detected is likely far below the number that exists and this is especially a problem for the less well characterised PTM classes. How then can the authors be confident that an 'unmodified' site is truly unmodified and not just undetected? The authors can refer to Freschi et al., 2011 (MSB) for a method that controls for the false negative (FN) PTM detection rate by comparing cross-study conservation with cross-study reproducibility. *
    1. The second point follows closely from the first. The issue is that MS-based PTM detection is generally biased towards abundant proteins, and protein abundance also correlates strongly with evolutionary rate, with more abundant proteins tending to have higher conservation. Taken together, these two relationships could explain the observation that the modified paralogue tends to be more conserved than the 'unmodified' paralogue. The authors should try and control for the effect of protein abundance on the results observed; for example, by checking if the results/conclusions change when restricting the analysis to paralogue pairs with similar abundances. *
    1. Alongside false negatives, there is the cognate issue of false positives and mislocalised PTM sites (see Lanz et al., 2021, EMBO Reports). If possible, the authors should check to see if their conclusions change when restricting the analysis to high-confidence PTM sites identified from multiple sources and/or validated by low throughput experimental assays.*

    __This point was also raised by Reviewer #1. To address the concern, we have now done a new control analysis, one that uses only those modifications identified in multiple studies and comparing only proteins of similar (“We then reanalyzed our data to account for possible effects of protein abundance, which in cross species comparisons was observed to negatively correlate with evolutionary rate and positively correlate with modification detection by mass spectrometry (39). Accordingly, we restricted our in-species analysis to a subset of 270 paralog pairs that have similar ( 100 instances each of phosphorylation, ubiquitylation and succinylation, where the target and paralog have the same amino acid, but only the target is modified. Even with this restricted dataset, we obtained similar results for all three types of analysis (Dataset S9). We also considered the potential effect of false positives and false negatives among the reported modification sites. False positives can result from ambiguous assignments, as might arise through misidentification of modified sites within peptides that contain multiple potential sites of modification. False negatives can result from difficulties in detecting modifications in poorly expressed proteins (39), or an overly strict reliance on high confidence sites. We then further restricted the data to only include modifications identified in multiple studies. After applying this additional filter, we were left with > 100 instances of phosphorylation. Once again, we obtained similar results for Symmetric Average Score and One-sided Average Score analysis, but not for Chemical Similarity Average Score, which is further restricted by splitting the data into five chemical categories (Dataset S10).”

    4) The authors define conservation here using 1011 wild and domesticated yeast isolates within one species (S. cerevisiae). While this is clearly valuable information, this reviewer wonders why orthologues from closely related species were not also leveraged to assess the evolutionary rate, as is traditionally done for studies on PTM evolution? Is there a strong rationale for this? Using more distantly-related genomes could give more statistical power for the detection of weak differences in selective constraint between paralogues.

    __We believe that a major strength of our study is the reliance on____ in-species sequence conservation data – with particular emphasis on duplicated proteins. To better emphasize this point, we have added new text as follows: __

    ”Most importantly, and in contrast with previous studies, we restricted our analysis to modified and unmodified pairs of paralogous proteins. This represents a very powerful test for the hypothesis because paralogs have a shared evolutionary history and are expected to have similar secondary structures. Moreover, the use of in-species polymorphism data is much less susceptible to the alignment errors that often occur with longer evolutionary comparisons.”

    and

    “We propose that in-species comparisons of paralogs will prove to be more reliable than cross-species comparisons of orthologous proteins, or in-species comparisons of non-homologous proteins. Comparison of paralogs is powerful because they are likely to have similar structures and functions, due to their shared evolutionary origin (56, 58, 68). Comparison within a single species is powerful because it allows us to avoid important non-biological sources of uncertainty, such as potential alignment errors and unknown structural or functional differences.”

    *Minor:

    1. Both the Introduction and Discussion describe PTMs and the evolution of gene duplication in very general terms. However, literature concerning the evolution of PTMs and specifically the evolution of PTMs following gene duplication has been largely ignored. These studies give the most relevant context to this work and should be described and cited. Freschi et al., 2011 (Molecular Systems Biology) and Ba et al., 2014 (PloS Computational Biology) are particularly relevant. *

    We have added references suggested by the reviewer, as well as new text describing the central findings of these papers, as follows:

    “Our analysis is clearly distinct from - and complementary to - earlier investigations of post-translational modifications in yeasts. Previous analysis showed that duplicated proteins in Saccharomyces cerevisiae are more likely to be phosphorylated, and to have a greater number of phosphorylation sites, than non-duplicated proteins (58). The difference persisted when controlling for differences in protein abundance, coverage, essentiality, positioning within protein interaction networks and assembly into multi-protein complexes (58). When compared with a yeast species that diverged before the whole genome duplication event, it appears that the majority of phosphorylation sites in paralogs have either been lost or gained, with a strong bias toward losses (56). Subsequent cross-species comparisons noted a high degree of sequence conservation near sites of phosphorylation and other types of modification in yeasts (49, 59-65). The relationship was strongest for phosphosites with known function (49, 50, 61). A focused study of 249 unique high-confidence phosphorylation sites, targeted by 7 protein kinases in S. cerevisiae, confirmed that regions flanking sites of phosphorylation are significantly constrained, in comparison with other closely related yeast species (61). A similar relationship exists for sites phosphorylated by the cyclin-dependent protein kinase Cdk1 (66), and was the basis for predicting novel sites of phosphorylation by the cAMP-dependent protein kinase (67). Our analysis builds on these foundational studies, by considering new and substantially larger proteomics datasets, multiple additional types of post-translational modifications, new and sophisticated models of protein structure, large-scale kinase interactome data, and *in-species *sequence conservation data – with particular emphasis on duplicated proteins.

    We propose that in-species comparisons of paralogs will prove to be more reliable than cross-species comparisons of orthologous proteins, or in-species comparisons of non-homologous proteins. Comparison of paralogs is powerful because they are likely to have similar structures and functions, due to their shared evolutionary origin (56, 58, 68). Comparison within a single species is powerful because it allows us to avoid important non-biological sources of uncertainty, such as potential alignment errors and unknown structural or functional differences. This is supported by a small number of prior studies, which compared four sets of paralogous proteins in yeast - Rck1 v. Rck2, Fkh1 v. Fkh2, Ace2 v. Swi5 (68), and Boi1 v. Boi2 (58), and concluded that divergence in short linear motifs is likely responsible for differences in phosphorylation. While paralogs are far less common in other organisms, a similar conclusion emerged from a comparison of predicted sites of phosphorylation in mammalian p53, p63 and p73 (69).”

    *2) While I enjoyed to a limited extent the historical perspective on PTM discovery, there is far too much text given to this overall and the writing should be made more concise by removing excessive detail. This is especially the case for the Results section, where the emphasis should be on the analysis performed by the authors. *

    __This point was also raised by Reviewer 1. At the suggestion of the reviewers, we have moved or removed discussion of these foundational studies of PTM mapping and added discussion of well-characterized examples of paralog pairs with polymorphic PTM sites, based on the references provided, as detailed above. __

    *3) Description of the methodology should be reviewed for language and clarity. In particular, the authors should explain explicitly the meaning of new terms such as 'pairing structure' and how this may confer an 'advantage / disadvantage' to target proteins -- wording that this reviewer found especially confusing and unnecessary. The authors should also be explicit about how the distributions for each test are constructed; the current wording sometimes gives the impression that a distribution is derived from a single target or paralogue instead of being derived from a set of modified targets and the corresponding set of unmodified paralogues. Another confusion is that the Distribution Mean Test is contrasted with the Paralog Pairing Test in Fig S8 and yet on page 15 the Distribution Mean Test is described as 'paired' test on page 15 even though from the description the test seems unpaired? *

    We are now more explicit about how the distributions for each test are constructed, and we have clarified the meaning of the terms 'pairing structure', 'advantage / disadvantage' and ‘Distribution Mean Test’, as follows:

    “We then performed two statistical tests: the Distribution Mean Test, which determines whether the mean of the distribution of target protein conservation scores (that is, the mean conservation score for all modified target proteins) is significantly larger than that of the unmodified paralogs, and the Paralog Pairing Test, which tests whether the pairing structure confers an advantage for the target proteins. Figure 2 presents two possible pairing structures (panels A and C) and how these can advantage (panels A and B) or disadvantage (panels C and D) target proteins...”

    “In this instance we applied a one-sided, paired Mann-Whitney-Wilcoxon Test (100), which determines whether the target protein conservation score distribution is significantly larger than the unmodified paralog conservation score distribution, without assuming that they follow a normal distribution. We used the paired test because the comparison is between the means of paired observations that have a relationship between the two groups (modified target and unmodified paralogs). Hereafter we refer to this as Distribution Mean Test.”

    4) Following on from point 2) in the 'major' section above, the authors could consider normalising the conservation scores within a protein to control for the effect of protein abundance and other potential confounders acting at the protein level.

    We have added additional text stating that detection of PTMs by mass spectrometry is correlated with protein abundance.____ In addition, and as suggested by the reviewers, we have now done a control experiment using cross-study conservation of PTMs and limiting our comparison to proteins of similar abundance. By both methods, and as detailed below, we were able to confirm our original findings:

    “We then reanalyzed our data to account for possible effects of protein abundance, which in cross species comparisons was observed to negatively correlate with evolutionary rate and positively correlate with modification detection by mass spectrometry (39). Accordingly, we restricted our in-species analysis to a subset of 270 paralog pairs that have similar ( 100 instances each of phosphorylation, ubiquitylation and succinylation, where the target and paralog have the same amino acid, but only the target is modified. Even with this restricted dataset, we obtained similar results for all three types of analysis (Dataset S9). We also considered the potential effect of false positives and false negatives among the reported modification sites. False positives can result from ambiguous assignments, as might arise through misidentification of modified sites within peptides that contain multiple potential sites of modification. False negatives can result from difficulties in detecting modifications in poorly expressed proteins (39), or an overly strict reliance on high confidence sites. We then further restricted the data to only include modifications identified in multiple studies. After applying this additional filter, we were left with > 100 instances of phosphorylation. Once again, we obtained similar results for Symmetric Average Score and One-sided Average Score analysis, but not for Chemical Similarity Average Score, which is further restricted by splitting the data into five chemical categories (Dataset S10).”

    *5) For the analysis of motifs, departure from the BLOSUM62 expectation may just reflect the fact that many of these PTMs fall in disordered regions - which have distinct amino acid propensities -- whereas matrices like BLOSUM62 were constructed mostly from ordered protein regions. *

    We have modified the Materials and Methods section to reflect this alternative, as follows:

    “If the observed changes differ substantially from expectation (BLOSUM62), this suggests the presence of selection pressure and functional importance. This might also arise from distinct amino acid propensities when comparing ordered protein regions, from which the BLOSUM62 matrices were constructed, and disordered regions, where most modifications are likely to occur. This is unlikely to impact our results, as we are comparing structurally similar paralogous proteins. In addition, we are using multiple score algorithms to support our conclusions.”

    6) The analysis of sequence motifs could be extended by scoring phosphosites with yeast position weight matrices (PWMs) for protein kinases and comparing the results between modified targets and their unmodified paralogues. This can help distinguish true positive and false negative modification differences. See Freschi et al., 2011 (Molecular Systems Biology).

    We have performed this analysis according to the reviewer’s suggestion and added new text to the Results, as follows:

    “Finally, in an initial effort to match sites of phosphorylation with protein kinases, we used the position-weight matrices (PWMs) developed by Mok et al. (56, 57). That analysis determined phosphorylation site selectivity for 61 of the 122 kinases in Saccharomyces cerevisiae and proposed empirically-derived PWMs that enable the assignment of candidate protein kinases to known sites of phosphorylation (56, 57). We applied the PWMs to our dataset, which contains sites where one of the two proteins is known to be phosphorylated and the amino acid residue is the same in both. From this dataset, we kept 190 paralogous pairs where each protein contains at least one such phosphorylation site, so that both proteins would have kinase interactions to be compared. Using the PWMs from (57), we assigned the kinase that most likely corresponds to each phosphorylation site, as implemented in (56). Out of the 190 paralogous pairs, 130 interacted with different kinases. Together, these results indicate that most kinases regulate one or the other of the protein paralogs. They suggest further that differential modifications reported here may be the result of differential interactions with modifying enzymes.”

    *Reviewer #2 (Significance (Required)):

    This work is potentially of specialist interest to researchers studying the evolution of PTMs. While the evolution of phosphorylation following gene duplication has been studied previously (Freschi et al 2011, MSB), this work considers other PTM classes and takes advantage of a much larger data set. Potentially, clear examples of paralogue PTM divergence could be used as a basis for follow-up experiments. However, the web-server as it is now is designed to facilitate the easy analysis of a single protein at a time and not comparisons across paralogue pairs.

    __We have added new text to better emphasize the importance of the question, the novelty of our approach, the significance of the results, and the potential for future discovery, as follows: __

    “Post-translational modifications are critical functional elements within proteins, and are therefore expected to be conserved in evolution. Here, we have identified several thousand instances where, despite a shared ancestry, only one of two paralogous proteins undergoes a specific post-translational modification. We also developed a custom algorithm that quantifies sequence conservation, in an automated fashion, across 1012 unique strain isolates. By comparing adjoining sequences in multiple isolates of the same species, we determined that sequence conservation near sites of modification is greater than at sites that are not modified. In addition, many of the modifications were associated with characteristic sequence elements nearby. We postulate that these differences in sequence conservation are partly responsible for differences in post-translational modifications, that differences in post-translational modifications allow duplicated proteins to be differentially regulated, and these differences may account for their retention after 100M years of evolution.

    Our analysis is clearly distinct from - and complementary to - earlier investigations of post-translational modifications in yeasts. Previous analysis showed that duplicated proteins in Saccharomyces cerevisiae are more likely to be phosphorylated, and to have a greater number of phosphorylation sites, than non-duplicated proteins (58). The difference persisted when controlling for differences in protein abundance, coverage, essentiality, positioning within protein interaction networks and assembly into multi-protein complexes (58). When compared with a yeast species that diverged before the whole genome duplication event, it appears that the majority of phosphorylation sites in paralogs have either been lost or gained, with a strong bias toward losses (56). Subsequent cross-species comparisons noted a high degree of sequence conservation near sites of phosphorylation and other types of modification in yeasts (49, 59-65). The relationship was strongest for phosphosites with known function (49, 50, 61). A focused study of 249 unique high-confidence phosphorylation sites, targeted by 7 protein kinases in S. cerevisiae, confirmed that regions flanking sites of phosphorylation are significantly constrained, in comparison with other closely related yeast species (61). A similar relationship exists for sites phosphorylated by the cyclin-dependent protein kinase Cdk1 (66), and was the basis for predicting novel sites of phosphorylation by the cAMP-dependent protein kinase (67). Our analysis builds on these foundational studies, by considering new and substantially larger proteomics datasets, multiple additional types of post-translational modifications, new and sophisticated models of protein structure, large-scale kinase interactome data, and *in-species *sequence conservation data – with particular emphasis on duplicated proteins.

    We propose that in-species comparisons of paralogs will prove to be more reliable than cross-species comparisons of orthologous proteins, or in-species comparisons of non-homologous proteins. Comparison of paralogs is powerful because they are likely to have similar structures and functions, due to their shared evolutionary origin (56, 58, 68). Comparison within a single species is powerful because it allows us to avoid important non-biological sources of uncertainty, such as potential alignment errors and unknown structural or functional differences. This is supported by a small number of prior studies, which compared four sets of paralogous proteins in yeast - Rck1 v. Rck2, Fkh1 v. Fkh2, Ace2 v. Swi5 (68), and Boi1 v. Boi2 (58), and concluded that divergence in short linear motifs is likely responsible for differences in phosphorylation. While paralogs are far less common in other organisms, a similar conclusion emerged from a comparison of predicted sites of phosphorylation in mammalian p53, p63 and p73 (69).

    Our analysis of differentially-modified pairs of paralogous proteins revealed that the most common modifications – phosphorylation, ubiquitylation and acylation but not N-glycosylation – occur within regions of high sequence conservation. Further studies will benefit from the availability of our search algorithm CoSMoS.c.. For example, when studying a particular protein kinase, CoSMoS.c. can be used to identify specific motifs near potentially modified serines, threonines and tyrosines (Table 2). When studying a particular substrate of ubiquitylation, CoSMoS.c. can be used to prioritize conserved versus non-conserved sequences flanking potentially modified lysines. For rare modifications, CoSMoS.c. can also be used to locate highly conserved regions as the starting points for finding new sequence motifs. Thus, by comparing unique modifications in closely-related gene products and across closely-related strain isolates, we can prioritize mechanistic investigations of modifications that are likely to have functional importance, to identify recognition motifs for specific modifying enzymes, and to better predict new enzyme-substrate relationships.”

    __In addition, and in response to the reviewer’s suggestion, we are currently expanding the web site to facilitate comparisons across paralogue pairs. ____

    __

    Currently, the major problems stated above 1) correction for the problem of false negatives, and 2) correction for the confounding effects of protein abundance need to be addressed before the results can be fully interpreted and evaluated.

    __As detailed above under Points 1-3,____ we have now done a control experiment using cross-study conservation of PTMs and limiting our comparison to proteins of similar abundance. By both methods, we were able to confirm our original findings, as detailed above. __

    *Reviewer field of expertise: phosphosite evolution, PTM evolution, protein evolution.

    *Reviewer #3 (Evidence, reproducibility and clarity (Required)): **

    This manuscript describes the evolutionary conservation of yeast post-translationally modified residues and sequence motifs surrounding them.

    Reviewer #3 (Significance (Required)):

    Although this is not new, (Beltrao, Cell 2012, Minguez, MSB 2012, Hendriksen 2012) all show that sites of acetylation, phosphorylation and other modifications are more conserved in yeast than would be expected. Beltrao and Minguez also provide webservers http://ptmfunc.com/ *http://ptmcode.embl.de *where the link of conserved modified sites is made to protein structures and protein-protein interactions. **

    The novelty of this study is in studying the duplicated proteins after whole genome duplication as well as providing an online interactive server where the conservation can be retrieved in detail, different scoring functions are provided. In addition, the conservation is calculated in closely related species rather than long evolutionary distances as previous studies have done.

    I am missing a concrete example of how a researcher would use the resource that the authors introduce here, and how it is an advance to previously proposed methods. For example, are there sites found conserved in this set of more closely related organisms, that are not conserved in yeast versus metazoa? Is the more fine-grained methodology useful to detect motif sequences that can otherwise not be detected? Can the authors provide proof that indeed the conserved sites are more functional than non-conserved?

    *At the moment the manuscript describes very little results, and only a possible advance compared to previous methods, no proof is given that an actual advance is made. *

    The authors should compare their work to previous work in this field.

    __We were very pleased and appreciative of the reviewer’s comments, and constructive suggestions for improving the manuscript.____ We have added new text to better emphasize the importance of the question, the novelty of our approach, the significance of the results, and the potential for future discovery, as follows: __

    “Post-translational modifications are critical functional elements within proteins, and are therefore expected to be conserved in evolution. Here, we have identified several thousand instances where, despite a shared ancestry, only one of two paralogous proteins undergoes a specific post-translational modification. We also developed a custom algorithm that quantifies sequence conservation, in an automated fashion, across 1012 unique strain isolates. By comparing adjoining sequences in multiple isolates of the same species, we determined that sequence conservation near sites of modification is greater than at sites that are not modified. In addition, many of the modifications were associated with characteristic sequence elements nearby. We postulate that these differences in sequence conservation are partly responsible for differences in post-translational modifications, that differences in post-translational modifications allow duplicated proteins to be differentially regulated, and these differences may account for their retention after 100M years of evolution.

    Our analysis is clearly distinct from - and complementary to - earlier investigations of post-translational modifications in yeasts. Previous analysis showed that duplicated proteins in Saccharomyces cerevisiae are more likely to be phosphorylated, and to have a greater number of phosphorylation sites, than non-duplicated proteins (58). The difference persisted when controlling for differences in protein abundance, coverage, essentiality, positioning within protein interaction networks and assembly into multi-protein complexes (58). When compared with a yeast species that diverged before the whole genome duplication event, it appears that the majority of phosphorylation sites in paralogs have either been lost or gained, with a strong bias toward losses (56). Subsequent cross-species comparisons noted a high degree of sequence conservation near sites of phosphorylation and other types of modification in yeasts (49, 59-65). The relationship was strongest for phosphosites with known function (49, 50, 61). A focused study of 249 unique high-confidence phosphorylation sites, targeted by 7 protein kinases in S. cerevisiae, confirmed that regions flanking sites of phosphorylation are significantly constrained, in comparison with other closely related yeast species (61). A similar relationship exists for sites phosphorylated by the cyclin-dependent protein kinase Cdk1 (66), and was the basis for predicting novel sites of phosphorylation by the cAMP-dependent protein kinase (67). Our analysis builds on these foundational studies, by considering new and substantially larger proteomics datasets, multiple additional types of post-translational modifications, new and sophisticated models of protein structure, large-scale kinase interactome data, and *in-species *sequence conservation data – with particular emphasis on duplicated proteins.

    We propose that in-species comparisons of paralogs will prove to be more reliable than cross-species comparisons of orthologous proteins, or in-species comparisons of non-homologous proteins. Comparison of paralogs is powerful because they are likely to have similar structures and functions, due to their shared evolutionary origin (56, 58, 68). Comparison within a single species is powerful because it allows us to avoid important non-biological sources of uncertainty, such as potential alignment errors and unknown structural or functional differences. This is supported by a small number of prior studies, which compared four sets of paralogous proteins in yeast - Rck1 v. Rck2, Fkh1 v. Fkh2, Ace2 v. Swi5 (68), and Boi1 v. Boi2 (58), and concluded that divergence in short linear motifs is likely responsible for differences in phosphorylation. While paralogs are far less common in other organisms, a similar conclusion emerged from a comparison of predicted sites of phosphorylation in mammalian p53, p63 and p73 (69).

    Our analysis of differentially-modified pairs of paralogous proteins revealed that the most common modifications – phosphorylation, ubiquitylation and acylation but not N-glycosylation – occur within regions of high sequence conservation. Further studies will benefit from the availability of our search algorithm CoSMoS.c.. For example, when studying a particular protein kinase, CoSMoS.c. can be used to identify specific motifs near potentially modified serines, threonines and tyrosines (Table 2). When studying a particular substrate of ubiquitylation, CoSMoS.c. can be used to prioritize conserved versus non-conserved sequences flanking potentially modified lysines. For rare modifications, CoSMoS.c. can also be used to locate highly conserved regions as the starting points for finding new sequence motifs. Thus, by comparing unique modifications in closely-related gene products and across closely-related strain isolates, we can prioritize mechanistic investigations of modifications that are likely to have functional importance, to identify recognition motifs for specific modifying enzymes, and to better predict new enzyme-substrate relationships.”

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    This manuscript describes the evolutionary conservation of yeast post-translationally modified residues and sequence motifs surrounding them.

    Significance

    Although this is not new, (Beltrao, Cell 2012, Minguez, MSB 2012, Hendriksen 2012) all show that sites of acetylation, phosphorylation and other modifications are more conserved in yeast than would be expected. Beltrao and Minguez also provide webservers http://ptmfunc.com/ http://ptmcode.embl.de where the link of conserved modified sites is made to protein structures and protein-protein interactions.

    The novelty of this study is in studying the duplicated proteins after whole genome duplication as well as providing an online interactive server where the conservation can be retrieved in detail, different scoring functions are provided. In addition, the conservation is calculated in closely related species rather than long evolutionary distances as previous studies have done.

    I am missing a concrete example of how a researcher would use the resource that the authors introduce here, and how it is an advance to previously proposed methods. For example, are there sites found conserved in this set of more closely related organisms, that are not conserved in yeast versus metazoa? Is the more fine-grained methodology useful to detect motif sequences that can otherwise not be detected? Can the authors provide proof that indeed the conserved sites are more functional than non-conserved?

    At the moment the manuscript describes very little results, and only a possible advance compared to previous methods, no proof is given that an actual advance is made.

    The authors should compare their work to previous work in this field.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    Summary

    The authors of this work study how S. cerevisiae paralogue pairs are differentially modified with respect to five major PTM classes: phosphorylation, ubiquitination, mono-acetylation, N-glycosylation, and succinylation. Emphasis is placed on paralogue pairs where a modification is found in only one of the two paralogues at homologous positions. A conservation analysis is then performed across 1011 S. cerevisiae isolates to check for differences in conservation between the modified target and its unmodified paralogue. The authors claim that, for most of the PTM classes, modified targets tend to be more conserved than their unmodified paralogues. Phosphorylation sites between paralogue pairs were also compared using AlphaFold2 and a database of kinase interactions (YeastKID), revealing differential interactions between paralogues but no significant structural differences.

    Major:

    1. A major issue with this work is that the problem of 'false negatives' for PTM detection is never adequately addressed or controlled for. As the authors allude to in the manuscript, the number of PTM sites detected is likely far below the number that exists and this is especially a problem for the less well characterised PTM classes. How then can the authors be confident that an 'unmodified' site is truly unmodified and not just undetected? The authors can refer to Freschi et al., 2011 (MSB) for a method that controls for the false negative (FN) PTM detection rate by comparing cross-study conservation with cross-study reproducibility.
    2. The second point follows closely from the first. The issue is that MS-based PTM detection is generally biased towards abundant proteins, and protein abundance also correlates strongly with evolutionary rate, with more abundant proteins tending to have higher conservation. Taken together, these two relationships could explain the observation that the modified paralogue tends to be more conserved than the 'unmodified' paralogue. The authors should try and control for the effect of protein abundance on the results observed; for example, by checking if the results/conclusions change when restricting the analysis to paralogue pairs with similar abundances.
    3. Alongside false negatives, there is the cognate issue of false positives and mislocalised PTM sites (see Lanz et al., 2021, EMBO Reports). If possible, the authors should check to see if their conclusions change when restricting the analysis to high-confidence PTM sites identified from multiple sources and/or validated by low throughput experimental assays.
    4. The authors define conservation here using 1011 wild and domesticated yeast isolates within one species (S. cerevisiae). While this is clearly valuable information, this reviewer wonders why orthologues from closely related species were not also leveraged to assess the evolutionary rate, as is traditionally done for studies on PTM evolution? Is there a strong rationale for this? Using more distantly-related genomes could give more statistical power for the detection of weak differences in selective constraint between paralogues.

    Minor:

    1. Both the Introduction and Discussion describe PTMs and the evolution of gene duplication in very general terms. However, literature concerning the evolution of PTMs and specifically the evolution of PTMs following gene duplication has been largely ignored. These studies give the most relevant context to this work and should be described and cited. Freschi et al., 2011 (Molecular Systems Biology) and Ba et al., 2014 (PloS Computational Biology) are particularly relevant.
    2. While I enjoyed to a limited extent the historical perspective on PTM discovery, there is far too much text given to this overall and the writing should be made more concise by removing excessive detail. This is especially the case for the Results section, where the emphasis should be on the analysis performed by the authors.
    3. Description of the methodology should be reviewed for language and clarity. In particular, the authors should explain explicitly the meaning of new terms such as 'pairing structure' and how this may confer an 'advantage / disadvantage' to target proteins -- wording that this reviewer found especially confusing and unnecessary. The authors should also be explicit about how the distributions for each test are constructed; the current wording sometimes gives the impression that a distribution is derived from a single target or paralogue instead of being derived from a set of modified targets and the corresponding set of unmodified paralogues. Another confusion is that the Distribution Mean Test is contrasted with the Paralog Pairing Test in Fig S8 and yet on page 15 the Distribution Mean Test is described as 'paired' test on page 15 even though from the description the test seems unpaired?
    4. Following on from point 2) in the 'major' section above, the authors could consider normalising the conservation scores within a protein to control for the effect of protein abundance and other potential confounders acting at the protein level.
    5. For the analysis of motifs, departure from the BLOSUM62 expectation may just reflect the fact that many of these PTMs fall in disordered regions - which have distinct amino acid propensities -- whereas matrices like BLOSUM62 were constructed mostly from ordered protein regions.
    6. The analysis of sequence motifs could be extended by scoring phosphosites with yeast position weight matrices (PWMs) for protein kinases and comparing the results between modified targets and their unmodified paralogues. This can help distinguish true positive and false negative modification differences. See Freschi et al., 2011 (Molecular Systems Biology).

    Significance

    This work is potentially of specialist interest to researchers studying the evolution of PTMs. While the evolution of phosphorylation following gene duplication has been studied previously (Freschi et al 2011, MSB), this work considers other PTM classes and takes advantage of a much larger data set. Potentially, clear examples of paralogue PTM divergence could be used as a basis for follow-up experiments. However, the web-server as it is now is designed to facilitate the easy analysis of a single protein at a time and not comparisons across paralogue pairs.

    Currently, the major problems stated above 1) correction for the problem of false negatives, and 2) correction for the confounding effects of protein abundance need to be addressed before the results can be fully interpreted and evaluated.

    Reviewer field of expertise: phosphosite evolution, PTM evolution, protein evolution.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary:

    This paper reports bioinformatics analysis of population variation in PTM sites in paralogs from the yeast whole-genome-duplication. If I understand it correctly, the main finding is that modified sites show less population variation than paralagous unmodified sites. The results are largely in line with what is expected based on previous studies, though the authors do not present their results in that context.

    Major comments:

    1. The study benefits from two clever design choices:

    First, comparison of sites between paralogs is a very powerful test for an evolutionary hypothesis because paralogous sites are expected to have relatively similar structural context. Second, use of within species polymorphism data is much less susceptible to alignment errors that can be an issue for longer evolutionary comparisons.

    However, these design choices are not discussed or motivated by the authors. Nor are they compared to the designs of previous studies. Examples of previous studies (PMID: 22588506, PMID: 21273632, PMID: 20594336,PMID: 20594336, PMID: 24465218, PMID: 22889910, PMID: 20368267, PMID: 28054638)

    1. One essential control that needs to be added is how much of the effect the authors observe can be explained by protein abundance. In yeast, protein abundance is strongly negatively correlated with evolutionary rate, and is strongly positively correlated with identification of PTMs in MS and other assays (extensively discussed in some of the previous studies I listed above). The authors need to assess whether their findings are due to the slow evolution of highly expressed proteins, and the detection bias for these proteins in PTM identification experiments. As far as I could tell this was not discussed by the authors.
    2. A major weakness of the paper is its lack of focus. It includes a rambling historical introduction and discussion that omits discussion of the relevant recent research directly related to the questions at hand. For example, the paper describes historical work on phosphorylase, but gives not a single example of a paralog pair with a polymorphic PTM site identified in their study. The authors introduce gene duplication in a very general way, even though several papers have focused specifically on evolution of protein regulation in paralogs (e.g., PMID: 20080574, PMID: 27003913, PMID: 25474245) The paper of Nguyen Ba et al. 2014 (PMID: 25474245) seems especially relevant, as in addition to perfoming a genome-wide analysis, their abstract reads "We examine changes in constraints on known regulatory sequences and show that for the Rck1/Rck2, Fkh1/Fkh2, Ace2/Swi5 paralogs, they are associated with previously characterized differences in posttranslational regulation." It seems that the results of that study could be directly compared to the analysis performed here.

    Significance

    The significance is hard to assess because the research is not given proper context and motivation.

    I believe the study could be of interest to research studying cell signalling and its evolution, as well as those interested in gene family diversification. However, as written, no specific examples are given or clear hypotheses tested, making the paper seem largely descriptive.

    My keywords: molecular evolution, signalling, intrinsically disordered regions, computational biology