Protein evidence of unannotated ORFs in Drosophila reveals diversity in the evolution and properties of young proteins

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    By integrating in silico predictions and mass-spectrometry, this manuscript tackles the problem of annotating the currently nameless stretches of genomic sequence that actually code for proteins. The hundreds of protein coding fruit fly genes described here offer new inroads for studying some of the very youngest functional elements in genomes, particularly those that have recently emerged from non-coding DNA sequences. To clarify the biological significance of the present study, the authors should both highlight the genes mostly like to encode functional products and conduct a comparison to published datasets that used different methods to identify such genes in fruit flies.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

De novo gene origination, where a previously nongenic genomic sequence becomes genic through evolution, is increasingly recognized as an important source of novelty. Many de novo genes have been proposed to be protein-coding, and a few have been experimentally shown to yield protein products. However, the systematic study of de novo proteins has been hampered by doubts regarding their translation without the experimental observation of protein products. Using a systematic, mass-spectrometry-first computational approach, we identify 993 unannotated open reading frames with evidence of translation (utORFs) in Drosophila melanogaster . To quantify the similarity of these utORFs across Drosophila and infer phylostratigraphic age, we develop a synteny-based protein similarity approach. Combining these results with reference datasets ontissue- and life stage-specific transcription and conservation, we identify different properties amongst these utORFs. Contrary to expectations, the fastest-evolving utORFs are not the youngest evolutionarily. We observed more utORFs in the brain than in the testis. Most of the identified utORFs may be of de novo origin, even accounting for the possibility of false-negative similarity detection. Finally, sequence divergence after an inferred de novo origin event remains substantial, suggesting that de novo proteins turn over frequently. Our results suggest that there is substantial unappreciated diversity in de novo protein evolution: many more may exist than previously appreciated; there may be divergent evolutionary trajectories, and they may be gained and lost frequently. All in all, there may not exist a single characteristic model of de novo protein evolution, but instead, there may be diverse evolutionary trajectories.

Article activity feed

  1. N-terminal acetylation as variable modifications

    This isn't critical but curious how many N-terminally modified peptides were observed overall and how that compares to the overall number of overall PSMs observed without this variable modification?

    This is an archived comment originally written by Peter Thuy-Boun

  2. using MaxQuant v. 1.6.1.0

    What were your criteria for positive protein/ORF-product identification by MS? Where applicable, were utORFs identified by more than one unique peptide? Each additional unique peptide could add a lot of confidence to utORF discovery.

    This is an archived comment originally written by Peter Thuy-Boun

  3. OutlookTogether, our results show that evolution of young proteins may progress along different, distinct trajectories in Drosophila. Whether similarly distinct trajectories are apparent in other model species such as yeast or mammals remains to be seen. Of note, Drosophila is a taxon of multicellular organisms with a large effective population size, so selective processes are more efficient; mammals – especially primates and Homo – are evolutionarily young and have a smaller effective population size, while yeasts are unicellular. If these factors affect general evolutionary properties, such as the selective cost of translation of lowly functional proteins and the probability of fixation by drift, it is possible that they may affect the evolution of de novo proteins. In the case of Homo, all these factors may be more favorable to the fixation of new de novo proteins, and the availability of broad and varied -omics data is unparalleled. It would therefore be an obvious extension to employ a similar approach to investigate possible utORFs and de novo proteins in humans.

    This is an interesting approach to utORF discovery and a useful reexamination of publicly available biological data resources.

    This is an archived comment originally written by Peter Thuy-Boun

  4. Accordingly, to improve total sensitivity while maintaining an acceptable FDR, we used two rounds of analysis

    Large protein sequence databases are common in metaproteomics and this paper describes a 2-step approach similar to the one used here: https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/pmic.201200352

    While this approach has generated some criticism with respect to false discovery rate estimation, it can still be a useful tool for discovery.

    This is an archived comment originally written by Peter Thuy-Boun

  5. OutlookTogether, our results show that evolution of young proteins may progress along different, distinct trajectories in Drosophila. Whether similarly distinct trajectories are apparent in other model species such as yeast or mammals remains to be seen. Of note, Drosophila is a taxon of multicellular organisms with a large effective population size, so selective processes are more efficient; mammals – especially primates and Homo – are evolutionarily young and have a smaller effective population size, while yeasts are unicellular. If these factors affect general evolutionary properties, such as the selective cost of translation of lowly functional proteins and the probability of fixation by drift, it is possible that they may affect the evolution of de novo proteins. In the case of Homo, all these factors may be more favorable to the fixation of new de novo proteins, and the availability of broad and varied -omics data is unparalleled. It would therefore be an obvious extension to employ a similar approach to investigate possible utORFs and de novo proteins in humans.

    This is an interesting approach to utORF discovery and a useful reexamination of publicly available biological data resources.

  6. using MaxQuant v. 1.6.1.0

    What were your criteria for positive protein/ORF-product identification by MS? Where applicable, were utORFs identified by more than one unique peptide? Each additional unique peptide could add a lot of confidence to utORF discovery.

  7. All 993 utORFs have unique genomic locations (Figure 1–Source Data 3). Unannotated translated ORFs reside in a range of genomic locations, including intergenic, intronic, or UTRs (Figure 1E).

    Did you look into any epigenetic modification states (modENCODE) in these areas? Any predictions that they might be regulatory regions or do they seem epigenetically transcriptionally active? Similarly, was there any SNP/variation data available for these regions (that might be a human genome biased question though!)?

  8. Predictions of structural disorder of utORFs suggest that while they are rather disordered, most retain a substantial proportion that is ordered. The median proportion of disordered utORFs is 24.5%.

    I am curious if you ever saw any predicted structures that were indicative of future functional domains - such as transmembrane/DNA binding/etc. Might be useful to predict future function?

  9. N-terminal acetylation as variable modifications

    This isn't critical but curious how many N-terminally modified peptides were observed overall and how that compares to the overall number of overall PSMs observed without this variable modification?

  10. Accordingly, to improve total sensitivity while maintaining an acceptable FDR, we used two rounds of analysis

    Large protein sequence databases are common in metaproteomics and this paper describes a 2-step approach similar to the one used here: https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/pmic.201200352

    While this approach has generated some criticism with respect to false discovery rate estimation, it can still be a useful tool for discovery.

  11. Evaluation Summary:

    By integrating in silico predictions and mass-spectrometry, this manuscript tackles the problem of annotating the currently nameless stretches of genomic sequence that actually code for proteins. The hundreds of protein coding fruit fly genes described here offer new inroads for studying some of the very youngest functional elements in genomes, particularly those that have recently emerged from non-coding DNA sequences. To clarify the biological significance of the present study, the authors should both highlight the genes mostly like to encode functional products and conduct a comparison to published datasets that used different methods to identify such genes in fruit flies.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  12. Reviewer #1 (Public Review):

    In this study, Zheng and Zhao identified the unannotated open reading frames (ORFs) in Drosophila, termed utORF, mainly based on proteomics datasets. The authors extended their analyses to the birth and the evolutionary heterogeneity of utORF. These analyses uncovered several types of utORFs that bear different feature, including transcription, age, distribution, and evolutionary conservation.

    The origin of de novo protein-coding genes is interesting. The authors' attempts to uncover utORFs from proteomics datasets are much appreciated, but crucial cross-validation is missing. Given a high potential of false positives in MS datasets, it is difficult to evaluate the evolutionary aspects of the identified ORFs. Some experimental validation is needed to confirm the translational potential of utORFs with or without start codons.

  13. Reviewer #2 (Public Review):

    Zhang & Zhao developed an advanced approach to recombining the full-reading-frame search with the ms-based translation evidence for evolutionary new genes. Several hundreds of previously unannotated but clearly translated genes were identified and dated for their origination. Their properties in genome, transcription, structure, and ages were characterized. These findings with the advent of technical development are a significant addition to the literature of evolutionary new genes. In addition, this study pointed out the insufficiency of present-day gene annotation in Drosophila genomes, a widely influencing issue to the Drosophila community that this manuscript should have emphasized.

  14. Reviewer #3 (Public Review):

    The goal of this work is to understand the role that previously neglected, unannotated ORFs play in the evolution of gene novelty in the Drosophila melanogaster lineage. These are ORFs that mostly code for small proteins, most of them having noncanonical start codons. The authors sought to identify translated ORFs using published MS proteomics datasets, making sure to achieve a balance between false positives and false negatives; they succeed rather convincingly. They then focused on when these ORFs first appeared and how they evolved, mainly aiming to understand whether some of them have emerged de novo and the evolutionary trajectories that they have taken.

    The major strengths of the manuscript lie in its scope, as it takes advantage of recently published data to exhaustively search the entire ORF catalogue of D. melanogaster for translation, in the application of rigorous methodologies for the identification of MS-supported ORFs and in the inference of the phylogenetic age of the ORF using a novel synteny-based approach. About this last point, however, I feel that some methodological details are missing. I understand that the genomic MSA of the D. melanogaster ORF and its orthologous region is extracted and that a search for the optimally aligning segment in the sequence of each species is conducted. Does that search include only ORFs in each orthologous region? I assume this is the case because the similarity cut-off of 2.5 is then calculated from protein alignments. If that is the case, why not use global alignments of entire ORFs? Furthermore, why is there no gap penalty used? Finally, I cannot see where the genomic similarity scoring part detailed in the methods is used, which adds to my confusion.

    Albeit not a major one, an additional weakness comes from the use of Latent Class Analysis to identify subpopulations of ORFs within the greater set, and examine their differences. I see why the authors did it and in theory, I have no objection, but given the small number of factors (8 if I'm counting correctly), it's unclear if it's worth the added level of complexity. Plus there's some potential bias involved since it requires binning continuous variables and hence defining bins. It seems to me that the authors could have achieved more or less the same by looking for specific subgroups based on criteria that they set themselves a priori.

    A crucial part of the work is the attribution of de novo origin to utORFs. Here, I find the initial analysis, wherein a single outgroup species is sufficient to invoke de novo origination, relatively unnecessary. Especially since the authors go on to state themselves that only two or more supporting outgroups can provide convincing evidence. I would add that at least two of the outgroups should be non-monophyletic. It is also unclear why an ORF needs to be present in the outgroups at all (and lacking significant similarity). Is there a limit to how small that ORF can be? If so, and if there happens to be no such ORF in a region, why would that not count as evidence?

    I feel that the authors achieve most of their aims, at least the ones that I perceive as the most important.
    There are however some findings that are not sufficiently well supported.