Open reading frame dominance indicates protein‐coding potential of RNAs

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

No abstract available

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    1. General Statements [optional]

    The comments of the reviewers were highly insightful and enabled us to greatly improve the quality of our manuscript. We provided point-by-point responses to each of the reviewers’ comments. Revisions in the text are highlighted in yellow. We hope that the revisions in the manuscript and our accompanying responses will be sufficient to make our manuscript suitable for publication.

    2. Point-by-point description of the revisions

    Reviewer #1

    - The authors provide no rationale for using the PTI score to measure the protein-coding potential of transcripts. The only attempt to justify this measure is given in the methods: "The definition of PTI score is motivated by our hypothetical concept that translation of pPTI is limited by alternate competing sPTIs." (lines 426-427, page 20). What the PTI score measures is the dominance of the largest predicted ORF over the predicted ORFs, in terms of length. It is not clear why there would be competition for translation of putative ORFs for genuine protein-coding transcripts. An alternative hypothesis, briefly touched upon in the discussion (lines 318-320) is that translation of non-functional ORFs could give rise to the production of toxic proteins, in addition to being costly in terms of energy. The authors should provide the reasoning behind the PTI score and should explain the biological mechanisms that may underlie differences between coding and non-coding transcripts.

    Thank you for your comment. We previously identified a *de novo gene, NCYM, and showed that its protein has a biochemical function (Suenaga et al 2014; Suenaga et al 2020). However, NCYM was previously registered as a non-coding RNA in the public database, and the established predictors for protein-coding potential, coding potential assessment tool (CPAT), showed a coding probability of NCYM of 0.022, labeling it as a noncoding RNA (new Supplementary Figure 1B). Therefore, we sought to identify a new indicator for coding potential, comparing NCYM with a small subset of coding and non-coding RNAs to determine whether NCYM has sequence features that would allow it to be registered as a coding transcript (data not shown). We found that predicted ORFs, other than major ORFs, seem to be short in coding RNAs. In addition, it has been reported that upstream ORFs inhibit the translation of major ORFs (Calvo et al 2009). Therefore, we hypothesized that the predicted ORFs may reduce the translation of major ORFs, thereby becoming short in the coding transcripts, including NCYM, *during evolution. The term ORF refers to an RNA sequence that is translated into an actual product; however, the biological significance of non-translating, predicted ORFs has been largely ignored and remains to be characterized. Therefore, we defined a PTI as an RNA sequence from the start codon sequence to the end codon sequence and did not assume that it would result in a translated product. Thus, PTI can be defined even in genuine non-coding RNAs. The major ORFs are often the longest PTIs (hereafter, primary PTIs or pPTIs) in coding transcripts. Thus, to investigate the importance of pPTIs relative to other PTIs (hereafter, secondary PTIs, or sPTIs) for the evolution of coding genes, we defined a PTI score as the occupancy of the pPTI length to the total PTI length (Figure 1A–B) and assumed that the PTI score was high in coding transcripts. These are the rationale for using the PTI score for protein-coding potential and are now included in the revised manuscript (lines 92-115, page 5-6).

    To examine the biological mechanism underlying the difference between coding and noncoding RNAs, we investigated the relationship between translation and PTI scores. We chose a dataset of non-coding RNAs that translated small proteins derived from the databases SmProt and sORF.org. From ribosome profiling and mass spectrometry data, the databases include noncoding RNAs that encode small proteins (less than 100 residues) as well as mRNAs that have extra-small ORFs in addition to major ORFs. The SmProt database divides these small ORFs into three categories: upstream (uORF), small (sORF), and downstream (dORF). The definitions are based on their locations: uORFs and dORFs are located in 5’ and 3’ UTRs, respectively, and sORFs overlap with major ORFs using different reading frames (new Figure 2B). We first calculated PTI scores of lincRNAs encoding small proteins and found that the distribution of these lincRNAs shifted to higher PTI scores compared with the distribution of all lincRNAs (new Figure 2A). Therefore, lincRNA translation is correlated with higher PTI scores. Next, we examined whether PTI scores were associated with the translation occupancy of major ORFs in coding RNAs. We calculated PTI scores in mRNAs with uORF, sORF, or dORFs and found that the distribution of mRNAs encoding such small proteins shifted to lower PTI scores (new Figure 2C). Similar data were obtained from the sORF org dataset (Supplementary Figure 5). These data support the idea that the PTI score is related to the occupancy of the major ORF during translation. These results are now included in the results of the revised manuscript (lines 241-271, pp 12-13).

    Translation of small proteins from noncoding RNAs seems to inhibit noncoding functions because of ribosome binding and subsequent translation. On the other hand, translation of sPTIs in coding RNAs seems to inhibit the translation of major ORFs because of competing translations (Calvo et al 2009). At the same time, however, the translation of such proteins may have the advantage of producing new functional proteins/regulatory mechanisms during evolution. Therefore, the right and left shifts of the PTI score that we observed for noncoding and coding RNAs, respectively, seem to be slightly deleterious or beneficial. As further discussed in the responses below, the overlap of distributions of PTI scores between coding and noncoding transcripts was negatively correlated with the effective population size of the species. Therefore, as nearly neutral theory predicts, mutations causing such slightly deleterious/beneficial effects of translation in coding and noncoding transcripts seem to be fixed in species with small effective population sizes (including humans) by genetic drift (Kimura 1968, 1983; Ohta 1992). Clearly, PTI scores are related to translation of PTIs, and their distributions suggest a mechanism for producing bifunctional RNAs that are simultaneously coding and noncoding. The discussion has now been included in the revised manuscript (lines 487-503, pp 23-24).

    Calvo SE, Pagliarini DJ, Mootha VK. Upstream open reading frames cause widespread reduction in protein expression and are polymorphic among humans. Proc Natl Acad Sci U S A. 2009 May 5;106(18):7507-12. doi: 10.1073/pnas.0810916106. Epub 2009 Apr 16. PMID: 19372376; PMCID: PMC2669787.

    Kimura M. 1968. Evolutionary rate at the molecular level Nature. 217(5129):624-6. PMID: 5637732. https://doi.org/10.1038/217624a0

    Kimura, M. (1983).* Neutral Theory of Molecular Evolution* Cambridge: Cambridge University Press. https://doi.org/10.1093/obo/9780199941728-0132

    Ohta T. 1992. The Nearly Neutral Theory of Molecular Evolution. Annu Rev Ecol Syst. 23:263-86.

    - The presence of ORFs in transcripts has long been used as a predictor of their protein-coding potential. For example, the ORF size and the ORF coverage are part of the set of predictors implemented in CPAT (Wang et al., 2013). The PTI score is necessarily related to these methods, yet no comparison is provided. If the PTI score is to be used as a measure to classify transcripts as coding or non-coding, its performance should be compared to other classifiers, including those that use the presence of ORFs as a predictor (e.g., CPAT) but not only (e.g., PhyloCSF, based on the pattern of sequence evolution).

    Thank you for your comment. As you noted, our reasons for using the PTI score were not clearly described in the original manuscript and are now included in the Results section (lines 92-115, page 5-6). As mentioned in response to comment 1, CPAT was not able to predict NCYM as a coding transcript (Supplementary Figure 1B). Furthermore, we intended to use this new concept to identify the RNA sequence elements that determine protein-coding potential, but did not intend to use the score as a classifier of coding or non-coding RNAs. Many studies have identified bifunctional RNAs that are simultaneously coding and noncoding (Li and Liu 2019; Huang Y et al. 2021). Moreover, neutrally evolving peptides are encoded by small ORFs of noncoding RNAs, possibly contributing to the evolutionary origin of new functional proteins (Ruiz-Orera et al. 2014). Therefore, we argue that such dichotomous classification is often misleading, by unconsciously ignoring ncRNAs that encode functional or nonfunctional small proteins. Additionally, this approach has several technical problems. For a training set for use with such a classification, we need a dataset of genuine noncoding RNAs. However, it is quite difficult to define such noncoding RNAs without bias, for example, for cell or tissue types, including cancer or normal cells/tissues. Increasing evidence has shown peptide translation from known noncoding RNAs (Li and Liu 2019; Huang Y et al. 2021); moreover, some of these peptides are specific to the cellular context (Dohka et al 2021). Therefore, we cannot be certain that we are identifying genuine noncoding RNAs from the datasets from ribosome profiling or mass spectrometry, which neither cover all cell/tissue types nor all physiological contexts.

    We agree with you in that we need to compare PTI scores with other indicators of coding potential, such as transcript length, ORF size, and ORF coverage. ORFs of less than 100 residues have been used to define noncoding RNAs; thus, such RNAs necessarily have shorter ORF sizes relative to coding RNAs. Therefore, we calculated these indicators by focusing on noncoding RNAs that encode proteins, but not coding RNAs (new Supplementary Figure 4). The PTI score distribution shifted to the right for lincRNAs encoding small proteins, indicating that the PTI score is related to translation (new Figure 2C). In contrast, the distributions of transcript length, ORF size, and ORF coverage did not shift higher for noncoding RNAs encoding small proteins (new Supplementary Figure 4), although a slight shift to higher ORF coverage was found. Therefore, we argue that the PTI score is a better indicator of translation than transcript length, ORF size or ORF coverage. These results are now included in the results of the revised manuscript (lines 241-255, page 12).

    - The authors compare the observed PTI score distributions with the PTI scores from random or shuffled sequences. They conclude that the PTI scores do not depend on transcript lengths but on transcript sequences (lines 122-123). However, this is not true for non-coding RNAs, for which the observed and randomized distributions are very similar. The relationship between transcript length and PTI scores should be analyzed into more detail. Are the annotated non-coding transcripts with high PTI scores particular in terms of length?

    We analyzed the length of high-PTI-score transcripts compared to all lncRNA transcripts. The average high-PTI-score with high coding potential (0.6 PTI score −29), consistent with the distribution of transcript length in lincRNAs translating small proteins (new Supplementary Figure 4C). Therefore, the high PTI scores are not simply due to the larger ORF size derived from longer transcript length, but also because of the occupancy of pPTI among all PTIs. The occupancy of pPTI can be estimated by ORF coverage or PTI score, and we can easily see that transcript length (the denominator of ORF coverage) correlates with the sum of the lengths of all PTIs (the denominator of the PTI score). Thus, we need to clarify which indicators have more biological significance in terms of gene evolution. Higher PTI scores in noncoding RNAs cause overlap of the coding and noncoding transcripts in eukaryotes, especially in multicellular eukaryotes (new Figure 4 and 5). The overlaps of PTI score distributions between coding and noncoding RNAs (Opti) were positively and negatively correlated with mutation rate and effective population size, respectively, and approximated by logarithmic or exponential relationships (new Figure 6). Because the inverse of the effective population size defines the strength of genetic drift relative to the strength of selection, the overlaps quantified by Opti seem to be derived from genetic drift. These results clearly suggest that the observed PTI score distribution of noncoding RNAs is not random. In contrast, ORF coverage (Ocov) showed a weaker relationship with mutation rates and effective population sizes (new Supplementary Figure 8 and 9). These results suggest that ORF coverage is less related to gene evolution than PTI score, with the weak relationship seemingly indirectly derived from the correlation with the PTI score. We have now included these results in the revised manuscript (lines 306-322, page 15).

    - The authors discuss in depth the correlation between PTI scores and PTI-based protein-coding potential measures (e.g., section "PTI scores correlate with protein coding potential in humans and mice", starting line 125; section "Relationship between the PTI score and protein-coding potential", starting line 243). Given that the protein-coding potential is directly derived from the PTI score distributions for coding and non-coding transcripts, it is not surprising that the two should be correlated. The significance of observing a linear or a sigmoid relationship is not clearly explained.

    As you noted, the protein-coding potential was directly derived from the PTI score distributions. Therefore, if the distribution for coding RNA shows a higher or lower PTI score compared to that of noncoding RNA, the protein-coding potential is expected to be positively or negatively correlated with the PTI score. If the distributions of coding and noncoding RNA significantly overlapped (Opti > 0.7), the protein-coding potential became constant and was not correlated with the PTI score (new Figure 7 and new Supplementary Figure 10). Thus, the PTI score is not always positively correlated with the protein-coding potential.

    We had divided the species into three groups; the sigmoidal group, the linear group, or others based on the intercept and slope in the linear approximation, but considering the fit of the linear approximation, there is no essential difference between the sigmoidal and linear groups. Therefore, in the revised text, we classify the species into two groups: linear and constant (new Figure 7 and Supplementary Figure 10). We have now replaced the figures and added a new interpretation of the results in the revised manuscript (lines 341-353, pages 16-17).

    - The authors use the entire set of annotated coding and non-coding transcripts to assess the distribution of PTI scores and to define the protein-coding potential. Traditionally, for methods that aim to classify transcripts as coding or non-coding, this is done using "bona fide" coding and non-coding transcripts, which are used as training sets. The efficiency of the method can then be evaluated using a test set of transcripts. This aspect is lacking here and should be implemented.

    As we wrote in response to your comment 2, we aimed to examine what RNA sequence elements determine genuine-coding RNA but not to identify the classifier of coding and noncoding RNA. Technically, the “bona fide” coding and noncoding RNAs cannot be rigorously defined, given the possible existence of unidentified bifunctional RNAs in the testing sets; therefore, more traditional approaches often eliminate such possibilities.

    - The comparisons among species are likely biased by the quality of lncRNA annotations in non-model organisms - cf. high variations among primates, which are likely driven by the annotation quality and depth.

    As written in the response to comment 3, the variation of PTI score distribution in lncRNA is not random, and overlaps with the distribution of coding RNA are negatively correlated with effective population size (new Figure 6). In addition, we found that the tissue-specific expression of lncRNA influences the PTI score distribution in multicellular eukaryotes (new Figure 8 C and D and new Supplementary Figure 11 and 12). Therefore, the variation is caused, at least in part, by the specificity of gene expression, and it thus contains biological significance. These results are now included in the revised manuscript (lines 383-402, pages 18-19).

    Based on these results, we expect that the quality of the lncRNA annotations derived from two major databases, Ensenbl and RefSeq, are well curated and sufficient to compare the PTI score distribution. Realistically, there is no database that catalogs a number of curated lncRNAs from various species other than these two. However, we also expect that recent progress in whole genome sequencing and transcriptome analysis of vertebrates may improve the annotation of lncRNAs, including non-model organisms, and provide more ideal datasets for comparisons among species.

    - The differences among bacteria, archaea and eukaryotes should be discussed into more depth. In bacteria, the genuine ORF is well defined by the presence of translation signals (e.g., Shine-Dalgarno sequence). Other factors are also at work in both prokaryotes and eukaryotes, including RNA secondary structures. The relationship between these factors and the PTI score should be discussed.

    The Shine–Dalgarno sequence in bacteria and the Kozak sequence in eukaryotes have been identified as important regulatory elements for ribosome binding, but these sequences are not essential for all coding RNAs, and their significance is not well characterized, especially in noncoding RNAs that are translated. Recent research has sought to identify the determinants that regulate ribosome binding to lncRNAs using 99 characteristics, including the weight of each base at the −6 to +1 positions relative to the start codon (Kozak-like sequence) or RNA secondary structure (Zeng et al 2018). They found that transcript length is a stronger indicator than either of these characteristics for ribosome binding in human lncRNAs. Because the PTI score is a better indicator for translation of lincRNAs than transcript length (new Supplementary Figure 4C), we would argue that Kozak sequences and RNA secondary structures are not reliable indicators for ribosome binding of lncRNAs, and their significance should be limited to more specific transcript classes. Furthermore, Hata et al. recently showed that the Kozak sequence is a negative regulator of *de novo *gene birth in plants (Hata et al. 2021). Therefore, these sequence characteristics seem to evolve after the birth of coding transcripts and are not generally involved in new coding gene origination from noncoding RNAs.

    Zeng C, Hamada M. 2018. Identifying sequence features that drive ribosomal association for lncRNAs BMC Genomics. 19(Suppl 10):906. PMID: 30598103; PMCID: PMC6311901. https://doi.org/10.1186/s12864-018-5275-8

    Hata T, Satoh S, Takada N, Matsuo M, Obokata J. 2021. Kozak sequence acts as a negative regulator of de novo transcription initiation of newborn coding sequences in the plant genome. Mol Biol Evol. 38:2791-2803. PMID: 33705557; PMCID: PMC8233501. https://doi.org/10.1093/molbev/msab069

    - From an evolutionary perspective, the effective population size (Ne) is also likely related to the "quality" of the ORFs. An analysis of Ne vs. the PTI score distributions would be an interesting addition to this manuscript.

    We appreciate this comment. We now include an analysis of the relationship between Ne and PTI scores by defining an indicator of the extent of overlap in the PTI score distributions between coding and noncoding transcripts. This overlapping score was calculated based on PTI scores or ORF coverage and named Opti or Ocov, respectively.* Opti showed positive and negative correlations with mutation rates (Up) and effective population size (N*e), respectively (new Figure 6A), suggesting that the overlap of PTI score distribution is related to slightly deleterious or beneficial mutations fixed in populations due to genetic drift. Furthermore, using the relationship between Ne and Opti, we calculated the minimum effective population size to be approximately 1000, which is consistent with the results from conservation biology (Frankham et al. 2014). Indeed, species at risk of extinction had significantly higher Opti than species with little risk of extinction (left panel, new Figure 6B). In addition, Opti was higher for species with a decrease compared to those with stable population sizes (right panel, new Figure 6B). These results are now included in the revised manuscript (lines 323-332, page 15-16).

    Frankham R, Bradshaw CJA. 2014. Genetics in conservation management: Revised recommendations for the 50/500 rules, Red List criteria and population viability analyses, Biological Conservation, 170:56-63, https://doi.org/10.1016/j.biocon.2013.12.036

    Reviewer #1 (Significance (Required)):

    This manuscript is lacking in novelty and is not well positioned in the field. If the aim of this work is to provide a method to classify transcripts as coding or noncoding, the authors should provide detailed comparisons with existing methods (see above). If the aim is to understand what defines a genuine protein-coding transcript, then the biological mechanisms should be better described and the comparisons among species and among functional categories of genes should be further developed. The idea of using the "dominance" of the largest ORF compared to the other predicted ORFs is interesting, and provides a new element compared to existing methods that rely exclusively on ORF length and ORF coverage. I would recommend that the authors develop this idea further and discuss the advantages of using the ORF dominance compared to just the ORF length or coverage.

    Thank you for your comment. To address this, we have revised the description of our aim to investigate what defines a genuine protein-coding transcript and found that doing so prompted us to learn that the extent of overlap of PTI score distribution between coding and noncoding transcripts is negatively correlated with effective population size. In addition, we have added characterizations of functional categories of high-PTI-score lncRNAs in mice (new Supplementary Tables 6 to 8) and C. elegans (new Supplementary Tables 9, 10, and 11). Comparison of ORF size and coverage to PTI score showed that PTI score is a better indicator for translation of lncRNAs than these indicators and has biological significance in molecular evolution because of the clear correlation between mutation rate and effective population size. These results and related descriptions are now included in the revised manuscript (lines 323-332, pages 15-16; lines 210-218, pages 10-11).

    **Referee Cross-commenting**

    I fully agree with Reviewer 2's remarks. In particular, adding ribosome profiling analyses is an excellent idea and could substantially improve the manuscript.

    We investigated the PTI scores in lncRNAs that are translated, using ribosome profiling data, and found that PTI scores correlated with translation (lines 241-271, pages 12-13). Thank you for this excellent suggestion.

    Reviewer 2

    **Major comments:**

    - some validation of their predictions of coding potential would be good to add. There are plenty of ribosome profiling experiments out there for some of the studied organisms (human, mouse, E. coli) that could be used to show that indeed some of the non-coding RNAs are misclassified and have ribosome density across the predicted open reading frames.

    Thank you for your comment. As noted in our response to Reviewer 1 above, we calculated the PTI scores of translated lncRNAs from the two databases and found that the PTI score correlates with translation of both coding and noncoding RNAs (new Figure 2 and new Supplementary Figures 4 and 5). As noted above, such translation seems to produce slightly deleterious/beneficial effects, thereby becoming fixed in species with smaller effective population sizes by genetic drift. These results and related discussion are now included in the revised manuscript (lines 241-271, pages 12-13; lines 323-332, pages 15-16; lines 487-503, page 23-24).

    - the manuscript is at times difficult to follow and the implication of the statements may not be immediately clear to the readers, particularly those without formal training in bioinformatic methods; even in the abstract. Some examples: "The relationship between the PTI score and protein-coding potential was sigmoidal in most eukaryotes; however, it was linear passing through the origin in three distinct eutherian lineages, including humans". Here it is not clear what this means (without reading the paper) - and even after reading the paper the importance of noting the sigmoidal vs linear relationship of PTI vs. protein-coding potential is unclear. I would encourage the authors to double-check that they provide a clear interpretation of their results, with readers unschooled in proper statistics in mind.

    Thank you for these comments. As we noted in response to comment 4 of Reviewer 1, considering the fit of the linear approximation, there was no essential difference between the sigmoidal and linear groups. Therefore, in the revised manuscript, we classify the species into two groups: linear and constant (new Figure 7 and Supplementary Figure 10). We also propose and diagram a new gene birth model to help readers understand our interpretations more easily (Figure 9). These results and discussion are now included in the revised manuscript (lines 341-353, pages 16-17; lines 514-538, pages 24-25).

    - For the definition of PTI and protein-coding potential the authors refer to the Materials and Methods. I would encourage to explain in plain terms in the results section 1.) how they decided on this particular formalization and 2.) explain clearly what this means.

    Thank you for your suggestion. We have included a concise definition in the revised text in plain terms (lines 107-115, page 5-6; lines 144-146, page 7).

    - The definition of protein coding potential for appears to be dependent on database classification of a transcript as either coding and non-coding. Particularly for organisms with complex transcriptomes, databases may not contain the proper information - what are the implications for their protein-coding potential score?

    Organisms with complex transcriptomes, such as multicellular organisms, present difficulties in classifying coding vs. noncoding transcripts because RNAs classified as noncoding based on proteomic data from a subset of cell types may encode functional proteins in other cell types for which proteomic data are not available. To examine whether cell types affect the PTI distribution of coding and noncoding transcripts, we analyzed transcriptomic data from five mammals (human, mouse, rat, macaque, and opossum) and found that the PTI score distributions were similar in most cell or tissue types for noncoding transcripts (new Figure 8C and Supplementary Figure 11). However, PTI score distributions for noncoding RNA in mature testes showed a rightward shift for all five species (new Figure 8C and Supplementary Figure 11).

    Furthermore, we found that tissue specificity of RNA expression was correlated with PTI score (new Figure 8D and new Supplementary Figure 12 and 13), with more specific expression associated with higher PTI scores in all five species, with the majority of the tissue-specific expression in mature testis. Therefore, the mature testis is a special tissue that expresses noncoding RNAs with high coding potentials. These results support the hypothesis that the testis is a special organ for new gene origination (Kaessmann 2010). We have added these results and discussion to the revised manuscript (lines 383-402, pages 18-19; lines 427-434, pages 20-21; lines 435-445, page 21).

    Kaessmann H. 2010. Origins, evolution, and phenotypic impact of new genes. Genome Res, 20:1313-26. Epub 2010 Jul 22. PMID: 20651121; PMCID: PMC2945180. https://doi.org/10.1101/gr.101386.109

    - The authors completely ignore plants - would it make sense to expand their analysis to this branch of the tree of life?

    In Supplementary Figure 5 of our original manuscript (new Supplementary Figure 7), we have included the PTI score distributions from plants. We also present their overlapping scores (Opti) in the revised manuscript.

    Reviewer 2 (Significance (Required)):

    The manuscript presents an elegant way to predict protein-coding and non-coding RNAs, which may be very relevant to the study of organisms with complex transcriptomes. The audience for the manuscript at the moment may be more limited to scientists trained and working in the field of bioinformatics, but with some integration of transcriptomics and ribosome profiling data, as well as an effort to make the results accessible to scientists not trained in bioinformatics, this manuscript may be relevant and of interest to researchers working on the biology of long non-coding RNAs and translation in general. My expertise: systems biology of RNA binding proteins, transcriptomics, RNA biology.

    **Referee Cross-commenting**

    I fully agree with my co-reviewer regarding additional analyses to strengthen the manuscript.

    Thank you for these comments. We analyzed noncoding RNAs using ribosome profiling data and transcriptomes in different tissues. We found that high PTI scores correlated with translation of noncoding RNAs, and that such high PTI-score noncoding RNAs were specifically expressed in mature testes. Because the effective population size was inversely correlated with the overlap of PTI distributions, the slightly deleterious or beneficial mutations in germ cells of matured testis seem to generate high-PTI score noncoding RNAs as candidates for new coding genes in the next generation. This idea is consistent with the hypothesis that new coding transcripts are derived from noncoding transcripts expressed in spermatocytes and spermatids in mature testes. In addition, we found that human noncoding transcripts with high PTI scores tended to be involved in transcriptional regulation, and the target gene of MYCN was significantly enriched as the original gene. A recent study showed that binding sites for transcription factors, including MYCN, are mutational hotspots in human spermatogonia (Kaiser et al. 2021). Therefore, the PTI score offers an opportunity to integrate the concept of gene birth with classical molecular evolutionary theory, thereby contributing to our understanding of evolution.

    Kaiser VB et al. 2021. Mutational bias in spermatogonia impacts the anatomy of regulatory sites in the human genome. Genome Res. Epub ahead of print. PMID: 34417209. https://doi.org/10.1101/gr.275407.121

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    In the manuscript "Potentially translated sequences determine protein-coding potential of RNAs in cellular organisms" Suenaga and colleagues analyze the available transcriptomes from 100 prokaryotes and eukaryotes, as well as >100 viruses to understand whether transcripts tend to be translated or not. They develop a potentially translated island score (PTI) that combines the number and length of open reading frames in a transcript. From there they develop a protein-coding potential score that combines PTI with database information on coding and non-coding transcripts in various organisms and that in some sense predicts whether a transcript would fall in the coding or non-coding category. The main takeaway appears to be that in prokaryotes PTIs and protein coding potential strongly differentiates coding and non-coding transcripts, while in eukaryotes these differences appear to be more fluid. The manuscript presents an interesting bioinformatic analysis of coding properties across the phylogenetic field and may represent an interesting resource. The audience for the manuscript at the moment may be more limited to scientists trained and working in the field of bioinformatics, but with some integration of transcriptomics and ribosome profiling data, as well as an effort to make the results accessible to scientists not trained in bioinformatics, this manuscript may be relevant and of interest to researchers working on the biology of long non-coding RNAs and translation in general.

    Major comments:

    • some validation of their predictions of coding potential would be good to add. There are plenty of ribosome profiling experiments out there for some of the studied organisms (human, mouse, E. coli) that could be used to show that indeed some of the non-coding RNAs are misclassified and have ribosome density across the predicted open reading frames.
    • the manuscript is at times difficult to follow and the implication of the statements may not be immediately clear to the readers, particularly those without formal training in bioinformatic methods; even in the abstract. Some examples: "The relationship between the PTI score and protein-coding potential was sigmoidal in most eukaryotes; however,it was linear passing through the origin in three distinct eutherian lineages, including humans". Here it is not clear what this means (without reading the paper) - and even after reading the paper the importance of noting the sigmoidal vs linear relationship of PTI vs. protein-coding potential is unclear. I would encourage the authors to double-check that they provide a clear interpretation of their results, with readers unschooled in proper statistics in mind.
    • For the definition of PTI and protein-coding potential the authors refer to the Materials and Methods. I would encourage to explain in plain terms in the results section 1.) how they decided on this particular formalization and 2.) explain clearly what this means.
    • The definition of protein coding potential for appears to be dependent on database classification of a transcript as either coding and non-coding. Particularly for organisms with complex transcriptomes, databases may not contain the proper information - what are the implications for their protein-coding potential score?
    • The authors completely ignore plants - would it make sense to expand their analysis to this branch of the tree of life?

    Significance

    The manuscript presents an elegant way to predict protein-coding and non-coding RNAs, which may be very relevant to the study of organisms with complex transcriptomes.

    The audience for the manuscript at the moment may be more limited to scientists trained and working in the field of bioinformatics, but with some integration of transcriptomics and ribosome profiling data, as well as an effort to make the results accessible to scientists not trained in bioinformatics, this manuscript may be relevant and of interest to researchers working on the biology of long non-coding RNAs and translation in general.

    My expertise: systems biology of RNA binding proteins, transcriptomics, RNA biology.

    Referee Cross-commenting

    I fully agree with my co-reviewer regarding additional analyses to strengthen the manuscript.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary

    The manuscript submitted by Suenaga and co-authors presents a method to evaluate the protein-coding potential of transcripts. This method is based on an index that they name the PTI (potentially translated island) score, which represents the ratio between the length of the largest predicted ORF and the sum of all the predicted ORF lengths, for each transcript. The author compare PTI score distributions between transcripts classified as protein-coding and as non-coding in public nucleotide databases, for a wide range of species, including bacteria, archaea, eukaryotes and viruses. They derive from this comparison a measure of the protein-coding potential of transcripts. To validate this approach, the authors evaluated the distributions of Ka/Ks values for transcripts annotated as coding or non-coding, in various classes of PTI-based protein-coding potential. The main finding of the manuscript stems from the comparison among species: the authors find that bacteria and archaea have narrow, non-overlapping PTI distributions for coding and non-coding transcripts, while eukaryotes have broader and more overlapping PTI distributions.

    Major comments

    • The authors provide no rationale for using the PTI score to measure the protein-coding potential of transcripts. The only attempt to justify this measure is given in the methods: "The definition of PTI score is motivated by our hypothetical concept that translation of pPTI is limited by alternate competing sPTIs." (lines 426-427, page 20). What the PTI score measures is the dominance of the largest predicted ORF over the predicted ORFs, in terms of length. It is not clear why there would be competition for translation of putative ORFs for genuine protein-coding transcripts. An alternative hypothesis, briefly touched upon in the discussion (lines 318-320) is that translation of non-functional ORFs could give rise to the production of toxic proteins, in addition to being costly in terms of energy. The authors should provide the reasoning behind the PTI score and should explain the biological mechanisms that may underlie differences between coding and non-coding transcripts.
    • The presence of ORFs in transcripts has long been used as a predictor of their protein-coding potential. For example, the ORF size and the ORF coverage are part of the set of predictors implemented in CPAT (Wang et al., 2013). The PTI score is necessarily related to these methods, yet no comparison is provided. If the PTI score is to be used as a measure to classify transcripts as coding or non-coding, its performance should be compared to other classifiers, including those that use the presence of ORFs as a predictor (e.g., CPAT) but not only (e.g., PhyloCSF, based on the pattern of sequence evolution).
    • The authors compare the observed PTI score distributions with the PTI scores from random or shuffled sequences. They conclude that the PTI scores do not depend on transcript lengths but on transcript sequences (lines 122-123). However, this is not true for non-coding RNAs, for which the observed and randomized distributions are very similar. The relationship between transcript length and PTI scores should be analyzed into more detail. Are the annotated non-coding transcripts with high PTI scores particular in terms of length?
    • The authors discuss in depth the correlation between PTI scores and PTI-based protein-coding potential measures (e.g., section "PTI scores correlate with protein-coding potential in humans and mice", starting line 125; section "Relationship between the PTI score and protein-coding potential", starting line 243). Given that the protein-coding potential is directly derived from the PTI score distributions for coding and non-coding transcripts, it is not surprising that the two should be correlated. The significance of observing a linear or a sigmoid relationship is not clearly explained.
    • The authors use the entire set of annotated coding and non-coding transcripts to assess the distribution of PTI scores and to define the protein-coding potential. Traditionally, for methods that aim to classify transcripts as coding or non-coding, this is done using "bona fide" coding and non-coding transcripts, which are used as training sets. The efficiency of the method can then be evaluated using a test set of transcripts. This aspect is lacking here and should be implemented.
    • The comparisons among species are likely biased by the quality of lncRNA annotations in non-model organisms - cf. high variations among primates, which are likely driven by the annotation quality and depth.
    • The differences among bacteria, archaea and eukaryotes should be discussed into more depth. In bacteria, the genuine ORF is well defined by the presence of translation signals (e.g., Shine-Dalgarno sequence). Other factors are also at work in both prokaryotes and eukaryotes, including RNA secondary structures. The relationship between these factors and the PTI score should be discussed.
    • From an evolutionary perspective, the effective population size (Ne) is also likely related to the "quality" of the ORFs. An analysis of Ne vs. the PTI score distributions would be an interesting addition to this manuscript.

    Significance

    This manuscript is lacking in novelty and is not well positioned in the field. If the aim of this work is to provide a method to classify transcripts as coding or non-coding, the authors should provide detailed comparisons with existing methods (see above). If the aim is to understand what defines a genuine protein-coding transcript, then the biological mechanisms should be better described and the comparisons among species and among functional categories of genes should be further developed. The idea of using the "dominance" of the largest ORF compared to the other predicted ORFs is interesting, and provides a new element compared to existing methods that rely exclusively on ORF length and ORF coverage. I would recommend that the authors develop this idea further and discuss the advantages of using the ORF dominance compared to just the ORF length or coverage.

    Referee Cross-commenting

    I fully agree with Reviewer 2's remarks. In particular, adding ribosome profiling analyses is an excellent idea and could substantially improve the manuscript.