Structure-guided isoform identification for the human transcriptome

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This study applies AlphaFold to the CHESS selection of transcripts with the goal of generating predicted 3D protein structures and a quality measure of folding, the pLDDT score. From these data, the authors build a database for result exploration, documented by several examples, including proteins, where the authors propose the pLDDT score as a measure of presumed superior biological functionality over other isoforms. These results will be highly relevant for anyone working with proteins that occur in different isoforms.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Recently developed methods to predict three-dimensional protein structure with high accuracy have opened new avenues for genome and proteome research. We explore a new hypothesis in genome annotation, namely whether computationally predicted structures can help to identify which of multiple possible gene isoforms represents a functional protein product. Guided by protein structure predictions, we evaluated over 230,000 isoforms of human protein-coding genes assembled from over 10,000 RNA sequencing experiments across many human tissues. From this set of assembled transcripts, we identified hundreds of isoforms with more confidently predicted structure and potentially superior function in comparison to canonical isoforms in the latest human gene database. We illustrate our new method with examples where structure provides a guide to function in combination with expression and evolutionary evidence. Additionally, we provide the complete set of structures as a resource to better understand the function of human genes and their isoforms. These results demonstrate the promise of protein structure prediction as a genome annotation tool, allowing us to refine even the most highly curated catalog of human proteins. More generally we demonstrate a practical, structure-guided approach that can be used to enhance the annotation of any genome.

Article activity feed

  1. eLife assessment

    This study applies AlphaFold to the CHESS selection of transcripts with the goal of generating predicted 3D protein structures and a quality measure of folding, the pLDDT score. From these data, the authors build a database for result exploration, documented by several examples, including proteins, where the authors propose the pLDDT score as a measure of presumed superior biological functionality over other isoforms. These results will be highly relevant for anyone working with proteins that occur in different isoforms.

  2. Reviewer #1 (Public Review):

    The sequencing of a genome is the first step in identifying the functional regions of that genome. The identification of the regions that encode sequences that will become proteins (protein coding genes) is made complicated by the transcription of the DNA into multiple versions of RNA (isoforms) from the same genome locus. Often these RNA isoforms have different start and stop positions in the genome and also have different sequences (exons) that are used for the protein coding process. Taking advantage of considerable improvements in a recently developed computer algorithm that predicts the most stable three-dimensional (3D) folding of protein sequences (AlphaFold2) Sommer, et al describe a strategy to use this information to evaluate among the multiple isoforms generated by each gene. This approach provides additional information along with sequence conservation, synteny and other genes that are co-regulated that can potentially rank order among isoforms to aid in annotating the protein coding human transcriptome. This capability is needed in determining the boundaries, exon sequences, evolutionary relationships of genes to their ancestral homologues, gene function and the structural regions responsible for disease.

    A troubling issue of using this approach is pointed out by the authors themselves, namely, the fact that many functional genes express isoforms that make proteins with poor Local Distance Difference Test (pLDDT) scores. Thus, the 3D structures of a proteins arising from two different isoforms cannot be the only criteria used to identify the gene structure encoded in a locus. However, an isoform encoding a protein with a high pLDDT (estimated to be >80/100) is likely to help define at least a conservative set of boundaries and structures for the annotation for a gene. It would have been useful to have some overall estimate as to the false positive and negative rates of using this strategy. Without this information this approach while useful, could be considered an incremental improvement in the annotation process.

  3. Reviewer #2 (Public Review):

    The study by Sommer et al. applies alphafold to the CHESS selection of transcripts with the goal of generating predicted 3D protein structures and a quality measure of folding, the pLDDT score. From these data, the authors build up a database for result exploration. In addition, they provide examples to underline this approach. Examples include proteins, where the authors propose the pLDDT score as a measure of presumed superior biological functionality over other isoforms. The authors also use the generated data to propose novel functionally relevant isoforms, e.g. in the mouse.

    The study is based on the elegant idea to aid genome annotation through 3D structure prediction. This is a very powerful approach that allows large-scale data generation for functional interpretation. This approach appears technically sound and well executed (although I may miss details not being a protein expert). However, in my opinion, the authors could make more use of the potential of their approach. From the big-data start, they seem to directly restrict themselves to interesting examples. I am missing a global analysis that shows the bigger picture of their results. Given that they have generated structures from 90,415 isoforms, each associated with a pLDDT score, conservation scores, length, expression levels and other quantifiable data listed on page 18. I would wish for a comprehensive analysis of these data and their potential before applying the focus on a few (admittedly very nice) examples.

    Furthermore, one of the weak spots of such an analysis is the relationship between foldability and functional relevance. Disordered regions would imply reduced relevance due to poor pLDDT scores, which may be a misleading conclusion. While this may be a problem difficult to solve with their approach, I think this still needs to be addressed and discussed throughout the paper and particularly as part of the global analysis, not just in the context of examples.

    As a minor point, I would like to motivate the authors to be more explicit with some quantifications. For example, when focusing on proteins < 500 aa long, what does this mean in relation to what they are not representing in their analysis? How many isoforms will they miss? Is there going to be a bias (e.g. against scaffolding proteins, kinases like ATM, etc.)?

    Overall, I consider the idea of the paper very elegant and well executed, yet focusing too much on trees, while I, as a reader, would like to know more about the forest.