Reference protein-coding transcripts of human genes annotated using long-read transcriptome datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accumulating NGS expression datasets suggest that protein-coding genes produce numerous alternatively spliced transcripts. However, this observation might be overestimated in short-read sequencing data, which often cannot accurately resolve distinct spliced isoforms and introduce ambiguity. Resolving tissue-specific expression profiles is crucial to identify bona fide translated peptide products. In this study, we identified the most highly expressed protein-coding transcripts by using long-read NGS datasets to better understand the biochemical and biological functions of human protein-coding genes. Using nanopore sequencing data from 30 normal human tissues in the GSE192955 dataset, we identified 18,094 dominantly expressed representative protein-coding transcripts (Ref-Tx) from 18,557 human genes. Comparison with MANE-select transcripts revealed that 14,546 Ref-Tx transcripts matched those in the MANE-select dataset. This result indicates improved agreement between long-read transcriptome data and MANE-select transcripts. A higher proportion of Rank1 transcripts were identified as Ref-Tx in the long-read dataset. Similar patterns were observed when Ref-Tx were compared with functional APPRIS annotations. Given the importance of tissue-specific expression profiles for protein-coding transcripts, we developed an expression visualization bioinformatic tool (eCPG). This webtool integrates the extensive expression information from 30 normal human tissues as well as from the GTEx project, which is designed to interrogate the dominant protein-coding transcripts.