Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Myriad mechanisms diversify the sequence content of eukaryotic transcripts at both the DNA and RNA levels, leading to profound functional consequences. Examples of this diversity include RNA splicing and V(D)J recombination. Currently, these mechanisms are detected using fragmented bioinformatic tools that require predefining a form of transcript diversification and rely on alignment to an incomplete reference genome, filtering out unaligned sequences, potentially crucial for novel discoveries. Here, we develop SPLASH+, a new analytic method that performs unified, reference-free statistical inference directly on raw sequencing reads. By integrating a micro-assembly and biological interpretation framework with the recently developed SPLASH algorithm, SPLASH+ discovers broad and novel examples of transcript diversification in single cells de novo , without the need for genome alignment and cell type metadata, which is impossible with current algorithms. Applied to 10,326 primary human single cells across 19 tissues profiled with SmartSeq2, SPLASH+ discovers a set of splicing and histone regulators with highly conserved intronic regions that are themselves subject to targets of complex splicing regulation. Additionally, it reveals unreported transcript diversity in the heat shock protein HSP90AA1 , as well as diversification in centromeric RNA expression, V(D)J recombination, RNA editing, and repeat expansion, all missed by existing methods. SPLASH+ is unbiased and highly efficient, enabling the discovery of an unprecedented breadth of RNA regulation and diversification in single cells through a new paradigm of transcriptomic analysis.

Article activity feed

  1. Some detected rRNA could represent contamination or microbiome composition, as has also been reported bya recent microbial analysis of human single cells (Mahmoudabadi, Tabula Sapiens Consortium, and Quake,n.d.) (Supplement).

    I have only worked with two single cell data sets so please forgive me if my comment is naive, but given that you suggest that NOMAD+ works on raw single cell reads and the proclivity of raw sequencing data to contain microbial contamination, do you think it would be beneficial to recommend that users screen their single cell data for microbial contamination prior to running NOMAD+? One rapid way to do this would be with the sourmash gather algorithm (https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). Again, I'm not sure how contamination in single cell compares to that of bulk RNA seq, but this is an interesting manuscript that profiles what is in the left over fractions of transcriptomes that don't map to the human genome: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1403-7

  2. Some detected rRNA could represent contamination or microbiome composition, as has also been reported bya recent microbial analysis of human single cells (Mahmoudabadi, Tabula Sapiens Consortium, and Quake,n.d.) (Supplement).

    I have only worked with two single cell data sets so please forgive me if my comment is naive, but given that you suggest that NOMAD+ works on raw single cell reads and the proclivity of raw sequencing data to contain microbial contamination, do you think it would be beneficial to recommend that users screen their single cell data for microbial contamination prior to running NOMAD+? One rapid way to do this would be with the sourmash gather algorithm (https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). Again, I'm not sure how contamination in single cell compares to that of bulk RNA seq, but this is an interesting manuscript that profiles what is in the left over fractions of transcriptomes that don't map to the human genome: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1403-7