Characterization of Alternative Splicing During Mammalian Brain Development Reveals the Magnitude of Isoform Diversity and its Effects on Protein Conformational Changes

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Regulation of gene expression is critical for fate commitment of stem and progenitor cells during tissue formation. In the context of mammalian brain development, a plethora of studies have described how changes in the expression of individual genes characterize cell types across ontogeny and phylogeny. However, little attention was paid to the fact that different transcripts can arise from any given gene through alternative splicing (AS). Considered a key mechanism expanding transcriptome diversity during evolution, assessing the full potential of AS on isoform diversity and protein function has been notoriously difficult. Here we capitalize on the use of a validated reporter mouse line to isolate neural stem cells, neurogenic progenitors and neurons during corticogenesis and combine the use of short- and long-read sequencing to reconstruct the full transcriptome diversity characterizing neurogenic commitment. Extending available transcriptional profiles of the mammalian brain by nearly 50,000 new isoforms, we found that neurogenic commitment is characterized by a progressive increase in exon inclusion resulting in the profound remodeling of the transcriptional profile of specific cortical cell types. Most importantly, we computationally infer the biological significance of AS on protein structure by using AlphaFold2 and revealing how radical protein conformational changes can arise from subtle changes in isoforms sequence. Together, our study reveals that AS has a greater potential to impact protein diversity and function than previously thought independently from changes in gene expression.

Article activity feed

  1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    'The authors do not wish to provide a response at this time.'

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #4

    Evidence, reproducibility and clarity

    Abdullah Alieh and colleagues generate comprehensive transcriptome annotations in FACS-sorted murine cortical neural stem cells, precursor cells and neurons by combining (existing) short-read RNA-seq (SRS) data with long-read sequencing (LRS) data. They identify around 50,000 novel transcripts and show that they are enriched in neural functions and have a strong tendency of increasing inclusion of differential splicing events during differentiation. Several examples are validated by PCR. They show, by means of AlphaFold2 prediction of protein structure, that many splice isoforms likely cause either overall structural differences or switches in secondary structure.

    Major points:

    1. The authors generate data using a previously characterized mouse model. However, they need to reconfirm expression of markers for the three cell types they analyse, particularly since one is identified by lack of expression of fluorescent tags.
    2. Validation is only performed by RT-PCR on 11 novel splicing events and not at all on novel TSS and termination sites. It would greatly benefit the reliability of novel isoforms if the authors could compare them with those detected previously by LRS in neural cells, or overlay novel TSS with data such as CAGE or 3'-end sequencing.
    3. Are divergent structural regions between isoforms often within regions of low model confidence? This would impact the relevance of the discovered changes.
    4. In the Discussion, the authors assert that '...AS alone was revealed to have a much greater impact in remodeling the transcriptome [...] than previously thought and independently from changes in gene expression.' However, this latter aspect is not demonstrated. To what extent does apparent change in AS derive from differential expression of isoforms from alternative TSS?
    5. The statement in the Discussion that 'Our study supports this notion [that differential inclusion of disordered segments can affect protein-protein interaction] with a significant increase in disordered isoforms arising concomitantly with neurogenic commitment' is not supported by the results presented. The authors only show that alternatively spliced proteins in their dataset have a higher propensity for disordered regions than the proteome at large, which is not a new observation.
    6. The statement in the Discussion that structural changes ostensibly caused by alternative splicing were 'similarly the case both when the structural change occurred within the AS event as well, more remarkably, when the event was far away' is not supported by the results as presented.
    7. Supplementary material is mentioned but not included with the manuscript.

    Minor points:

    1. Fig. 1A: Why are there two numbers for transcripts (70,658, 71,760) in the overlap of pipelines 1 and two?
    2. Fig. 2F: Statements that events either low in NSC and rising, or high in NSC and declining, represent the 'least represented' isoform in NSC or N, respectively, do not seem to take into account that there may be other transcript isoforms for which inclusion of the event in question stays constant (e.g., skipped). The authors could make use of their LRS to confirm that at least for selected events.
    3. p8: How many unique new transcription start and end sites were identified?
    4. Fig. 2C: were categories selected for display (and if so, how), or are these all the categories identified?
    5. Fig. 2F-H: How many of the detected AS events, including neural microexons, are novel?
    6. Was the propensity to elicit nonsense-mediated decay taken into account when AS events were mapped to transcripts that did not contain them?
    7. How did 212 genes selected for modeling in Fig. 3 correspond to 987 isoforms? When genes comprised more than two isoforms, how were the changes in quantified properties attributed to the splicing events for which they were selected vs other isoforms or alternative translation start and stop sites?
    8. Fig. 3D: Coloring the structures by chain would make this figure easier to interpret.
    9. Details of Alphafold modeling are not provided.
    10. The authors should acknowledge that integrating SRS and LRS is a standard approach to generating annotations in organisms for which no reliable annotation exists, as well as approaches aimed at doing so to improve annotations in mammals, such as PMID: 37779246, 35468141, 32461551 etc.

    Significance

    While a combination of SRS and LRS sequencing along stages of neuronal differentiation has not been used in the same way to identify novel transcript isoforms, substantial work has been done employing LRS in neural contexts, including in single cells (e.g., work from the Tilgner, Waldmann lab).

    Although it is not entirely clear from the results presented how many of the detected AS events are novel, as opposed to transcript isoforms, their characteristics are similar to previously known neural-differential events, thus supporting their veracity. The main advance in this manuscript lies in the insights derived from structural modeling of splice isoforms, which supports the potential relevance of many splicing events. This is a question relevant for both fundamental research and clinical audiences. However, several of the author's claims are not well supported, or else are not novel (see major points).

    This reviewers' expertise lies in the field of molecular biology of alternative splicing; they have experience with RNA-seq and structural modeling of splice variants.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    Summary:

    Haj Abdullah Alieh at al., describe re-analysis of an existing short read RNA-Seq dataset consisting of 3 replicates of 3 FAC sorted cell populations of the E14.5 Btg2::RFP/Tubb3::GFP mouse cortex: neural stem cells (NSC; RFP-/GFP-), neural precursors (NP; RFP+/GFP-) and neurons (N; GFP+), for the purpose of investigating alternative splicing isoform switching during neuronal cell-type specification. They generate a one replicate PacBio dataset of these same sorted cells, with the aim of identifying full-length transcript isoforms, which are difficult to discern with short-read data alone. The key conclusions are the discovery of ~50,000 novel transcript isoforms containing ~2,500 novel splice junctions; the discovery of isoform switches between NSC -> neuron that contain a high proportion of microexon inclusion events and the finding that many of these switches are predicted by Alphafold2 to have a structural impact.

    The data is interesting and the bioinformatics approach of investigating potential impacts of splice variants on protein structure using Alphafold2 is also interesting, however at present the paper would be better presented as a resource, unless effort is undertaken to experimentally validate some potential biological findings. However, for the paper to be useful as a resource, links to newly generated data and analysis code need to be provided. The capacity for exploration of these newly identified splice isoforms, or further analysis using the new GTF, could then be one of the attractions of this work.

    Major comments:

    • Are the key conclusions convincing?
    • Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?
    • Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation. Figure 1 The discovery of ~50,000 novel transcript isoforms containing ~2,500 novel splice junctions As far as I can see the description of novelty is based on them being not present in either Ensembl (GRCm38.p6), NCBI_RefSeq, or Gencode (vM10) - note here the numbers are genome assembly versions and do not refer to the GTF annotation versions compared against - these should be provided as they are frequently updated. The claim is that they are not present in these references because the unique cell samples have not been analysed before. For transcript isoforms to be included in these references they must have a good level of support. I have a couple of concerns about the support for these isoforms: The numbers in figure 1A do not add up. For long read sequencing two pipelines are used resulting in 76,077 and 80,782 isoforms - in the venn diagram 1A the overlapping circle has two numbers of isoforms in it: 70,658 and 71,760 so it is unclear, are 70,658 isoforms found by both pipelines or 71,760? Then we are told the union of these transcripts is taken forward to the next venn diagram. However this diagram is labelled with 82,046 transcript isoforms. Pipeline 1 has labelled 5419 unique isoforms, pipeline 2 has 9,022 unique isoforms so 5419 + 9022 + 70658(71760) = 85,099(86201) not 82,046 - perhaps some extra filtering has occurred that should be labelled/described? Again the final number of transcripts at the end of everything is off - if the 82,046 transcripts from long read are combined with the 16,070 unique to the short read this equals 98,116, not 97,240. The authors decide to use long read sequencing to assemble the isoforms as short-read sequencing is unreliable for assembling full length isoforms - however for their final list they merge isoforms assembled by StringTie from short read data with the isoforms assembled from the PacBio long read data, it seems likely that the isoforms detected only by short-read Stringtie assembly would be unreliable and shouldn't be included in the final total. The authors perform only one biological replicate of PacBio long read sequencing of three different samples, so it is not possible to easily determine the reproducibility of the findings. I appreciate PacBio is expensive, the authors could consider other ways to evaluate the reproducibility - perhaps by looking at the detection of transcripts expected to be uniformly expressed between the different conditions? The authors provide no quality information for their PacBio sequencing run - eg. length distribution of reads, how many reads are left after quality filtering, quality across the length of reads, ie. I do not know if most isoforms reported are supported by 5 full length isoform reads, or if it is rare in the dataset to get full length isoform reads .etc is the quality comparable across the three PacBio samples? How many of the novel isoforms are supported by both short read and long read data? How many of the novel isoforms are supported only by short reads? How many isoforms are found in all three PacBio samples? Does gene expression measured with the PacBio data match the previous results of measuring gene expression in the short read data? Adding these kinds of analyses would give more confidence in the results. This section of methods is confusing, I don't really understand what has been done or what part of the manuscript this refers to: "​​Events were assigned to an inclusion isoform if their coordinates overlapped, at least partially, with an exon or to an exclusion isoform if they were located within an intron. AS events without a corresponding inclusion or exclusion isoform were assigned to an Ensembl or NCBI_RefSeq isoform using the criteria above. Only AS events assigned to at least one inclusion and one exclusion isoform were considered for further analysis." VastDB is a splicing database created by Manuel Irimia/Ben Blencowe containing a lot of neural samples across development - how many of the 'novel' splice sites are present in VastDB? Similarly, how many of the 'novel' splice isoforms were previously detected by Zhang et al., 2016, Cell.

    Figure 2: over neuronal maturation the major splicing change is for cassette exons to become more included, 50% of those measured being microexons Overall this section is strongest, the conclusions are well supported. Figure 2D - there are no genome coordinates given to allow the reader to check the highlighted events out for themselves. Figure 2F is very confusing, consider an alternative way to present this. Figure 2G, the premise of this analysis is interesting! But confused on the numbers - in 2F its shown that 226 exons become more included between both NSC->NP->N, so why are 441 exons plotted in 2G? Whilst I appreciate genes must be expressed in both NSCs and neurons to be able to calculate differential splicing, one thing not addressed is whether expression of a lot of these genes also goes up in neurons, i.e. could it be that when these genes are lowly expressed in NSCs their splicing is not particularly well regulated but it doesn't really matter because they are not really required in NSCs? This becomes relevant later where you start to address the functionality of isoform switches - if the gene is expressed to the same degree in NSC vs. N this would suggest that both isoforms are functional, if a gene is very lowly expressed in NSC but highly expressed in N, then maybe only the N isoform needs to be functional. Gene ontology methodology is not described in the methods. What were the spliced genes compared against? Given these are neural samples, lots of expressed genes will have neural functions, so is this really informing us about the alternatively spliced genes? The manuscript would benefit by integration of its data with other published datasets - especially with the microexons - how do these behave in other datasets of neuronal maturation (such as those from vastdb or zhang 2016)? The authors could consider looking at motifs around regulated microexons to try and establish if any specific RBPs might be involved in this regulation, although this would benefit from follow up experiments.

    Figure 3: exon inclusion in neuronal specific transcripts confers different structures to translated proteins, suggesting these events are important functionally Here, Alphafold2 is used to predict the structures of switching isoforms, whilst an interesting approach to inform further experiments, presented alone, it remains hypothetical. Hook2 is highlighted as one example, where inclusion of a microexon introducing two amino acids to the translated protein is predicted to cause a structural change that will impact its binding to microtubules. It's hard to determine if this really will have a functional impact without doing experiments in the lab. For this manuscript to serve as a research (rather than resource) article, it would benefit from an example experiment expressing neuronal vs. NSC Hook2 isoform in a cell line and measuring co-localisation with microtubules via IF microscopy, or something similar to address the proposed function. In the second half of this figure, more subtle local structural changes are investigated and the example of an alpha-helix to beta-strand switch predicted in Kctd13 is presented. The figure would benefit from showing the splicing change at the RNA level and relating that to the change seen at the protein sequence level as it is a bit confusing - the region of deletion is labelled as 'AS REGION' however, two amino acids preceding this box are different between the two isoforms (KVEF vs. KVRG) - so presumably the splicing change starts earlier than denoted? In the discussion the authors state: "While these regions are long known to exist, their structural switch was assumed to be dependent on substantial changes in their structural and sequence contexts (Gendoo and Harrison, 2011; W. Li et al., 2015) as opposed to, as observed in our study, being triggered by small perturbations within nearly identical sequence contexts." It's not clear whether these small local predictions are accurate and would require some additional structural data to validate.

    • Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments. Suggestions of additional computational analysis are very realistic and shouldn't take longer than a month or two. The addition of experimental data to support Figure 3 would take considerable time and resources, potentially collaboration with other labs. Perhaps focusing on making this dataset an accessible resource would be a better route to publication.
    • Are the data and the methods presented in such a way that they can be reproduced? No, no source code, software versions or supplementary data/materials is provided.
    • Are the experiments adequately replicated and statistical analysis adequate? Having one replicate of the PacBio experiment is a bit concerning, but I am aware that it is expensive. Given they have three samples of different conditions with PacBio data perhaps showing the quality control of the libraries, reproducibility of transcripts that don't change in the three conditions, etc. would give more confidence in the data.

    Minor comments:

    • Specific experimental issues that are easily addressable. Made above.
    • Are prior studies referenced appropriately? Yes. Except for this section of introduction: "While great effort is being made to overcome these limitations, capturing cell type-specific AS dynamics that is both quantitative and comprehensive of full-length transcript information currently requires combination of both SRS and LRS performed in parallel on the same cell pool. This was seldom attempted (Gupta et al., 2018; Joglekar et al., 2021) and, to the best of our knowledge, never for specific cell types of the developing mammalian brain. Even more limiting, systematic assessment of the consequences of AS on protein structure and putative function in cell fate commitment is entirely lacking. "

    LRS has allowed for whole transcriptome determination and quantification in a number of cases, especially in non-model organisms, below I mention some examples from human and mouse: Nanopore use in GTEX + short reads: Glinos et al., 2022 Nature https://www.nature.com/articles/s41586-022-05035-y PacBio SMRT-Seq + short reads human and mouse cortex: Leung et al. Cell Reports 2021 https://www.cell.com/cell-reports/pdf/S2211-1247(21)01504-7.pdf PacBio IsoSeq + short reads in human and mouse sperm: Sun et al., 2021 Nature Communications https://www.nature.com/articles/s41467-021-21524-6 Single cell long read RNA-Seq has also been described in several scenarios and is worth referencing in the introduction: Samples from various mouse and human sources: Tian et al., 2021 Genome Biology https://link.springer.com/article/10.1186/s13059-021-02525-6 differential isoform usage in myeloma cell lines: Phillpott et al., 2021, Nature Biotech https://www.nature.com/articles/s41587-021-00965-w Single cell long read isoform analysis in human immune cells: Volden and Vollmers, 2022, Genome Biology https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02615-z

    • Are the text and figures clear and accurate? Mostly, I've highlighted where numbers in figures don't make sense to me. Generally the text could use some going over and tightening up (eg. sentence on page 12 needs revising for clarity and typo "The fact that within this helical packing resides the protein domain essential for Hook2 function to bind microtubules, implies that such a negligible AS switch by two ammino acids may result in a completely altered function. ")
    • Do you have suggestions that would help the authors improve the presentation of their data and conclusions? I have made suggestions above about figures that are unclear to me.

    Referees cross-commenting

    After reading the reviews of other reviewers, it seems we are much in agreement over the main concerns relating to this manuscript. Namely: concerns over the PacBio being single replicate, concerns over indiscriminately merging PacBio and SRS transcripts, concerns about lack of validation of the structural changes predicted by AlphaFold2. On the question of novelty and significance we also seem to be aligned.

    Significance

    • Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field.

    The main general findings of the work have been described elsewhere: that microexon inclusion increases in many transcripts during neuronal cell fate commitment has previously been described, the suggestions of important isoform structural changes in Hook2 and Kctd13 are not backed up by any experimental data and so are not reliable. The description of a huge number of novel isoforms is not particularly useful because it's not clear if these have been found by other similar studies, because the data is not compared, furthermore we have no information about these isoforms to be able to pursue further research about them. The main output of the work would be the data and transcript annotations for other people to follow up on, but this is not provided in any accessible way. The paper might be better reframed as a resource, if it is not possible to follow up on the biological conclusions.

    • Place the work in the context of the existing literature (provide references, where appropriate).

    Previously, alternative splicing has been studied in purified cell types of the developing mouse cortex using short read sequencing eg. in Zhang 2016, Cell. In this previous study, VZ NPCs (EGFP−) and non-VZ cells (EGFP+) were isolated from E14.5 Tbr2-EGFP mouse cerebral cortex. The double reporter mouse model used in the present study allows for better cell sorting into NSC, NPC and neurons, and the long read sequencing allows for whole transcript identification, however the present study has made no effort to compare the data, so it's not clear how much new biology this leads to. In Zhang 2016, the authors also predict disruption to protein domains caused by AS, but go further to perform experiments to validate the impact of some of these predictions.

    • State what audience might be interested in and influenced by the reported findings.

    Researchers of this cell fate transition might want to look at their favourite genes to see if there are novel isoforms reported (however this is currently not possible because this information is not provided). Researchers of Hook2 or Kctd13 may want to further explore the described predicted structural changes. Researchers generally studying alternative splicing may want to include the novel isoforms in their analyses (again currently not possible because they are not provided). Generally this paper would probably be best seen as a resource.

    • Define your field of expertise with a few keywords to help the authors contextualize your point of view.

    Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate. Bioinformatics, Splicing, RBP biology

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    In this manuscript the authors attempt to characterize alternative splicing in neurogenic progenitors during corticogenesis and the consequence of such alternative splicing on protein conformation. To do this the authors used previously published short-read sequencing data from neural stem cells, neural progenitors, and neurons at E14.5 and expanded on this dataset by adding long-read sequencing data.

    Major comments:

    1. According to the methods section, new long-read sequencing data was generated for each of the NSC, NP, and N cell types. It is unclear to me how these were processed in terms of replicates. From figure 1 is seems that the samples were sequenced individually but then pooled for transcriptome assembly. It would really be helpful to understand the quality of the samples better. Are there replicates for each of the cell types included? What did the read count and transcript detection look like for each of the individual samples? Are the 3 samples really equal enough to be pooled together or will 1 sample dominate when assembling the transcriptome?
    2. On page 9, end of 2nd paragraph the authors state: "... these findings highlight the extent of AS within the neurogenic lineage underscoring its potential to regulate corticogenesis to a much greater degree than previously appreciated." Would it be possible to do a direct comparison between the number of AS detected or the type of AS detected between published data and the current paper? The authors provide a very coarse description of AS events during corticogenesis based on GO terms. The GO terms to surface are not surprising and seem not very meaningful in distinguishing the three cell types. Are there lower level GO terms that are specific to a subset of the cell populations?
    3. The authors show that cell types moving from NSC to NP to N gain exons. This raises the questions whether there is a specific set of genes that gains exons during development and/or there are different RNA binding proteins present in the three cell populations that could contribute to the differential splicing patterns seen in the three cell populations?

    Minor comments:

    1. What was the background chosen for gene ontology analysis?
    2. For this paper the focus was on development of neurons. Certain non-neuronal populations arise from NSC and it would be interesting to compare the non-neuronal lineage as well. To what extent is the splicing pattern a differentiation/maturation hallmark and to what extent is it specific to the neuronal lineage.

    Significance

    • General assessment:
      • Strengths: This manuscript describes a potential strategy to investigate the effect of alternative splicing events on the protein output. By combining short- and long-read sequencing the authors are able to capture a wide variety of splicing events in the neuronal lineage at one timepoint during development. The modeling of potential protein structures that arise from the alternatively spliced transcripts is critical to start to understand the biological effects of alternative splicing in ever changing systems like the brain during development.
      • Limitations: Main limitations are the wet-lab experimental setup. The analysis was performed on a limited number of samples (n=1?) per cell type for just 1 time point. It is not known what the variability in AS events between individuals is and will limit statistical testing.
    • This manuscript is mostly a proof-of-concept but does not provide enough solid proof to claim new discoveries.
    • This manuscript serves a specialized audience interested in alternative splicing and biological effects of splicing events.
    • Filed of expertise: single cell transcriptomics, long-read, alternative splicing, mouse brain development.
  5. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    The authors FACS-sorted neuronal cells and conducted both short- and long-read sequencing to delineate the process of neurogenic differentiation. They went on to verify certain new splice junctions via RT-PCR and employed AlphaFold2 to forecast the outcomes. There are several issues the authors need to address.

    1. It's unclear why the author decided to superimpose the GTF file created by StringTie (intended for SRS) onto those generated by two distinct LRS pipelines. Given that long-read sequencing doesn't match the accuracy of NGS, which could result in discrepancies in splice junction coordinates, this approach seems questionable. Additionally, the presence of alternative start sites or polyadenylation sites could further reduce the concordance rate, as evidenced by the mere 15% transcript overlap between the methods depicted in Figure 1A. The updated version of StringTie, StringTie2, offers an improved protocol for assembling short-reads using long-read data as a guide. The author should contemplate the use of these more advanced tools rather than combining them in a potentially incompatible manner.
    2. The main text and figure legends of Figure 1 do not specify the number of replicates used.
    3. The author needs to depict alternative splicing events with gene annotations, such as those seen in a sashimi plot in panel 1C. The existing panel does not adequately differentiate whether the splice junctions presented are novel. Furthermore, the author should provide the PSI for each splicing event and contrasts these with the PSI derived from RT-PCR data.
    4. In the discussion section, the author asserts that their methodology, combining Short Read Sequencing (SRS) and Long Read Sequencing (LRS), is novel. However, similar approaches have been reported in previous studies, for instance in references 10.1371/journal.pcbi.1009730 and 10.1098/rsob.220206.

    Significance

    While the sequencing data and the integration of AlphaFold2 are new, the authors fall short of experimentally demonstrating the biological significance of their findings.