Machine learning sequence prioritization for cell type-specific enhancer design

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript describes an exciting new approach for tagging and isolation of unique neuronal subpopulations, which has traditionally been challenging without the incorporation of expensive and time consuming transgenic animal colonies. While the manuscript highlights a specific test case of this technology with neurons expressing Parvalbumin, in theory this method could be applied to any neuronal or even non-neuronal cell type. Further, this approach could be applied to other model organisms for which transgenic technologies are limited, thereby facilitating research in other species.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article

Abstract

Recent discoveries of extreme cellular diversity in the brain warrant rapid development of technologies to access specific cell populations within heterogeneous tissue. Available approaches for engineering-targeted technologies for new neuron subtypes are low yield, involving intensive transgenic strain or virus screening. Here, we present Specific Nuclear-Anchored Independent Labeling (SNAIL), an improved virus-based strategy for cell labeling and nuclear isolation from heterogeneous tissue. SNAIL works by leveraging machine learning and other computational approaches to identify DNA sequence features that confer cell type-specific gene activation and then make a probe that drives an affinity purification-compatible reporter gene. As a proof of concept, we designed and validated two novel SNAIL probes that target parvalbumin-expressing (PV+) neurons. Nuclear isolation using SNAIL in wild-type mice is sufficient to capture characteristic open chromatin features of PV+ neurons in the cortex, striatum, and external globus pallidus. The SNAIL framework also has high utility for multispecies cell probe engineering; expression from a mouse PV+ SNAIL enhancer sequence was enriched in PV+ neurons of the macaque cortex. Expansion of this technology has broad applications in cell type-specific observation, manipulation, and therapeutics across species and disease models.

Article activity feed

  1. Evaluation Summary:

    This manuscript describes an exciting new approach for tagging and isolation of unique neuronal subpopulations, which has traditionally been challenging without the incorporation of expensive and time consuming transgenic animal colonies. While the manuscript highlights a specific test case of this technology with neurons expressing Parvalbumin, in theory this method could be applied to any neuronal or even non-neuronal cell type. Further, this approach could be applied to other model organisms for which transgenic technologies are limited, thereby facilitating research in other species.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.)

  2. Reviewer #1 (Public Review):

    This manuscript introduces a novel approach for identification and tagging of specific neuronal populations called SNAIL, or "Specific Nuclear-Anchored Independent Labeling". This approach begins with identification of population-specific open chromatin regions (defined using single nucleus ATAC-seq or bulk ATAC-seq from isolated populations) to feed a machine learning algorithm that enables comparisons of these regions between the targeted cell type and other cell types. The manuscript demonstrates the efficacy of this system in identification of thousands of novel DNA elements that are accessible in Parvalbumin-expressing interneurons in the mouse cortex that are not accessible in other cortical populations. Next, SNAIL uses these sequences to drive a promoter-less AAV reporter construct expressing the nuclear anchored SUN1-GFP. In validation experiments, the manuscript reports selective expression of GFP in PV+ cells, with enrichment ratios beating the current "gold standard" technology. Further, the manuscript highlights that cells expressing this reporter driven by two separate SNAIL-selected sequences have open chromatin signatures that are highly similar to PV+ interneurons isolated/identified with other approaches. Overall, the results of the manuscript are compellingly presented, and computational predictions are tested with experimental observations where appropriate (with proper controls). While the technology has not been extensively validated in other brain regions or other species, the these tools will be made available to the research community, and should enable nuclear tagging of other selected neuronal populations of interest. Further, as the tools described here can be delivered entirely via AAVs, this platform will enable cell type tagging in model organisms for which transgenic lines are not commonly available, or even expression of other transgenes for control of cell function or genetic perturbation.

    It should also be noted that the general idea of SNAIL (using putative enhancer elements to drive reporters or effectors in a targeted cell population) is not new. In fact, a recent report (PMID 32807948) used a similar strategy to effectively express transgenes in parvalbumin interneurons across three vertebrate species. Likewise, other reports have used open chromatin profiling to identify enhancers expression of transgenes by AAVs (PMID 33789083, 33789096). Therefore, the ultimate utility of the SNAIL approach will likely depend on extension of this platform to other targeted populations, and will require further demonstration that the enhancer candidate sequences used here also effectively target PV neurons in other brain regions and in other species. However, this computationally rigorous manuscript represents a necessary and fundamental first step in this process.

  3. Reviewer #2 (Public Review):

    To achieve this goal, the authors use a series of machine learning approaches to find sequence information within differential regions of open chromatin between cell types that best predict the cell-type specific function of enhancers. The author first built support vector machines that could classify sequences from astrocyte versus neuron and inhibitory versus excitatory open chromatin data, then validated the classifiers on known cell type specific promoters. They then classified PV+ neuron open chromatin regions versus non-PV cell open chromatin, as well as comparing to datasets from excitatory and VIP+ neurons. Importantly they also made CNN classifiers and SVM classifiers from clusters of single-nucleus ATAC sequencing data, which is key to advancing their cell type comparisons to the broader condition of cell subtypes. To test the quality of their classifiers, they compared them against the sequences of experimentally tested putative PV enhancers and evaluate the comparisons. Finally, they use their optimal sequences to generate vectors and show co-expression with PV+, and they analyze the optimal sequences for the likely TFs that drive this cell-type specific expression.

    Bioinformatics methods represent an efficient means to achieve the goal of developing tools for increasingly specific manipulations of neuronal cell types. Machine learning tools are particularly powerful and unbiased way to find sequence patterns and the authors show evidence to experimentally validate their computational outcomes. The application of the method to single cell data makes it especially useful in the context of the explosion of that data now appearing in databases. If the method turns out to be widely applicable when applied to additional datasets, this will of significant importance for the field.

    I have no major concerns about the quality of the data or the statistical analyses. My only concerns are about how certain data are interpreted or described.

  4. Reviewer #3 (Public Review):

    Recent years have given us increasing insight as to the diversity of cell types in the brain and other tissues as well as novel approaches to target them. Often these approaches are combinations of viral vector-based transfection and cis acting native regulatory DNA elements (i.e. enhancers). This paper addresses two important issues in the search for specificity, first providing a (largely previously described) means to increase the representation of relatively rare cell types in a sample via labeling nuclei to facilitate isolation, and then providing a semi-automated means of sorting through the increasingly large numbers of putative regulatory elements one obtains with modern epigenetic and bioinformatic methods. The authors proposed approach is based on an in silico filter using machine learning to identify sequence characteristics in enhancers specific for particular cell types. This is a very rational approach, focusing on the functional characteristics of the enhancers. Interestingly, they do indeed find some functionally important characteristics in the form of particular TF-binding motifs general for successful enhancers to drive transgene expression specifically in PV+ neurons.
    If the method works as well as it seems, it will give the community another valuable means to winnow down the dizzying number of hits involved in enhancer selection, which would be of great benefit to the field. This is not solely a technical manuscript, however... in addition this work provides some insight into the functional interaction between enhancers and transcription factors in particular cell types of the brain, the complexity of which we are only starting to appreciate, so the paper has some interesting biology as well. Taken together, it has the makings of an impactful paper.
    There are some issues, however. First and foremost, as it stands now it is unclear how much better the in silico filters improve upon just raw chromatin accessibility, let alone other methods. Judging from Figure 1 sup 4a, the correlation between specificity in vivo and predicted activity is driven mostly by enhancers E1, E22 and E29. I would like to see the correlation scores without these enhancers. Further in fig 1 sup4b, there seems to be a strong correlation between specificity in vivo and accessibility to begin with. Please include a comparison between these two correlations. Also, what would be much more convincing as a demonstration of added value would be to test which one is "correct" in those few instances that the two measures deviate, see #4 below.
    Second, the authors present scATACseq as the only way to find enhancers. There are in fact more ways that if incorporated will also improve prediction of which genomic regions are functional, active enhancers, which are "poised", and which are simply accessible, most notably ChIP-seq data, though this is without a doubt more resource intensive than an in silico filter. There are many examples, but of particular interest is the work by Ernst and Kellis (PMID: 29120462), as well as the work by Axel Visel and Len Pennacchio. They respectively use models based on additional epigenomic information and modifications associated with active enhancers for their selection, and provide good comparators as to what else is out there. However, it is important to note that this could be combined with their algorithm.
    Third, the authors gloss over what the models actually do, and do not explicitly compare them to other means of enhancer selection (this could go in the discussion). From the text it was unclear to me exactly what happened and how the rankings occurred. Furthermore, the graphs in 1B-E are insufficiently explained. Yes, it is true that there are hidden levels in , but the algorithm should be explained well enough to know what the limitations are, and whether there can be biases "baked in" towards e.g. particular kinds of TFs.
    Finally, the two novel enhancers the authors find (SC1 and SC2) are selected from the 90th percentile. There are likely several dozen or even hundreds of enhancers here, so the authors need to be more forthcoming about their selection criteria. SC1 is arguably just picked as the highest scoring hit, but what about SC2? Which characteristics led to manually pick this particular enhancer? Were other enhancers tested for specificity and failed? At the end of the day, the utility and impact of SNAIL depends upon the extent to which it aids the generation of celltype specific tools beyond the current state of the art. The authors show two examples of at least somewhat PV-specific vectors as proof of concept, but this seems a bit thin, as they could arguably have been picked from the chromatin accessibility data alone (see Figure1 Suppl 6b vs. c). To sum up, this is a potentially quite valuable addition to the toolkit for making celltype-specific vectors, but just exactly how valuable is not yet clear (see recommendations for authors).