Genome-wide discovery of cis- regulatory elements in a large genome

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This important study combines chromatin accessibility and genomic DNA sequence conservation data from low-coverage genome sequencing of related species (without assembly), for the in silico identification of cis-regulatory elements in large genomes. The approach and results are compelling and well supported by the experimental validations. The work will be of interest to researchers working in the field of gene regulation and evolution, particularly because the methodology proposed can be applied to a large variety of experimental organisms.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Identifying non-coding regulatory elements in the genome poses a challenge in most organisms. Classical methods rely on trial and error to test the regulatory activities of DNA fragments using reporter constructs. In large eukaryotic genomes, where cis- regulatory elements can spread over long distances, separated by large stretches of non-functional DNA, this trial and error approach is particularly challenging. Here, we generate two types of resources that can be used to narrow the search for such cis- regulatory elements in the 3.6 Gbp genome of Parhyale hawaiensis (comparable in size to the human genome). First, we use bulk ATACseq to uncover genome-wide patterns of chromatin accessibility in embryonic and adult tissues of Parhyale (whole embryos and legs), and single-nucleus ATACseq to identify regions of open chromatin in diverse cell types recovered from adult legs, including epidermal, neuronal, muscle and blood cells. Second, by sequencing the genomes of three congeneric species of Parhyale hawaiensisP. darvishi , P. aquilina and P. plumicornis – we identify islands of sequence conservation across the genome, corresponding to DNA elements that are functionally constrained during evolution. We present an approach by which low-coverage (10-15x) short-read genome sequencing, without genome assembly, is sufficient to provide reliable maps of sequence conservation. This approach cuts the cost and labour required to generate these maps, making the identification of cis- regulatory elements more widely accessible. We demonstrate the utility of these resources by identifying cis- regulatory elements that drive robust expression of fluorescent reporters ubiquitously and in specific cell types.

Article activity feed

  1. eLife Assessment

    This important study combines chromatin accessibility and genomic DNA sequence conservation data from low-coverage genome sequencing of related species (without assembly), for the in silico identification of cis-regulatory elements in large genomes. The approach and results are compelling and well supported by the experimental validations. The work will be of interest to researchers working in the field of gene regulation and evolution, particularly because the methodology proposed can be applied to a large variety of experimental organisms.

  2. Reviewer #1 (Public review):

    Summary:

    Forbes et al. developed an integrated approach to identify cis-regulatory elements (CREs) in the large (3.6 Gbp) genome of the crustacean Parhyale hawaiensis, addressing the challenge of pinpointing these regions among large regions of non-coding sequences. They combined ATAC-seq chromatin accessibility profiling (both bulk and single-nucleus) across embryonic and adult tissues with low-coverage genome sequencing of three congeneric species (P. aquilina, P. darvishi, P. plumicornis). Without assembling congener genomes, they mapped reads with low stringency to the P. hawaiensis reference, identifying about 55k conserved islands that overlap ATAC peaks more than expected by chance. This dual filter was used to select CRE candidates for transgenic reporter validation, yielding 6 functional elements (out of 11 tested) driving ubiquitous, neuronal, or muscle-specific expression, a major advance for non-model systems with large genomes.

    Strengths:

    Forbes et al. generated high-quality ATAC data across multiple scales. Using bulk ATAC-seq (from whole embryos, developing and adult legs), they identified tens of thousands of open chromatin peaks across the assembled P. hawaiensis large genome. Moreover, using single-nucleus ATAC-seq from adult legs, they could resolve differentially accessible chromatin profiles across over 15 cell types previously identified by scRNA-seq, enabling cell-type-specific candidate selection.

    Furthermore, their innovative low-coverage comparative genomics method mapped 0.46-6.4% of congener reads to P. hawaiensis without genome assembly, revealing hundreds of thousands of conserved non-coding islands, including about 55k showing conservation in all four species, far exceeding random expectation.

    Using the developed approach, the authors could validate 6 (out of 11 candidates) reporter constructs, driving robust ubiquitous and tissue-specific expression, succeeding where prior promoter-only screening failed and providing immediately useful genetic tools for the Parhyale community.

    Weaknesses:

    The primary limitation is that functional CRE testing was performed only in P. hawaiensis. While conservation maps are valuable resources, the manuscript lacks functional validation in congener species, limiting claims about broad applicability across related genomes/species.

    The approach also failed to validate developmental CREs. None of the candidates from combined ATAC and conservation filtering drove reporter expression matching endogenous patterns. The authors appropriately hypothesize technical limits (low expression) or biological factors (long-range enhancers, shadow enhancers).

    Overall Assessment:

    Forbes et al. fully succeed with their integrated approach to (1) generate an ATAC-seq atlas plus functional CRE discovery and (2) innovative low-coverage sequencing for conservation mapping in the large 3.6 Gbp genome of Parhyale hawaiensis. Their combination of ATAC-seq chromatin accessibility profiling (bulk and single-nucleus) across embryonic and adult tissues with low-coverage genome sequencing of three congeneric species (P. aquilina, P. darvishi, P. plumicornis), without congener genome assembly, drastically shrank the CRE search space. Using this approach, the authors could validate six out of 11 candidate transgenic reporters (ubiquitous, neuronal, and muscle-specific), where prior promoter-only screening failed.

    The low-coverage mapping innovation cuts cost and labour while snATAC-seq provides cell-type resolution, making these resources valuable for building new genetic and imaging tools in Parhyale.

    This compelling method also has the potential to enable labs with limited resources to identify and characterize regulatory elements in more non-model organisms, advancing our understanding of their evolution while establishing a scalable pipeline for large-genome systems.

  3. Reviewer #2 (Public review):

    The manuscript by Forbes, Skafida, Karapidaki et al. concerns the in silico identification of cis-regulatory elements (CREs) in large genomes using chromatin accessibility (ATAC-seq) and sequence conservation (genomic DNA sequencing) data. They exemplify this method by applying it to identify novel CREs in Parhyale hawaiensis, which they validated using reporter constructs.

    The results are convincing and are well supported by the data and validations. Identified CREs are valuable for researchers interested in the regulation of the expression of genes they control.

    The methodology on the whole is also valid, as suggested by the results and previous publications on various taxa. Sequence conservation, as stated by the authors, was long used as a method to identify regions of non-coding DNA with functional and evolutionary constraints. The same applies to ATAC-seq data, which has also been used as a proxy for functional regions in different animals such as sea urchins and amphioxus. The methodology proposed is likely to be successfully used by researchers working on a variety of experimental organisms.

    The authors do not use existing genome assemblies and use short-read sequencing to identify conserved regions, and while it is not conceptually novel, such an approach is becoming more and more viable and useful considering the recent advances in next-generation sequencing technology and the decrease in price of short-read sequencing.

    Two major weaknesses are:

    (1) The novelty of the approach and its advantages should be more explicitly stated.

    (2) The authors do not discuss in depth the strength of using a combination of two methods rather than either of the two, especially considering that previously known CREs do not overlap with conserved sequences.

  4. Reviewer #3 (Public review):

    Summary:

    Forbes et al. present a new approach for identifying cis-regulatory elements in large genomes. Using Parhyale hawaiensis, a crustacean with a large genome (~3.6 Gb, comparable in size to the human genome), the authors show that current methods for identifying cis-regulatory elements, effective in smaller genomes, are markedly inefficient in organisms with large genomes. To address this limitation, they combine bulk ATAC-seq and single-cell (sc) ATAC-seq to identify chromatin regions that are either ubiquitously accessible or specifically accessible in particular cell types. They further integrate comparative genomics across multiple Parhyale species (P. hawaiensis, P. aquilina, and P. darvishi), selected at appropriate phylogenetic distances (20-95 million years divergence), to pinpoint conserved open chromatin regions likely under functional constraint.

    Using this strategy, the authors predict a set of ubiquitous and cell-type-specific cis-regulatory elements. Importantly, they validate these predictions using rigorous transgenic reporter assays, convincingly demonstrating that their approach can successfully identify functional regulatory elements where previous methods had failed.

    Strengths:

    The approach introduced by Forbes et al. is conceptually straightforward, efficient, and readily transferable to other organisms. The validation experiments show not only that a substantial proportion of the predicted elements are functional, but also that the method is capable of identifying both ubiquitous and cell-type-specific regulatory elements. Given that the identification of regulatory regions remains a major bottleneck in understanding the molecular mechanisms underlying processes of development and regeneration, this work has the potential to make a significant impact in developmental and regeneration biology, particularly for studies involving non-model organisms with large genomes.

    An additional strength is the demonstration that only the genome of the focal species requires high-quality sequencing and assembly. In contrast, species used solely for comparative analysis can be sequenced at low coverage without assembly, substantially reducing costs and increasing the accessibility of the approach.

    Weaknesses:

    While the method is effective in identifying regulatory elements that are active ubiquitously or in differentiated cell types, it failed in detecting elements associated with developmentally regulated genes. This may be due to trivial reasons, such as a very low level of expression of the selected genes. However, as acknowledged by the authors, it may also indicate inherent challenges in identifying regulatory elements associated with developmentally dynamic gene regulation, compared to those associated with genes expressed in differentiated cell types.

    A second limitation, also acknowledged by the authors, is the absence of chromatin conformation capture data, which would help link distal regulatory elements to their target genes. This limitation may be particularly relevant for developmentally regulated genes, where long-range regulatory interactions may be critical.

    Addressing these limitations will be an important direction for future work. Nonetheless, the approach as presented in this manuscript represents a key contribution that sets the stage for further methodological advances in the identification of cis-regulatory elements in large genomes.