Promoter sequence and architecture determine expression variability and confer robustness to genetic variants

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This paper by Einarsson and colleagues presents a comprehensive analysis on how human genetic variability impacts both gene expression and promoter. Using a new resource of CAGE data in lymphoblastoid cell lines from 108 individuals, they uncover a series of features that distinguish promoters with highly variable expression across individuals from those exhibiting low variability. The authors propose various explanations for the observed results. A few additional analyses and a more pragmatic interpretation of their data may help consolidate or refine the models proposed.

This article has been Reviewed by the following groups

Read the full article

Abstract

Genetic and environmental exposures cause variability in gene expression. Although most genes are affected in a population, their effect sizes vary greatly, indicating the existence of regulatory mechanisms that could amplify or attenuate expression variability. Here, we investigate the relationship between the sequence and transcription start site architectures of promoters and their expression variability across human individuals. We find that expression variability can be largely explained by a promoter’s DNA sequence and its binding sites for specific transcription factors. We show that promoter expression variability reflects the biological process of a gene, demonstrating a selective trade-off between stability for metabolic genes and plasticity for responsive genes and those involved in signaling. Promoters with a rigid transcription start site architecture are more prone to have variable expression and to be associated with genetic variants with large effect sizes, while a flexible usage of transcription start sites within a promoter attenuates expression variability and limits genotypic effects. Our work provides insights into the variable nature of responsive genes and reveals a novel mechanism for supplying transcriptional and mutational robustness to essential genes through multiple transcription start site regions within a promoter.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    Einarsson et al have produced CAGE data from EBV-immortalised lymphoblastoid cells from more than a hundred individuals from two genetically diverse African populations (YRI and LWK), and used it to study how sequence variation affects the activity of promoters at the level of expression variability and at the level of transcription start site usage within promoters across individuals.

    The dataset is very exciting, and the analyses were performed carefully and described well. The results show that promoters in the genome vary a lot with respect to their expression variability across individuals and that their level of variability is closely associated with their biological function and their sequence and architectural features. These results are often confirmatory - it is well established that promoters have different architectures associated with different sequence elements, different types of gene regulation and even differences across individual cells. In general, the multifarious observations boil down to one key distinction:

    • Regulated genes have promoters that look and act differently from those of housekeeping genes.

    We are pleased that the reviewer is as excited as we are about the unique dataset, the rigorous analyses performed, and the biological results. While we agree that housekeeping and regulated genes show apparent differences in terms of promoter variability, our analyses were not informed or guided by the expression of promoters/genes across cell types or tissues, but rather by their variability within the same cell type across individuals. It is indeed interesting that the same underlying mechanisms that cause stable expression across cell types also attenuates variability across individuals. And, similarly, promoters that display cell-type restricted (regulated) expression levels tend to also be more variable within the same cell type across individuals. While one may argue that these relationships are unsurprising they have, to the best of our knowledge, not been demonstrated before. Of note, however, while most low variable promoters regulate housekeeping genes and highly variable ones regulate regulatory genes, this is not always the case.

    While this is unsurprising, the authors then proceed to analyse other underlying differences between low variability (mostly housekeeping) and high variability (overwhelmingly regulated) promoters. Several observations have alternative and sometimes more elegant explanations if some of the previously worked out properties of housekeeping vs regulated promoters are taken into consideration:

    • The authors are keen to interpret the architectural features of ubiquitously expressed (housekeeping) promoters as selected for robustness against mutations in ensuring stable and steady expression levels.

    However, there are some known facts about both housekeeping and regulated promoters that make alternative interpretations plausible.

    • When discussing broad promoters, the authors disregard the well known fact that the most commonly used transcription start positions are those with YR sequence at (-1,+1) position. Any mutation within the span of broad promoter cluster that removes an existing YR or introduces a new one has the capacity to change both the TSS distribution pattern and overall level of expression of that promoter - but only slightly. This way, broad promoters can be viewed as adaptation not for robustness but for ability to take many mutations with small effect size that will drive any positive selection smoothly across a changing fitness landscape.

    We thank the reviewer for these remarks. We fully agree with the scenario described by the reviewer, that disruptions of TSSs may have different consequences depending on whether this would be in a broad promoter with multiple YR sequences or within a sharp promoter. However, we argue that the observation that promoters containing such flexible TSSs are not affected much upon genetic perturbations reveals robustness. Per definition, robustness is the ability to produce a persistent phenotype (in our case the molecular phenotype of promoter expression) even in perturbed conditions (e.g. under the influence of natural genetic variation affecting TSS usage). The very fact that TSS disruptions will only have small effect sizes in certain promoters but not in others, tells us that the unaffected or only mildly affected promoters have architectural properties that minimize the effect sizes of these disruptions and thereby cause robustness in overall promoter expression. Hence, we do not see our explanations and those of the reviewer to contradict each other.

    • Indeed, the main property of low variability promoters is that there isn't a single nucleotide change (either substitution or indel) that can substantially change their activity. (In that they are clearly different from e.g. TATA-dependent promoters, where one change can abolish TBP binding or deprive the promoter of a YR dinucleotide at a suitable distance from the TATA box.) This is achieved by their dependence on broad and weak sequence signatures such as GC composition and nucleosome positioning signal. However, most such genes are not known to have a strict requirement for dosage control. On the contrary, dosage seems to be much more critical for the functional classes that in the authors' analysis show variable expression.
    • Whether it is a removal of YR dinucleotide, introduction of a new one, or the change of nucleosome positioning, it seems that the transcription level from housekeeping, low variability promoters is unaffected, or at least affected mildly enough that it is not within the statistical power of the CAGE data across different individuals to detect the difference. Rather than robustness, it can be interpreted as competition - the architecture recruits preinitiation complex at a fairly constant rate, and it is the different YR positions that "compete" for serving as transcription initiation position, with the CAGE signal reflecting the relative effectiveness of each position in that competition. If one of the YR dinucleotides is removed, often the other, neighbouring ones will be used instead. The same might happen for potential multiple nucleosome positioning signals - if one becomes less efficient at stopping a nucleosome, another will be used more often.
    • The fact that decomposed parts of housekeeping promoters add up to approximately the same expression level across individuals even when they are uncorrelated point that they might actually be anticorrelated - indeed, the UFSP2 plot in Figure 4E looks like the two decomposed promoters are anticorrelated. That would argue against the independence of the decomposed promoters - indeed it may again point to "competition" where the decrease in use of one will simply shift most initiation events to the other.

    We thank the reviewer for these thoughts. The reviewer has made an excellent observation regarding the correlation between decomposed promoters within low variable promoters. While decomposed promoter pairs of highly variable promoters frequently have correlated expression levels, low variable multi-modal promoters often contain decomposed promoters that have low or even negative expression correlation across individuals. We agree that negative correlation points to the possibility that these decomposed promoters are competing for the transcriptional machinery. Indeed, nucleosome positioning analysis (described below), suggests the existence of diverse configurations of chromatin accessibilities within low variable multi-modal promoters with low or negatively correlated decomposed promoters. This may suggest a competition between the usage of their decomposed promoters. We have revised the manuscript to better reflect this aspect, discussed the potential for YR shifts encoded within the promoter sequence, and also toned down the independence of decomposed promoters. However, regardless of whether decomposed promoters are independent (low correlation) or competing for the transcriptional machinery (negative correlation), we do not agree that this violates our conclusion of robustness. A competition between decomposed promoters within a low variable multi-modal promoter would favor the strongest decomposed promoter, and if the strongest decomposed promoter is affected by genetic perturbation (for instance though disruption of YRs or proximal TF binding) this will affect the competition and shift the dominant usage to another decomposed promoter, as suggested by the frQTL analysis, leading to minimal change in total promoter expression, i.e. a robust molecular phenotype.

    • In general, not everything is a result of direct evolutionary selection, and that is what should have clear landmarks of purifying selection. On the contrary, promoters, especially housekeeping promoters, have vastly different nucleotide and dinucleotide compositions across Metazoa, both at large and at relatively short distances, which means they can undergo concerted evolution as a group, which means they should be "robust" to mutations in a way that allows them to change much more and more rapidly than some other promoter architectures - especially TATA-dependent architectures whose key elements and spacing between them haven't substantially changed for more than a billion years, and possibly longer.

    We fully agree with the reviewer and have revised the manuscript to remove the evolutionary aspect of robustness. We believe our results are better interpreted with regards to the existence of inherent mechanisms of low variable multi-modal promoters to provide regulatory robustness. Indeed, the vastly different sequence composition of housekeeping promoters between species makes these properties even more interesting. We do not believe that the robustness for perturbations need to be encoded by a specific sequence signature. Rather, we observe that multimodal promoters with low variability require broad initiation regions and a flexibility in the usage of TSSs. This fits well with observations in flies (Schor et al, 2017, DOI: 10.1038/ng.3791) of shifts in the shape of the promoter, which we believe to reflect shifts in decomposed promoter usage, upon genetic perturbation.

    • While housekeeping promoters are broad but mostly not among the broadest, regulated promoters can be either broad or narrow. This is also known - while narrow promoters are overrepresented for tissue-specific and non-CGI promoters, promoters of Polycomb-bound developmental genes are often broad and have large CpG islands; the latter may account for some of the broadest CAGE clusters observed in the data. It would be an interesting finding if both TATA-dependent and developmental promoters were found to be variable across individuals in a non-trivial way (the trivial way being the variability due to larger dynamic range of their expression - e.g. the expression of SIX3 in many cell types is basically zero, while the dynamic range of RPL26L1 is very limited) - this should be checked by analysing them separately.

    We agree that an analysis of the variability of developmental, Polycomb-bound promoters would be very interesting and thank the reviewer for ideas for a follow-up study. We do not feel that LCLs are the best model system for analyzing developmental promoters and therefore argue that this is out of scope in this study.

    • While broad promoters can be decomposed into subclusters with differential expression across individuals, the authors do not seem to allow for the decomposition of intertwined TSS positions within the cluster, but rather postulate hard boundaries between subclusters. This is different from e.g. overlapping maternal and zygotic promoter use (Haberle et al Nature 2014), where the distribution of the used TSS positions is different but the clusters can overlap.

    This is correct, we do not allow for overlapping decomposed promoters. We agree that the work by Haberle et al (2014, DOI: 10.1038/nature12974) on switches between maternal and zygotic TSSs is an excellent demonstration of how intertwined promoters can occur and be of biological relevance. Our analysis is based on the observation that low variable promoters are often multimodal and can not be well-explained by simply the width of promoters. This led us to decompose multimodal promoters into their sub-peak constituents. We believe that the frQTL analysis and the new decomposed promoter QTL (dprQTL) analysis clearly demonstrate the value of our approach. While it would indeed be interesting to see the results of an alternative approach for decomposition, we feel this is out of scope in this study but acknowledge that additional determinants of promoter variability may possibly be discovered using alternative strategies.

    • Both Dreos et al (PLOS Comp Biol 2016) and Haberle et al. (2014) show that one stable element of a broad promoter is the positioning signal of its first downstream nucleosome. As seen very convincingly in both Drosophila and zebrafish, the dominant TSS position of the broad promoter is highly predictive of the position of first downstream nucleosome and its underlying positioning sequence, and the most plausible interpretation is that there is an "optimal" distance from nucleosome for transcriptional initiation, resulting in the dominant (i.e. most often used) TSS position. In mammals, broad promoters are even broader than in those two species and might have multiple nucleosome positioning signals they can use. In such cases, mutations in one of the nucleosome positioning signals, or indels changing the spacing between the nucleosome and the part of sequence that contains TSS, might lead to differential use of one nucleosome signal vs other. This would be compatible with the authors' observations in low variability promoters that decomposed promoters are used to different extents in different individuals.

    We thank the reviewer for this excellent suggestion. In the revised manuscript, we have analyzed both the preference of the distance between the dominant TSS and the downstream (+1) nucleosome and the positional fuzzyness of that nucleosome. We observe a clear separation between low variable multimodal promoters with highly correlated decomposed promoters and those with low correlated decomposed promoters. Interestingly, those with low correlated decomposed promoters show a much less restrictive +1 nucleosome positioning with higher fuzziness, in contrast to what we would expect from broad CGI promoters having a reported fixed +1 nucleosome positioning. While this may be unexpected, it fits well with a model on how a flexible nucleosome positioning architecture can allow differential usage of decomposed promoters. Our results suggest that an array of underlying nucleosome positioning configurations exists for these promoters across single cells, which causes fuzzy nucleosome positioning and may allow for a competition between initiation sites, which provide robustness through their compensatory usage. Interestingly, we find that these results are consistent when analyzing the relationship between transcription initiation and nucleosome positioning within a single individual. This suggests that there is an inherent mechanism of flexibility in TSS usage in these robust promoters even when there is no differential influence of genetic variants. However, to which extent TSS preference is affected by nucleosome positioning or whether nucleosome positioning reflects TSS usage remains unclear. We believe these results further strengthen our general conclusions and thank the reviewer for this constructive suggestion of new analysis.

    • If we were to look for sources of difference other than the actual sequence architecture, some differences between regulated and unregulated promoters can be explained by the key difference: the regulation of regulated genes comes from outside the core promoter; the regulation of housekeeping genes is largely dependent on the intrinsic activity of the core promoter itself. This way, for example, in the absence of a causative variant in the promoter itself, the observed variability in the SIX3 promoter might not be encoded in the promoter itself - instead, enhancer responsiveness might be encoded in the promoter, and the variability itself could be due an enhancer that can be hundreds of kilobases away. Such a scenario combined with broad promoter would likely result in decomposed promoters that are highly correlated across individuals - because they are both externally controlled by the same regulatory inputs.

    These thoughts are very much in line with our own ideas on how enhancers may influence expression variation. Here, we aimed to investigate variability from a promoter perspective and we are confident that we observe several promoter features associated with low variability. Describing these, we agree that it is important to speculate also on the added contributions by distal elements. We now acknowledge the likely added contribution by enhancers in the Discussion:

    “The promoter sequence may also encode a promoter’s intrinsic enhancer responsiveness (Arnold et al., 2017), which may influence its expression variability. Although current data cannot distinguish between direct or secondary effects, an increased variability mediated via enhancers is supported by a higher dependency on enhancer-promoter interactions for cell-type specific genes compared to housekeeping genes (Furlong and Levine, 2018; Schoenfelder and Fraser, 2019). However, compatibility differences between human promoter classes and enhancers only result in subtle effects in vitro (Bergman et al., 2022), suggesting that measurable promoter variability is likely a result of both intrinsic promoter variability and additive or synergistic contributions from enhancers. Directly modeling the influence and context-dependency of enhancers on promoter variability would therefore be important to further characterize regulatory features that may amplify gene expression variability.”

    Reviewer #2 (Public Review):

    This manuscript by Einarsson and colleagues in the Andersson lab examined how genetic variability across a population impacts both gene expression and promoter architecture in a human population. The authors generate new CAGE data in 108 lymphoblastoid cell lines (LCLs). The authors' analysis is focused on defining how DNA sequence and promoter architecture correlate with population-variation in expression across this cohort. In general, there is a lot that I like about this manuscript: The dataset will be an extremely valuable resource for the genomics community. Furthermore, the biological findings are often thoughtful and potentially interesting and significant for the community. The analysis is generally very strong and is clearly conducted by a lab that has a lot of expertise in this area. My main concerns are centered around the often unwarranted implication that DNA sequence or promoter features cause differences in variation at different genes.

    We are pleased that the reviewer is as excited as we are about the unique dataset, the rigorous analyses performed, and the biological results. In our revised manuscript we have followed the recommendations by the reviewer and:

    ● Toned down implied causal relationships and added additional interpretations to our results, including YR positional preferences

    ● Performed additional analyses on nucleosome positioning of low variable promoters, as well as genetic association testing for decomposed promoter expression

    In all, we believe these revisions substantially improved our manuscript and even strengthened our previous conclusions.

  2. eLife assessment

    This paper by Einarsson and colleagues presents a comprehensive analysis on how human genetic variability impacts both gene expression and promoter. Using a new resource of CAGE data in lymphoblastoid cell lines from 108 individuals, they uncover a series of features that distinguish promoters with highly variable expression across individuals from those exhibiting low variability. The authors propose various explanations for the observed results. A few additional analyses and a more pragmatic interpretation of their data may help consolidate or refine the models proposed.

  3. Reviewer #1 (Public Review):

    Einarsson et al have produced CAGE data from EBV-immortalised lymphoblastoid cells from more than a hundred individuals from two genetically diverse African populations (YRI and LWK), and used it to study how sequence variation affects the activity of promoters at the level of expression variability and at the level of transcription start site usage within promoters across individuals.

    The dataset is very exciting, and the analyses were performed carefully and described well. The results show that promoters in the genome vary a lot with respect to their expression variability across individuals and that their level of variability is closely associated with their biological function and their sequence and architectural features. These results are often confirmatory - it is well established that promoters have different architectures associated with different sequence elements, different types of gene regulation and even differences across individual cells. In general, the multifarious observations boil down to one key distinction:

    - Regulated genes have promoters that look and act differently from those of housekeeping genes.

    While this is unsurprising, the authors then proceed to analyse other underlying differences between low variability (mostly housekeeping) and high variability (overwhelmingly regulated) promoters. Several observations have alternative and sometimes more elegant explanations if some of the previously worked out properties of housekeeping vs regulated promoters are taken into consideration:

    - The authors are keen to interpret the architectural features of ubiquitously expressed (housekeeping) promoters as selected for robustness against mutations in ensuring stable and steady expression levels. However, there are some known facts about both housekeeping and regulated promoters that make alternative interpretations plausible.

    - When discussing broad promoters, the authors disregard the well known fact that the most commonly used transcription start positions are those with YR sequence at (-1,+1) position. Any mutation within the span of broad promoter cluster that removes an existing YR or introduces a new one has the capacity to change both the TSS distribution pattern and overall level of expression of that promoter - but only slightly. This way, broad promoters can be viewed as adaptation not for robustness but for ability to take many mutations with small effect size that will drive any _positive_ selection smoothly across a changing fitness landscape.
    - Indeed, the main property of low variability promoters is that there isn't a single nucleotide change (either substitution or indel) that can substantially change their activity. (In that they are clearly different from e.g. TATA-dependent promoters, where one change can abolish TBP binding or deprive the promoter of a YR dinucleotide at a suitable distance from the TATA box.) This is achieved by their dependence on broad and weak sequence signatures such as GC composition and nucleosome positioning signal. However, most such genes are not known to have a strict requirement for dosage control. On the contrary, dosage seems to be much more critical for the functional classes that in the authors' analysis show variable expression.
    - Whether it is a removal of YR dinucleotide, introduction of a new one, or the change of nucleosome positioning, it seems that the transcription level from housekeeping, low variability promoters is unaffected, or at least affected mildly enough that it is not within the statistical power of the CAGE data across different individuals to detect the difference. Rather than robustness, it can be interpreted as competition - the architecture recruits preinitiation complex at a fairly constant rate, and it is the different YR positions that "compete" for serving as transcription initiation position, with the CAGE signal reflecting the relative effectiveness of each position in that competition. If one of the YR dinucleotides is removed, often the other, neighbouring ones will be used instead. The same might happen for potential multiple nucleosome positioning signals - if one becomes less efficient at stopping a nucleosome, another will be used more often.
    - The fact that decomposed parts of housekeeping promoters add up to approximately the same expression level across individuals even when they are uncorrelated point that they might actually be anticorrelated - indeed, the UFSP2 plot in Figure 4E looks like the two decomposed promoters are anticorrelated. That would argue against the independence of the decomposed promoters - indeed it may again point to "competition" where the decrease in use of one will simply shift most initiation events to the other.
    - In general, not everything is a result of direct evolutionary selection, and that is what should have clear landmarks of purifying selection. On the contrary, promoters, especially housekeeping promoters, have vastly different nucleotide and dinucleotide compositions across Metazoa, both at large and at relatively short distances, which means they can undergo concerted evolution as a group, which means they should be "robust" to mutations in a way that allows them to change much more and more rapidly than some other promoter architectures - especially TATA-dependent architectures whose key elements and spacing between them haven't substantially changed for more than a billion years, and possibly longer.

    - While housekeeping promoters are broad but mostly not among the broadest, regulated promoters can be either broad or narrow. This is also known - while narrow promoters are overrepresented for tissue-specific and non-CGI promoters, promoters of Polycomb-bound developmental genes are often broad and have large CpG islands; the latter may account for some of the broadest CAGE clusters observed in the data. It would be an interesting finding if both TATA-dependent and developmental promoters were found to be variable across individuals in a non-trivial way (the trivial way being the variability due to larger dynamic range of their expression - e.g. the expression of SIX3 in many cell types is basically zero, while the dynamic range of RPL26L1 is very limited) - this should be checked by analysing them separately.

    - While broad promoters can be decomposed into subclusters with differential expression across individuals, the authors do not seem to allow for the decomposition of intertwined TSS positions within the cluster, but rather postulate hard boundaries between subclusters. This is different from e.g. overlapping maternal and zygotic promoter use (Haberle et al Nature 2014), where the distribution of the used TSS positions is different but the clusters can overlap.

    - Both Dreos et al (PLOS Comp Biol 2016) and Haberle et al. (2014) show that one stable element of a broad promoter is the positioning signal of its first downstream nucleosome. As seen very convincingly in both Drosophila and zebrafish, the dominant TSS position of the broad promoter is highly predictive of the position of first downstream nucleosome and its underlying positioning sequence, and the most plausible interpretation is that there is an "optimal" distance from nucleosome for transcriptional initiation, resulting in the dominant (i.e. most often used) TSS position. In mammals, broad promoters are even broader than in those two species and might have multiple nucleosome positioning signals they can use. In such cases, mutations in one of the nucleosome positioning signals, or indels changing the spacing between the nucleosome and the part of sequence that contains TSS, might lead to differential use of one nucleosome signal vs other. This would be compatible with the authors' observations in low variability promoters that decomposed promoters are used to different extends in different individuals.

    - If we were to look for sources of difference other than the actual sequence architecture, some differences between regulated and unregulated promoters can be explained by the key difference: the regulation of regulated genes comes from outside the core promoter; the regulation of housekeeping genes is largely dependent on the intrinsic activity of the core promoter itself. This way, for example, in the absence of a causative variant in the promoter itself, the observed variability in the SIX3 promoter might not be encoded in the promoter itself - instead, enhancer responsiveness might be encoded in the promoter, and the variability itself could be due an enhancer that can be hundreds of kilobases away. Such a scenario combined with broad promoter would likely result in decomposed promoters that are highly correlated across individuals - because they are both externally controlled by the same regulatory inputs.

  4. Reviewer #2 (Public Review):

    This manuscript by Einarsson and colleagues in the Andersson lab examined how genetic variability across a population impacts both gene expression and promoter architecture in a human population. The authors generate new CAGE data in 108 lymphoblastoid cell lines (LCLs). The authors' analysis is focused on defining how DNA sequence and promoter architecture correlate with population-variation in expression across this cohort. In general, there is a lot that I like about this manuscript: The dataset will be an extremely valuable resource for the genomics community. Furthermore, the biological findings are often thoughtful and potentially interesting and significant for the community. The analysis is generally very strong and is clearly conducted by a lab that has a lot of expertise in this area. My main concerns are centered around the often unwarranted implication that DNA sequence or promoter features cause differences in variation at different genes.