Cis-regulatory variants affect gene expression dynamics in yeast

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    The authors use RNAseq in yeast hybrids to study the effect of cis-variation on evolutionary divergence in gene expression and expression dynamics. Importantly, some of the findings are further confirmed using reporter assays. This is a clever and efficient approach that allows obtaining a genome-wide view of how cis-sequence variation affects expression. What sets this study apart from previous work is that the authors use hybrids across different genetic distances, separate expression levels and dynamics by sampling across different time points during an environmental shift, and also investigate 3' sequences. The main conclusions confirm that SNPs and InDels both affect gene expression as well as dynamics, and that on average, InDels have larger effects compared to SNPs, especially on expression dynamics. Moreover, the results also reflect negative selection on expression levels, with the effect of some cis mutations compensated by other cis variation, which ultimately results in complex interactions between the different cis-acting polymorphisms. Together, the results further our understanding of how cis sequence variation supports divergence in gene expression levels and dynamics.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article

Abstract

Evolution of cis -regulatory sequences depends on how they affect gene expression and motivates both the identification and prediction of cis -regulatory variants responsible for expression differences within and between species. While much progress has been made in relating cis -regulatory variants to expression levels, the timing of gene activation and repression may also be important to the evolution of cis -regulatory sequences. We investigated allele-specific expression (ASE) dynamics within and between Saccharomyces species during the diauxic shift and found appreciable cis -acting variation in gene expression dynamics. Within-species ASE is associated with intergenic variants, and ASE dynamics are more strongly associated with insertions and deletions than ASE levels. To refine these associations, we used a high-throughput reporter assay to test promoter regions and individual variants. Within the subset of regions that recapitulated endogenous expression, we identified and characterized cis -regulatory variants that affect expression dynamics. Between species, chimeric promoter regions generate novel patterns and indicate constraints on the evolution of gene expression dynamics. We conclude that changes in cis -regulatory sequences can tune gene expression dynamics and that the interplay between expression dynamics and other aspects of expression is relevant to the evolution of cis -regulatory sequences.

Article activity feed

  1. Author Response:

    Reviewer #1:

    In this manuscript Shi and Fay investigate how natural genetic variation in cis-regulatory sequences impact gene expression dynamics, using budding yeast as a model. Much work in the field, including some landmark studies from this laboratory, have focused on allele specific expression. By contrast, relatively few have investigated the impact of natural genetic variation on the kinetics of gene expression, as the authors do here during the diauxic shift using both inter- and intra-specific hybrids. Strikingly, they find that ASE dynamics are more strongly associated with insertions and deletions than ASE levels. Using reporter assays the authors test which promoter regions and individual variants are sufficient to produce the observed dynamics of gene expression. By investigating chimeric promoter regions between species, the authors gain insight into constraints on the evolution of gene expression dynamics. This manuscript addresses an important question, the findings are novel, and the methods are appropriate. I have a couple of suggestions that I hope the authors will agree can improve their work.

    1. Line 124: I understand the focus on regulatory regions, but post transcription regulation of transcript stability can arise from many mechanisms. RNA binding proteins frequently interact with regions within an open reading frame. I understand the complications of considering coding mutations, but why exclude synonymous polymorphisms within ORFs, for example? At a bare minimum it should be noted in the text.

    We included all variants, synonymous or otherwise, within coding regions. We now state this in the methods. Coding region results were excluded from Figure 2, but are included in Table S4.

    1. In what is otherwise an exceptionally clear manuscript it took some time to understand on line 157 precisely how the 334 'regions' were defined from the 1,818 CREs. Some extra sentences would be very helpful to guide the reader here, perhaps with a figure panel to scaffold the logic.

    Some regions were excluded due to overlap with upstream genes. We have now stated this in the methods: "The intra-specific and inter-specific libraries respectively represented 334 and 452 regions upstream of 69 and 98 genes after removing regions that overlapped with upstream genes, and contained a total of 7,268 and 7,232 synthetic CRE sequences." We also modified the text in the results to indicate that the total number of CREs comes from the number of variants in the 334 regions. "The total library contained 1,818 CREs with four barcode replicates per CRE, and included all variants within 334 regions upstream of the 69 genes."

    1. In figure 4 the scale of the x-axis (time) is confusing. Most of the plots don't seem to start at t=0, but it is impossible to tell from the labeling. Because the timepoints highlighted also differ depending on the message being plotted, which is of course natural, interpreting differences in slope, etc. becomes confusing. The authors should either replot with the origins at t=0 or clearly indicate that there is a break in the axis.

    There are no breaks in the x-axes. We felt it would be misleading to put all the plots on the same time-scale, i.e. with t=0 being the first point. The reason is that glucose depletion occurs at a different time in the RNA-seq and CRE-seq experiments, both of which are shown in Figure 4. We have now added an arrow to indicate the approximate time of glucose depletion in both Figure 4 and Figure 5 in order to provide some indication of the time differences.

    1. Line 209 and 210 - I understand that the PhastCons scores did not improve the association between upstream polymorphisms and ASE dynamics, but it would be nice to hear a bit more from the authors about what this might mean. The observation is restated in the discussion but again mostly without any speculation about what it might mean before moving on to the discussion of technical limitations. If the result is true what might it mean?

    We have modified the discussion to clarify this issue: "Beyond technical differences, the absence of association with conserved sequences and binding sites could be related to differences in cis- regulatory variants underlying ASE levels versus dynamics, to the strains used in each study, or to our smaller sample size. Strain differences may be relevant since we used variants between two wild strains Oak and ChII, whereas Renganaath et al. (2020) used a wine and laboratory strain, the later of which has evolved under relaxed selection and has more deleterious variants (Gu et al., 2005; Doniger et al., 2008). Consistent with a sample size explanation, we found that PhastCons conservation scores improved the odds ratios from genome-wide logistic regression for SNPs with ASE levels and dynamics (Table S6)." Note that the last sentence has been changed to report improvement of odds ratios rather than significance of those ratios.

    Reviewer #2:

    Weaknesses:

    First, the results in the first half of the paper are not overly surprising. They boil down to "genetic variation does influence expression dynamics". This is not unexpected, given genetic variation has been shown to influence just about any cellular process studied so far. As such, the paper essentially confirms the existence of a phenomenon whose existence was not really in doubt. Fortunately, the work into causal variants in the second half of the paper does provide additional insight.

    Second, the results are somewhat descriptive. This is not uncommon for genomics work, but does leave the reader wondering how exactly a given variant may alter gene expression dynamics, especially if it neither occurs at a conserved site nor drastically changes transcription factor binding. I do understand that a deep dive into individual causal variants is outside of the already impressive scope of this paper. I nevertheless hope that one impact of this work will be future mechanistic studies of some of these variants.

    We acknowledge both of these weaknesses. Our goal was not to demonstrate the existence of expression dynamics but to determine whether patterns of variation in expression levels and dynamics are similar. While these results are descriptive we felt they were necessary to complete before testing whether cis-regulatory variants or their associated features (conservation and binding sites) differed between genes with ASE levels versus dynamics. We have edited the discussion to better put our work into perspective.

    Third, the statistical model to infer ASE strikes me as suboptimal (line 420). From how I understand the Methods section, allelic read counts are transformed to an allele frequency. This frequency is assumed to be 0.5 in the absence of ASE. ASE is then modeled as deviation from 0.5, using a linear model. This last point seems problematic. First, frequencies can only range from 0 and 1, whereas a basic linear model would be allowed to infer frequencies outside of this range. It is not clear to me that this model can properly capture the bounded nature of these data. Second, RNA-Seq data is count based, and transforming to an allele frequency loses information about the accuracy of each measurement. Specifically, genes with few reads have less power due to more stochastic counting noise. Third, the choice of weighting observation simply by the raw read counts (line 422) seems ad hoc and should be justified. More broadly, the authors could have opted for more established, count-based analysis strategies for ASE data, such as binomial tests or more advanced frameworks (e.g. beta-binomial tests as in https://www.biorxiv.org/content/10.1101/699074v2).

    We examined and have now included estimates of our false positive rate based on permutation re-sampling of the data. "Ten permutations of the data were used to validate the statistical cutoffs. Permuting the counts for the two alleles independently at each time-point yielded an average of between 0.3 and 2.0 false positive across the five hybrids at an FDR cutoff of 0.01 for the test of ASE levels. Permuting the time-points yielded an average of between 2.3 and 7.7 false positives across the five hybrids at an FDR cutoff of 0.01 for the test of ASE dynamics." To provide a comparison with a count based test we used DESeq2 to test for differences in ASE levels for the hybrid with S. paradoxus. We found 2,970 genes at an FDR cutoff of 1%, slightly more than the 2,930 genes found with the frequency test. The majority of these genes (2,530) were significant for both tests. Thus, we find that our existing statistics are valid, but we agree that count based method could be more powerful at detecting ASE levels than the frequency based test that we use. Our rationale for using the frequency based tests is as follows: We found no count based method that could detect auto-correlations whereas the Durbin- Watson auto-correlation test is applicable to allele frequencies. We wanted to use as similar a statistical framework as possible for the two tests of levels and dynamics and so we used allele frequency for both tests. To enable at least some means of taking into account the number of counts underlying the frequencies, we used counts as weights in both the Durbin-Watson test for ASE dynamics and the linear model test for levels.

    Fourth, there is only one biological replicate per hybrid, creating the risk that this one observation of the given time course may not be biologically representative. This also raises questions about how the linear model (see above) was fit without replicate data.

    For the linear model each time-point was used as a replicate measure of allele frequency. We agree that certain aspects of the data may be specific to the single time course we used. However, there are number of reasons we believe our results are biologically representative. First, we see similar patterns in each of the three intra-specific hybrids. This is the most direct evidence that the time-courses are biologically representative. Before addressing other evidence, we'd like to point out that all the technical error in the experiment (extraction, library preparation, sequencing, etc) is independent and statistically accounted for. Second, most but not all biological variation between time courses will affect both alleles. Such biological variation would include slight differences in rates of glucose depletion, rates of metabolism or other variations in the culture that easily affect gene expression. A greater concern is whether there is biological variation that differs between the two alleles. Stochastic noise in expression is a good example, is known to be common, and could cause allele differences to extend over time since RNA decay is not immediate. However, noise alone should not cause allele-specific differences at the population level since we measured the average across many cells and stochastic noise in expression is independent across cells. In summary, significant allele differences are unlikely to be specific to a single time course, although we recognize that the magnitude and time over which they change may differ between independent time courses. We did consider replication of the time-course. However, we found that the time-point at which glucose was depleted varied in replicate cultures by more than 15 minutes, which would have made it difficult to accurately align replicates based on glucose depletion.

    My final comments (these are not weaknesses but more discussion points) are about the analyses relating the number of sequence differences at a given gene to its strength of ASE (starting at line 120). The authors report significant associations, in line with previous studies. However, it is worth pointing out that this analysis makes an implicit assumption that there are multiple causal variants with effects in the same direction such that adding each variant would increase the ASE difference. The analyses cannot account for the case of multiple causal variants with effects in opposite directions. In this case, even a large number of variants could result in no net ASE. The authors' observation that the association between the number of variants and ASE is strongest for the most closely related strain pair (line 139) may be explained by this scenario. If there are many causal variants that cancel each other, having fewer variants in closely related strains reduces the opportunity for such cancellation. Given these considerations, it is actually somewhat surprising that there is any association between the number of variants at a gene and its ASE.

    We agree that the results of the logistic regression depend on divergence, and we have now added that the effects of multiple variants could cancel each other out: "Association between ASE and divergence may be weak or absent if most substitutions between species do not affect gene expression or if there are many substitutions that affect expression but they have random effects that cancel each other out." However, this appears to only occur in the inter-specific hybrids where the number of variants becomes so large that it becomes uninformative. Empirically we find that the increase in the probability of ASE with the number of variants is linear for the intra- specific hybrids (Figure 2-figure supplement 2). Thus, while this effect may be present it is not strong enough to eliminate the logistic regression signal from the intra-specific hybrids.

    Along similar lines, the authors' point (line 226 and end of the Discussion) that inter-species chimeras should lie between the two parental species unless there are epistatic interactions misses the possibility that there could be multiple causal variants with effects in different directions. Additive combinations of these may well create phenotypes more extreme than the parents. For example, say the distal promoter of a given gene has accumulated five variants that all increase expression by the same amount x, and the proximal promoter has accumulated four variants that each decrease expression by the same amount x. The net difference between species would be an increase of one x. A chimera that only has the five distal variants would show a difference of 5x without needing to evoke epistasis.

    We agree. We were assuming no relationship between the effects of the alleles and their position. Upon reflection this is not a good assumption and have revised the text accordingly:

    "Expression driven by chimeric sequences may lie within the range of the two parental species and can be used to map parental differences to the proximal or distal portion of the cis-regulatory region (Figure S8). However, chimera expression may also lie outside of the parental range if recombination brings together variants with effects in the same direction or if there are epistatic interactions between variants. Such cis-regulatory interactions are thought to be common due to binding site turnover (Zheng et al., 2011), and do not require expression divergence between the parental species."

  2. Evaluation Summary:

    The authors use RNAseq in yeast hybrids to study the effect of cis-variation on evolutionary divergence in gene expression and expression dynamics. Importantly, some of the findings are further confirmed using reporter assays. This is a clever and efficient approach that allows obtaining a genome-wide view of how cis-sequence variation affects expression. What sets this study apart from previous work is that the authors use hybrids across different genetic distances, separate expression levels and dynamics by sampling across different time points during an environmental shift, and also investigate 3' sequences. The main conclusions confirm that SNPs and InDels both affect gene expression as well as dynamics, and that on average, InDels have larger effects compared to SNPs, especially on expression dynamics. Moreover, the results also reflect negative selection on expression levels, with the effect of some cis mutations compensated by other cis variation, which ultimately results in complex interactions between the different cis-acting polymorphisms. Together, the results further our understanding of how cis sequence variation supports divergence in gene expression levels and dynamics.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  3. Reviewer #1 (Public Review):

    In this manuscript Shi and Fay investigate how natural genetic variation in cis-regulatory sequences impact gene expression dynamics, using budding yeast as a model. Much work in the field, including some landmark studies from this laboratory, have focused on allele specific expression. By contrast, relatively few have investigated the impact of natural genetic variation on the kinetics of gene expression, as the authors do here during the diauxic shift using both inter- and intra-specific hybrids. Strikingly, they find that ASE dynamics are more strongly associated with insertions and deletions than ASE levels. Using reporter assays the authors test which promoter regions and individual variants are sufficient to produce the observed dynamics of gene expression. By investigating chimeric promoter regions between species, the authors gain insight into constraints on the evolution of gene expression dynamics. This manuscript addresses an important question, the findings are novel, and the methods are appropriate. I have a couple of suggestions that I hope the authors will agree can improve their work.

    1. Line 124: I understand the focus on regulatory regions, but post transcription regulation of transcript stability can arise from many mechanisms. RNA binding proteins frequently interact with regions within an open reading frame. I understand the complications of considering coding mutations, but why exclude synonymous polymorphisms within ORFs, for example? At a bare minimum it should be noted in the text.

    2. In what is otherwise an exceptionally clear manuscript it took some time to understand on line 157 precisely how the 334 'regions' were defined from the 1,818 CREs. Some extra sentences would be very helpful to guide the reader here, perhaps with a figure panel to scaffold the logic.

    3. In figure 4 the scale of the x-axis (time) is confusing. Most of the plots don't seem to start at t=0, but it is impossible to tell from the labeling. Because the timepoints highlighted also differ depending on the message being plotted, which is of course natural, interpreting differences in slope, etc. becomes confusing. The authors should either replot with the origins at t=0 or clearly indicate that there is a break in the axis.

    4. Line 209 and 210 - I understand that the PhastCons scores did not improve the association between upstream polymorphisms and ASE dynamics, but it would be nice to hear a bit more from the authors about what this might mean. The observation is restated in the discussion but again mostly without any speculation about what it might mean before moving on to the discussion of technical limitations. If the result is true what might it mean?

  4. Reviewer #2 (Public Review):

    Summary and Strengths:

    In this manuscript, the authors set out to determine the degree to which genetic variation among yeast strains and species influences gene expression during a large, genome-wide change in gene expression. While many studies have examined genetic influences on static expression levels, much less attention has been paid to dynamic responses. We know that different environmental conditions trigger different cellular states that engage different gene expression programs. We also know that DNA variation can have different effects in these cellular states. What we know much less about is how genetic influences shape the transition from one transcriptional state to the other.

    The authors addressed this gap with a comprehensive genomics study. First, they quantified "allele specific expression" (ASE) in several diploid hybrids among strains of the yeast Saccharomyces cerevisiae as well as hybrids between S. cerevisiae and two sister species. RNA sequencing in such hybrids can distinguish RNA molecules produced from the two parental genomes. When there are cis-acting variants influencing a given gene, their effects become detectable as a difference between the expression levels of the two parental alleles.

    The main innovation of this work is that the authors profiled ASE along a time series during a shift from fermentative yeast growth to respiration. During this shift, gene expression changes substantially. Using their time-resolved ASE profiling strategy, the authors were able to track when and how genetic differences influence these changes. This experimental design is a major strength of the paper. It is strengthened further by the inclusion of several hybrids and by dense temporal sampling. Overall, the authors succeeded in their goal of quantifying dynamic ASE.

    Second, the authors used high-throughput reporter assays to study the effects of individual DNA variants in several hundred cis-regulatory elements (CREs). Interestingly, the CREs they studied were able to capture ASE dynamics at least in part, even though the reporter system was integrated into a common locus that probably differs from the chromatin state at the native genes. This use of a complementary genomics approach is another major strength of this paper.

    One highlight result from the high-throughput assays is that many cis-regulatory elements contained multiple causal variants. Another thought-provoking result is that causal variants were neither more likely to occur at conserved nucleotides nor to cause more severe disruption of transcription factor binding sites than other variants. This result is somewhat counterintuitive given the well-established ability of conservation to mark functionally important nucleotides. As the authors state, this absence of evidence may be due to the fact that only a few handful of causal variants were found, limiting the statistical ability to detect more subtle differences in conservation or transcription factor binding sites. On the other hand, the results clearly show that there is no simple code for determining causal variants from available annotations. As the authors state, this is in line with earlier observations that much of cis-regulatory DNA variation could be evolutionarily neutral, perhaps because the effects it has on most genes are not large enough to matter for fitness. These two results are additional strengths of the paper.

    Together, the paper contains an impressive amount of work. I greatly enjoyed the complementary use of ASE and reporter assays. The experiments seem to have been executed well, and are described succinctly and clearly. The paper is an interesting overview of the effects of cis-regulatory variants on dynamic gene expression change. Its main impact on the field lies in the clear demonstration that dynamic ASE exists, as well as its quantification.

    Weaknesses:

    First, the results in the first half of the paper are not overly surprising. They boil down to "genetic variation does influence expression dynamics". This is not unexpected, given genetic variation has been shown to influence just about any cellular process studied so far. As such, the paper essentially confirms the existence of a phenomenon whose existence was not really in doubt. Fortunately, the work into causal variants in the second half of the paper does provide additional insight.

    Second, the results are somewhat descriptive. This is not uncommon for genomics work, but does leave the reader wondering how exactly a given variant may alter gene expression dynamics, especially if it neither occurs at a conserved site nor drastically changes transcription factor binding. I do understand that a deep dive into individual causal variants is outside of the already impressive scope of this paper. I nevertheless hope that one impact of this work will be future mechanistic studies of some of these variants.

    Third, the statistical model to infer ASE strikes me as suboptimal (line 420). From how I understand the Methods section, allelic read counts are transformed to an allele frequency. This frequency is assumed to be 0.5 in the absence of ASE. ASE is then modeled as deviation from 0.5, using a linear model. This last point seems problematic. First, frequencies can only range from 0 and 1, whereas a basic linear model would be allowed to infer frequencies outside of this range. It is not clear to me that this model can properly capture the bounded nature of these data. Second, RNA-Seq data is count based, and transforming to an allele frequency loses information about the accuracy of each measurement. Specifically, genes with few reads have less power due to more stochastic counting noise. Third, the choice of weighting observation simply by the raw read counts (line 422) seems ad hoc and should be justified. More broadly, the authors could have opted for more established, count-based analysis strategies for ASE data, such as binomial tests or more advanced frameworks (e.g. beta-binomial tests as in https://www.biorxiv.org/content/10.1101/699074v2 ).

    Fourth, there is only one biological replicate per hybrid, creating the risk that this one observation of the given time course may not be biologically representative. This also raises questions about how the linear model (see above) was fit without replicate data.

    My final comments (these are not weaknesses but more discussion points) are about the analyses relating the number of sequence differences at a given gene to its strength of ASE (starting at line 120). The authors report significant associations, in line with previous studies. However, it is worth pointing out that this analysis makes an implicit assumption that there are multiple causal variants with effects in the same direction such that adding each variant would increase the ASE difference. The analyses cannot account for the case of multiple causal variants with effects in opposite directions. In this case, even a large number of variants could result in no net ASE. The authors' observation that the association between the number of variants and ASE is strongest for the most closely related strain pair (line 139) may be explained by this scenario. If there are many causal variants that cancel each other, having fewer variants in closely related strains reduces the opportunity for such cancellation. Given these considerations, it is actually somewhat surprising that there is any association between the number of variants at a gene and its ASE.

    Along similar lines, the authors' point (line 226 and end of the Discussion) that inter-species chimeras should lie between the two parental species unless there are epistatic interactions misses the possibility that there could be multiple causal variants with effects in different directions. Additive combinations of these may well create phenotypes more extreme than the parents. For example, say the distal promoter of a given gene has accumulated five variants that all increase expression by the same amount x, and the proximal promoter has accumulated four variants that each decrease expression by the same amount x. The net difference between species would be an increase of one x. A chimera that only has the five distal variants would show a difference of 5x without needing to evoke epistasis.