Quantifying how post-transcriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Transcriptional rates are often estimated by fitting the distribution of mature mRNA numbers measured using smFISH (single molecule fluorescence in situ hybridization) with the distribution predicted by the telegraph model of gene expression, which defines two promoter states of activity and inactivity. However, fluctuations in mature mRNA numbers are strongly affected by processes downstream of transcription. In addition, the telegraph model assumes one gene copy but in experiments, cells may have two gene copies as cells replicate their genome during the cell cycle. While it is often presumed that post-transcriptional noise and gene copy number variation affect transcriptional parameter estimation, the size of the error introduced remains unclear. To address this issue, here we measure both mature and nascent mRNA distributions of GAL10 in yeast cells using smFISH and classify each cell according to its cell cycle phase. We infer transcriptional parameters from mature and nascent mRNA distributions, with and without accounting for cell cycle phase and compare the results to live-cell transcription measurements of the same gene. We find that: (i) correcting for cell cycle dynamics decreases the promoter switching rates and the initiation rate, and increases the fraction of time spent in the active state, as well as the burst size; (ii) additional correction for post-transcriptional noise leads to further increases in the burst size and to a large reduction in the errors in parameter estimation. Furthermore, we outline how to correctly adjust for measurement noise in smFISH due to uncertainty in transcription site localisation when introns cannot be labelled. Simulations with parameters estimated from nascent smFISH data, which is corrected for cell cycle phases and measurement noise, leads to autocorrelation functions that agree with those obtained from live-cell imaging.

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    The authors do not wish to provide a response at this time.

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    Summary:

    In this work, the authors investigate the effect of using mature mRNAs instead of only nascent mRNA (located at the transcription site) when estimating transcriptional kinetics parameters from single-molecule fluorescent in situ hybridization (smFISH) experiments. The authors find that using nascent mRNA and correcting for cell cycle effects yields more accurate parameter estimates than using mature mRNAs. The author performs smFISH experiments of the GAL10 gene in yeast to test their findings. Also, the authors test different methods to obtain parameter estimates in cases where there is no information about the location of the transcription site.

    Major comments:

    1.The authors make multiple claims of novelty that conflict with work described in some of their references, particularly: Skinner et al., eLife, 2016; Xu et al., Nature Methods, 2015 and Physical Review Letters, 2016 (References #26,27 and 24 in their manuscript). I could find several instances where the scope of their claims was unclear. Below I describe some cases:

    a.The title of this paper, "accurate inference of stochastic gene expression from nascent transcript heterogeneity" could also be the summary conclusion of the three works cited above. However, later in the Introduction of the manuscript, the authors state that their goal is to "understand the impact of post-transcriptional noise and cell-to-cell variability on the accuracy of transcriptional parameters inferred from mature mRNA data," a related yet different topic. I would change the title of the manuscript to reflect their main goal better.

    b.I would make their claims of novelty more specific. For example, at the end of the abstract, the authors claim that "our novel data curation method yields a quantitatively accurate picture of gene expression." Quantifying nascent mRNA using smFISH to obtain transcription kinetic parameters has been done before (the references above are an example) also developing the modeling tools to do so (for example, in Xu et al., Physical Review Letters, 2016). What is, exactly, the novelty in their approach? They need to make that explicit or soften their claims.

    c.In the Introduction, when discussing the effect of the cell cycle in parameter estimation, they write: "Since estimation of all transcriptional parameters (...) from nascent data as a function of the cell cycle phase has not been reported". However, the work they reference (Skinner et al., eLife, 2016) shows such measurements for multiple transcriptional parameters for different cell cycle stages. The original work may not have gone as far as the current work, but it is unclear what has been done before from the way the authors describe earlier literature.

    d.The authors develop a new formulation of the delay telegraph model to obtain kinetic parameters from the nascent RNA copy number statistics. They state in the SI that "Similar delay models have also been studied by other authors," however, the authors do not explain in which way their model differs from previous work. Does their approach have advantages over previously published models?

    2.There is a particular choice during their analysis that I find problematic. In section 2.3, the authors state "The transcription site is counted as 1 mRNA, regardless of its intensity, but has a negligible influence since the mean number of mature mRNA is much greater than 1" (the number should be spelled). It is unclear that statement is true for all possible kinetic parameters. It is also hard to evaluate that claim because the authors do not show images of transcription sites that would support it. Trying to find more information, I saw images from previous work from one of the authors ("Optimized protocol for single-molecule RNA FISH to visualize gene expression in S. cerevisiae", figure 4). Those images suggest that the opposite is the case: in the cell shown, the number of mRNAs in the transcription site is not negligible but instead seems to contain most of the mRNAs in the cell. Solving this problem would require the authors to remake their analysis without making this assumption.

    3.Overall, I think the current experiments are sufficient to support their claims. Also, the description of methods and references is appropriate to allow other researchers to reproduce their observations. Finally, the experiments are replicated, and enough cells are analyzed to provide enough statistical significance to their claims.

    Minor comments:

    1.In section 2.1.3, the authors mention using an optimization package written in Julia programing language. A reference to the package needs to be included, either an academic article or the website to the package.

    2.In the discussion, the authors state "In addition, live-cell measurements include cells in S phase, which are excluded in smFISH." I do not think that statement is correct. One would expect that a large enough sample of cells assayed with smFISH will contain a subpopulation containing cells in the S-phase.

    3.I find the overall presentation of figures and the analysis performed not optimal to convey their points. Below are some suggestions regarding presentation (and in some cases, analysis).

    Text suggestions:

    a.The meaning of the word "inference" seems to change across the manuscript. In the title, I understand that inference means "estimation," or more explicitly, estimating model parameters from experimental or simulated data. However, in the methods section, the authors write "Mature mRNA inference" and "Nascent mRNA inference." Do they mean "Estimating/Inferring model parameters from synthetic/experimental mature/nascent mRNA datasets"?

    b.In the Introduction, the authors use three different terms for cell cycle (cell cycle position, cell cycle stage, and cell cycle phase). It is unclear to me if they are referring to the same concept.

    Presentation suggestions:

    c.I would remove Figure 2C and put it in the Supplementary information. It shows procedure details that are not fundamental to understanding their claims.

    d.I would also relegate the tables in their six datasets in figure 1 and 2 to the Supplementary material. Tables are not very effective methods to present information.

    e.I do not think that figures 1c and 2d are needed. Comparing the results from stochastic simulations and the predictions from the models is an internal control that the researchers should do to test the accuracy of their SSA implementation; it does not convey a message related to the main conclusions of their work.

    f.I like figure 4a; it conveys one of the main points: not correcting for cell cycle can lead to considerable errors in parameter estimation. I would like to see a similar plot that conveys the difference in parameter estimation when using nascent vs. mature mRNA.

    g.Why do the authors have table 1 separated from figure 4 while adding the tables to figures 1 and 2? I would be consistent and move all tables to the supplementary material.

    Significance

    As described above, some claims do not seem novel considering the references in this manuscript. This is not a problem; the authors can soften their claims to novelty without compromising their other claims. Previous works that estimated mRNA transcription kinetic parameters by quantifying nascent mRNA recognized that using mature mRNA would incur in parameter estimation errors. They considered it evident that quantifying the process closer to the transcription site would improve estimates. Similarly, it was also apparent that adding missing information (the gene copy number based on cell cycle information) would improve parameter estimates. That is why the authors presenting those arguments as findings is unnecessary. However, it is true that here the authors are interested in the level of error, not the fact that getting more accurate (or relevant) measurement will improve estimates.

    An item that the authors may want to emphasize is their finding that it is possible to correct for measurements where the identity of the transcription site is unknown. All the works that they cite where nascent mRNA is measured using some method to localize the position of the transcription site. I mammalian cells and fly embryos, it is possible to label introns to identify mRNA located at the transcription site. That is not possible in many yeast genes or other microorganisms.

    Which audience would be most interested in this work? I think those searching for methods to quantify transcriptional kinetics in organisms where the identity of the transcription site cannot be measured by smFISH or other novel methods such as Cas-FISH.

    I performed studies of transcriptional kinetics in bacteria during my doctorate, and I continue utilizing smFISH in my research.

    Referees cross-commenting

    I agree with the assessment from the other reviewers. One of reviewer 2's requests (to perform simulations covering the parameter space) is particularly relevant given the main goals of the authors. All reviewers noted that the method used to quantify the number of RNA at the transcription site has shortcomings that need to be addressed

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    In the manuscript Fu and co-authors compare accuracy for 2 models that infer kinetics of the transcription from synthetic and experimental data. Specifically, they compare the telegraph model for mRNA and the delayed telegraph model for nascent RNA. They first provide the comparison for synthetically simulated data, and derive that the latter exhibits higher accuracy. Next they apply the model to experimental data from smFISH for PP7-GAL10 strain, and provide the framework to estimate the number of mRNAs and use the intensity at the transcriptional site to infer the number of bounds of polymerase during the transcription (nascent RNA). For the latter, I appreciate that they account for the fact that intensity throughout the transcription will depend on 'spatial' position of polymerase and incorporate this into the framework to infer nascent RNA levels. Additionally, for the experimental data they infer kinetics with and without accounting for cell cycle (accordingly 1 or 2 gene copies), and through comparing to life imaging data from Donovan et al., 2019, they suggest that the model that best describes experimental data is delayed telegraph for nascent RNA when accounted for cell cycle. Finally they provide 2 approaches - called rejection and fusion - to account for potential artifacts in estimation of nascent RNA levels from the intensity at transcriptional sites, and provide the comparison of how this approaches affect the overall fit.

    Whereas it is important to have a systematic understanding/comparison for both models as well as for how accounting of cell cycle might improve the overall accuracy, some of the aspects of the results/estimation of values from experimental data require more thorough analysis. Specifically, below I describe points to be addressed:

    Major points:

    Comparison of the models for simulated data. In the first two chapters of the results the authors compare simulations/parameter inference from the synthetic data for the telegraph-based model for mRNA and delayed telegraph model for nascent RNA, and conclude that the latter provides better accuracy. However, based on the relationship for mean relative error distribution as a function of fON, it seems to me that both models show very similar results, and the support of better accuracy for nascent RNA seems unclear to me. Additionally, simulations are performed for the concise number of parameter sets, and it is unclear how well/uniformly the chosen sets cover the parameter space. I suggest that more thorough analysis is required. One way to do so would be to perform simulations on the same set of parameters that comprehensively cover the parameter space for both models and compare mean error rates in pairwise fashion. Additionally, it might be worth considering comparing error rate for each parameter separately (i.e. for sigma-on, sigma-off and the production rate of mRNAs when promoter is on).

    An additional analysis of the accuracy of the estimated values from the experimental data. When it comes to experimental data, the overall fit of any proposed model will depend on both the suitability/correctness of a model to explain the process in question as well as the reliability of the estimates (inputs for the model) from the experiments. Specifically, it is possible that a model (either telegraph for mRNA or delayed telegraph for nascent RNA or both) to explain transcriptional kinetics is fairly accurate, but the input estimates (for accordingly mRNA or nascent RNA) are biased (due to technical artifacts from the experiment and/or the approach towards estimating those values), thus affecting the overall fit of a model and interpretation of the results.

    I appreciate that authors address one potential artifact in estimating nascent RNA, where it is possible that the intensity of nascent RNA is overestimated if it is mistakenly confused with mRNA. I suggest that the more detailed analysis of the accuracy for both the number of mRNA molecules and the intensity of nascent RNA is required to provide better insight in how reliably those values are estimated and accordingly whether models might perform poorly due to biased estimates.

    Specifically, I am wondering about next aspects:

    Mature mRNA: More detailed method section covering the estimate for background signal and spot detection. A potential proximity of mRNA molecules resulting in underestimation of the total number of mRNAs, and how this might affect the fit of the telegraph model. Even though smFISH has been widely used to estimate the number of mRNA molecules (as a total number of spots), the technique has been mostly applied to mammalian cells with considerably bigger cell size. Additionally, the usage of the total number of mRNA molecules in order to estimate transcriptional kinetics from the telegraph model seemingly requires a highly accurate estimate of the total number of molecules. Combined, it is not obvious if potential underestimation of mRNAs (specifically in cells with high number of mRNAs) via smFISH in budding yeast cells might lead to the misleading interpretation of the results. One way to assess whether such 'merging' takes place is to look into the distribution of intensities for cytoplasmic spots (per cell and/or all the cells in the whole field of view). If those distributions frequently show bi/multi-modal behavior, it is worth considering whether a proposed way to estimate mRNA number is suitable in for given model organism/growth conditions/gene, and further extend the analysis on simulated data to provide the robustness of the fit of the telegraph model for mRNAs in cases whether number of mRNAs is underestimated. A more minor issue, but authors state that, for each cell, the highest intensity of the nuclear spot will count as one mRNA, and that it has a negligible influence. I would appreciate a more thorough analytical explanation for this or an additional analysis on the simulated data to support how random +/-1 of mRNAs might affect results of the fit, specifically for cases with ~low average mRNA estimate.

    Nascent RNA: I might be missing something, but it seems that for cells in late G2 phase where nucleus is either strongly elongated (and looks like a sand clock) or even exhibits 2 separate nuclei connected with the chromatin bridge - 2 copies of the gene can be spatially resolved and therefore it might happen that 2 independent/separate brightest spots (one per each cell) amount to total estimate of nascent RNA in cases where promoter is on simultaneously in both copies? If so, depending on estimated in the study/prior literature-based estimates for sigma-on/off, the probability of simultaneous transcription might vary and this should be taken into account? This also might partially explain the phenomenon of lower transcriptional activity in G2 which is currently suggested to be explained with dosage compensation? Or are those cells considered as 2 cells in G1? If so, it needs to be specified in the text. Additionally, I suggest that images from microscopy can be provided as a supplement to aid clarity in how cell cycle, number of mRNAs and intensity for nascent RNA were estimated.

    Additional experimental validation and/or the discussion of the accuracy of the inference for a different range of parameters. The analysis of the experimental data consists of the (I presume highly comparable with Donovan et al., 2019) single condition (i.e. galactose concentration, glu/galactose ratio) resulting in a single parameter set for transcriptional kinetics. Specifically, it is estimated that sigma on and off will be comparable for the given set up, and therefore, based on simulated data, the estimates will be somewhat reliable for the cell cycle accounted delayed telegraph for nascent RNA. I wonder how in practice (i.e. estimated from the experiments) the same model will perform for a different set of parameters/different conditions. Ideally, I would suggest performing the similar experiment, but where sigma on/sigma off is expected to be different. One way to achieve this with the GAL10 / galactose set up is to tune the glu/gal ratio of the media. Even without a comparison to live-cell tracing, the analysis of estimated parameters for merged and cell cycle specific data can shed light on how suitable the model is for alternative parameters. Alternatively, if the experiment is currently not feasible, I would appreciate a more extensive discussion of the practical suitability of the cell-cycle specific delayed telegraph model for nascent RNA for alternative sets of transcriptional parameters. Considering that the comparison was performed only against 'simple' telegraph model and in introduction authors mention a variety of 'improved' models for mRNA, that account for various sources of heterogeneity, they might be more suitable for alternative set of transcriptional parameters, and might be more suitable that cell cycle specific delayed telegraph for nascent RNA.

    Overall, the main statements of the paper - that cell cycle specific inference from the experimental data using delayed telegraph model from nascent RNA performs best (compared to telegraph model from mRNA or not cell cycle specific) are supported, and I agree that understanding of the limitations of the currently popular models (telegraph for mRNA and/or not accounting for cell cycle) is an important addition to the field. I would be happy to further proceed with the revision/acceptance of the paper if the comments above are addressed/considered.

    Minor comments:

    Current method section is lacking the description of the growth media, which is an important aspect to specify when it comes to budding yeast (particularly when the sugar source is different from the standard glucose and/or results are compared to another publication). In the figure 2b I find the cartoon a little misleading - specifically why polymerase is bound when the promoter is off? If it is to illustrate the case when transcription/polymerase bound occured after promoter is switched off, why there are no polymerase to the right from the current one (as in in the case where promoter is on)? In table1 - there is a typo in the 2nd meta-row - I suspect it should say G2?

    Significance

    This paper is somewhat outside my core expertise, although closer to the expertise of my postdoc who assisted with the review.

    The work is interesting but the generalisability of the conclusions is somewhat limited, partially by the lack of experimental validation. Nevertheless, there are interesting aspects of the study and the area of research is important.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary:

    In this study, the authors consider the problem of inferring transcription dynamics from smFISH data. They distinguish between two important experimental situations. The first one considers measurements of mature mRNAs, while the second one considers measurements of nascent mRNA through fluorescent probes targeting PP7 stem loops. The former problem has been previously dealt with extensively, but less work has been done on the context of the latter. The inference approaches are based on maximum likelihood estimation, from which point estimates for promoter-switching and transcription rates are obtained. The study focuses on steady state measurements only. The authors perform several analyses using synthetic data to understand the limitations of both approaches. They find that inference from nascent mRNA is more reliable than inference from mature mRNA distributions. Moreover, they show that accounting for different cell-cycle stages (G1 vs G2) is important and that pooling measurements across the cell-cycle can lead to quantitatively and even qualitatively different inferences. Both approaches are then used to analyze transcription in an experimental system in yeast, for which they find evidence of gene dosage compensation. I consider this an interesting and relevant study, which will appeal to the systems- and computational biology community. The paper is well written and the (computational) methods are described in detail. The experimental description is quite minimal and could profit from further details / explanations. I have several technical criticisms and questions, which I believe should be addressed before publication. Since I am a theorist, I will comment predominantly on the statistical / computational aspects.

    Major comments/questions:

    -A key reference that is missing is Fritzsch et al. Mol Syst Biol (2018). In this work, the authors have used nascent mRNA distributions and autocorrelations (obtained from live-imaging) to infer promoter- and transcription dynamics. I believe this work should be appropriately cited and discussed.

    Synthetic case study:

    -Inference and point estimates. The authors use a maximum-likelihood framework to extract point estimates of the parameters. Subsequently, relative absolute differences are used to assess the accuracy of the inference. However, as far as I have understood, this is performed for only a single simulated dataset, for each considered parameter configuration. The resulting metric, however, does not really capture the inference accuracy, since it is based on a single (random) realization of the MLE. I would recommend to at least repeat the inference multiple times for different realizations of the simulated dataset (per parameter configuration) to get a better feeling of the distribution of the MLE (e.g., its bias / variance). Alternatively, identifiability analyses based on the Fisher information could be performed for (some of) the different parameter configurations although this may be computationally more demanding.

    -It would be useful to include confidence intervals based on profile likelihoods also for the synthetic case study, in particular for the 6 reported datasets. I would also find it helpful to see comprehensive profile likelihood plots for the key results / parameter inferences in the supplement. This would also provide useful insights into the identifiability of the parameters.

    Experimental case study:

    -Validation against live-cell data. In the simulation of the autocorrelation function, what was the ratio of cells initialized in G1 / G2, respectively? I'd expect this to have direct influence on the simulated ACF. Moreover, a linear fit is used to correct for "non-stationary effects" in the ACF that supposedly stem from cell-cycle dynamics. First, I don't think this terminology is really accurate, since non-stationarity would lead to an ACF that depends on two parameters (tau_1 and tau_2). I suppose the goal of the linear correction is to remove slow / static population heterogeneity? If yes, wouldn't it be easier / more direct to also change the simulations to non-synchronized cell-cycles? In this case, they should also display the very slow / static components as displayed in the data, which would eliminate the need for the post-hoc correction. I was also wondering whether other statistics (e.g., mean, variance, distributions) match between the simulations and the live-cell experiment? This could provide further validation of the inferred parameters.

    -If I understood correctly, the signal intensity of the measured transcription spot is normalized by the median cytoplasmic spot brightness. Since the normalized intensity of a single complete transcript is 1, the cumulative intensity should give a lower bound on the nascent mRNAs. The histograms in Fig. 4b show intensity values in the range of 30, which would mean that at least 30 transcripts contribute to the transcription spot. The total number of nucleoplasmic and cytoplasmic mRNA, however, is in the range of 10 (Fig. 3a). I am probably missing something but how can we reconcile these numbers? The authors mention that the brightest spot just counts for one transcript, but argue that this has negligible influence on mature RNA counts. Could this be a possible explanation for the mismatch?

    Minor comments:

    -In the experimental case study, the authors argue that the "correct" inference result is the one that accounts for cell-cycle stage, while the other one termed "incorrect". I find this terminology too strong, since every estimate is subject to uncertainty.

    -Page 2: "... in a asynchronous population" -> "... in an asynchronous population"

    -Page 7: "...parameters sets 3 and 4" -> "...parameter sets 3 and 4"

    -Figures 5a and 6a: parameter names and units should go on the y-axis.

    Significance

    Quantifying kinetic parameters from incomplete and noisy experimental data is a core problem in systems biology. I therefore consider this manuscript to be very relevant to this field. The contribution of this manuscript is largely methodological, although its potential usefulness is demonstrated using experimental data in yeast.