Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    The issue of general fitness effects in organisms expressing Cas9 enzymes as part of gene drive genetic control strategies is important, particularly in the emerging field of vector control. This manuscript reports experiments aimed at teasing apart such effects in a Drosophila model system, providing evidence that off-target effects predominate, which may be ameliorated by utilising high-fidelity nucleases, but a more detailed analysis of data and justification for some of the assumptions, especially some direct evidence of off-target cleavage, are still needed to support the authors' inferences. It is currently also not entirely clear how the lines were generated and tested. Finally, additional modelling to include scenarios where the initial frequency of the drive allele is very low (as would be the case for an actual release) would help to strengthen the conclusions.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/Cas9 provides a highly efficient and flexible genome editing technology with numerous potential applications ranging from gene therapy to population control. Some proposed applications involve the integration of CRISPR/Cas9 endonucleases into an organism’s genome, which raises questions about potentially harmful effects to the transgenic individuals. One example for which this is particularly relevant are CRISPR-based gene drives conceived for the genetic alteration of entire populations. The performance of such drives can strongly depend on fitness costs experienced by drive carriers, yet relatively little is known about the magnitude and causes of these costs. Here, we assess the fitness effects of genomic CRISPR/Cas9 expression in Drosophila melanogaster cage populations by tracking allele frequencies of four different transgenic constructs that allow us to disentangle ‘direct’ fitness costs due to the integration, expression, and target-site activity of Cas9, from fitness costs due to potential off-target cleavage. Using a maximum likelihood framework, we find that a model with no direct fitness costs but moderate costs due to off-target effects fits our cage data best. Consistent with this, we do not observe fitness costs for a construct with Cas9HF1, a high-fidelity version of Cas9. We further demonstrate that using Cas9HF1 instead of standard Cas9 in a homing drive achieves similar drive conversion efficiency. These results suggest that gene drives should be designed with high-fidelity endonucleases and may have implications for other applications that involve genomic integration of CRISPR endonucleases.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    The goal of the work was to test for direct and indirect fitness costs associated with specific types of constructs that could be used for gene drive. The authors conclude that there are no direct fitness costs associated with the presence and expression of either Cas9 or the guide RNAs but that the Cas9 is causing off-target cuts that result in loss of fitness. They also conclude that a newer form of CAS doesn't cause these off-target cuts. While the goal of this study is important, there are many caveats associated with the work as reported, and these limit interpretation of the results, Many of the caveats are pointed out in the discussion.

    1.a) I am specifically concerned by the fact that from what I read, a company made the transgenic lines and that there was only one transgenic line per treatment. Unless the fly line used for the insertion was completely homozygous for the chromosome where the insertion was made, the lines could have differed in fitness, due to somewhat deleterious reccessives captured in one G1 but not another. This cost could have persisted for a number of generations after the crosses were made, especially in the high frequency "releases". This may not have been a real problem, but without any replication it is difficult to know.

    We apologize that this was unclear in our initial submission. We did in fact generate several transgenic lines of each construct and used independently obtained lines for each of our population cages, except for the Cas9_gRNAs construct, where four lines were used in seven population cages (replicates 1 to 4 were founded with the same line). All of these were also crossed to w1118 flies before we obtained homozygous lines, so the impact of deleterious alleles would have been minimized. We have edited the section “Generation of transgenic lines” in the Methods to clarify this.

    We also examined the possibility of fitness effects being caused by such alleles in our maximum likelihood analysis (assuming they are unlinked from the construct — otherwise they should have appeared as direct fitness effects). This model was not a good match for the data, nor was the model with direct fitness effects. Based on these results, we consider it unlikely that such deleterious alleles had a major impact on the observed frequency trajectories in our cage populations.

    1.b) My concern is reinforced by the fact that the no-Cas9, no-gRNA line goes up in frequency for the first 5 generations and then becomes stable in frequency. The loss of the fitness advantage is consistent with a fitness effect partially linked to the insertion site in that one cross but not others.

    Both of these cages were made with independent lines. We agree with the reviewer that the increase in frequency of the no-Cas9_no-gRNAs construct at the beginning of the experiment seems surprising at first. However, if an initial fitness advantage was truly driving the dynamics of this construct, we would expect that the “initial off-target model” (where fitness costs originated before the experiment) should have yielded the highest model quality in our maximum likelihood analysis, since we also allowed advantageous cut off-target alleles (i.e., fitness estimates > 1) in this model. While the maximum likelihood fitness estimate in the “initial off-target model” indeed exceeded the reference value of 1, its 95% confidence interval still included a fitness value of 1, and a neutral model actually yielded the lowest AICc value (i.e., best model quality, Table 3). We think that one possible explanation for this apparent initial frequency increase is that population cages tend to undergo larger than average fluctuations in the first one or two generations due to the smaller initial population size and potential health differences between founding fly lines (which can persist for a generation or two). We briefly note this in the manuscript methods section.

    1.c) It is important to note that the starting points are cages with separate vials of the control and experimental strain. Even a small difference in development time of the two strains in the first generation could lead to an excess of homozygotes in the next generation.

    We agree. In our maximum likelihood framework, such differences in development time should show up as a viability difference (fraction of offspring that made it to adulthood in the time window of our experiment). We now note in our revised manuscript that fitness differences between genotypes could be due to longer development time rather than an increase in the juvenile death rate in Cas9_gRNAs carriers. In the “Phenotypic fitness assays” section of our revised manuscript, we additionally state that “longer development time of individuals carrying the Cas9_gRNAs construct would also have appeared as a viability cost in our cage study but not in these fitness assays.”

    1.d) I am also concerned by the fact that the main conclusion is that the decline in frequency in the Cas9-gRNA line is due to off-target cuts, but there was no sequencing to back up that conclusion. In the discussion, this problem is mentioned but dismissed. I don't see how it can be dismissed when this is a major conclusion that remains based on very indirect evidence.

    We thank the reviewer for raising this important concern, which touches on the issue of how our approach differs from previous approaches that sought to directly detect off-target cleavage through sequencing. Our approach, by contrast, seeks to provide a “direct” measurement of the fitness of an allele. While this allows us to avoid the challenging task of detecting off-target mutations in vivo through whole-genome, population-level sequencing (and then predicting their potential effects), it comes at the price that inferences about the molecular nature of these fitness effects will rely on indirect evidence. However, we want to point out that our conclusion of these fitness effects being primarily due to off-target cleavage is based on three independent lines of evidence: (i) The maximum likelihood analysis of the frequency trajectory of the Cas9_gRNAs construct, where statistical model comparison ranked the off-target effect model higher than the direct fitness costs model; (ii) The fact that we inferred fitness costs only for the Cas9_gRNAs construct but not the construct in which Cas9 was replaced with the high-fidelity Cas9HF1 endonuclease (which should have similar expression and thus, similar direct fitness costs); and (iii) The heterogeneity we observed in the frequency trajectories of the Cas9_gRNAs construct in our cages, which is consistent with a model where off-target sites accumulate over the course of the experiment yet more difficult to reconcile with a model of direct fitness costs.

    Inspired by the reviewer’s recommendation, we wondered whether we may in fact be able to directly detect cuts at a few computationally predicted off-target sites. To this end, we performed Sanger sequencing at six sites that were computationally predicted for our Cas9_gRNAs construct by CRISPR Optimal Target Finder, which unfortunately revealed only wild-type sequences (this analysis is described in the new section “Evaluation of computationally predicted off-target sites”). However, we believe that this does not rule out off-target cutting as the primary driver of fitness costs for the Cas9_gRNAs construct due to the following arguments we state in the discussion section of our revised manuscript:

    “For example, our sequencing approach would not have allowed us to detect larger insertion/deletion events, which are frequently observed at on-target sites (48, 49). More likely though, we suspect that cleavage events occurred at other sites than the six computationally predicted ones. Indeed, the predictions by CRISPR Optimal Target Finder are based on cleavage specificity in cell lines, where off-target cutting is known to occur more frequently than in animals (47). All but one of the predicted off-target sites carry combinations of single nucleotide mismatches in the PAM-proximal and the distal region, which could make in-vivo cleavage less likely at these sites. Generally, our results are consistent with other studies that found off-target cleavage to frequently occur at sites which would have been difficult to predict computationally (50).”

    In a sense, our inability to detect any mutated alleles at this small set of computationally predicted off-target sites might actually highlight a key benefit of our approach: It can estimate the potential fitness costs of a construct without having to rely on accurate computational predictions of putative off-target sites or requiring the very costly approach of whole-genome, population-scale sequencing.

    Additionally, we would like to point out that while we found off-target effects to explain the empirical data best, we would probably consider our estimation of the overall magnitude of the fitness costs of the Cas9_gRNAs construct as one of the main conclusions of our manuscript, together with the fact that these were avoided when using the high-fidelity Cas9HF1 endonuclease instead. Thus, even if some readers may remain skeptical about the role of off-target cleavage (and we made sure to qualify our claims on this in the Discussion section accordingly), our systematic analysis of the overall fitness effects is more robust and should be of broad interest.

    1.e) When releasing homing gene drives, the initial frequency of the transgenic line is very low, and as in the Garrood et al paper cited, it is possible for the gene drive to outpace the non-target cutting. The modeling does not address what the impact of the presumed fitness costs in this experiment would be for a replacement/suppression drive released at low frequency.

    We thank the reviewer for raising this point. It has led us to add a completely new analysis on the “Effect of off-target fitness costs on gene drive performance”, in which we now show simulation results to illustrate the effect of direct and off-target fitness effects on both modification and suppression homing drives. We have also added more discussion on how these different types of fitness costs may affect other frequency-dependent CRISPR based gene drives.

    Reviewer #2 (Public Review):

    This paper reports a set of Drosophila population cage experiments aimed at quantifying fitness effects associated with the expression of Cas9 gene drive constructs in the absence of homing. The study attempts to deconvolve fitness effects due to the presence of the active nuclease at a genomic location from those that arise from off-target effects elsewhere in the genome: an important issue when considering gene drive strategies in the wild. To distinguish effects due to cleavage at the target site from activity elsewhere in the genome, a construct where Cas9 was replaced with a high fidelity nuclease (Cas9HF1) was employed. The experimental design compares the active nuclease-gRNA constructs targeting a site on another chromosome with no gRNA and reporter only controls, all inserted in the same locus. The Cas9 construct was assayed in 7 replicates with Cas9HF1 and controls assessed as duplicates with cages running for between 8 and 19 generations.

    2.a) There is a lack of clarity in terms of the cage set up design, the description in the supplementary methods could clarify if all the replicates came from a single founder and the difference in set-ups that necessitated ignoring some 1st generations.

    Thank you for pointing this out. We have thoroughly revised and extended our Methods section on “Generation of transgenic lines” to clarify this point. We now explicitly mention that we generated several transgenic lines of each construct and used independently obtained lines for each of our population cages, except for the Cas9_gRNAs construct, where we used four lines in seven population cages (replicates 1 to 4 were founded with the same line).

    For the cage start conditions, we now note that “To avoid potentially confounding maternal fitness effects on the construct frequency dynamics (which could arise based on minor differences in health or age between the initial batches of flies mixed together), we excluded the first generation of five cage populations…” In general, it is quite common for this to happen in insect population cage studies (please see some examples below) and is always a very short-term effect.

    2.b) The main finding reported from this part of the work is that with the control populations the frequency of the construct remained fairly constant across the generations, but the active nuclease tended to decline. I am somewhat confused by some of the claims here. First, the authors report a "bottoming out" effect where construct frequency declines then levels off: I am not entirely convinced that Figure 2 shows this. For example, comparing replicates 4 and 5 (8 and 16 generations respectively), it looks to me that there is a steady decline at the same rate with no evidence for a plateau. Perhaps replicates 2 and 3 show "some" evidence of leveling. In addition, replicates 4, 5, 6 and 7 have similar construct starting frequencies (particularly 5 and 7, which are only a few % different) yet the former show a steady decline whereas the latter maintain the construct at a steady level. This does not appear to be consistent with the author's explanation of higher off-target effects in populations carrying high frequencies of the construct. It would be helpful if the authors could more clearly explain the trajectories presented in Figure 2.

    We agree with the reviewer that our initial description of the raw construct frequency dynamics solely based on visual clues was making too strong claims (e.g., “different frequency dynamics between single replicates”) without providing more quantitative statistical support. This was originally intended as some basic introduction, with our maximum likelihood analysis then providing a more rigorous assessment in the next section. To improve clarity, we have completely restructured this in our revised manuscript. We removed the comparison of Cas9_gRNAs replicates solely based on visual clues, highlighted the general heterogeneity in trajectories among replicates (without making any specific claims), and instead of the vaguely defined “bottoming out” interpretation, we now only mention the average construct frequency change for the Cas9_gRNAs construct. In addition, we now present our more rigorous maximum likelihood analysis of the construct frequency trajectories and statistical model comparison earlier on in the Results section, so that all of our conclusions are now based on this statistical analysis, rather than an initial visual inspection of the curves. Please see also our comments to point 3.a) below, as reviewer 3 made very similar comments and suggestions.

    2.c) Utilising the allele frequencies obtained from the cages, 2 locus ML models were applied with the construct insertion site and an idealised off target site. They argue, correctly in my view, that fitness effects can be attributed to off target activity and not cleavage at the 3L target since the Cas9HF1 construct shows no substantive effect. In the models they assume that the presence of Cas9 in the germline (or maternally contributed) will invariably lead to cleavage at the idealised site. The model indicates that the construct insertion per se has no direct fitness costs but that off-target effects may have fitness consequences of approximately 30%, and seek to support this conclusion with simulations. I found this section difficult to follow but I feel that the conclusions are supported.

    We agree with the reviewer that the “Maximum likelihood analysis” section was too dense and therefore challenging to follow, especially for non-expert readers who may not be very familiar with such methods. We have revised and extended this section. In particular, we now also provide a brief summary of the modeling approach at the beginning of the section and have added subsection titles aiming to better guide the reader through the various steps of the analysis. Furthermore, we added a table with an overview of all tested models and highlighted the best-fitting models in tables 2 and 3. We hope that this has improved the clarity of our revised manuscript.

    2.d) Direct phenotypic assays with the active Cas9 nuclease were performed, looking at viability, mating preference and fecundity. Relegating these data to the supplements is not useful. While significant effects are attributed to the Cas9-gRNA construct, the authors cannot rule out a DsRed effect and it is a shame they did not assay at least one of the control constructs. In addition, in their modelling they assume that Cas9 activity will always cleave but see no evidence for this in the heterozygote viability assay. Whether this is due to the difference in rearing conditions that the authors claim is debatable.

    We thank the reviewer for this valuable feedback. As suggested, we have moved the phenotypic assays (Methods & Results) of the Cas9_gRNAs construct to the main part of the revised manuscript. We decided to conduct phenotypic assays only for the Cas9_gRNAs construct, because it was the only one that displayed some fitness costs in our maximum likelihood analysis (in particular, the DsRed construct did not display any fitness costs in the cages). However, given more time and capacity, we agree that additional phenotypic assays would have been desirable (e.g., a larger sample size per construct and additional constructs). Regarding our choice of model for the maximum likelihood analysis, we used a highly simplified off-target approach, which was necessary given the available information.

    2.e) Finally, since the initial cage experiments suggest that the Cas9HF1 enzyme reduces off-target effects they assay this enzyme in a model homing drive, indicating that this enzyme performs as well as the regular Cas9. Again, relegation of these data to supplementary datasets is unhelpful and it would improve the manuscript if these results could be simply summarised in a figure.

    We added an additional figure at the end of the “Cas9HF1 homing drive” section in the Results showing the gene drive inheritance rate and resistance allele formation rate in early embryos for the Cas9HF1 and Cas9 homing drive respectively. The gene drive inheritance rate is the percentage of offspring with DsRed fluorescence when crossing individual gene drive heterozygotes with “wildtype” homozygotes (i.e., not carrying any gene drive allele) and is used to calculate the gene drive conversion rate (i.e., the rate at which wildtype alleles are converted to drive alleles) mentioned in the main text. We hope that this has improved the clarity of our revised manuscript.

    2.f) Taken together, I think this is a useful study but is presented in a way that is at times impenetrable to the non expert. More clarity in presenting the cage and modelling data, as well as promotion of figures from supplementary material to the main manuscript would considerably aid the non expert and provide greater confidence in the interpretations. If these issue could be clarified I feel the work provides a useful addition to the gene drive field and will help those thinking about developing such strategies, particularly relevant are the findings related to the Cas9HF1 enzyme.

    We thank the reviewer for the valuable feedback. We have significantly revised the Results as well as the Discussion, provided additional information on the modeling approach, and shifted supplementary material to the main text of the manuscript. We hope this has improved the overall clarity of the manuscript.

    Reviewer #3 (Public Review):

    The manuscript by Langmuller, Champer and colleagues reports a set of experiments and models investigating the fitness effects of transgenes in Drosophila melanogaster carrying CRISPR components to determine how useful such transgenes may be for population control. This study benefits from well-designed transgene constructs that allow the investigators to distinguish the effects of on-target and off-target Cas9 endonuclease activity, and a sophisticated maximum likelihood modeling framework that allows estimation of the fitness effects of the transgene constructs. The manuscript's major shortcoming is the absence of statistical analysis of the allele frequency data and some potentially unrealistic assumptions that went into the model.

    3.a) My first recommendation is that a statistical analysis of the allele frequency data should be included in the manuscript, rather than inferring patterns solely from visual inspection of the data. Specifically, the manuscript claims that (lines 176-180): "We found Cas9_gRNAs to be the only construct that systematically decreased in frequency across all replicate cages (Figure 2). Interestingly, the allele frequency change was not consistent with fixed direct fitness costs. Instead, the construct frequency "bottomed out" in most replicates, and this occurred more quickly when the starting frequency was higher (Figure 2)." These conclusions regarding allele frequency changes should be supported by statistical analyses. What is the uncertainty surrounding the allele frequency estimates? Some indication of this uncertainty (such as error bars) could be added to Figure 2. Which of the trajectories in Figure 2 show a statistically significant change in allele frequency over the course of the experiment? Is the increase in the frequency of the no-Cas9_no-gRNA replicates significant? What support is there for the claim that the allele frequency changes "bottomed out"? Does a non-linear model fit these data significantly better than a linear trend? What is the evidence that allele frequency decreases slowed earlier "when the starting frequency was higher"? What is the evidence that "replicates 3 and 4 ... had very different frequency dynamics"? While they started at different frequencies, the slope of those two trajectories could be statistically indistinguishable. What is the authors' interpretation of the Cas9_gRNAs replicates 6 & 7 whose trajectories did not decrease?

    We thank the reviewer for this detailed recommendation. We agree that our description of construct frequency dynamics solely from visual clues was indeed making too strong claims (e.g., regarding “different frequency dynamics”) without providing enough statistical support for these specific statements. We had originally thought that some readers would prefer we first provide such a qualitative description of the allele frequency trajectories, prior to going into the mathematically more rigorous (but therefore also more complicated) maximum likelihood inference of fitness costs and statistical model comparison of different selection scenarios (“full inference model” vs. “construct model” vs. “off-target model”, etc.)

    In response to the reviewer’s comments, we decided to completely restructure this first part of the Results section. Specifically, we have removed our comparison of Cas9_gRNAs replicates solely based on visual clues, and also any mention of the admittedly vaguely defined “bottoming out” behavior. Instead, we now only mention the average frequency change for the Cas9_gRNAs construct across all replicates, while highlighting the heterogeneity among replicates. The maximum likelihood analysis is now introduced right after this and has also been revised extensively to improve clarity. We believe that this analysis provides a very powerful framework for the systematic inference of fitness costs and for assessing which of the different selection scenarios best explains our empirical data. This is because it combines the data from all replicates while fully accounting for the heterogeneity among them. For example, it could well be that construct frequency trajectories in individual replicates may not be statistically distinguishable from neutral evolution, yet in aggregate, an inferred fitness cost of the construct becomes highly significant. Note that the maximum likelihood framework also provides confidence intervals for its estimates, based on the entirety of the data. So the question of whether a departure from a neutral model is significant comes down to whether the 95% confidence interval surrounding the fitness estimate of the given construct still includes a value of 1 (which it does for the “direct fitness” estimate of the full model, but not for the “off-target fitness” estimate, see Table 2).

    Regarding the comment about error bars for the allele frequency trajectories in Figure 2, we want to point out that our construct frequency estimates are actually based on the genotype counts of all adult flies present in the given cage experiment at the specific time point. We therefore did not include uncertainty estimates in Figure 2, nor did we include sampling noise in the maximum likelihood analysis. We have now clarified this in the caption of Figure 2 and in the Methods section (“Maximum Likelihood framework for fitness cost estimation”). We also acknowledge that we still cannot rule out sampling noise completely (for example through escaped flies, phenotyping errors, or loss of frozen flies due to destruction or other issues). However, we expect that the relative contribution of these errors should be negligible compared to drift.

    The reviewer raises an interesting question: Why did the Cas9_gRNAs construct frequency not decrease in the two replicates with the highest construct starting frequency (replicate 6 and 7)? A possible explanation could be that — given a limited set of off-target sites — cut off-target alleles that impose a fitness cost will accumulate and start to independently segregate from the construct alleles very quickly in populations where the construct has a high starting frequency (and thus a higher overall rate of cleavage events). We now state this possible explanation in the section on “Construct frequency dynamics suggest moderate off-target fitness costs” of our revised manuscript.

    3.b) My second recommendation involves the assumptions that went into the maximum likelihood modeling. In particular, it strikes me as unrealistic to assume that 1) the genome contains only a single off-target site that is entirely responsible for the decrease in fitness due to Cas9 activity; and 2) that the rate of off-target mutation is as high as it is assumed to be ("In individuals that carry a construct, all uncut off-target alleles are assumed to be cut in the germline, which are then passed on to offspring that could suffer negative fitness consequences."). Regarding point 1), isn't a more realistic scenario that there are multiple off-target sites, each with a potentially different fitness consequence resulting from Cas9-induced mutations? If so, doesn't the likelihood that all off-target sites have been cut depend on the number of such sites, as multiple off-target sites should reduce the mutation rate at any single site. This possibility also suggests that there may be multiple loci with potentially deleterious Cas9-induced alleles segregating within the experimental populations. Regarding point 2), even assuming only a few potential off-target sites per genome, it seems like the rate of off-target cutting would have to be unrealistically high to approach mutating all off-target sites in the population. The conversion efficiency of the constructs used here is reported as ~80% and 60% in females and males, respectively; it seems likely that the rate of Cas9 mutation at off-target sites is lower than this efficiency for the target site. These assumptions should be justified or relaxed before claiming that mutational saturation of off-target sites is responsible for a decreasing fitness loss over the course of the experiments (after confirming that there is statistical support for the claim that the allele frequency trajectories bottom out).

    The reviewer raises a very important point: modeling only one off-target site that represents the net fitness effect of Cas9 cleavage outside the target region as well as a cut rate of 100 % (i.e., the off-target site is always cut in the presence of Cas9) is highly idealized.

    (1) We agree with the reviewer that in reality, the experimental populations might have a polygenic off-target landscape, where the fitness of cleavage alleles could differ vastly within as well as between loci. However, given the limited number of data points (e.g., n=87 generation transitions for experimental populations with the Cas9_gRNAs construct), it would be extremely difficult if not impossible to disentangle the numerous parameters that would be necessary to describe such a more complex off-target scenario with our modeling approach. We have now highlighted our model choices, potential caveats, and resulting limitations in both the Discussion section and also the section “Construct frequency dynamics suggest moderate off-target fitness costs” in the Results.

    (2) Similar to the single off-target locus, our cut rate of 100 % is an idealized assumption that was chosen with the aim to reduce model complexity. As outlined above, it would be extremely hard to disentangle the cut rate from other parameters (such as the number of target sites if fitness effects are multiplicative across loci). Additionally, we would like to point out that the reported conversion efficiencies (~80 % in males, ~60% in females) are not the conversion efficiencies of the constructs in the experimental populations shown in Figure 2, but of separate homing drives with a single gRNA. All constructs in the experimental populations are designed in a way that no homing can occur, and they have four gRNAs if any. We apologize for the confusion. Our revised manuscript contains now a paragraph in the “Cas9HF1 homing drive” section in the Results that highlights the differences between the constructs in the cage populations and the homing drives assessed in this study. Furthermore, we have added an additional figure that displays the individual results of the homing drive (Figure 5) — we hope this improves clarity.

    3.c) My third suggestion involves the correspondence between the results of the likelihood modeling and the phenotypic assays. The best fit model inferred a viability loss of 26% and no detectable effects on female choice (or male attractiveness) or fecundity. In contrast, the phenotypic assays inferred no detectable effect on viability, but a 50% reduction in male attractiveness and 25% reduction in female fecundity. I think that the authors' conclusion that "[t]hese assays broadly confirmed our previous findings" needs some context or explanation as to how these numerically discrepant findings are broadly confirming, beyond the speculation that the discrepancy in viability may be due to rearing in vials vs. population cages.

    We thank the reviewer for pointing this out. We removed the claim that the phenotypic assays “broadly confirmed our previous findings” and highlight now the differences in estimated fitness costs for male and females in the phenotypic assays as well as the discrepancy to our maximum likelihood estimates. Furthermore, we provide now additional explanations for what might be causing this phenomenon (i.e., single crosses vs. large populations, vial vs. cage, interactions between individual genotypes and the environment, delayed development of construct homozygotes being interpreted as reduced viability in the maximum likelihood analysis). We also point towards the discrepancies in the Discussion of our revised manuscript and recap potential explanations.

    3.d) My fourth suggestion involves the comparison between the Cas9_gRNAs and Cas9HF1_gRNAs transgenes. The inference that off-target cuts are the major source of fitness loss for the Cas9_gRNAs construct relies heavily on the observation that there was no decrease in allele frequency for the two Cas9HF1_gRNAs replicates. It therefore seems critical to be confident in this observation, and to rule out alternative explanations as much as possible. For example, did the authors confirm that the Cas9HF1_gRNAs construct has on-target Cas9 activity levels as high as the Cas9_gRNAs construct? Although I am not certain about this (see comments in the next paragraph on this point), I think the transgene constructs used to estimate drive conversion rates are different from the constructs used for the population cage experiments; if this is correct, I think it would be helpful to provide the on-target mutation rates for the actual constructs used in the population cages.

    The reviewer is correct: The constructs in the population cages are different to the homing gene drives for which we estimated the gene drive conversion rates. However, we were able to confirm at least one mutated gRNA target site in every PCR-based genotyped offspring of individuals carrying either the Cas9_gRNAs or the Cas9HF1_gRNAs construct (this is now specified in the manuscript). Thus, we did not expect a systematic difference in on-target mutation rates for Cas9_gRNAs, and Cas9HF1_gRNAs constructs respectively. We acknowledge in the Discussion that construct performance might substantially vary with genomic sites and even organisms.

    3.e) Relatedly, I was confused about the portion of the manuscript that reports the drive conversion efficiency. The manuscript states, "As a proof-of-principle that Cas9HF1 is indeed a feasible alternative, we designed a homing drive that is identical to a previous drive (45), except that it uses Cas9HF1 instead of standard Cas9. This drive targets an artificial EGFP target locus with a single gRNA (see Methods)." Given that the rate of drive conversion was estimated by the loss of GFP, these homing drive constructs must be different from the constructs used in the population cage experiments, as those constructs targeted a site on chromosome 3L which does not contain GFP. I could not find a description of these homing constructs in the Methods - while a reader might be able to puzzle this out by reading reference #45, I think it would be helpful to explicitly describe these details in this manuscript.

    We apologize for the confusion. We have highlighted the similarities (e.g., nanos promoter, DsRed) as well as the differences (e.g., number of gRNAs) between the homing drives and the constructs in the cage populations at the beginning of the section “Cas9HF1 homing drive” in the Results. We hope this makes it more clear.

  2. Evaluation Summary:

    The issue of general fitness effects in organisms expressing Cas9 enzymes as part of gene drive genetic control strategies is important, particularly in the emerging field of vector control. This manuscript reports experiments aimed at teasing apart such effects in a Drosophila model system, providing evidence that off-target effects predominate, which may be ameliorated by utilising high-fidelity nucleases, but a more detailed analysis of data and justification for some of the assumptions, especially some direct evidence of off-target cleavage, are still needed to support the authors' inferences. It is currently also not entirely clear how the lines were generated and tested. Finally, additional modelling to include scenarios where the initial frequency of the drive allele is very low (as would be the case for an actual release) would help to strengthen the conclusions.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

  3. Reviewer #3 (Public Review):

    The manuscript by Langmuller, Champer and colleagues reports a set of experiments and models investigating the fitness effects of transgenes in Drosophila melanogaster carrying CRISPR components to determine how useful such transgenes may be for population control. This study benefits from well-designed transgene constructs that allow the investigators to distinguish the effects of on-target and off-target Cas9 endonuclease activity, and a sophisticated maximum likelihood modeling framework that allows estimation of the fitness effects of the transgene constructs. The manuscript's major shortcoming is the absence of statistical analysis of the allele frequency data and some potentially unrealistic assumptions that went into the model.

    My first recommendation is that a statistical analysis of the allele frequency data should be included in the manuscript, rather than inferring patterns solely from visual inspection of the data. Specifically, the manuscript claims that (lines 176-180): "We found Cas9_gRNAs to be the only construct that systematically decreased in frequency across all replicate cages (Figure 2). Interestingly, the allele frequency change was not consistent with fixed direct fitness costs. Instead, the construct frequency "bottomed out" in most replicates, and this occurred more quickly when the starting frequency was higher (Figure 2)." These conclusions regarding allele frequency changes should be supported by statistical analyses. What is the uncertainty surrounding the allele frequency estimates? Some indication of this uncertainty (such as error bars) could be added to Figure 2. Which of the trajectories in Figure 2 show a statistically significant change in allele frequency over the course of the experiment? Is the *increase* in the frequency of the no-Cas9_no-gRNA replicates significant? What support is there for the claim that the allele frequency changes "bottomed out"? Does a non-linear model fit these data significantly better than a linear trend? What is the evidence that allele frequency decreases slowed earlier "when the starting frequency was higher"? What is the evidence that "replicates 3 and 4 ... had very different frequency dynamics"? While they started at different frequencies, the slope of those two trajectories could be statistically indistinguishable. What is the authors' interpretation of the Cas9_gRNAs replicates 6 & 7 whose trajectories did not decrease?

    My second recommendation involves the assumptions that went into the maximum likelihood modeling. In particular, it strikes me as unrealistic to assume that 1) the genome contains only a single off-target site that is entirely responsible for the decrease in fitness due to Cas9 activity; and 2) that the rate of off-target mutation is as high as it is assumed to be ("In individuals that carry a construct, all uncut off-target alleles are assumed to be cut in the germline, which are then passed on to offspring that could suffer negative fitness consequences."). Regarding point 1), isn't a more realistic scenario that there are multiple off-target sites, each with a potentially different fitness consequence resulting from Cas9-induced mutations? If so, doesn't the likelihood that all off-target sites have been cut depend on the number of such sites, as multiple off-target sites should reduce the mutation rate at any single site. This possibility also suggests that there may be multiple loci with potentially deleterious Cas9-induced alleles segregating within the experimental populations. Regarding point 2), even assuming only a few potential off-target sites per genome, it seems like the rate of off-target cutting would have to be unrealistically high to approach mutating all off-target sites in the population. The conversion efficiency of the constructs used here is reported as ~80% and 60% in females and males, respectively; it seems likely that the rate of Cas9 mutation at off-target sites is lower than this efficiency for the target site. These assumptions should be justified or relaxed before claiming that mutational saturation of off-target sites is responsible for a decreasing fitness loss over the course of the experiments (after confirming that there is statistical support for the claim that the allele frequency trajectories bottom out).

    My third suggestion involves the correspondence between the results of the likelihood modeling and the phenotypic assays. The best fit model inferred a viability loss of 26% and no detectable effects on female choice (or male attractiveness) or fecundity. In contrast, the phenotypic assays inferred no detectable effect on viability, but a 50% reduction in male attractiveness and 25% reduction in female fecundity. I think that the authors' conclusion that "[t]hese assays broadly confirmed our previous findings" needs some context or explanation as to how these numerically discrepant findings are broadly confirming, beyond the speculation that the discrepancy in viability may be due to rearing in vials vs. population cages.

    My fourth suggestion involves the comparison between the Cas9_gRNAs and Cas9HF1_gRNAs transgenes. The inference that off-target cuts are the major source of fitness loss for the Cas9_gRNAs construct relies heavily on the observation that there was no decrease in allele frequency for the two Cas9HF1_gRNAs replicates. It therefore seems critical to be confident in this observation, and to rule out alternative explanations as much as possible. For example, did the authors confirm that the Cas9HF1_gRNAs construct has on-target Cas9 activity levels as high as the Cas9_gRNAs construct? Although I am not certain about this (see comments in the next paragraph on this point), I think the transgene constructs used to estimate drive conversion rates are different from the constructs used for the population cage experiments; if this is correct, I think it would be helpful to provide the on-target mutation rates for the actual constructs used in the population cages.

    Relatedly, I was confused about the portion of the manuscript that reports the drive conversion efficiency. The manuscript states, "As a proof-of-principle that Cas9HF1 is indeed a feasible alternative, we designed a homing drive that is identical to a previous drive (45), except that it uses Cas9HF1 instead of standard Cas9. This drive targets an artificial EGFP target locus with a single gRNA (see Methods)." Given that the rate of drive conversion was estimated by the loss of GFP, these homing drive constructs must be different from the constructs used in the population cage experiments, as those constructs targeted a site on chromosome 3L which does not contain GFP. I could not find a description of these homing constructs in the Methods - while a reader might be able to puzzle this out by reading reference #45, I think it would be helpful to explicitly describe these details in this manuscript.

  4. Reviewer #2 (Public Review):

    This paper reports a set of Drosophila population cage experiments aimed at quantifying fitness effects associated with the expression of Cas9 gene drive constructs in the absence of homing. The study attempts to deconvolve fitness effects due to the presence of the active nuclease at a genomic location from those that arise from off-target effects elsewhere in the genome: an important issue when considering gene drive strategies in the wild. To distinguish effects due to cleavage at the target site from activity elsewhere in the genome, a construct where Cas9 was replaced with a high fidelity nuclease (Cas9HF1) was employed. The experimental design compares the active nuclease-gRNA constructs targeting a site on another chromosome with no gRNA and reporter only controls, all inserted in the same locus. The Cas9 construct was assayed in 7 replicates with Cas9HF1 and controls assessed as duplicates with cages running for between 8 and 19 generations.

    There is a lack of clarity in terms of the cage set up design, the description in the supplementary methods could clarify if all the replicates came from a single founder and the difference in set-ups that necessitated ignoring some 1st generations.

    The main finding reported from this part of the work is that with the control populations the frequency of the construct remained fairly constant across the generations, but the active nuclease tended to decline. I am somewhat confused by some of the claims here. First, the the authors report a "bottoming out" effect where construct frequency declines then levels off: I am not entirely convinced that Figure 2 shows this. For example, comparing replicates 4 and 5 (8 and 16 generations respectively), it looks to me that there is a steady decline at the same rate with no evidence for a plateau. Perhaps replicates 2 and 3 show "some" evidence of levelling. In addition, replicates 4, 5, 6 and 7 have similar construct starting frequencies (particularly 5 and 7, which are only a few % different) yet the former show a steady decline whereas the latter maintain the construct at a steady level. This does not appear to be consistent with the authors explanation of higher off-target effects in populations carrying high frequencies of the construct. It would be helpful if the authors could more clearly explain the trajectories presented in Figure 2.

    Utilising the allele frequencies obtained from the cages, 2 locus ML models were applied with the construct insertion site and an idealised off target site. They argue, correctly in my view, that fitness effects can be attributed to off target activity and not cleavage at the 3L target since the Cas9HF1 construct shows no substantive effect. In the models they assume that the presence of Cas9 in the germline (or maternally contributed) will invariably lead to cleavage at the idealised site. The model indicates that the construct insertion per se has no direct fitness costs but that off-target effects may have fitness consequences of approximately 30%, and seek to support this conclusion with simulations. I found this section difficult to follow but I feel that the conclusions are supported.

    Direct phenotypic assays with the active Cas9 nuclease were performed, looking at viability, mating preference and fecundity. Relegating these data to the supplements is not useful. While significant effects are attributed to the Cas9-gRNA construct, the authors cannot rule out a DsRed effect and it is a shame they did not assay at least one of the control constructs. In addition, in their modelling they assume that Cas9 activity will always cleave but see no evidence for this in the heterozygote viability assay. Whether this is due to the difference in rearing conditions that the authors claim is debatable.

    Finally, since the initial cage experiments suggest that the Cas9HF1 enzyme reduces off-target effects they assay this enzyme in a model homing drive, indicating that this enzyme performs as well as the regular Cas9. Again, relegation of these data to supplementary datasets is unhelpful and it would improve the manuscript if these results could be simply summarised in a figure.

    Taken together, I think this is a useful study but is presented in a way that is at times impenetrable to the non expert. More clarity in presenting the cage and modelling data, as well as promotion of figures from supplementary material to the main manuscript would considerably aid the non expert and provide greater confidence in the interpretations. If these issue could be clarified I feel the work provides a useful addition to the gene drive field and will help those thinking about developing such strategies, particularly relevant are the findings related to the Cas9HF1 enzyme.

  5. Reviewer #1 (Public Review):

    The goal of the work was to test for direct and indirect fitness costs associated with specific types of constructs that could be used for gene drive. The authors conclude that there are no direct fitness costs associated with the presence and expression of either Cas9 or the guide RNAs but that the Cas9 is causing off-target cuts that result in loss of fitness. They also conclude that a newer form of CAS doesn't cause these off-target cuts. While the goal of this study is important, there are many caveats associated with the work as reported, and these limit interpretation of the results, Many of the caveats are pointed out in the discussion.

    I am specifically concerned by the fact that from what I read, a company made the transgenic lines and that there was only one transgenic line per treatment. Unless the fly line used for the insertion was completely homozygous for the chromosome where the insertion was made, the lines could have differed in fitness, due to somewhat deleterious reccessives captured in one G1 but not another. This cost could have persisted for a number of generations after the crosses were made, especially in the high frequency "releases". This may not have been a real problem, but without any replication it is difficult to know.

    My concern is reinforced by the fact that the no-Cas9, no-gRNA line goes up in frequency for the first 5 generations and then becomes stable in frequency. The loss of the fitness advantage is consistent with a fitness effect partially linked to the insertion site in that one cross but not others.

    It is important to note that the starting points are cages with separate vials of the control and experimental strain. Even a small difference in development time of the two strains in the first generation could lead to an excess of homozygotes in the next generation.

    I am also concerned by the fact that the main conclusion is that the decline in frequency in the Cas9-gRNA line is due to off-target cuts, but there was no sequencing to back up that conclusion. In the discussion, this problem is mentioned but dismissed. I don't see how it can be dismissed when this is a major conclusion that remains based on very indirect evidence.

    When releasing homing gene drives, the initial frequency of the transgenic line is very low, and as in the Garrood et al paper cited, it is possible for the gene drive to outpace the non-target cutting. The modeling does not address what the impact of the presumed fitness costs in this experiment would be for a replacement/suppression drive released at low frequency.