Novel non-transposable-element regulation patterns of KZFP family reveal new drivers of its rapid evolution

This article has been Reviewed by the following groups

Read the full article

Abstract

One striking feature of the large KRAB-containing zinc finger protein (KZFP) family is its rapid expansion and divergence since its formation about 400 million years ago. However, the evolutionary characteristics of KRAB domains, C2H2 zinc fingers and the full protein of KZFPs have not been fully analyzed. As for the drivers of the rapid evolution, it’s partly due to their coevolution with transposable elements (TEs). But their diverse functions besides inhibiting TEs suggest other reasons exist. Here we address these two issues by the systematic analysis of the divergence time and diversification pattern of KZFPs at three aspects and the functional analysis of the potential target genes besides TEs. We found that old-zinc-finger-containing KZFPs tend to have varied and disordered KRAB domains, indicating there are two ways of the evolution of KZFPs, including the variation of KRAB domains and the diversification of zinc fingers. Among them, the divergence of zinc fingers mainly contributes to the rapid evolution of KZFPs. Thus, we mainly focused on the functional requirements of the evolution of zinc fingers. Different from the classical regulation pattern of this family, we found and experimentally confirmed that KZFPs tend to bind to non-TE regions and can positively regulate target genes. Although most young genes tend to be with a low expression level, young-zinc-finger-containing KZFPs tend to be highly expressed in early embryonic development or early mesoderm differentiation, indicating their particular evolutionarily novel functional roles in these two processes. We further validated a young KZFP, ZNF611, can bind to non-TE region of STK38 gene and positively regulates its expression in ESCs. The emergence of new sequence in STK38 promoter may drive the evolution of zinc fingers in ZNF611. Finally, we proposed a ‘two-way evolution model’ of KZFP family.

Article activity feed

  1. ##Author Response

    This paper analyzes the evolution of the KRAB-containing zinc finger protein (KZFP) family of proteins. While the reviewers were all interested in the topic, several major concerns came up during review. These include technical limitations of the methods chosen to analyze this challenging protein family (e.g., determination of orthology, selection analysis, and so on), and that new ideas, including claims about non-coding evolution and positive selection, are not convincingly supported by the analysis presented.

    Response: In our study, we focused on the co-evolution between zinc fingers in KZFPs and non-TE regions, not ‘non-coding regions’. Non-TE regions are located in both coding and non-coding regions.

    ###Reviewer #1:

    1. The title and abstract make it clear that the authors are trying to argue that non-coding sequence contributes to rapid evolution of the KRAB-ZFP family….

    Response: As we mentioned above, we focused on the co-evolution between zinc fingers in KZFPs and non-TE regions, not ‘non-coding regions’. Non-TE regions are located in both coding and non-coding regions.

    1. Page 6 line 122: The authors do not define, here or in the methods, what constitutes a "variant" KRAB domain.

    Response: In fact, the meaning of variant KRAB domains had been simply described here (Page 6, line 120-122). The variant KRAB domains display a very significant degree of sequence divergence from the KRAB A-box consensus sequence, and variant KRAB domains are clustered into one separated branch in the phylogenetic tree of KRAB domain A-box amino acid sequence. This description was similar to that in the reference (Helleboid et al., 2019). We will explain it more detailly in method section in the further revision of the manuscript.

    1. Page 9 lines 189-193: Does the 90% cited here refer to 90% of the ~50% that are called as "tending to bind non-TE sequence" or 90% of all KZFPs? Regardless, this point is very misleading: the fact that less than 50% of the binding sites of a KZFP is not found to overlap TEs does not mean that the KZFP only binds to non-TEs.

    Response: Here, ‘90%’ refer to 90% of all KZFPs. We did not state that ‘less than 50% of the binding sites of a KZFP is not found to overlap TEs means that the KZFP only binds to non-TEs’. Instead, we mean that they tend to bind to non-TEs.

    1. Page 11 lines 249-251: 1) It is not clear how the author defined genes as transcription factors (they also do not define the acronym), or why they included them in the analysis. 2) Additionally, the authors say that the divergence time of KZFPs is correlated with expression level but does not provide correlation values or the significance of these correlations.

    Response: 1) Since KZFPs can bind to target genes and regulate their transcription, most of them are regarded as potential transcription factors. To confirm whether the special features of KZFPs found in our study are KZFP-specific or common to all transcription factors, we compared KZFPs with other transcription factors. The data source of transcription factors was described in the method section (page 18, lines 422-424). 2)We showed the correlation values in figure 4A and the corresponding P values were listed in Figure 4–source data 1.xlsx.

    1. How were the target genes selected for qPCR validation among the KZFP targets? 2) In Fig.5 suppl. 3 the authors show that there is a fraction of genes that is only accessible in ESCs, but there is also a similar number of genes that is accessible both in ESCs and HEK293T cells, so the authors could have tried to validate some of those in both cell lines...

    Response: 1) We screened the target genes with significant changes in the expression level from ESC into endoderm or mesoderm for qPCR validation. 2) Indeed, we have performed some validations (Fig.5 suppl. 1)

    1. Page 16 lines 367-373: The conclusions that can be drawn from the ZNF611 reporter assay and associated evolutionary analysis are minimal. …There is almost no experimental methodology on how the tree was generated, how the authors overcame these issues, and how the authors identified the orthologous binding site in different species.

    Response: Sequence alignments were performed using ClustalX (version 2.1) with default parameters (Larkin et al., 2007), and the phylogenetic tree (neighbor-joining tree) was constructed using MEGAX (Kumar, Stecher, Li, Knyaz, & Tamura, 2018) with default parameters. To identify the orthologous binding site in different species, firstly, we found the ZFN611 binding site in the ZNF611 ChIP peak sequence in STK38 promoter according to the predicted ZNF611 binding motif in human. Then we compared the ZFN611 binding site within the promoter of orthologous STK38 in different species.

    17)There is not enough detail about how the human KRAB-ZFPs were identified. Bare minimum, the authors need to report thresholds used to determine if a protein's domains scored high enough to be either a KRAB or C2H2 ZF domain.

    Response: All KRAB domains and C2H2 zinc fingers in human proteins were identified using HMMER v3.1b2 with E value < 0.01. The proteins containing both a KRAB domain and C2H2 zinc fingers were defined as KZFPs. This method was not described clearly in the manuscript (page 18). We will add detailed description of that in the revision.

    1. How the authors performed the gene ontology enrichment/depletion analysis is not clear. For example, if the authors indeed prefiltered their list to remove genes that have no GO terms, that would bias the results.

    Response: This has been described in the method section (page 23, lines 530-532). The genes that haven’t GO term annotation were filtered out. It’s also needed that the genes were expressed at least in one sample. These genes were regarded as the background of the enrichment/depletion analysis. This strategy was used widely in published papers.

    ###Reviewer #2:

    It is not clear how the authors identified KRAB-ZNF genes in the 80 species analysed, nor how they defined orthology relationship of KRAB-ZNF genes.

    Response: The methods were described in lines 425-443 (page18-19). To identify the divergence time of KRAB domain in human KZFPs, protein sequences of 80 species from 80 genera in deuterostomia were downloaded from Ensembl database. All KRAB domains and C2H2 zinc fingers in proteins were identified using HMMER v3.1b2 with E value < 0.01. The proteins containing both a KRAB domain and C2H2 zinc fingers were defined as KZFPs. The divergence time of the full protein sequence was inferred according to the homology information from Ensembl Compara (Herrero et al., 2016; Vilella et al., 2009).

    it is puzzling that the divergence time of the full protein sequence can be estimated above 400 Mya, while the root of the KRAB-ZNF gene family has been assigned to the common ancestor of coelacanths, lungfish and tetrapods (Imbeault et al., 2017).

    Response: the root of the KRAB-ZNF gene family in the research (Imbeault et al., 2017) was based on the earliest appearance of the gene encoding both KRAB domain and zinc fingers. However, the divergence time of the full protein sequence based on pairwise alignments, Large-scale syntenies and Enredo-Pecan-Ortheus (EPO) multiple alignments (Herrero, et al., 2016). Thus, some of the orthologous proteins of human KZFPs do not containing a KRAB domain.

    Peaks filtering should include, at the very least, the canonical ENCODE blacklisted regions (Amemiya et al., 2019)

    Response: we have used the corresponding total input samples as controls to get credible peaks.

    Of note, numerous ChIP-seq datasets from ENCODE are listed in the method, but are not referenced or mentioned in the text. Were those included in the ChIP-seq binding sites analysis? How do the two datasets (ENCODE and Imbeault et al., 2017) relate to one another?

    Response: ChIP-seq datasets from ENCODE are included in the ChIP-seq binding sites analysis. We firstly used the ChIP-seq data in Imbeault et al., 2017, and ChIP-seq datasets from ENCODE were used as supplements of the ChIP-seq data of other KZFPs.

    No details are given regarding the method used to assign "the expression level grade" of genes to a specific category.

    Response: The threshold wasn’t described clearly in the manuscript. Genes with read counts over 10 are considered to be expressed, while genes with read counts less than 10 are considered to be unexpressed (undetected). For each dataset, we used the upper and lower quartiles of TPMs of all expressed genes to divide them into three expression level grades: low-abundant genes, the genes with TPMs lower than lower quartile; medium-abundant genes, the genes with TPMs between the lower quartile and the upper quartile; high-abundant genes, the genes with TPMs higher than the upper quartile.

    The KD efficiency of ZNF611 is really poor (<20%, Figure 6B), and prevents further conclusions on this experiment (especially since a western blot cannot be performed). We are also sceptical about the statistical analysis performed in this panel. The authors should explain in detail which t-test was used and whether it was performed on raw or normalized values.

    Response: The statistical method was described in the figure legend. We used grouped t test. And it was performed on normalized values (relative mRNA levels of predicted target genes were normalized to GAPDH).

  2. ###Reviewer #3:

    This paper gives the impression that it is two stories bundled together into one. One story is the evolution of the family and the other one is the experimental part focusing on a very specific KFZP, ZNF611. However, it is a rather weak synthesis with results of moderate interest and likely low phenotypic impact.

    The authors state that the KFZP family is coevolving with TEs and suppresses their expression. According to previous knowledge, that is why this family is evolving so fast. However, the authors argue that this fast evolution is further attributed to the fact that KFZPs also positively regulate the promoters of other non-TE genes. They have analysed published Chip-seq data toward this end. Furthermore, they have experimentally identified that a "young" KFZP, ZNF611 can bind to a promoter element of the STK38 gene and positively regulate its expression in ESCs. However, I did not see substantial experimental evidence supporting a strong phenotypic effect of this particular regulation.

  3. ###Reviewer #2:

    This work interestingly addresses the evolutionary pressures undergone by KRAB-ZNF genes. However, a large part of the manuscript is based on the analysis of pre-existing datasets, but neither exploits these data in new ways nor reveals novel findings overlooked in the original studies. The authors' findings are not a significant addition to the conclusions made by the original investigations, which are, by the way, not properly referenced and often misquoted. Moreover, when the authors attempt to build a systematic method for the identification of non-TE related / activating functions for KRAB-ZNFs, the experimental validation tends to point to few regulatory exceptions rather than general principle for the KRAB-ZNF family. The paper finishes by the analysis of a single non-TE target of a young KRAB-ZNFs, ZNF611, which is clearly not the best candidate considering the proposed model of bimodal evolution of KRAB-ZNFs (old vs. young). The picture that comes out of this manuscript is that of a patchwork of analyses that struggle to stand together as a whole.

    Major points:

    Figure 1 / Comparison of the divergence time of the full sequence, KRAB domain and zinc fingers in KZFPs: The method section is not very clear, which suggests that the authors may have done their analysis by relying on pre-existing database annotations which could bias the estimation of the divergence time.

    -It is not clear how the authors identified KRAB-ZNF genes in the 80 species analysed, nor how they defined orthology relationship of KRAB-ZNF genes. This should precede the estimation of the divergence time. Methods to infer orthology for KRAB-ZNF genes has been based the on best reciprocal hit of the full protein-sequence by Blast (Liu et al., 2014) or on KRAB-ZNF fingerprint (Imbeault et al., 2017). Is it based on Ensembl? It is known that Ensembl has a poor annotation of KRAB-ZNF genes especially in distantly related species with human. Clarification is needed regarding de novo KRAB-ZNF gene detection, annotation and comparison in the method section.

    -Related to this, it is puzzling that the divergence time of the full protein sequence can be estimated above 400 Mya, while the root of the KRAB-ZNF gene family has been assigned to the common ancestor of coelacanths, lungfish and tetrapods (Imbeault et al., 2017). In addition, some of the oldest KRAB-ZNF genes found in the human genome are ~320 Mya (Liu et al., 2014). How do the authors reconcile this with the estimation of the full protein divergence time?

    Figure 2 / The diversification pattern of KRAB domains and zinc fingers in humans: The authors suggest that old KZFPs tend to have a variant KRAB variant domain and thus are involved in non-canonical protein-protein interactions. This analysis has been entirely made in Helleboid et al., 2019, who further validated these results by identifying the interactome of these proteins by mass-spectrometry. Considering the timeline of this submission and the release of the original paper, the authors could have modified their conclusions. They could also have taken greater advantage of non-overlapping findings, such as the disordered nature of the variant KRAB domain. This is interesting but under-exploited.

    Figure 3 / KZFPs tend to bind to non-TE regions in exon and promoter: The analysis of pre-existing data from different sources come with considerable drawbacks, notably in terms of unforeseen experimental artifacts and biases, which could affect peak calling, data interpretation and conclusion. As such, KZFPs may display promiscuous binding to unrelated "opened" regions, especially when they are overexpressed in a non-native context (Amemiya et al., 2019; Marinov et al., 2014). While the authors tested different parameters of the ChIP-seq analysis pipeline, I do not see any attempts to assess the overall reliability of KZFPs peaks within open regions in the method section or in supplementary figures:

    -Peaks filtering should include, at the very least, the canonical ENCODE blacklisted regions (Amemiya et al., 2019). Additional steps of filtering should be included such as building background models that are experiment-specific and cell-type specific, as it has been done in the past (Helleboid et al., 2019; Imbeault et al., 2017; Schmitges et al., 2016). Does it change the overall proportion of peaks falling into TE/non-TE regions?

    -As emphasized in the manuscript, targets of KRAB-ZNFs are expected to be highly specific (Schmitges et al., 2016) as only few of them display similar key amino-acids in their ZFs (Figure 2E/F) and may depend on the appearance of their binding site in evolution (Figure 6). As such, only a minimal overlap of non-TE targets peaks is to be expected for different KRAB-ZNFs proteins: it is likely that non-TE targets bound by many KRAB-ZNFs may result from promiscuous binding sites. The authors should show the overlap of non-TE targets bound by different KRAB-ZNFs before and after filtering steps.

    -As a consequence, these promiscuous binding sites would skew the results of the over-/under-representation of genes in specific biological processes (as presented in Figure 3D) and gene essentiality tolerance (in Figure 3E). What would be the result of these analyses once peaks and gene lists are filtered? Similarly, what would be the result if only promiscuous binding sites were considered?

    -Of note, numerous ChIP-seq datasets from ENCODE are listed in the method, but are not referenced or mentioned in the text. Were those included in the ChIP-seq binding sites analysis? How do the two datasets (ENCODE and Imbeault et al., 2017) relate to one another?

    Figure 4 / KZFP genes encoding young zinc fingers tend to have higher expression level in early embryonic development and the ESC differentiation into mesoderm:

    -The author should refer to previous work on young KZFPs expression during human embryogenesis (Pontis et al., 2019) when they introduce this section. This is especially important since the TE-controlling function of ZNF611 has been investigated in this study, and is not discussed or mentioned in Figure 6.

    -No details are given regarding the method used to assign "the expression level grade" of genes to a specific category. Is it common arbitrary thresholds used for all genes or is it based on something similar to a z-score value ? Clarification is needed.

    Figure 5 / KZFPs can positively regulate target genes by binding to non-TE regions in endoderm or mesoderm differentiation: We would suggest the authors reorganize the figure 5 to bring their strongest evidence of KRAB-ZNFs activating function in the main figure. For instance, genes over/under-representation (Figure 5C) and essentiality (Figure 5D) are not very informative. On the other hand, the Figure 5-figure supplement 1D/E could be presented in the main figure as it reinforces the link between chromatin accessibility and regulatory activities of KRAB-ZNFs in non-TE regions. Of note, while the authors may conclude to regulatory differences between ESC and HEK293, it would be farfetched to superimpose their conclusions to mesoderm and endoderm differentiation without experimental validation. Therefore, the authors should tone down their conclusion in the corresponding section.

    For the KRAB-ZNFs functionally investigated in Figure 5-figure supplement 1D/E, the authors should highlight :

    -Their divergence time, the type of KRAB domain, their known interactors and endogenous expression levels in ESCs, HEK293, during endoderm and mesoderm differentiation (it is impossible to zoom in Figure 4).

    -The proportion of peaks falling in TE/non-TEs region and their associated chromatin accessibility in the different cell types (such as plotHeatmap function from the deepTools suite).

    -The correlation matrix of the chromatin accessibility signals in non-TE binding sites between the two cell lines should be displayed for all the KRAB-ZNFs functionally investigated.

    Figure 6 / The emergence of new sequence in STK38 promoter may drive the evolution of zinc fingers in ZNF611: While the emphasis on KZFPs divergence time and KRAB domain feature is clear in the first part of the manuscript, the shift toward the functional assessment of a young KRAB-ZNF is somehow inconsistent and should be explained.

    -As mentioned above for the KRAB-ZNFs of Figure 5-figure supplement 1D/E, ZNF611 features (divergence time,...) should be displayed in the figure or stated in the text. The number of peaks of ZNF611 in non TE/ non-TE regions should be plotted. Also, previous work on ZNF611 function during embryogenesis should be introduced in this section.

    -ZNF611 expression during mesoderm differentiation (with corresponding correlation) and ESCs should be added to Figure 6-figure supplement 1A.

    Overall, the effect of ZNF611 overexpression or knock-down appears to be mild, and should be reinforced by additional information:

    -Considering the discrepancy of the effect of ZNF611 overexpression and knock-down on the level of STK38 (Figure 6A/B): (i) a western blot analysis of ZNF611-FLAG protein levels in overexpressing cells (like in Figure 5 - figure supplement 1C) could indicate that the overexpression of the protein is actually mild compared to overexpression mRNA levels of ZNF611. Similarly, a previous study analysed the effect of ZNF611 overexpression in hESCs (Pontis et al., 2019), is STK38 upregulated in those datasets? That would reinforce the conclusions made by the authors.

    -The KD efficiency of ZNF611 is really poor (<20%, Figure 6B), and prevents further conclusions on this experiment (especially since a western blot cannot be performed). We are also sceptical about the statistical analysis performed in this panel. The authors should explain in detail which t-test was used and whether it was performed on raw or normalized values.

    -Since the BMPR2 gene remained unaffected by ZNF611 "KD" or "overexpression", could the authors show / perform the same analysis as for STK38 promoter region in Figure 1C-D for this gene?

    -The authors emphasize that ZNF611 functions in mesoderm differentiation through STK38 regulation. This analysis was conducted in the pluripotent state (hESCs). What about the differentiation potential of these cells toward the mesoderm lineage? Does it prevent STK38 upregulation?

    -The authors have shown that KRAB-ZNF effect is largely cell type dependent (figure 5 - figure supplement 1D/E), while the experimental assessment of ZNF611 was done in ESCs, the luciferase assay was performed in HEK293 (figure 6H-I). The authors should repeat the experiments in ESCs or tone down their conclusions.

    -Interestingly, RACDE use TE-related sequences to identify the binding motif of KRAB-ZNFs, suggesting that the binding motif of ZNF611 to STK38 promoter is fairly similar to its TE-derived consensus motif (figure 6F). How many binding sites of ZNF611 in non-TE region present binding sites with a close similarity to the consensus motif derived from TE-binding? Are there changes in specific DNA bases of the canonical binding site motif that could predict activating function of ZNF611 in non-TE regions?

  4. ###Reviewer #1:

    Summary:

    In this study, the authors seek to determine patterns of KRAB-ZFP family evolution and identify the factors that drive those patterns. To do so, they first annotated KRAB-ZFP genes in the human genome and determined the age of these genes in four different ways: orthology, divergence age of full protein, KRAB, and ZnF domain respectively. They found that age estimates based on the KRAB domain and Zinc finger array were older and younger, respectively, relative to full-length or orthology-based estimates of divergence, and that many human KRAB-ZFPs emerged in the eutherian common ancestor. They also determined that older KRAB-ZFPs were more likely to have variant, disordered KRAB domains, and that zinc finger arrays were most variable at the residues directly in contact with DNA. By reanalyzing existing data, the authors claim that most KRAB-ZFPs bind to non-TE regions, and that many KZFP genes are expressed during early embryonic development. They show correlative evidence that KRAB-ZFPs are capable of positively regulating gene expression, and functionally validate a single candidate gene of a KZFP using reporter gene assays. Based on this evidence, they propose a 2-way model of evolution of KRAB-ZFP evolution, where older KRAB-ZFPs are more likely to have non-TE silencing roles and thus have different patterns of evolution compared with younger KRAB-ZFPs.

    General Comments:

    While the subject of KRAB-ZFP family evolution is of interest, the data and conclusions the authors present in this manuscript are mostly confirmatory. Nearly every major conclusion of the paper, including the 2-way model of KRAB-ZFP evolution, has been extensively documented before by the Trono lab (Imbeault, et al. 2017 Nature; Helleboid, et al. 2019 EMBO J; Ecco, et al. 2017; Pontis, et al. 2019), many of which the authors cite. The conclusion that older KZFPs gained new functions not related with TEs repression (such as imprinting regulation or meiotic hotspot determination) is already well established knowledge, which goes together with the model of higher purifying selection of the zinc finger array to retain the binding specificity, while the KRAB domain loses interaction with KAP1. Furthermore, the fact that KZFPs don't only bind to TEs has also been already reported by Imbeault et al. that originally provided the datasets re-analyzed in this manuscript.

    The functional validation of ZNF611 binding to one of its target sequences is welcome and adds another example of a KRAB-ZFP that might have positive transcription regulatory function, however it is only a single KRAB-ZFP in a single assay. The finding that a KRAB-ZFP is capable of activating gene expression is also confirmatory (Ye at al. 2004; Frietze et al. 2010; Hallen et al. 2011).

    There is value in replicating existing research, but the article is not written with that in mind. One contrast with previous studies is that their reanalysis of existing ChIP-seq data showed KRAB-ZFPs primarily bind to non-TE regions. However, these findings are based on thin evidence. It is not enough to say that a KRAB-ZFP mostly binds non-TE regions because >50% of its binding sites are outside of a TE. Rather, more quantitative statistics, such as enrichment or depletion of binding in a given genomic compartment compared to a random expectation is required. Additionally, there is no evidence such as heatmaps or metaplots over a subset of peaks to further demonstrate that the peaks identified in the new analysis are any better than the previous analysis. The authors argue that the more significant p values of their peaks are indicative of better peak calls, but there is no formal comparison of true/false negative rate (such as at known binding sites). Furthermore, many TEs, which are poorly mappable, will have less significant p values simply because fewer unique reads are mapped there relative to unique sequences. More careful analysis will be needed to assess these claims.

    Finally, the paper itself is hard to read and the logic is difficult to follow, often due to a lack of sufficient detail. The methodology is also light on details, making it challenging to understand exactly what the authors did or did not do (see specific examples below). Additionally, the figures (especially Figure 1, Figure 3A, and Figure 4) are difficult to read and understand as currently presented.

    Specific Comments:

    1. The title and abstract make it clear that the authors are trying to argue that noncoding sequence contributes to rapid evolution of the KRAB-ZFP family. While this is possibly true, the authors' data, which is limited to a phylogenetic analysis of a single gene (using methodology that does not work well for highly repetitive sequences such as the KRAB-ZFP C2H2 zinc finger array) and its potential binding site. Much more analysis (such as selection analysis of more KRAB-ZFPs and their predicted or empirically determined binding sites) is required to make this claim.

    2. Page 4, lines 66-70: The authors present the two possible models of KRAB-ZFP evolution (ie: arms race/domestication model) as if they are mutually specific, when most argue they would not be. Also, the authors state: "and (2) the domestication model (Ecco et al., 2017; Pontis et al., 2019), in which KZFPs regulate domestication of TEs instead of restraining the transposition potential of TEs". This should be rephrased, because in most of the cases reported, the "domesticated TEs" have lost transposition potential and only regulatory and protein coding sequences got domesticated with new functions. If the authors were referring to the adaptation of KZFPs to non-TE related functions, this cannot be called domestication, since KZFP genes are already from the host.

    3. Page 5, lines 91-93: Here and throughout the authors use language such as "later" or "earlier" which is confusing - these should be replaced with "younger/more recent" and "older".

    4. Page 6, lines 111-115: This section is highly speculative and should be moved to discussion.

    5. Page 6 line 122: The authors do not define, here or in the methods, what constitutes a "variant" KRAB domain.

    6. Page 7 lines 129-133: The authors only inferred their conclusion, yet they state that their result is consistent with a previous study. No real evidence is provided there.

    7. Page 7 line 138-140: The authors say that the data suggests variant KRAB domains were formed gradually rather than in a burst, but their analysis is not sufficient to conclude this. Also, the only conclusion that can be drawn from Figure 2A is that the KZFPs that were clustered as "vKRAB" are on a separated branch in the tree on the left. This would mean that early in evolution some KZFP got a "vKRAB" and subsequently this gene underwent duplication and diversification, like all the other KZFP genes with "sKRAB" did.

    8. Page 9 lines 189-193: Does the 90% cited here refer to 90% of the ~50% that are called as "tending to bind non-TE sequence" or 90% of all KZFPs? Regardless, this point is very misleading: the fact that less than 50% of the binding sites of a KZFP is not found to overlap TEs does not mean that the KZFP only binds to non-TEs. Some of this non-TE binding could also be an artifact of overexpression, which has not been considered but which has been well documented (for example ZFP809, Macfarlan Lab, and PRDM9 Simon Myers lab).

    9. Lines 196-197, the authors state that they randomly selected 30 KZFPs. The authors should state in a supplementary figure which KZFPs were selected and, among them, what is the percentage of KZFPs that bind or not to TEs according to the analysis performed in the original paper (Imbault et al. 2017) and in this manuscript.

    10. Page 11 line 230: Here and throughout the rest of the document the authors use the acronym "PCGs" without defining it (outside a figure legend).

    11. Page 11 lines 234-237: Here the authors cite their use of pLI, RVIS, Shet, and dN/dS values as evidence of purifying selection. Of those, only dN/dS measures purifying selection, and the authors do not specify whether the dN/dS values they obtain are statistically significant evidence of purifying selection relative to a neutral model (likely the case when only considering chimp-human, as the authors do). Moreover, while the other measures do suggest some constraint, the differences between the KZFP-TE and KZFP-nonTE protein coding genes is very subtle. Also, they don't provide any explanation as to why, according to their claim, there should be less purifying selection for the KZFPs involved in mesoderm differentiation. Thus, the authors should temper their claims or else omit this data.

    12. Page 11 lines 249-251: It is not clear how the author defined genes as transcription factors (they also do not define the acronym), or why they included them in the analysis. Additionally, the authors say that the divergence time of KZFPs is correlated with expression level but does not provide correlation values or the significance of these correlations.

    13. Page 12 lines 266-268: This is not surprising, since TEs are generally silenced, while the rest of the genes can be either active or silent, so comparison of accessibility of cumulative TEs versus non-TEs will inevitably show open chromatin for non-TEs.

    14. Page 13 lines 280-286: Here the authors try to draw conclusions from comparing chromatin accessibility of binding sites in ESCs and 293T cells and conclude that because they are more accessible in ESCs that suggests that KRAB-ZFPs activate in conditions. In reality, it is difficult to compare epigenetic states across cell lines, especially in undifferentiated vs differentiated, making it almost impossible without genetic manipulation to determine that KRAB-ZFPs are the cause of these differences.

    15. How were the target genes selected for qPCR validation among the KZFP targets? In Fig.5 suppl. 3 the authors show that there is a fraction of genes that is only accessible in ESCs, but there is also a similar number of genes that is accessible both in ESCs and HEK293T cells, so the authors could have tried to validate some of those in both cell lines...Also, if the KZFPs are responsible for the target genes activation, why overexpression did not activate genes that are repressed in HEK293T cells? The ChIP-exo dataset used here (from Imbeault et al. 2017) was obtained from overexpression of the KZFPs in HEK293T cells, so obviously the proteins could bind to these genes in this cell line. This would rather suggest that if it's true that the tested KZFPs can promote transcriptional activation, this might be a secondary effect, since it might rely on something else making the genes already accessible and expressed in ESCs.

    16. Page 16 lines 367-373: The conclusions that can be drawn from the ZNF611 reporter assay and associated evolutionary analysis are minimal. First, the authors cloned in a large chunk of DNA (1.2kb) rather than just the predicted binding site. This is mitigated somewhat by the deletion, but the deletion construct also deletes sequence upstream of the binding sites making the results hard to interpret. Additionally, the evolutionary analysis is very weak - traditional methods to generate phylogenetic trees do not work well for repetitive sequences, such as the ZnF arrays, and the bootstrap values on the tree are poor. There is almost no experimental methodology on how the tree was generated, how the authors overcame these issues, and how the authors identified the orthologous binding site in different species.

    17. Page 18 lines 417-424: There is not enough detail about how the human KRAB-ZFPs were identified. Bare minimum, the authors need to report thresholds used to determine if a protein's domains scored high enough to be either a KRAB or C2H2 ZF domain.

    18. Page 19: Given the highly repetitive nature of KRAB-ZFPs, it is not sufficient to use the homology estimations from Ensembl to identify orthologous proteins. Other methods, such as synteny, should be used to confirm orthologs. Additionally, the authors identify homologs between different KRAB domains based on %identity, but this will likely give spurious results, as functional domains do not evolve neutrally and often have high similarity across proteins due to functional constraint. Regarding the phylogenetic analysis, there is again not enough detail to explain how the authors overcome issues with alignments and low bootstrap values - additionally, they did not perform a model test prior to constructing the tree, which can impact the final results.

    19. Page 22 lines 517-520: The authors do not elaborate why they chose FC > 1.1 or FC < 0.9 to call differentially expressed genes

    20. Page 23 lines 529-532: How the authors performed the gene ontology enrichment/depletion analysis is not clear. For example, if the authors indeed prefiltered their list to remove genes that have no GO terms, that would bias the results.

    21. Page 24 lines 552-554: For the non-targeting siRNA, it is unclear whether this is a scramble or targeting another gene (such as GFP)?

  5. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 1 of the manuscript.

    ###Summary

    This paper analyzes the evolution of the KRAB-containing zinc finger protein (KZFP) family of proteins. While the reviewers were all interested in the topic, several major concerns came up during review. These include technical limitations of the methods chosen to analyze this challenging protein family (e.g., determination of orthology, selection analysis, and so on), and that new ideas, including claims about non-coding evolution and positive selection, are not convincingly supported by the analysis presented.