Sim3C: simulation of HiC and Meta3C proximity ligation sequencing technologies

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Chromosome conformation capture (3C) and HiC DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols.

Findings

We describe a computational simulator that, given reference genome sequences and some basic parameters, will simulate HiC sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in HiC and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of topologically associating domains (TADs) is provided. The simulator also models several sources of error common to 3C and HiC library preparation and sequencing methods, including spurious proximity ligation events and sequencing error.

Conclusions

We have introduced the first comprehensive simulator for 3C and HiC sequencing protocols. We expect the simulator to have use in testing of HiC data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions must be made in advance in order to ensure adequate statistical power to test the relevant hypotheses.

Article activity feed

  1. Abstract

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/gix103), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    **Reviewer 1. Andrew Adey **

    In the manuscript by De Maere and Darling, the authors describe their computational simulator for HiC and 3C sequencing that models the 3D arrangement of chromatin and how that arrangement is conveyed via proximity ligation methods. Overall the manuscript is long and does not clearly describe the main goals of the simulator. The detail is appreciated, but not when it obfuscates the main goal of the manuscript. Also the figures could be condensed so that there are less figures with more panels. That being said, I do believe the simulator that the authors have developed is very sophisticated and appears to work well with a few exceptions. The major issue is the packaging of the method into more a concise and clear text. Below are some more specific comments:

    My first thought is regarding where this simulator will be particularly useful? The authors mention it is primarily for software tool development and that the cost of generating HiC/3C data is very high and that many of the existing datasets are sparse. However, there are many existing datasets that are extremely rich and deep that would seem more appropriate. While I am not convinced on the utility for software development when abundant real data is publicly available, I do agree that having means to simulate sequence read data may have other valuable applications - primarily in exploring power in deconvolving metagenomic samples. For the eukaryotic simulated data there is a clear stretch of signal this is perpendicular to the diagonal as is typically observed for circular genomes, though this would not be expected for linear chromosomes (e.g. Figure 7). Does the simulator assume all chromosomes are circular? This is odd and needs to be addressed. Also on figure 7, the authors are highlighting that there is a greater inter chromosomal signal when compared to real data - is that a good thing? I can see that it may be desirable if the goal is to generate signal that would be generated under the assumption that there is no chromatin organization in the genome and thus be used as a background model. I can see this as a potential use, but it should be more clearly stated. The authors describe the ability to simulate TADs - however it is not clearly described how the TADs are decided upon - can users specify where TADs should be located (e.g. if they have a callset of TADs and want to create data simulating them that they can then alter - e.g. change one TAD and see how it effects signal nearby so they can know what to expect for an experiment where they may be altering TAD-forming loci). Or are they only created randomly (which seems the case given page 8 line 212). This could also be more clearly described by stating broadly what is done then going into the methods of how that is accomplished. Figure 2 is an extremely simple and small diagram – could it not just be added into figure 1? It seems a bit excessive to stand as its own figure. This goes for several other figures. Figure 8 - there is no description for c and d panels. I assume c is real and d is simulated. The strong perpendicular band midway through the chromosome is observed which is discouraging as I have commented on for Figure 7.

    Re-review; The major issues I had with the manuscript previously were that it was too long and may have limited interest. The authors have addressed the first point. For the second, I believe that the interest is broad enough to warrant publication.

  2. Background

    **Reviewer 2. Ming Hu **

    In this paper, the authors developed a software package Sim3C to simulate Hi-C data and other 3C-based data. This work addresses a very important research question, and has the potential to become a useful computational tool in genomics research. However, the authors need to provide more explanations and technical details to further improve the current manuscript.

    Here are my specific comments: Major comments:

    1. Figure 3. It is better to plot Figure 3 in log scale for both x-axis and y-axis. In log scale, the slope of contact probably has direct biophysical interpretation, as described by the first Hi-C paper (Lieberman-Aiden et al, Science, 2009). I am very curious to see how biophysics model contributes to the data generation mechanism.
    2. In Rao et al, Cell, 2014 paper, they identified chromatin loops anchored by CTCF motifs. In Sim3C, the authors considered the 1D genomic distance effect and hierarchical TAD structures. It would be great if Sim3C can also take chromatin loops into consideration.
    3. Hi-C data can help to detect allelic-specific chromatin interactions. Is Sim3C able to simulate allelic specific proximity ligation data?
    4. It is very important to rigorously evaluate the data reproducibility. Using Sim3C, if users simulate Hi-C data multiple times with different random seeds, would the reproducibility between two simulated datasets be comparable to the reproducibility between two real biological replicates?
    5. The authors showed simulated contact matrices of bacteria (Figure 6) and budding yeast (Figure 7). They also need to simulate both human and mouse genome-wide contact matrices, and compare the simulated contact matrices with real data.

    Minor comments: 1.Please replace all 'HiC' by 'Hi-C'.

    1. Page 6, line 116, "sciHiC" should be "scHi-C".