SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Simulating genome sequence data with variant features facilitates the development and benchmarking of structural variant analysis programs. However, there are only a few data simulators that provide structural variants in silico and even fewer that provide variants with different allelic fraction and haplotypes.

Findings

We developed SVEngine, an open-source tool to address this need. SVEngine simulates next-generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file, and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs), and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine's flexible design process enables one to specify size, position, and allelic fraction for deletions, insertions, duplications, inversions, and translocations. Finally, SVEngine simulates sequence data that replicate the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time.

Conclusions

We demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine's features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated SVEngine's accuracy by simulating genome-wide structural variants of NA12878 and a heterogeneous cancer genome. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift, and neighboring hanging read pairs for representative variant types. Structural variant callers Lumpy and Manta and tumor heterogeneity estimator THetA2 were able to perform realistically on the simulated data. SVEngine is implemented as a standard Python package and is freely available for academic use .

Article activity feed

  1. Now published in GigaScience doi: 10.1093/gigascience/giy081

    Li Charlie Xia 1Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 943052Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 18014Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Li Charlie XiaDongmei Ai 3School of Mathematics and Physics, University of Science and Technology Beijing, 30 Xueyuan Road, Haidian District, Beijing 100083 P. R. ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHojoon Lee 1Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteNoemi Andor 1Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteChao Li 3School of Mathematics and Physics, University of Science and Technology Beijing, 30 Xueyuan Road, Haidian District, Beijing 100083 P. R. ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteNancy R. Zhang 2Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 18014Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteHanlee P. Ji 1Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 943054Stanford Genome Technology Center, Stanford University, Palo Alto, CA 94304Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: genomics_ji@stanford.edu

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy081 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    These peer reviews were as follows:

    Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101246 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101247 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101248