vcfsim : flexible simulation of all-sites VCFs with missing data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

VCFs are the most widely used data format for encoding genetic variation. By design, standard VCFs do not include data from sites where all individuals are homozygous for the reference allele (“invariant sites”) and thus do not differentiate these from sites where data are completely missing. However, missing data are a key feature of biological datasets across all domains of genomics, and many recent studies have shown that missing data can introduce a variety of statistical biases in the estimation of key population genetic parameters. A solution to this limitation is to include invariant sites in a standard VCF, creating an “all-sites VCF”, exposing missing and invariant sites explicitly. One hurdle to the wider adoption of all-sites VCFs is a reliable parameterized simulation framework for generating biologically realistic all-sites VCFs. Here, we introduce an open-source command line tool, vcfsim , that interfaces with the popular coalescent simulation platform msprime and provides convenience functions for simulating all-sites VCFs with variable levels of ploidy and missing data. We show that the post-processed VCFs generated using vcfsim align precisely with population genetic expectations (i.e. are statistically identical to raw msprime output) and can accurately introduce missing data and varying ploidy levels, including the simulation of intraindividual ploidy variation (e.g. heterogametic sex chromosomes). We suggest vcfsim will be a useful tool for the benchmarking of new software tools, training of machine learning models, and the exploration of the effects of missing data in genomics data sets.

Article activity feed