IGD: A simple, efficient genotype data format
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement – yet fast and small – is helpful for research on highly scalable bioinformatics.
Results
We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100 times faster and 3.5 times smaller than vcf.gz on Biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.
Availability
A C++ library reading and writing IGD, and tooling to convert . vcf.gz files, can be found at https://github.com/aprilweilab/picovcf . A Python library is at https://github.com/aprilweilab/pyigd