IGD: A simple, efficient genotype data format

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement – yet fast and small – is helpful for research on highly scalable bioinformatics.

Results

We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100 times faster and 3.5 times smaller than vcf.gz on Biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.

Availability

A C++ library reading and writing IGD, and tooling to convert . vcf.gz files, can be found at https://github.com/aprilweilab/picovcf . A Python library is at https://github.com/aprilweilab/pyigd

Article activity feed