xAtlas: Scalable small variant calling across heterogeneous next-generation sequencing experiments

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Motivation

The rapid development of next-generation sequencing (NGS) technologies has lowered the barriers to genomic data generation, resulting in millions of samples sequenced across diverse experimental designs. The growing volume and heterogeneity of these sequencing data complicate the further optimization of methods for identifying DNA variation, especially considering that curated highconfidence variant call sets commonly used to evaluate these methods are generally developed by reference to results from the analysis of comparatively small and homogeneous sample sets.

Results

We have developed xAtlas, an application for the identification of single nucleotide variants (SNV) and small insertions and deletions (indels) in NGS data. xAtlas is easily scalable and enables execution and retraining with rapid development cycles. Generation of variant calls in VCF or gVCF format from BAM or CRAM alignments is accomplished in less than one CPU-hour per 30× short-read human whole-genome. The retraining capabilities of xAtlas allow its core variant evaluation models to be optimized on new sample data and user-defined truth sets. Obtaining SNV and indels calls from xAtlas can be achieved more than 40 times faster than established methods while retaining the same accuracy.

Availability

Freely available under a BSD 3-clause license at https://github.com/jfarek/xatlas .

Contact

farek@bcm.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

Article activity feed

  1. Motivation

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac125), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Ruibang Luo

    In this paper, the authors proposed xAtlas, an open-source NGS variant caller. xAtlas is a fast and lightweight caller with comparable performance with other benchmarked callers. The benchmark comparison in multiple popular short-read platforms (Illumina HiSeq X and NovaSeq) demonstrated xAtlas's capacity to identify small variants rapidly with desirable performance. Although xAtlas is limited to call multi-allelic variants, the high sensitivity (~99.75% recall for ~60x benchmarking datasets) and desirable runtime (<2 hours) enable xAtlas to rapidly filter candidates and be considered as important quality control for further utilization.

    The authors presented a detailed explanation of xAtlas's workflow, design decisions and have done complete experiments in benchmarking, while there are still some points the authors need to discuss further listed as follow:

    The authors reported the performance in multiple coverages of the HG001 sample and the benchmarking result of HG002-4 samples by measuring the concordance with the GIAB truth set (v3.3.2). I noticed that GIAB had updated the GIAB truth sets from v3.3.2 to v4.2.1 for the Ashkenazi trio. The updated version included more difficult regions like segmental duplications and the Major Histocompatibility Complex (MHC) to identify previously unknown clinically relevant variants. Therefore, it would be helpful if the author could give a performance evaluation using the updated truth sets to give a more comprehensive performance of the proposed caller.

    In the Methods section, The authors stated the main three stages of the xAtlas variant calling process: read prepossessing, candidates identification, and candidates evaluation. The author fed hand-craft features (base quality, coverages, reference and alternative allele support, etc.) into a logistic regression model to classify true variants and reference calls in the candidate evaluation stage. But in Figure 1, the main workflow of xAtlas, only model scoring was shown, and the evaluation details were not visible. It would be useful if the authors could enrich Figure 1 to add more details to ensure consistency with Methods and facilitate reader understanding.

    In Figure 2, the authors reported the xAtlas performance comparison across in HG001 dataset with other variant callers. I noticed that the x-axis was F1-score while the y-axis was true positives per second. The tendency measurement of two metrics seems irrelevant, which might confuse the readers. we suggest the authors make separate comparisons for the two metrics. (For instance, plot Precision-Recall curves for F1-score measurement and Runtime comparison of various variant callers for speed benchmarking).

    Zheng, Zhenxian on behalf of the primary reviewer

    Reviewer 2: Jorge Duitama

    The manuscript describes a variant caller called xAtlas, which uses a logistic regression model to call SNPs after building an alignment and pileup of the reads. The manuscript is clear. The software is built with the aim of being faster than other solutions. However, I have some concerns relative to the method and the manuscript.

    1. Unfortunately, the biggest issue with this work is that the gain of speed is obtained with an important sacrifice in accuracy, specially to call indels. I ran xAtlas with two different benchmark datasets and the accuracy, especially for indels and other complex regions was about 20% lower compared to other solutions. Although the difference was smaller, xAtlas is also less accurate than other software tools for SNV calling. It is well known that even a simple SNV caller can achieve high sensitivity and specificity (see results from https://doi.org/10.1101/gr.107524.110). However, several SNV errors can be generated by incorrect alignment of reads around indels and other complex regions. For that reason most of the work on variant detection is focused on mechanisms to perform indel realignment or de-novo miniassembly to increase accuracy of both SNV and indel detection. The paper of Strelka is a great example of this (https://doi.org/10.1038/s41592-018-0051-x). The manuscript does not mention if any procedure has been implemented to realign reads or to increase in some way the accuracy to call indels. This is critical if xAtlas is meant to be used in clinical settings.

    2. The manuscript looks outdated in terms of evaluation datasets, metrics and available tools. Since high values of standard precision and sensitivity are easy to achieve with simple SNV callers, metrics such as the false positives per million basepair (FPPM) proposed by the developers of the synthetic diploid benchmark dataset should be used to achieve a more clear assessment of the accuracy of the different methods (https://doi.org/10.1038/s41592-018-0054-7). Regarding benchmark experiments, SynDyp should also be used for benchmarking. To actually support that xAtlas is reliable across heterogeneus datasets (as stated in the title), further datasets should be tested, as it has been done for software tools such as NGSEP (https://doi.org/10.1093/bioinformatics/btz275). In terms of tools, both DeepVariant and NGSEP should be included in the comparisons.

    3. Regarding the metrics proposed by the authors, I do not think it is a good practice to merge results on accuracy and efficiency, taking into account that the accuracy in this case is lower than other solutions, and for clinical settings that is an important issue. The supplementary table should also report sensitivity and precision for indels, not only for SNVs.

    4. The SNV calling method and particularly the genotyping procedure should be describe in much better detail. The manuscript describes the general pileup process, then, it mentions some general filters for read alignments and then it mentions that it applies logistic regression. However, it is not clear which data is used for such regression or in general how allele counts and quality scores are taken into account. A much deeper description of the logistic regression model should be included in the manuscript.

    5. There are better methods than PCA to show clustering of the 1000g samples. A structure analysis is more suitable for population genomics data and it is more clear to show the different subpopulations.

    6. Finally, about the software, genotype calls produced by the xAtlas should have a value for the genotype quality (GQ) format field to assess the genotyping accuracy. For single sample analysis the QUAL value can be used (although this is not entirely correct). However, for population VCFs, the GQ field is very important to have a measure of genotyping quality per datapoint. Regarding population VCF files it is not clear, either from the in-line help or from the github site, how population VCF files should be constructed.