xAtlas: scalable small variant calling across heterogeneous next-generation sequencing experiments

Abstract

Background

The growing volume and heterogeneity of next-generation sequencing (NGS) data complicate the further optimization of identifying DNA variation, especially considering that curated high-confidence variant call sets frequently used to validate these methods are generally developed from the analysis of comparatively small and homogeneous sample sets.

Findings

We have developed xAtlas, a single-sample variant caller for single-nucleotide variants (SNVs) and small insertions and deletions (indels) in NGS data. xAtlas features rapid runtimes, support for CRAM and gVCF file formats, and retraining capabilities. xAtlas reports SNVs with 99.11% recall and 98.43% precision across a reference HG002 sample at 60× whole-genome coverage in less than 2 CPU hours. Applying xAtlas to 3,202 samples at 30× whole-genome coverage from the 1000 Genomes Project achieves an average runtime of 1.7 hours per sample and a clear separation of the individual populations in principal component analysis across called SNVs.

Conclusions

xAtlas is a fast, lightweight, and accurate SNV and small indel calling method. Source code for xAtlas is available under a BSD 3-clause license at https://github.com/jfarek/xatlas.

Motivation

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac125), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Ruibang Luo

In this paper, the authors proposed xAtlas, an open-source NGS variant caller. xAtlas is a fast and lightweight caller with comparable performance with other benchmarked callers. The benchmark comparison in multiple popular short-read platforms (Illumina HiSeq X and NovaSeq) demonstrated xAtlas's capacity to identify small variants rapidly with desirable performance. Although xAtlas is limited to call multi-allelic variants, the high sensitivity (~99.75% recall for ~60x benchmarking datasets) and desirable runtime (<2 hours) enable xAtlas to rapidly filter candidates and be considered as important quality control for further utilization.

The authors presented a detailed explanation of xAtlas's workflow, design decisions and have done complete experiments in benchmarking, while there are still some points the authors need to discuss further listed as follow:

The authors reported the performance in multiple coverages of the HG001 sample and the benchmarking result of HG002-4 samples by measuring the concordance with the GIAB truth set (v3.3.2). I noticed that GIAB had updated the GIAB truth sets from v3.3.2 to v4.2.1 for the Ashkenazi trio. The updated version included more difficult regions like segmental duplications and the Major Histocompatibility Complex (MHC) to identify previously unknown clinically relevant variants. Therefore, it would be helpful if the author could give a performance evaluation using the updated truth sets to give a more comprehensive performance of the proposed caller.

In the Methods section, The authors stated the main three stages of the xAtlas variant calling process: read prepossessing, candidates identification, and candidates evaluation. The author fed hand-craft features (base quality, coverages, reference and alternative allele support, etc.) into a logistic regression model to classify true variants and reference calls in the candidate evaluation stage. But in Figure 1, the main workflow of xAtlas, only model scoring was shown, and the evaluation details were not visible. It would be useful if the authors could enrich Figure 1 to add more details to ensure consistency with Methods and facilitate reader understanding.

In Figure 2, the authors reported the xAtlas performance comparison across in HG001 dataset with other variant callers. I noticed that the x-axis was F1-score while the y-axis was true positives per second. The tendency measurement of two metrics seems irrelevant, which might confuse the readers. we suggest the authors make separate comparisons for the two metrics. (For instance, plot Precision-Recall curves for F1-score measurement and Runtime comparison of various variant callers for speed benchmarking).

Zheng, Zhenxian on behalf of the primary reviewer

Reviewer 2: Jorge Duitama

The manuscript describes a variant caller called xAtlas, which uses a logistic regression model to call SNPs after building an alignment and pileup of the reads. The manuscript is clear. The software is built with the aim of being faster than other solutions. However, I have some concerns relative to the method and the manuscript.

Unfortunately, the biggest issue with this work is that the gain of speed is obtained with an important sacrifice in accuracy, specially to call indels. I ran xAtlas with two different benchmark datasets and the accuracy, especially for indels and other complex regions was about 20% lower compared to other solutions. Although the difference was smaller, xAtlas is also less accurate than other software tools for SNV calling. It is well known that even a simple SNV caller can achieve high sensitivity and specificity (see results from https://doi.org/10.1101/gr.107524.110). However, several SNV errors can be generated by incorrect alignment of reads around indels and other complex regions. For that reason most of the work on variant detection is focused on mechanisms to perform indel realignment or de-novo miniassembly to increase accuracy of both SNV and indel detection. The paper of Strelka is a great example of this (https://doi.org/10.1038/s41592-018-0051-x). The manuscript does not mention if any procedure has been implemented to realign reads or to increase in some way the accuracy to call indels. This is critical if xAtlas is meant to be used in clinical settings.
The manuscript looks outdated in terms of evaluation datasets, metrics and available tools. Since high values of standard precision and sensitivity are easy to achieve with simple SNV callers, metrics such as the false positives per million basepair (FPPM) proposed by the developers of the synthetic diploid benchmark dataset should be used to achieve a more clear assessment of the accuracy of the different methods (https://doi.org/10.1038/s41592-018-0054-7). Regarding benchmark experiments, SynDyp should also be used for benchmarking. To actually support that xAtlas is reliable across heterogeneus datasets (as stated in the title), further datasets should be tested, as it has been done for software tools such as NGSEP (https://doi.org/10.1093/bioinformatics/btz275). In terms of tools, both DeepVariant and NGSEP should be included in the comparisons.
Regarding the metrics proposed by the authors, I do not think it is a good practice to merge results on accuracy and efficiency, taking into account that the accuracy in this case is lower than other solutions, and for clinical settings that is an important issue. The supplementary table should also report sensitivity and precision for indels, not only for SNVs.
The SNV calling method and particularly the genotyping procedure should be describe in much better detail. The manuscript describes the general pileup process, then, it mentions some general filters for read alignments and then it mentions that it applies logistic regression. However, it is not clear which data is used for such regression or in general how allele counts and quality scores are taken into account. A much deeper description of the logistic regression model should be included in the manuscript.
There are better methods than PCA to show clustering of the 1000g samples. A structure analysis is more suitable for population genomics data and it is more clear to show the different subpopulations.
Finally, about the software, genotype calls produced by the xAtlas should have a value for the genotype quality (GQ) format field to assess the genotyping accuracy. For single sample analysis the QUAL value can be used (although this is not entirely correct). However, for population VCFs, the GQ field is very important to have a measure of genotyping quality per datapoint. Regarding population VCF files it is not clear, either from the in-line help or from the github site, how population VCF files should be constructed.

Read the original source

xAtlas: scalable small variant calling across heterogeneous next-generation sequencing experiments

This article has been Reviewed by the following groups

Listed in

Abstract

Background

Findings

Conclusions

Article activity feed

Sub-consensus haploid variant calling in Long-read sequencing technology

Finding an optimal sequencing strategy to detect short and long genetic variants in a human genome

ZIPcnv: accurate and efficient inference of copy number variations from shallow whole-genome sequencing

This article has been Reviewed by the following groups

Listed in

Abstract

Background

Findings

Conclusions

Article activity feed

Related articles

Sub-consensus haploid variant calling in Long-read sequencing technology

Finding an optimal sequencing strategy to detect short and long genetic variants in a human genome

ZIPcnv: accurate and efficient inference of copy number variations from shallow whole-genome sequencing