ntsm: an alignment-free, ultra low coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc .) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

Findings

The similarity between samples can be determined using indexed k -mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias ( i . e . missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

Conclusions

Because this tool processes raw data, is faster than alignment, and can be used on very low coverage data, it can save an immense degree of computational resources in standard QC pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample-swap detection, this method provides other useful information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.

Article activity feed

  1. Findings The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

    Reviewer2: Qian Zhou In this paper, the authors have presented a tool, ntsm, which utilizes the k-mer distribution information directly from raw sequencing data for sample swap detection. The approach of bypassing the reference genome alignment step and saving computational resources is commendable. Utilizing k-mers for reference-free and de novo analysis of sequencing data is a valuable application. The authors have demonstrated the impressive performance of ntsm on low coverage data through experimental results presented in the manuscript, showcasing its strengths in terms of sensitivity, accuracy. However, while ntsm eliminates the need for reference genome alignment, it still relies on a pre-defined set of variant sites and pre-built PCA rotation matrices. This raises doubts about the true reference-free nature of ntsm and raises concerns about its generalizability to other species.Major comments:1.The concept of reference-free:I believe that ntsm's approach is not truly reference-free. In order to use ntsm, it requires the use of existing high-quality population SNP sites and kmers from the human reference genome. Additionally, the population PCA results are used to assist in pairwise comparisons between samples. Both of these information can only be obtained when a reference genome is available. A true referencefree tool would be applicable to species without a reference genome, such as SPLASH (Chaung et al., 2023, Cell). ntsm can be considered as an alignment-free or kmer-based tool.2.The reduction of computational costs:NTSM differs from Somalier in its computational workflow. To compare the computational costs or time, a holistic end-to-end comparison is necessary, rather than timing individual steps such as kmer counting and sample pairwise comparison separately. Conducting an end-to-end comparison for an analysis task allows users to have a comprehensive understanding of the tool's time and cost consumption. Furthermore, when comparing software, it is important to allocate computational resources fairly. For example, ntsm utilizes 16 threads in the 'Sample comparison process' stage, while for the 'k-mer counting (ntsm) vs. alignment (somalier)' stage, tools like bwa and minimap2, which can utilize multiple threads, were run using a single thread.3.Sensitivity and Specificity:More experimental details are needed. In the section 'Sensitivity and Specificity of Sample Swaps,' were the results obtained using the 39 HPRC samples? Did it include their Hi-C data?For Fig 6, did the results come from all sequencing datasets of the 39 samples, including Illumina and ONT? Since the results was obtained using full coverage, would the threshold change at lower coverage?For Fig 7, which demonstrates ntsm's results, was PCA information used as an auxiliary? Does the use of PCA information impact Sensitivity and Specificity?4.Regarding PCA-based method:The 39 HPRC samples used in the study are actually part of the 3,202 samples from the 1000 Genomes Project. Therefore, it is important to clarify whether the PCA matrix used in the study already includes information from these 39 samples. From a rigorous experimental design perspective, a precomputed PCA matrix should not include information from the 39 samples. Otherwise, the effect of the PCA matrix on these 39 samples may be overestimated. It raises questions about whether the same results can be achieved on non-1000 Genomes Project samples.5.The applicability of the tool:In order to expand the applicability of ntsm to a wider range of species, two aspects need to be addressed:1). Provide detailed information on customizing the sites file. From the site files available in ntsm code repository on GitHub, the process of selecting variant sites seems to be more complex than what is described in the manuscript, involving more than just SNP variants.2). The sites and PCA files should be user-customizable inputs instead of being built-in. This limitation restricts the application of ntsm to other species.Minor comments:The manuscript appears to have been hastily written and requires further polish by the authors.1. In Figure 6, A and B seem to be labeled incorrectly.2. In Figure 9, the two subplots have different y-axes, one labeled "min" and the other labeled "s." Could you clarify what each subplot is illustrating?3. When mentioning HPRC for the first time, it would be helpful to provide the full name and explanation of the acronym. However, the full explanation appears in the next paragraph.4. "We then keep only purine to pyrimidine (A or T to G or C) variants, as final insurance against possible human error influencing this tool" It seems there may be a mistake or confusion in the sentence. The writer should indeed mention "A/G <-> C/T" instead of "A/T <-> G/C" to accurately describe purine to pyrimidine variants. The writer may have made an error in describing the nucleotide exchange, or it could be a typographical mistake.5. There is a typo in the formula for estimating sequencing error rate. (nm)·log(1-… …

  2. Background Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

    Reviewer1: Jianxin Wang In this manuscript, authors present a fast intra-species sample swap detecting tool, named ntsm. By counting the relevant variant k-mers from samples, it estimates the probability of each allele at sites and then uses the likelihood ratio test to detect sample swaps. Compared with the alignment-based method, Somalier, nsam performs better on low coverage data (≤5X) and is more efficient in terms of memory and computing time. The authors use PCA-based spatial index heuristic to reduce the number of sample comparisons. Of course, in my opinion, compared with the time spent on counting k-mer, the time saved by the PCA-based method is trivial. In addition, ntsm also provides other features such as error rate estimation. The tool requires population snp information, which limits its applications in practice to some extent. Overall, ntsm is a fast and practical tool for calculating intra-species sample similarity and detecting sample swaps. The writing and experiments in this paper are generally well done. There are some major and minor issues that I suggest the authors consider addressing.Major issues:The paper mentions that due to high error rates, nanopore data is difficult to analyze. Can the authors analyze the performance of ntsm under different error rate data? In general, alignment-based methods may perform better on high error rate data. This is very useful information for users to choose the tool.The authors use the PCA-based spatial index heuristic to reduce the number of pairwise comparisons. However, the relation between PCA distance and similarity score is not clear here. How to ensure that samples with similarity scores less than the threshold are within the search radius?The paper involves two metrics, say, similarity score and relatedness, to detect sample swaps. Can the authors analyze the relation between them to help readers understand the advantages and disadvantages of the two methods?Minor issues:In the "Conlusions" section, the second "useful" in the sentence "this method provides other useful information useful in QC" is redundant."R=1, p<2.2e-16" in Figure 3 is not explained.In the "Sequencing error rate estimation" section, the variable n is not explained.In Figure 9, the case of the first letter of two y-axis labels (time) is inconsistent.