ntsm: an alignment-free, ultra-low-coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection

Justin Chu
Jiazhen Rong
Xiaowen Feng
Heng Li

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

Findings

The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio–based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)–based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

Conclusions

Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.

GigaScience
Jul 1, 2024

Findings The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

Reviewer2: Qian Zhou In this paper, the authors have presented a tool, ntsm, which utilizes the k-mer distribution information directly from raw sequencing data for sample swap detection. The approach of bypassing the reference genome alignment step and saving computational resources is commendable. Utilizing k-mers for

Findings The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

Reviewer2: Qian Zhou In this paper, the authors have presented a tool, ntsm, which utilizes the k-mer distribution information directly from raw sequencing data for sample swap detection. The approach of bypassing the reference genome alignment step and saving computational resources is commendable. Utilizing k-mers for reference-free and de novo analysis of sequencing data is a valuable application. The authors have demonstrated the impressive performance of ntsm on low coverage data through experimental results presented in the manuscript, showcasing its strengths in terms of sensitivity, accuracy. However, while ntsm eliminates the need for reference genome alignment, it still relies on a pre-defined set of variant sites and pre-built PCA rotation matrices. This raises doubts about the true reference-free nature of ntsm and raises concerns about its generalizability to other species.Major comments:1.The concept of reference-free:I believe that ntsm's approach is not truly reference-free. In order to use ntsm, it requires the use of existing high-quality population SNP sites and kmers from the human reference genome. Additionally, the population PCA results are used to assist in pairwise comparisons between samples. Both of these information can only be obtained when a reference genome is available. A true referencefree tool would be applicable to species without a reference genome, such as SPLASH (Chaung et al., 2023, Cell). ntsm can be considered as an alignment-free or kmer-based tool.2.The reduction of computational costs:NTSM differs from Somalier in its computational workflow. To compare the computational costs or time, a holistic end-to-end comparison is necessary, rather than timing individual steps such as kmer counting and sample pairwise comparison separately. Conducting an end-to-end comparison for an analysis task allows users to have a comprehensive understanding of the tool's time and cost consumption. Furthermore, when comparing software, it is important to allocate computational resources fairly. For example, ntsm utilizes 16 threads in the 'Sample comparison process' stage, while for the 'k-mer counting (ntsm) vs. alignment (somalier)' stage, tools like bwa and minimap2, which can utilize multiple threads, were run using a single thread.3.Sensitivity and Specificity:More experimental details are needed. In the section 'Sensitivity and Specificity of Sample Swaps,' were the results obtained using the 39 HPRC samples? Did it include their Hi-C data?For Fig 6, did the results come from all sequencing datasets of the 39 samples, including Illumina and ONT? Since the results was obtained using full coverage, would the threshold change at lower coverage?For Fig 7, which demonstrates ntsm's results, was PCA information used as an auxiliary? Does the use of PCA information impact Sensitivity and Specificity?4.Regarding PCA-based method:The 39 HPRC samples used in the study are actually part of the 3,202 samples from the 1000 Genomes Project. Therefore, it is important to clarify whether the PCA matrix used in the study already includes information from these 39 samples. From a rigorous experimental design perspective, a precomputed PCA matrix should not include information from the 39 samples. Otherwise, the effect of the PCA matrix on these 39 samples may be overestimated. It raises questions about whether the same results can be achieved on non-1000 Genomes Project samples.5.The applicability of the tool:In order to expand the applicability of ntsm to a wider range of species, two aspects need to be addressed:1). Provide detailed information on customizing the sites file. From the site files available in ntsm code repository on GitHub, the process of selecting variant sites seems to be more complex than what is described in the manuscript, involving more than just SNP variants.2). The sites and PCA files should be user-customizable inputs instead of being built-in. This limitation restricts the application of ntsm to other species.Minor comments:The manuscript appears to have been hastily written and requires further polish by the authors.1. In Figure 6, A and B seem to be labeled incorrectly.2. In Figure 9, the two subplots have different y-axes, one labeled "min" and the other labeled "s." Could you clarify what each subplot is illustrating?3. When mentioning HPRC for the first time, it would be helpful to provide the full name and explanation of the acronym. However, the full explanation appears in the next paragraph.4. "We then keep only purine to pyrimidine (A or T to G or C) variants, as final insurance against possible human error influencing this tool" It seems there may be a mistake or confusion in the sentence. The writer should indeed mention "A/G <-> C/T" instead of "A/T <-> G/C" to accurately describe purine to pyrimidine variants. The writer may have made an error in describing the nucleotide exchange, or it could be a typographical mistake.5. There is a typo in the formula for estimating sequencing error rate. (nm)Â·log(1-… …

Read the original source
GigaScience
Jul 1, 2024

Background Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

Reviewer1: Jianxin Wang In this manuscript, authors present a fast intra-species sample swap detecting tool, named ntsm. By counting the relevant variant k-mers from samples, it estimates the probability of each allele at sites and then uses the likelihood ratio test to detect …

Background Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

Reviewer1: Jianxin Wang In this manuscript, authors present a fast intra-species sample swap detecting tool, named ntsm. By counting the relevant variant k-mers from samples, it estimates the probability of each allele at sites and then uses the likelihood ratio test to detect sample swaps. Compared with the alignment-based method, Somalier, nsam performs better on low coverage data (â‰¤5X) and is more efficient in terms of memory and computing time. The authors use PCA-based spatial index heuristic to reduce the number of sample comparisons. Of course, in my opinion, compared with the time spent on counting k-mer, the time saved by the PCA-based method is trivial. In addition, ntsm also provides other features such as error rate estimation. The tool requires population snp information, which limits its applications in practice to some extent. Overall, ntsm is a fast and practical tool for calculating intra-species sample similarity and detecting sample swaps. The writing and experiments in this paper are generally well done. There are some major and minor issues that I suggest the authors consider addressing.Major issues:The paper mentions that due to high error rates, nanopore data is difficult to analyze. Can the authors analyze the performance of ntsm under different error rate data? In general, alignment-based methods may perform better on high error rate data. This is very useful information for users to choose the tool.The authors use the PCA-based spatial index heuristic to reduce the number of pairwise comparisons. However, the relation between PCA distance and similarity score is not clear here. How to ensure that samples with similarity scores less than the threshold are within the search radius?The paper involves two metrics, say, similarity score and relatedness, to detect sample swaps. Can the authors analyze the relation between them to help readers understand the advantages and disadvantages of the two methods?Minor issues:In the "Conlusions" section, the second "useful" in the sentence "this method provides other useful information useful in QC" is redundant."R=1, p<2.2e-16" in Figure 3 is not explained.In the "Sequencing error rate estimation" section, the variable n is not explained.In Figure 9, the case of the first letter of two y-axis labels (time) is inconsistent.

Read the original source
Version published to 10.1093/gigascience/giae024
Jan 1, 2024
Version published to 10.1101/2023.11.01.565041 on bioRxiv
Nov 3, 2023

Comprehensive benchmarking of somatic single-nucleotide variant and indel detection at ultra-low allele fractions using short- and long-read data

This article has 46 authors:
1. Yoo-Jin Jiny Ha
2. Dominika Maziec
3. Julia Markowski
4. Stephanie J. Georges
5. Nancy L. Parmalee
6. Michele Berselli
7. Tim H.H. Coorens
8. Shihua Dong
9. Stephanie Gardiner
10. Divya Kalra
11. Daofeng Li
12. Benpeng Miao
13. Rajeeva Musunuri
14. Liying Xue
15. Zhi Yu
16. Kimberly Walker
17. Lisa Anderson
18. Natalie Y.T. Au
19. Carrie Cibulskis
20. Harsha Doddapaneni
21. Christopher M. Grochowski
22. Dana M. Jensen
23. Tina Lindsay
24. Kelsey Loy
25. Azeet Narayan
26. Giuseppe Narzisi
27. Jeffrey Ou
28. Meranda M. Pham
29. Alexi M. Runnels
30. Andrew B. Stergachis
31. Lila M. Sutherlin
32. Ting Wang
33. Hu Jin
34. William C. Feng
35. Yuwei Zhang
36. Alexander D. Veit
37. Clara TaeHee Kim
38. Hye-Jung E. Chun
39. SMaHT Network Single Nucleotide Variant (SNV) Working Group
40. Kristin Ardlie
41. Robert S. Fulton
42. Soren Germer
43. Richard Gibbs
44. Gabor T. Marth
45. James T. Bennett
46. Peter J. Park
This article has no evaluationsLatest version Oct 14, 2025
A Pangenomic Method for Establishing a Somatic Variant Detection Resource in HapMap Mixtures

This article has 29 authors:
1. Nahyun Kong
2. Zitian Tang
3. Andrew Ruttenberg
4. Juan F. Macias-Velasco
5. Zefan Li
6. Wenjin Zhang
7. Benpeng Miao
8. Zilan Xin
9. Qichen Fu
10. Haeorum Park
11. Xiaoyu Zhuo
12. Elvisa Mehinovic
13. Edward Belter
14. Chad Tomlinson
15. John E. Garza
16. Shihua Dong
17. Emma Casey
18. Ben Johnson
19. Mary F Majewski
20. Theron Palmer
21. Yuchen Cheng
22. Tina Lindsay
23. Tim Schedl
24. Daofeng Li
25. Hui Shen
26. Robert Fulton
27. SMaHT Network Assembly/Pangenome Working Group
28. Ting Wang
29. Sheng Chih Jin
This article has no evaluationsLatest version Oct 1, 2025
Comprehensive benchmarking of somatic structural variant detection at ultra-low allele fractions

This article has 12 authors:
1. Yuwei Zhang
2. Adam C English
3. Luis F Paulin
4. Christopher M Grochowski
5. Surabhi Maheshwari
6. Taralynn Mack
7. Michele Berselli
8. Alexander D Veit
9. Yilei Fu
10. SMAHT SV working group
11. Peter J Park
12. Fritz J Sedlazeck
This article has no evaluationsLatest version Sep 20, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Findings

Conclusions

Article activity feed

Related articles

Comprehensive benchmarking of somatic single-nucleotide variant and indel detection at ultra-low allele fractions using short- and long-read data

A Pangenomic Method for Establishing a Somatic Variant Detection Resource in HapMap Mixtures

Comprehensive benchmarking of somatic structural variant detection at ultra-low allele fractions