Are reads required? High-precision variant calling from bacterial genome assemblies
This article has been Reviewed by the following groups
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate nucleotide variant calling is essential in microbial genomics, particularly for outbreak tracking and phylogenetics. This study evaluates variant calls derived from genome assemblies compared to traditional read-based variant-calling methods, using seven closely related Staphylococcus aureus isolates sequenced on Illumina and Oxford Nanopore Technologies platforms. By benchmarking multiple assembly and variant-calling pipelines against a ground truth dataset, we found that read-based methods consistently achieved high accuracy. Assembly-based approaches performed well in some cases but were highly dependent on assembly quality, as errors in the assembly led to false-positive variant calls. These findings underscore the need for improved assembly techniques before the potential benefits of assembly-based variant calling – such as reduced computational requirements and simpler data management – can be realised.
Article activity feed
-
Thank you very much for submitting your revised manuscript to Access Microbiology and for introducing the corrections proposed by the reviewers. I am please to let you know that this current version is now accepted for publication. Congratulations to all authors!
-
-
-
Comments to Author
The authors of this manuscript address a timely and fundamental question in microbial genomics: can genome assemblies replace sequencing reads for accurate variant calling? Using Illumina short reads and Oxford Nanopore long reads from whole-genome sequencing of a methicillin-resistant Staphylococcus aureus strain NRS384 and its six closely-related mutants, the authors benchmark different variant-calling approaches against ground-truth variants determined by comparing finished-grade reference genomes of these seven isolates. Their results show that read-based methods remain more accurate for variant calling than assembly-based methods, but the latter are feasible with error-free assemblies. The authors acknowledge that their conclusions are drawn from a specific technical configuration and a small …
Comments to Author
The authors of this manuscript address a timely and fundamental question in microbial genomics: can genome assemblies replace sequencing reads for accurate variant calling? Using Illumina short reads and Oxford Nanopore long reads from whole-genome sequencing of a methicillin-resistant Staphylococcus aureus strain NRS384 and its six closely-related mutants, the authors benchmark different variant-calling approaches against ground-truth variants determined by comparing finished-grade reference genomes of these seven isolates. Their results show that read-based methods remain more accurate for variant calling than assembly-based methods, but the latter are feasible with error-free assemblies. The authors acknowledge that their conclusions are drawn from a specific technical configuration and a small collection of closely related S. aureus isolates. Overall, the study is well-designed and reproducible, with accessible computer code and sequence data. This study has considered a wide range of bioinformatics software and evaluated comprehensive workflows for variant calling in bacterial genomics. I would like to make the following suggestions to improve the scientific rigour and communication clarity of this work. = Major suggestions = Table 1: Would it be possible to present the summary statistics under two categories: long-read-only/first assemblies (Canu, Flye, Raven, Hybracter) and short-read-only/first assemblies (Shovill, SKESA, Unicycler) rather than providing summaries across both categories? Since Figure 3 and Table S2 show a marked increase in the number of false-positive SNPs when the latter assembly strategy was used, which substantially reduces the precision, I do not think any summary statistics in the current table can reflect such data heterogeneity or provide adequate evidence to support the choice of the best assembly-based variant-calling method even if the conclusion could be the same. Lines 150-179: I would recommend explaining the sources of false-positive and false-negative SNPs and indels under different assembly-based variant-calling methods, or at least, explaining how those excessive false-positive SNPs arose when the short-read-only/first assemblies were used, to support the discussion of possible optimisation strategies in Lines 220-221, which would otherwise appear as conjecture. = Minor suggestions = Line 81: I recommend replacing "described previously" with a brief summary of methods for DNA extraction and phenotypic analysis to facilitate readers' understanding, while still citing the original paper for full details. Line 70-81 and Figure S1: Detailed description of isolates' phenotypes and functions of mutations seem irrelevant to this manuscript's topic and may unnecessarily challenge readers, although these details are scientifically interesting. The sentence from Line 74 to 78 could be improved by shortening its length and avoiding nested parentheses. Line 86: Could you elaborate on what the "150 bp PE kit" was, if this information is available? Lines 70-148: Subheadings could be added into the Method section to improve its structure and help readers navigate this section with clear rationales. Line 116: Perhaps you could add "wild-type NRS384" in this line to recapitulate what the "reference genome" refers to, given the diverse terms used in the Methods section. Line 130: Would a replacement of "reference assemblies" with "ground-truth assemblies" avoid possible confusion between these assemblies and the reference genome (wild-type NRS384) used for variant calling? The same suggestion applies to "reference" entries from Row 44 to 64 in column "assembly_method" of Table S2 for clarity, although I understand that you mean reference sequences of isolates according to your comment in this column's header. Lines 135-143: Although the answer may be obvious, could you confirm whether the mutations converting the wild-type isolate NRS384 into WalKT389A were among the true set of variants? I could not find this crucial information in either Figure 1 or the Results section. Lines 190-191: What is the length distribution of these duplicated genomic regions across the ground-truth assemblies? Figure 1 (Lines 96-100): Perhaps it better fits the Results section than does Methods. Figure 1B: Could you clarify that the tree was generated with option "--polytomy" (to enable the WalKT389A isolate to become a polytomy) and whether the tree was rooted on the wild-type isolate, since the iqtree command in the "Build tree" section of the "Supplementary methods" (https://github.com/rrwick/Are-reads-required/blob/main/methods.md) may not be immediately apparent to readers. Table S1: What does the column name "CPG-IH ID" mean, and what is the relevance of CPG-IH accessions to the wider research community? Table S1: I understand that for each isolate, you used the Lander-Waterman equation by dividing the total bases of quality-processed reads by its genome length to estimate the read depth of each sequencing method. Since the read depth is a crucial factor in this study, it would help readers understand your methods if you could clarify in this table or the manuscript how the read depths were calculated. Table S2: Readers would find it easier to link the 756 rows of raw results in Table S2 to the five groups of VCF files described in Lines 129-134 if a column could be added to this table to represent the VCF group. Table S2: Entries in column "variant_call_method" should not be blank for the read-based method (Rows 2-43), because the variant callers Freebayes and Clair3 were used for Illumina and ONT reads, respectively, according to Lines 103-104 in the manuscript.
Please rate the manuscript for methodological rigour
Good
Please rate the quality of the presentation and structure of the manuscript
Good
To what extent are the conclusions supported by the data?
Strongly support
Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?
No
Is there a potential financial or other conflict of interest between yourself and the author(s)?
No
If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?
Yes
-
Thank you very much for submitting your manuscript to Access Microbiology. It has now been reviewed by two experts in the field, whose comments are attached at the bottom of this email. They both agree that this is a valuable piece of work that tackles a timely question in the microbial genomics field. However, they have provided some suggestions to further strengthen the manuscript, which you will need to address in a revised manuscript. Please provide a revised version of the manuscript (including a tracked changes document) along with a point-by-point response to the reviewers (including the lines where each comment was addressed) within one month.
-
Comments to Author
A simple question, is it possible to use assemblies instead of raw reads for variant calling (less storage, less computation...)?, which triggers a rigorous scientific method to arrive at the current state of art. I really appreciated the links in the text giving direct access to all the data in the paper. The details of the materials and method seem to me sufficient to have a good reproducibility of this work.The style is simple and direct, making it easy to read and the conclusions very well understood. Literature and discussion are clear and highlight the issues at stake.
Please rate the manuscript for methodological rigour
Very good
Please rate the quality of the presentation and structure of the manuscript
Very good
To what extent are the …
Comments to Author
A simple question, is it possible to use assemblies instead of raw reads for variant calling (less storage, less computation...)?, which triggers a rigorous scientific method to arrive at the current state of art. I really appreciated the links in the text giving direct access to all the data in the paper. The details of the materials and method seem to me sufficient to have a good reproducibility of this work.The style is simple and direct, making it easy to read and the conclusions very well understood. Literature and discussion are clear and highlight the issues at stake.
Please rate the manuscript for methodological rigour
Very good
Please rate the quality of the presentation and structure of the manuscript
Very good
To what extent are the conclusions supported by the data?
Strongly support
Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?
No
Is there a potential financial or other conflict of interest between yourself and the author(s)?
No
If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?
Yes
-