DENTIST—using long reads for closing assembly gaps at high accuracy

Arne Ludwig
Martin Pippel
Gene Myers
Michael Hiller

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read–based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap-closing methods have been developed, these methods often close an assembly gap with sequence that does not accurately represent the true sequence.

Findings

Here, we present DENTIST, a sensitive, highly accurate, and automated pipeline method to close gaps in short-read assemblies with long error-prone reads. DENTIST comprehensively determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the correct loci, integrates a consensus sequence computation step to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps. Unlike previous benchmarks, we generated test assemblies that have gaps at the exact positions where real short-read assemblies have gaps. Generating such realistic benchmarks for Drosophila (134 Mb genome), Arabidopsis (119 Mb), hummingbird (1 Gb), and human (3 Gb) and using simulated or real PacBio continuous long reads, we show that DENTIST consistently achieves a substantially higher accuracy compared to previous methods, while having a similar sensitivity.

Conclusion

DENTIST provides an accurate approach to improve the contiguity and completeness of fragmented assemblies with long reads. DENTIST's source code including a Snakemake workflow, conda package, and Docker container is available at https://github.com/a-ludi/dentist. All test assemblies as a resource for future benchmarking are at https://bds.mpi-cbg.de/hillerlab/DENTIST/.

GigaScience
Mar 9, 2022
sequencing

This paper has been published by GigaScience ( https://doi.org/10.1093/gigascience/giab100) and the peer-reviews have been shared under a CC-BY 4.0 license. These are as follows.

**Reviewer 1. Edward Rice **

In this manuscript, the authors present a sophisticated method for closing gaps in assemblies, built around the knowledge that gaps usually occur in repetitive regions. They test their software against similar software with more realistic scenarios than previous studies, through the use of gaps from real assemblies of genomes that have other assemblies with fewer gaps, rather than randomly generated gaps. These tests convincingly demonstrate that this software is more sensitive and accurate than existing gap closers.

Given this increase in performance over existing software and the novelty of the methods, I recommend …
sequencing

This paper has been published by GigaScience ( https://doi.org/10.1093/gigascience/giab100) and the peer-reviews have been shared under a CC-BY 4.0 license. These are as follows.

**Reviewer 1. Edward Rice **

In this manuscript, the authors present a sophisticated method for closing gaps in assemblies, built around the knowledge that gaps usually occur in repetitive regions. They test their software against similar software with more realistic scenarios than previous studies, through the use of gaps from real assemblies of genomes that have other assemblies with fewer gaps, rather than randomly generated gaps. These tests convincingly demonstrate that this software is more sensitive and accurate than existing gap closers.

Given this increase in performance over existing software and the novelty of the methods, I recommend this manuscript for publication with some changes. I do have some concerns about the usability and maintainability of the software it describes, noted below, but most of the alternate options have similar issues, and the methodological advancements present in the manuscript merit publication.

The introduction seems to imply that the primary use of this software is for closing gaps in short-read assemblies where high-coverage long reads are not available due to cost. Although I do not have a statistic to back this up, it is my sense from recent genome assembly papers that long-read de novo assembly is much more the norm these days than short-read assembly. In my personal experience I have found that gap closing can sometimes greatly improve long-read assemblies as well, especially CLR assemblies of highly repetitive genomes. I recommend rewriting the introduction somewhat to make it clear that usage of this software is not limited to short-read assemblies, as these are becoming rarer and rarer.

I have some concerns about the maintainability of this code base, considering its size (>40k lines), language (D, which is not a common language in bioinformatics), and sparsity of comments in the code. Further, the use of non-standard dependencies and file formats may make it difficult to adapt the software to future advances in sequencing technology; for example, this package uses daligner to perform alignment, and so far as I can tell, daligner does not produce output in SAM format, so it may be difficult to switch to using another aligner in the future as the types of long reads available change. The fact that many of the dependencies are not maintained on bioconda is also concerning. The presence of integration tests is helpful. I apologize that this is probably not a particularly helpful comment as it's far too late to change any of these things, but still wanted to point them out.

I also have concerns about usability. The availability of a docker file and snakemake workflow for running this software and the thorough and mostly comprehensible documentation alleviate these concerns to some degree, but it still takes a significant amount of work to configure it for a specific cluster. The example run did not work out of the box without fixing some errors (see minor edits). To test on my own assembly, I had to edit one JSON file to choose the parameters for dentist itself, which required reading about the two ways to specify two required coverage parameters; one yaml file to configure the workflow options; and one yaml file to make snakemake work with my cluster. In addition, not all clusters have singularity, so the lack of a conda package may be a problem for some potential users. The singularity image and snakemake workflow make its usability far better than PBJelly, which required actually editing the source code to make it work on my cluster with conda-installable versions of its dependencies, but it is still much worse than TGS-GapCloser, which only takes a single conda command to install with all dependencies and a single command to run, and no editing of configuration files.

Minor comments: Abstract:

"Here, we developed" -> "Here, we present"

"Highly-accurate" â€” no hyphen

"Short read assemblies" -> "short-read assemblies" (this occurs in several other places too throughout manuscript)

Replace "right loci" with "correct loci" Introduction:

Page 3: "High contiguity, completeness, and accuracy... is fundamental" â€” change "is" to "are"

Page 3: avoid parentheses inside other parentheses

Page 3: I'm not sure I've ever heard of GenomicConsensus being used for gap closing, and cannot find any reference to it being used for this purpose with a quick scan of documentation. It must be capable of doing this, though, as you tested it alongside other gap closers. Could you explain this in the manuscript?

Results:

Page 4: replace "right loci" with "correct loci"

Page 4: say a little more about what makes DENTIST's "state-of-the-art" consensus module better than or different from existing consensus callers

Page 5: "real life" to "real-life"

Page 5: "high quality" to "high-quality" Discussion:

Page 9: "long read data" -> "long-read data" Methods:

Page 11: "genomic regions, where the number" â€” remove comma

Page 12: "a common conflict are" to "a common conflict is"

Page 12: "less than three reads" to "fewer than three reads"

Page 14: "'copied' gaps from short read assembly" to "copied gaps from the short-read assembly"

Page 14: remove quotation marks around "disassembled"

Software:

The "small example" does not work out of the box as "dentist_v1.0.2.sif" is hard-coded into snakemake.yml but the image distributed with the example is v2.0.0.

The "read-coverage" and "ploidy" options are listed as required (unless you're using "min-coverage-reads" and "max-coverage-reads", but they are not among the "important options" listed in the README under the "How to choose DENTIST parameters" subheading.

In the more extensive list of command-line options, the description of the "read-coverage" option is "this is used to provide good default values for -max-coverage-reads or -min-coverage-reads; both options are mutually exclusive." This tells the user how it is used by the program but gives the reader no explanation of how it should be chosen, which is important as it is one of the required options.

The use of comments in dentist.json by putting double slashes in front of attribute strings is confusing and also not supported by the json specification. Dentist.json would be better in yaml format because: a) YAML supports comments b) YAML is easier to read by humans c) YAML is used for the other two configuration files necessary to run the pipeline, so for consistency purposes it's best to have them all in the same for

Re-review The authors have thoroughly and satisfactorily addressed all of my comments and the comments of the other reviewers. After testing the latest version, I can confidently say ease of use is much improved as it took me less than five minutes to go from zero to successfully starting a run of the example. I am therefore happy to recommend this manuscript for publication in its current format.
Read the original source
GigaScience
Mar 9, 2022
reads

**Reviewer 2. Leena Salmela **

Overview: The paper presents a new tool called DENTIST for closing gaps in short read assemblies using PacBio CLR data. Although new assemblies are nowadays most often done with PacBio HiFi data resulting in contiguous and accurate assemblies, closing the gaps of an existing short read assembly with long read data is a cost effective and therefore attractive alternative for species for which short read assemblies are already available. The new tool is shown to be more accurate than previous tools and of comparable sensitivity.

Suggestions for revision:
1. The authors should clearly indicate in the Introduction that their tool is tested on PacBio CLR reads. It would also be good to specify in the abstract that the reads were CLR reads and not HiFi reads.
2. In the Discussion, the authors recommend to …
reads

**Reviewer 2. Leena Salmela **

Overview: The paper presents a new tool called DENTIST for closing gaps in short read assemblies using PacBio CLR data. Although new assemblies are nowadays most often done with PacBio HiFi data resulting in contiguous and accurate assemblies, closing the gaps of an existing short read assembly with long read data is a cost effective and therefore attractive alternative for species for which short read assemblies are already available. The new tool is shown to be more accurate than previous tools and of comparable sensitivity.

Suggestions for revision:

The authors should clearly indicate in the Introduction that their tool is tested on PacBio CLR reads. It would also be good to specify in the abstract that the reads were CLR reads and not HiFi reads.

In the Discussion, the authors recommend to "polish" the final gap closed assembly with Illumina reads. It would be interesting to see how much this improves the accuracy of gap closing. I would assume that the improvement on the gap sequences would be smaller than on other regions of the assembly because the gap sequences typically cover repetitive regions.

Last paragraph of section "Closing the gaps", page 14: DENTIST has three modes. Here it is indicated that the third mode (only use scaffolding information for conflict resolution and freely scaffold the contigs using long reads) would be the best mode for contig-only assemblies. It seems to me that also the second mode would be appropriate for this as it also closes gaps between scaffolds (or contigs in case of lack of scaffold information). Is this so?
Read the original source
GigaScience
Mar 9, 2022

allow

**Reviewer 3. Ian Korf. **

The paper by Ludwig et al demonstrates that DENTIST offers a substantial improvement in closing genomic assembly gaps. The paper is well written with a clear and concise style. I liked the way they approached the experiments with a combination of simulated and real data for both the assemblies and reads. Specifically, I applaud how they generated gaps where they actually happen. The figures are generally effective. The only exception to this is Figure 4 with the black background and inconsistent ordering of competing software. In addition to winning the bake-off against other software, they did a very useful analysis of read depth (figure 6) and resources used (table 2). These help future users plan their projects. From a code perspective, I like that they have put their code on github. I don't think …

allow

**Reviewer 3. Ian Korf. **

The paper by Ludwig et al demonstrates that DENTIST offers a substantial improvement in closing genomic assembly gaps. The paper is well written with a clear and concise style. I liked the way they approached the experiments with a combination of simulated and real data for both the assemblies and reads. Specifically, I applaud how they generated gaps where they actually happen. The figures are generally effective. The only exception to this is Figure 4 with the black background and inconsistent ordering of competing software. In addition to winning the bake-off against other software, they did a very useful analysis of read depth (figure 6) and resources used (table 2). These help future users plan their projects. From a code perspective, I like that they have put their code on github. I don't think they need to have the supplemental file of command line parameters, as anyone who wants to use the software is going to go to the github anyway, which has a much more comprehensive explanation of usage.

Read the original source
Version published to 10.1093/gigascience/giab100
Jan 1, 2022
Version published to 10.1101/2021.02.26.432990v2 on bioRxiv
Dec 16, 2021
Version published to 10.1101/2021.02.26.432990v1 on bioRxiv
Feb 27, 2021