xRead: a coverage-guided approach for scalable construction of read overlapping graph

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.

Article activity feed

  1. AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.

    This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer #2: Anuradha Wickramarachchi

    Overall comments.

    Authors of the manuscript have developed an iterative overlap graph construction algorithm to support genome assembly. This is both an interesting and a demanding area of research due to very recent advancements in sequencing technologies.

    Although the text in the manuscript is interesting, grammar must be rechecked and revised. At some point it is difficult to keep track of the content and references to supplementary to make sense out of the content.

    Specific comments

    Page 1 Line 13: I believe the authors are talking about assembly sizes and not genome sizes. The sentences here could be a bit short to make them easy to understand.

    Page 2 Line 19: Theoretical time complexity O(m2n2) is bit of an overstatement due to the heuristics employed by most assemblers. For example, mash distance, minimisers and k-mer bins are there to prevent this explosion of complexity. Either acknowledge such methods or provide a range for the time complexity. I would be interesting to know the time complexities of the methods expressed in sentence starting Line 15.

    Page 5 Line 11: Was this performed with overlapping windows of 1gb? Otherwise, simulations may not have reads spanning across such regions.

    Page 5 Line 14: It seems you are simulating 9 + 4 + 4 datasets. This is unclear, please make this into bullet points or separate paragraphs and explain clearly. Include simulator information in the table itself by may be making it landscape (in supplementary).

    Fig 2: I believe authors should expand their analysis to more recent and popular assemblers. For example, wtdbg2 is designed for noisy reads and not specifically for more accurate R10/ HiFi reads. So please include, HiFi-asm, Flye where appropriate. Flye supports ONT out of the box and in my experience does produce good assemblies.

    Although, you are evaluating read overlaps, it is hard to ignore assemblers themselves just because they do not produce intermediate overlaps graphs.

    Page 5-9: In the benchmarks section, please include how True Positives and False Positives were labelled. Was this from simulation data?

    Page 11: Use of xRead has been evaluated on genome assemblies. This is a very important and it is a bit unfortunate that existing assemblers are not very flexible in terms of plugging in new intermediate steps. It might be worth exploring into creating a new assembler using the wtpoa2 cli command of wtdbg2.

    Page 16: What will happen if you only capture reads from a single chromosome due to longer length? I believe the objective is to gather longest reads capturing as much as possible covering the whole genome. Please comment on this.

    Page 19: In the Github Readme the download URL was wrong. Please correct it to the latest release

    Correct: https://github.com/tcKong47/xRead/releases/download/xRead-v1.0.0.1/xRead-v1.0.0.tar.gz Existing: https://github.com/tcKong47/xRead/releases/download/v1.0.0/xRead-v1.0.0.tar.gz

    Make command failed with make: *** No rule to make target main.h', needed bymain.o'. Stop.

    It seems the release does not have source code, but rather the compiled version. Please update github instructing how to compile code properly with a git clone.

  2. AbstractThe development of long-read sequencing is promising to high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for genome assemblers to well-handle thousands of genomes, tens of gigabase level genome sizes and terabase level datasets simultaneously and efficiently, which is a bottleneck to large de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for that of large genomes. Such lower performance and scalability are not suited to handle the numerous samples to be sequenced. Herein, we propose xRead, an iterative overlapping graph approach that achieves high performance, scalability and yield simultaneously. Under the guidance of its novel read coverage-based model, xRead uses heuristic alignment skeleton approach to implement incremental graph construction with highly controllable RAM space and faster speed. For example, it enables to process the 1.28 Tb A. mexicanum dataset with less than 64GB RAM and obviously lower time-cost. Moreover, the benchmarks on the datasets from various-sized genomes suggest that it achieves higher accuracy in overlap detection without loss of sensitivity which also guarantees the quality of the produced graphs. Overall, xRead is suited to handle numbers of datasets from large genomes, especially with limited computational resources, which may play important roles in many de novo sequencing studies.

    This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer #1: Antoine Limasset

    The manuscript describes xreads, a novel method that enables resource-efficient overlap graph computation based on new strategies to compute it quickly and with controlled memory usage. The authors introduce several quality metrics to assess the quality of the overlap graph and integrate their tool into NextDenovo, improving its resource usage.

    The manuscript is overall clear, although the section order can make it hard to read as concepts are defined backward. Some typos and minor phrasing issues should be corrected.

    Remarks:

    The manuscript spends a lot of time evaluating the quality of the overlap graph, which is a very commendable approach and is often overlooked. I thank the authors for this contribution. However, I have issues with the definition of ground truth overlap. Even if two reads do not come from successive parts of the genome, if they share, let's say, a very large perfect overlap, they should indeed overlap in the graph. Considering that the actual biological overlap is necessarily the best one found in the reads is a greedy strategy that could harm the final assembly. Because of this definition, I am not fully convinced by xreads' performance, which seems to employ an overall very greedy strategy.

    A key selling point of the abstract is the ability of xreads to work with controlled memory usage at the expense of time and external memory usage. Showing some results on this feature would be very interesting, such as a plot showing the time performance depending on memory usage, for example. Also, the amount of external memory used should be discussed.

    As far as I understand, the end goal of xreads is to perform efficient de novo assembly. The assembly results should be the primary results of the manuscript and not relegated to the supplementary section. The assembly benchmark should include other assemblers and not only NextDenovo. The assembly results and justification are not quite convincing since the proposed assembler is slightly more resource-efficient at the cost of degraded assembly quality. While the case studies are interesting, it is hard to avoid concluding that the overall quality is degraded compared to regular NextDenovo.