Streamlining remote nanopore data access with slow5curl

This article has been Reviewed by the following groups

Read the full article

Abstract

As adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl , a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset ( n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curl

Article activity feed

  1. ABSTRACTAs adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curlCompeting Interest StatementI.W.D. manages a fee-for-service sequencing facility at the Garvan Institute of Medical Research that is a customer of Oxford Nanopore Technologies but has no further financial relationship. H.G., J.M.F. and I.W.D. have previously received travel and accommodation expenses from Oxford Nanopore Technologies. The authors declare no other competing financial or non-financial interests.

    Reviewer 3. Guillermo Dufort y Alvarez

    This paper introduces slow5curl, a software tool that extends slow5lib, a previous tool developed by the authors. The tool allows users to retrieve raw nanopore sequencing reads from remote BLOW5 files, a novel format that has several advantages over FAST5 and POD5, which are the most widely used formats for storing raw nanopore sequencing data. BLOW5 is not yet a standard format, but this tool could encourage its adoption and the development of similar tools in the future. The paper is well written, clear, and concise, and the tool is tested on various scenarios. The GitHub repository provides clear instructions and examples for building and using the tool. My comments to the authors are: Major

    1. I am concerned about the main use case of the tool, which is to obtain a subset of raw nanopore reads that align to a specific region (e.g., a gene), in order to re-basecall them with a new software tool. This assumes that the alignment region of the original basecall is consistent with the new basecall, which may not be true. The new basecall sequences may align better to a different region, and some sequences that were not retrieved may align well to the desired region. This affects the precision and recall of the process. I would like the authors to address this issue, by either providing evidence that this is rare, or explaining why the tool is still useful despite this limitation.
    2. The tool depends on the availability of a BAM file for the raw reads, which is uploaded along with the BLOW5 file and its index. In the section Fetching reads from a large cohort, the authors claim that storing the raw nanopore data with its index reduces the size by 29.7% compared to FAST5. However, they do not consider the size of the BAM file, which is required for the main use case. I would like the authors to address this, by either reporting the size of the BAM files, or justifying why their size is irrelevant for this comparison.

    Minor

    1. In section RESULTS, in line two, delete the repeated word simple from "simple BLOW5 simple".

    Re-review. The authors correctly addressed each one of the comments I made. From my side, no further changes are needed for publication. Great work.

  2. As adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curlCompeting Interest Statement

    Reviewer 2. Yunfan Fan

    Comments to Author: In this manuscript, the authors demonstrate a highly streamlined method for downloading targeted subsets of raw ONT electrical signals, for re-analysis. In my view, this will be a highly useful tool for researchers working with public nanopore data, and I hope to see its widespread adoption. The benchmarks are well-described in the manuscript, and the code is publicly available and well-documented. I have no other notes or suggestions for the authors.

  3. ABSTRACTAs adoption of nanopore sequencing technology continues to advance, the need to maintain large volumes of raw current signal data for reanalysis with updated algorithms is a growing challenge. Here we introduce slow5curl, a software package designed to streamline nanopore data sharing, accessibility and reanalysis. Slow5curl allows a user to fetch a specified read or group of reads from a raw nanopore dataset stored on a remote server, such as a public data repository, without downloading the entire file. Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelised data access requests to maximise download speeds. Using all public nanopore data from the Human Pangenome Reference Consortium (>22 TB), we demonstrate how slow5curl can be used to quickly fetch and reanalyse signal reads corresponding to a set of target genes from each individual in large cohort dataset (n = 91), minimising the time, egress costs, and local storage requirements for their reanalysis. We provide slow5curl as a free, open-source package that will reduce frictions in data sharing for the nanopore community: https://github.com/BonsonW/slow5curl

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae016), and has published the reviews under the same license. These are as follows.

    Reviewer 1: Jan Voges

    Comments to Author: The manuscript provides a detailed overview of the proposed technology, with an emphasis on reproducibility through precise software version and command line documentation. Although slow5curl is a rather simple implementation of a curl-based streaming for nanopore data, it is extensively evaluated. In this way, its value to the nanopore community is made clear. I do have a few minor comments:

    Introduction: The importance of preserving raw signal data needs to be more clearly articulated. There is a view within the community that reads that have undergone high-accuracy base calling and methylation calling are sufficient for distribution and long-term storage. The clarification on the importance of raw data retention would strengthen the introduction.

    Results: Please rephrase "[…]fetch a specific read(s) […]". Results: It should be stated more explicitly that BLOW5 is a compressed data representation and therefore suitable for streaming. Results: "The simple BLOW5 simple file-structure[…]" -> "The simple BLOW5 file structure […]"

    Discussion: "[…] users must upload a single FAST5 tarball for a given datasets" -> "[…] users must upload a single FAST5 tarball for a given dataset" Discussion: While the SLOW5 ecosystem is described in detail, it would be beneficial to discuss whether there are any alternative solutions or technologies that provide a comparative perspective. Discussion: It would be interesting to discuss the possible standardization of the SLOW5 ecosystem. What is the vision? An academically centered open-source ecosystem? A proprietary system? A more "formal" standard (GA4GH, ISO/IEC)?