SqueezeCall: nanopore basecalling using a Squeezeformer network

Curation statements for this article:
  • Curated by GigaByte

    GigaByte logo

    Editors Assessment:

    The accuracy of basecalling of nanopore sequencing still needs to be improved. With recent advances in deep learning this paper introduces SqueezeCall, a novel end-to-end tool for accurate basecalling. This uses Squeezeformer-achitecture which integrates local context extraction through convolutional layers and long-range dependency modeling via global context acquisition. Testing and peer review demonstrated that SqueezeCall outperformed traditional RNN and Transformer-based basecallers across multiple datasets, indicating its potential to refine genomic assembly and facilitate direct detection of modified bases in future genomic analytics. Future work is ongoing that will focus on training on highly curated datasets, including known modifications, to further increase research value. SqueezeCall is MIT licensed and available from GitHub here: https://github.com/labcbb/SqueezeCall

    This evaluation refers to version 1 of the preprint

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Nanopore sequencing, a third-generation sequencing technique, enables direct RNA sequencing, real-time analysis, and long-read length. Nanopore sequencers measure electrical current changes as nucleotides pass through nanopores; a basecaller identifies base sequences according to the raw current measurements. However, accurate basecalling remains challenging due to molecular variations and sequencing noise. Here, we introduce SqueezeCall, a novel Squeezeformer-based model for accurate nanopore basecalling. SqueezeCall uses convolution layers to down-sample raw signals and model local dependencies. A Squeezeformer network captures the global context, and a connectionist temporal classification (CTC) decoder with beam search generates DNA sequences. Experimental results demonstrated SqueezeCall’s ability to resist noise, improving basecalling accuracy. We trained SqueezeCall combining three types of loss, and found that all three loss types contribute to basecalling accuracy. Experiments across multiple species demonstrated the potential of a Squeezeformer-based model to improve basecalling accuracy and its superiority over recurrent neural network-based models and Transformer-based models.

Article activity feed

  1. Editors Assessment:

    The accuracy of basecalling of nanopore sequencing still needs to be improved. With recent advances in deep learning this paper introduces SqueezeCall, a novel end-to-end tool for accurate basecalling. This uses Squeezeformer-achitecture which integrates local context extraction through convolutional layers and long-range dependency modeling via global context acquisition. Testing and peer review demonstrated that SqueezeCall outperformed traditional RNN and Transformer-based basecallers across multiple datasets, indicating its potential to refine genomic assembly and facilitate direct detection of modified bases in future genomic analytics. Future work is ongoing that will focus on training on highly curated datasets, including known modifications, to further increase research value. SqueezeCall is MIT licensed and available from GitHub here: https://github.com/labcbb/SqueezeCall

    This evaluation refers to version 1 of the preprint

  2. ABSTRACTNanopore sequencing, a novel third-generation sequencing technique, offers significant advantages over other sequencing approaches, owing especially to its capabilities for direct RNA sequencing, real-time analysis, and long-read length. During nanopore sequencing, the sequencer measures changes in electrical current that occur as each nucleotide passes through the nanopores. A basecaller identifies the base sequences according to the raw current measurements. However, due to variations in DNA and RNA molecules, noise from the sequencing process, and limitations in existing methodology, accurate basecalling remains a challenge. In this paper, we introduce SqueezeCall, a novel approach that uses an end-to-end Squeezeformer-based model for accurate nanopore basecalling. In SqueezeCall, convolution layers are used to down sample raw signals and to model local dependencies. A Squeezeformer network is employed to capture the global context. Finally, a connectionist temporal classification (CTC) decoder generates the DNA sequence by a beam search algorithm. Inspired by the Wav2vec2.0 model, we masked a proportion of the time steps of the convolution outputs before feeding them to the Squeezeformer network and replaced them with a trained feature vector shared between all masked time steps. Experimental results demonstrate that this method enhances our model’s ability to resist noise and allows for improved basecalling accuracy. We trained SqueezeCall using a combination of three types of loss: CTC-CRF loss, intermediate CTC-CRF loss, and KL loss. Ablation experiments show that all three types of loss contribute to basecalling accuracy. Experiments on multiple species further demonstrate the potential of the Squeezeformer-based model to improve basecalling accuracy and its superiority over a recurrent neural network (RNN)-based model and Transformer-based models.

    This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.148). These reviews (including a protocol review) are as follows.

    Reviewer 1. Tao Jiang

    In this study, Zhongxu ZHU presents a novel approach combining the Squeezeformer architecture with masking techniques for nanopore basecalling, demonstrating meaningful improvements over existing methods. However, several concerns need to be addressed before publication.

    1. The rationale behind the chosen hyperparameter values (e.g., mask_time_prob = 0.05 and mask_time_length = 5) is unclear. Did the authors experiment with other hyperparameter settings? If so, please provide results or justification for selecting these specific values.
    2. The signal preprocessing methodology would benefit from a more detailed explanation. Specifically, the current description should clarify whether standard signal normalization techniques were applied to the raw current signals and detail any FFT preprocessing steps. Since nanopore sequencing signals can vary significantly between different species and experimental runs, explaining how SqueezeCall handles these variations would help other researchers implement and potentially improve upon this work. The author could consider including a flowchart or detailed pseudocode of the preprocessing pipeline.
    3. A more detailed analysis of the model's error handling would strengthen the paper. Specifically, how effectively does SqueezeCall address key challenges in nanopore sequencing, such as homopolymer errors?
    4. The manuscript requires attention to detail in presentation,such as: I) In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which should be corrected for consistency. II) On page 12, line 19, there is an unnecessary "e.g." before "SqueezeCall," which should be removed.
    5. Instances of "Error! Reference source not found" are present in the manuscript. Please resolve these citation errors to ensure clarity and credibility.

    Re-review: The revised manuscript addresses most of my concerns; however, I have a few additional suggestions before recommending it for publication: 1) The newly added experimental Mask module presents only the results. Charts should be included to provide a more intuitive and visual representation of these results. 2) The images included in the Response should also be incorporated into the main text or published as supplementary materials alongside the manuscript. 3) The formulas in the manuscript are missing corresponding numbers. It is recommended to add numbers to each formula for clarity and ease of reference.

    Reviewer 2. Ximei Luo

    This manuscript describes a tool called SqueezeCall, designed for accurate nanopore basecalling. The authors compare SqueezeCall with four existing basecalling methods across 11 different datasets and report that it outperforms them in terms of basecalling accuracy. However, the study has several shortcomings and requires further clarification. Below are my comments.

    1. The current discussion and conclusion section lacks sufficient analysis of the scientific and practical value of the proposed algorithm for nanopore sequencing. To strengthen the manuscript, consider expanding the conclusion section to provide a detailed discussion on the practical applications of the tool in real-world nanopore sequencing workflows. Additionally, include potential directions for further improvement of the algorithm to inspire future research and development in this area.
    2. The figures in the manuscript are blurry and should be improved for clarity. Additionally, the layout requires better structuring and alignment, ensuring that the borders are neat and consistent. Efforts should be made to enhance the visual appeal of the figures, and the accompanying descriptions should provide sufficient detail to enable readers to understand the content by reviewing the figures alone. 3)To enhance the showcasing of SqueezeCall's superiority, it is advisable to include one or two of the latest methods for comparison.

    Minor comments:

    1. There are instances of missing punctuation marks in sentences throughout the article. For example, the sentence on page 3, line 9, is missing a period at the end.
    2. Address the "Reference not found" issues that appear in several places in the manuscript.
    3. Number all formulas in the manuscript for easier reference and citation. 4) Verify that all references are complete and formatted according to the target journal's guidelines. 5) Some areas in Table 1 that necessitate emphasis through bold formatting are inaccurately labeled. 6) Certain content in Figure 1 and Figure 2 appears redundant; consolidation is recommended to streamline the visuals.

    Reviewer 3. Yongtian Wang

    The manuscript presents SqueezeCall, an innovative approach that combines Squeezeformer architecture with masking techniques for nanopore basecalling. The work demonstrates promising accuracy improvements through comprehensive evaluation across multiple datasets, including human, lambda phage, and nine bacterial datasets. The architecture thoughtfully integrates convolution layers for signal downsampling, employs a Squeezeformer network for capturing global context, and introduces a novel masking technique inspired by Wav2vec2.0. While the research direction and initial results are valuable, several aspects could be strengthened to enhance the work's impact: 1. Several formatting inconsistencies in the manuscript require attention for improved clarity. In Table 1, the mismatch rate (3.68) for the NA12878 Human Dataset is partially bolded, which affects the table's readability. On page 12, line 19, the redundant "e.g." before "squeezecall" should be removed. The citation system needs review as multiple instances of "Error! Reference source not found" appear throughout. 2. The mask hyperparameter selection (mask_time_prob = 0.05 and mask_time_length = 5) requires empirical justification. Including ablation studies showing model performance with different masking probabilities (e.g., 0.01, 0.03, 0.07, 0.1) and lengths (e.g., 3, 7, 10) would provide valuable insights. This analysis could reveal whether the chosen values are optimal or if there's room for improvement. A visualization of how different masking parameters affect model performance could be particularly instructive.

    1. The error analysis could be expanded to provide deeper technical insights. The author should particularly analyze the distribution of skip and stay errors in homopolymer regions (e.g., AAAAA or GGGGG) where nanopore basecalling typically struggles.
    2. The manuscript would benefit from exploring modified base calling capabilities. The author could train and evaluate the model on datasets containing known DNA modifications (e.g., 5mC, 6mA). This could start with synthetic sequences containing known modifications and extend to well-characterized genomic regions. Even if full modified base calling is beyond the current scope, preliminary results or architectural considerations for future extension would be valuable.