Atria: an ultra-fast and accurate trimmer for adapter and quality trimming

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

With advances in next-generation sequencing, adapters attached to reads and low-quality bases directly and implicitly hinder downstream analysis. For example, they can produce false-positive single nucleotide polymorphisms (SNP), and generate fragmented assemblies. There is a need for a fast trimming algorithm to remove adapters precisely, especially in read tails with relatively low quality. Here, we present Atria, a trimming program that matches the adapters in paired reads and finds possible overlapped regions using a fast and carefully designed byte-based matching algorithm (O (n) time with O (1) space). Atria also implements multi-threading in both sequence processing and file compression and supports single-end reads. Compared with other trimmers, Atria performs favorably in various trimming and runtime benchmarks of both simulated and real data. We also provide a fast and lightweight byte-based matching algorithm, which can be used in various short-sequence matching applications, such as primer search and seed scanning before alignment.

Article activity feed

  1. Background

    Reviewer 2. Alun Li.

    Is the language of sufficient quality? Yes

    Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes

    Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code? No Additional Comments There is no license in the github repository.

    As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? Yes. Github can be used to report issues or seek support on the code

    Is the code executable? Yes

    Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes

    Is the documentation provided clear and user friendly? Yes

    Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes

    Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes

    Are there (ideally real world) examples demonstrating use of the software? Yes

    Is automated testing used or are there manual steps described so that the functionality of the software can be verified? Yes

    Additional Comments
    Any Additional Overall Comments to the Author The paper describes an ultra-fast and accurate trimmer for adapter and quality trimming: Atria and compare it to several published tools. The tool is demonstrated to work on sequencing data with competitive accuracy and efficiency compared with existing tools.

    There are concerns that should be addressed: 1. The performance comparisons listed in Table 2 show that Atria is not extremely impressive compared with existing tools with quality trimming in percentage of the properly paired reads and the number of unmapped reads. Also, there are no more features than existing tools like Fastp, which may limit the widespread use of this software. 2. IO could be the main bottleneck for most hard-disk drivers when performing adapter trimming for compressed input/output files. So, the wall time to run different tools is also a good measurement. I wonder whether there is a significant advantage in performance if the runtime benchmark is measured by wall time. 3. Can the algorithm deal with different lengths of adapter sequences? It would be good to test out the performance of the tools with increasing length of adapter sequence. 4. L79 states that Atria is compatible with single-end data from Pacbio and Nanopore platforms, but there is no corresponding data in the paper to support the statement. Besides, the limitations of the byte-based matching algorithm make it difficult to deal with Pacbio and Nanopore sequences with high insert and deletion rates. It is necessary to describe how to get rid of these limitations in sufficient detail if they have been overcome. 5. It may be better if the description of this algorithm is presented in pseudocode especially in the section of “Matching and scoring” and “Decision rules”. 6. L165-L168, I don't quite understand why an adapter is an ideal adapter when the matching score is bigger than 10? Also, why the read pair will not be trimmed when the matching score is less than 19? Are there any reasons for the authors to set these two parameters 10 and 19 respectively? In addition, it is necessary for the authors to demonstrate that the program is robust enough for different lengths of adapter sequences. 7. All symbols in the paper should be clearly identified, e.g., L115 a1, L121 8. L135,” Because the matching algorithm requires much less time, we implement four pairs of matching to utilize properties of paired-end reads thoroughly”. The causation here does not hold.

    Recommendation Minor Revisions

  2. Abstract

    Reviewer 1. Xingyu Liao

    Is the language of sufficient quality? Yes

    Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes

    Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code? Yes

    As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? Yes

    Is the code executable? Unable to test

    Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes

    Is the documentation provided clear and user friendly? Yes

    Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes

    Have any claims of performance been sufficiently tested and compared to other commonly-used packages? No

    Are there (ideally real world) examples demonstrating use of the software? No

    Is automated testing used or are there manual steps described so that the functionality of the software can be verified? Yes

    Additional Comments

    Opinion: Author Should Prepare a Major Revision.

    In this paper, the authors proposed a trimming algorithm called Atria, which matches the adapters in paired-end reads and finds possible overlapped regions with a super-fast and carefully designed byte-based matching algorithm. Furthermore, Atria implements multi-threading in both sequence processing and file compression and support single-end reads. The proposed algorithm has some significance in both theory and practical application. However, I still have some questions to discuss with authors. The comments on the paper are as follows. (1) Major Comments:

    1. The author highlights the fast and accurate characteristics of the proposed trimming algorithm in the title of the manuscript. However, the large amount of content in the manuscript and supplementary is to prove the advantages of the proposed algorithm in terms of speed, processing efficient, and utilization of CPU and RAM. The assessment of trimming accuracy is very limited, and it seems that only general statistics are given in Table 2 of the manuscript. I personally think that the alignment rate of reads (or the number of paired-end reads) before and after trimming is not a good proof of the accuracy of the trimming algorithm. What's more, judging from the experimental results in Table 2, the Atria algorithm does not have much advantage in accuracy compared to other methods. As the author stated in the abstract, sequence trimming is of great significance for SNP detection and sequence assembly. I very much hope to see Atria's optimization and promotion of these applications.
    2. The datasets used in this study seem to be unrepresentative, and most of them can be trimmed within a few to ten seconds. The difference between a few seconds and a dozen seconds, I think most users will not care. To prove the significant advantages of the proposed algorithm in terms of efficiency, some large-scale datasets (such as several samples sequenced in the 1000 genome project) should be used. (2) Minor Comments:
    3. The table2 display of line 562 is incomplete.