Improving the Reliability and Quality of Nextflow Pipelines with nf-test

Lukas Forer
Sebastian Schoenherr

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

The workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.

GigaScience
Oct 30, 2025
ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce …
ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf130), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 2: Katalin Ferenc

General assessment of the work.

It is a very nice addition to the scientific community, an important step towards standardizing the development and maintenance of software for bioinformatics pipelines. It is not a trivial task to adapt unit testing concepts to pipelines. nf-test has already been used by the community and has been in a feedback loop with the users. Thus, its usability has been constantly improving, both through the efforts of the developers and additional plugins from the user base, highlighting the ease of contribution to the nf-test software base. The text is well written and easy to follow. However, some concepts could be better described and discussed for the readers.

Specific comments for revision:

a) Major comments;

The authors should refer to pytest-workflow in the introduction, along with NFTest, as both are used for comparison.

Test coverage is helpful to identify which lines are vulnerable to changes. For the calculation of the test coverage in nf-test, indirect tests are considered. Does it mean that if a single integration test is written, then all called modules are considered covered? Please clarify or argue why this is a good strategy.

An interesting idea in nf-test is to use snapshot testing for modules, workflows, and pipelines. As the authors mention, this has been used in web development. According to the cited reference, it is especially used for frontend code and has been noted as a quick but fragile way of testing. This is because snapshot testing does not provide insight into the correctness of the code, but only asserts that there was no change. It is beneficial that this test checks for unexpected changes that unit tests might miss. In the "Code reduction through snapshot testing" section, the authors highlight cases when snapshot testing results in failed tests: 1) when there is a change in the code due to a bug, and 2) when default parameters are modified. We understand that snapshot testing in the context of pipeline development is useful in two scenarios:

when the pipeline itself is being refactored, the output of each module should stay the same. In this case, snapshot testing is used to fix the output of the tools, and a failing test highlights that the Nextflow code wrapping the tools is incorrectly integrated (i.e., connected to each other).

pipeline / module versioning requires knowledge about changes in the underlying tools. In this case, snapshot testing helps because any failure in the tests flags a change. As there is no oracle, one would not know if the bug was introduced or fixed. However, from the pipeline development perspective, the only thing that matters is that there should be a new version. According to our understanding, in any other case, a more traditional approach should be preferred, where there is an oracle knowing about expected file formats, content, or errors. Otherwise, there is a risk of adding many tests that unnecessarily fail, causing increased development time. Please add explicit discussion about these scenarios, or other ones based on your insights, highlighting when snapshot testing is applicable/appropriate during pipeline development. Please add a summary of other types of tests (e.g., assertions about file or channel content, verification of tool execution given input data, and error handling checks) that can be run within the nf-test framework. b) Minor comments:

In the "evaluation and validation" section, the authors describe that they ran tests in nf-core/modules between github versions. Please clarify that these modules were already covered by tests.

Table 4 is referenced in the Discussion section. It would be better to move the comparison between tools to the Results section.

On page 16, typo: "queuing system"

Figure 2 title typo: "nf-tet"

Figure 2: please add comments about the time cost of adding tests during the development, as it is highlighted on the figure.

Page 22 typo: "savings areis calculated"

Abstract: "Build on…" should be "Built on…"

Shouldn't TM2 linked to M3 be TM3 in Figure 1?
Read the original source
GigaScience
Oct 30, 2025

ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce …

ABSTRACTThe workflow management system Nextflow builds together with the nf-core community an essential ecosystem in Bioinformatics. However, ensuring the correctness and reliability of large and complex pipelines is challenging, since a unified and automated unit-style testing framework specific to Nextflow is still missing. To provide this crucial component to the community, we developed the testing framework nf-test. It introduces a modular approach that enables pipeline developers to test individual process blocks, workflow patterns and entire pipelines in insolation. nf-test is based on a similar syntax as Nextflow DSL 2 and provides unique features such as snapshot testing and smart testing to save resources by testing only changed modules. We show on different pipelines that these improvements minimize development time, reduce test execution time by up to 80% and enhance software quality by identifying bugs and issues early. Already adopted by dozens of pipelines, nf-test improves the robustness and reliability in pipeline development.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf130), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Jose Espinosa-Carrasco

The article presents nf-test, a new modular and automated testing framework designed specifically for Nextflow workflows, a widely used workflow management system in bioinformatics. nf-test aims to help developers improve the reliability and maintainability of complex Nextflow pipelines. The framework includes very useful features such as snapshot testing, which assesses the computational repeatability of the results produced by the execution of a pipeline or its components and smart testing which optimises computational resources by only executing tests on the parts of the pipeline that were modified, reducing overall run time. Notably, nf-test can be integrated into CI workflows and has already been adopted by the nf-core community, demonstrating its utility and maturity in real-world scenarios

General comments:

The manuscript could benefit from reordering some sections to follow a more consistent structure and by removing redundant explanations. I think it would be nice to include one limitation of nf-test, the fact that reproducing previous results does not necessarily imply biological correctness. This point is not entirely clear in the current version of the manuscript (see my comment below). Another aspect that could improve the manuscript is the inclusion of at least one reference or explanation of how nf-test can be applied outside nf-core pipelines, as all the provided examples are currently restricted to nf-core.

Specific comments:

On page 3, the sentence "Thus, maintenance requires substantial time and effort to manually verify that the pipeline continues to produce scientifically valid results" could be more precise. I would argue that identical results across versions do not guarantee scientific validity; they merely confirm consistency with previous outputs. True scientific validity requires comparison against a known ground truth or standard.

On page 4, in the sentence "It is freely available, and extensive documentation is provided on the website", I think it would be nice to include the link to the documentation.

In the "Evaluation and Validation" section (page 8), it would be helpful to briefly state the goal of each evaluated test, as is done with the nf-gwas example. ou could include something similar for the nf-core/fetchngs and modules examples (e.g. to assess resource optimization through smart testing). Also, the paragraph references the "--related-tests" option, which could benefit from a short explanation of what it does. Lastly, the order in which the pipelines are presented in this section differs from the order in the Results, which makes the structure a bit confusing.

The sections titled "Unit testing in nf-test", "Test case execution", "Smart testing and parallelization", "Snapshot testing", and "Extensions for bioinformatics" seem more appropriate for the Materials and Methods section, as they describe the design and functionality of nf-test rather than reporting actual results. Please ignore this comment if the current structure follows specific journal formatting requirements that I may not be aware of.

The Snapshot testing discussion in the Results section feels somewhat repetitive with its earlier explanation. Consider combining both discussions or restructuring the content to reduce duplication.

On page 11, the sentence "In these cases, MD5 sums cannot be used and validating the dynamic output content can be time-intensive" is not entirely clear to me, does it mean that it is time consuming to implement the test for this kind of files or that the validation of the files is time consuming?

On page 12, the sentence "Second, we analyzed the last 500 commits..." is confusing because this is actually the third point in the "Evaluation and Validation" section, as mentioned before. reordering would improve clarity.

On page 14, the authors state "However, changes (b) and (c) lead to incorrect output results without breaking the pipeline. Thus, these are the worst-case scenarios for a pipeline developer." While this is mostly true, I would also add that a change in parameters may produce different, but not necessarily incorrect, results—some may even be more biologically meaningful. I suggest to acknowledge this.

Typos:

In the abstract: "Build on a similar syntax as Nextflow DSL2" should be corrected to "Built on a similar syntax as Nextflow DSL2".

In the legend of Figure 2 (page 19): "nf-tet" should be "nf-test".

In the legend of Table 2: "Time savings areis calculated..." should be "Time savings are calculated..."

Recommendation:

Given the relevance and technical contributions of the manuscript, I recommend its publication after addressing the minor revisions summarized above.

Read the original source
Version published to 10.1101/2024.05.25.595877 on bioRxiv
May 30, 2024

From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

This article has 2 authors:
1. Khairul Alam
2. Banani Roy
This article has no evaluationsLatest version Oct 10, 2025
Hybrid Fuzzing with LLM-Guided Input Mutation and Semantic Feedback

This article has 1 author:
1. Shiyin Lin
This article has no evaluationsLatest version Sep 23, 2025
Integrating Large Language Models into Automated Software Testing

This article has 4 authors:
1. Yanet Sáez Iznaga
2. Luís Rato
3. Pedro Salgueiro
4. Javier Lamar León
This article has no evaluationsLatest version Oct 18, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

Hybrid Fuzzing with LLM-Guided Input Mutation and Semantic Feedback

Integrating Large Language Models into Automated Software Testing