Cleanifier: Contamination removal from microbial sequences using spaced seeds of a human pangenome index

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

The first step when working with DNA sequence data of human-derived microbiomes is usually to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis steps, such as genome assembly and binning. For large-scale metagenomics projects, fast and accurate removal of human contamination is hence critical.

Results

We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k -mers, or spaced seeds. Cleanifier uses a pangenome index of all human gapped k -mers, but the creation and use of alternative references is also possible. Reads are filtered based on the gapped k -mers present in the index. Cleanifier supports two filtering modes: one that queries all gapped k -mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that our sampling mode makes Cleanifier the fastest method with comparable accuracy. Because we store the gapped k -mers in a probabilistic Cuckoo filter, Cleanifier has similar memory requirements to methods that use a minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index.

Availability and Implementation

The Cleanifier tool is available via gitlab ( https://gitlab.com/rahmannlab/cleanifier ), PyPi ( https://pypi.org/project/cleanifier/ ) and Bioconda ( https://anaconda.org/bioconda/cleanifier ). The pre-computed human pangenome index is available for download at https://doi.org/10.5281/zenodo.15639519 .

Contact

rahmann@cs.uni-saarland.de

Article activity feed