CAGEcleaner: reducing genomic redundancy in gene cluster mining

Lucas De Vrieze
Miguel Biltjes
Sofya Lukashevich
Kodai Tsurumi
Joleen Masschelein

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Summary

Mining homologous biosynthetic gene clusters (BGCs) typically involves searching colocalised genes against large genomic databases. However, the high degree of genomic redundancy in these databases often propagates into the resulting hit sets, complicating downstream analyses and visualization. To address this challenge, we present CAGEcleaner, a Python-based pipeline with auxiliary bash scripts designed to reduce redundancy in gene cluster hit sets by dereplicating the genomes that host these hits. CAGEcleaner integrates seamlessly with widely used gene cluster mining tools, such as cblaster and CAGECAT, enabling efficient filtering and streamlining BGC discovery workflows.

Availability and implementation

Source code and documentation is hosted at GitHub (https://github.com/LucoDevro/CAGEcleaner) and Zenodo (https://doi.org/10.5281/zenodo.14726119) under an MIT license. For accessibility, CAGEcleaner is installable from Bioconda (https://anaconda.org/bioconda/cagecleaner) and PyPi (https://pypi.org/project/cagecleaner/), and is also available as a Docker image from DockerHub (https://hub.docker.com/r/lucodevro/cagecleaner).

Version published to 10.1093/bioinformatics/btaf373
Jun 25, 2025
Version published to 10.1101/2025.02.19.639057 on bioRxiv
Feb 20, 2025

QPX: Pathway analysis environment

This article has 9 authors:
1. Hidemasa Bono
2. Naoya Oec
3. Airu Hayashi
4. Chiharu Fujita
5. Kotaro Uchida
6. Ryo Mameda
7. Sora Yonezawa
8. Kazuki Nakamae
9. Ryo Nozu
This article has no evaluationsLatest version Jan 6, 2026
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026
MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources

This article has 8 authors:
1. Vivek Ashokan
2. Clara Emery
3. Agnès Barnabé
4. Valentin Loux
5. Christina Pavloudi
6. Paul Zierep
7. Nikolaos Strepis
8. Bérénice Batut
This article has no evaluationsLatest version Jan 6, 2026

Discuss this preprint

Listed in

Abstract

Summary

Availability and implementation

Article activity feed

Related articles

QPX: Pathway analysis environment

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources