CAGEcleaner: reducing genomic redundancy in gene cluster mining
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Summary
Mining homologous biosynthetic gene clusters (BGCs) typically involves searching colocalised genes against large genomic databases. However, the high degree of genomic redundancy in these databases often propagates into the resulting hit sets, complicating downstream analyses and visualization. To address this challenge, we present CAGEcleaner, a Python-based pipeline with auxiliary bash scripts designed to reduce redundancy in gene cluster hit sets by dereplicating the genomes that host these hits. CAGEcleaner integrates seamlessly with widely used gene cluster mining tools, such as cblaster and CAGECAT, enabling efficient filtering and streamlining BGC discovery workflows.
Availability and implementation
Source code and documentation is hosted at GitHub (https://github.com/LucoDevro/CAGEcleaner) and Zenodo (https://doi.org/10.5281/zenodo.14726119) under an MIT license. For accessibility, CAGEcleaner is installable from Bioconda (https://anaconda.org/bioconda/cagecleaner) and PyPi (https://pypi.org/project/cagecleaner/), and is also available as a Docker image from DockerHub (https://hub.docker.com/r/lucodevro/cagecleaner).