CAGEcleaner: reducing genomic redundancy in gene cluster mining

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Summary

Mining homologous biosynthetic gene clusters (BGCs) typically involves searching colocalised genes against large genomic databases. However, the high degree of genomic redundancy in these databases often propagates into the resulting hit sets, complicating downstream analyses and visualization. To address this challenge, we present CAGEcleaner, a Python-based pipeline with auxiliary bash scripts designed to reduce redundancy in gene cluster hit sets by dereplicating the genomes that host these hits. CAGEcleaner integrates seamlessly with widely used gene cluster mining tools, such as cblaster and CAGECAT, enabling efficient filtering and streamlining BGC discovery workflows.

Availability and implementation

Source code and documentation is hosted at GitHub (https://github.com/LucoDevro/CAGEcleaner) and Zenodo (https://doi.org/10.5281/zenodo.14726119) under an MIT license. For accessibility, CAGEcleaner is installable from Bioconda (https://anaconda.org/bioconda/cagecleaner) and PyPi (https://pypi.org/project/cagecleaner/), and is also available as a Docker image from DockerHub (https://hub.docker.com/r/lucodevro/cagecleaner).

Article activity feed