Hidden Contaminants in Sponge Genomes: Large-Scale Decontamination of 30 Public Assemblies

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Sponges (phylum Porifera) are early-diverging metazoans that play central ecological roles and serve as models for understanding animal evolution. However, their associations with diverse microbial communities increase the risk of contamination in publicly available datasets, potentially compromising downstream biological inference. Despite growing genomic resources, systematic assessments of contamination in sponge genome assemblies have been lacking.

Here, we present a comprehensive contamination analysis of 30 publicly available sponge genome assemblies and introduce a reproducible and easily adoptable decontamination pipeline tailored to non-model organisms. Using this framework, we provide decontaminated versions of the analysed assemblies. The pipeline integrates three complementary lines of evidence: compositional outlier detection based on k-mer profiles and GC content, protein-level taxonomic classification using DIAMOND, and nucleotide-level classification with Kraken2. Scaffolds are designated as contaminants when supported by at least two independent signals. Pipeline performance was validated using a realistic spike-in dataset composed of bona fide sponge sequences and representative contaminant genomes.

The decontamination pipeline achieved 96.8% accuracy, 99.6% precision, and 90.8% recall, maintaining consistently strong recall across the vast majority of analyzed taxa. In addition, taxonomic assignments were accurately resolved to the genus level for 96.3% of identified contaminants. Application to public assemblies revealed variable contamination. On average, 14.5% of scaffolds per assembly were classified as contaminants, although they represented a low fraction of the total genome length, indicating that contamination is concentrated in relatively short scaffolds. Detected contaminants were dominated by bacterial phyla commonly associated with sponge microbiomes, including Pseudomonadota, Chloroflexota, and Poribacteria, with additional archaeal, protozoan, algal, and fungal sequences. Importantly, the number of complete BUSCO orthologs remained virtually unchanged following contamination removal, indicating minimal loss of genuine host scaffolds.

Taken together, our study provides 30 curated sponge genome assemblies and a consensus-based decontamination framework tailored to non-model organisms, improving the reliability of genomic resources for evolutionary, ecological, and functional analyses.

Article activity feed