Identifying eukaryotes in drinking water metagenomes and factors influencing their biogeography

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

The biogeography of eukaryotes in drinking water systems is poorly understood relative to prokaryotes or viruses. A common challenge with studying complex eukaryotic communities from natural and engineered systems is that the metagenomic analysis workflows are currently not as mature as those that focus on prokaryotes or even viruses. In this study, we benchmarked different strategies to recover eukaryotic sequences and genomes from metagenomic data and applied the best-performing workflow to explore eukaryotic communities present in drinking water distribution systems (DWDSs). We developed an ensemble approach that exploits k-mer and reference-based strategies to improve eukaryotic sequence identification from metagenomes and identified MetaBAT2 as the best performing binning approach for clustering of eukaryotic sequences. Applying this workflow on the DWDSs metagenomes showed that eukaryotic sequences typically constituted a small proportion (i.e., <1%) of the overall metagenomic data. Eukaryotic sequences showed higher relative abundances in surface water-fed and chlorine disinfected systems. Further, the alpha and beta-diversity of eukaryotes were correlated with prokaryotic and viral communities. Finally, a co-occurrence analysis highlighted clusters of eukaryotes whose presence and abundance in DWDSs is affected by disinfection strategies, climate conditions, and source water types.

Synopsis

After benchmarking tools and developing a dedicated consensus workflow for eukaryotic sequence detection in metagenomes, the experimental, environmental, and engineering factors affecting their biogeography in drinking water distribution systems were investigated

Article activity feed

  1. 3.2 Factors affecting eukaryotic abundance in DWDS metagenomes

    I'm not sure if this is helpful, but especially if you end up with specific genomes that you want to look for, you could try using sourmash branchwater: https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1. If you have a eukaryotic genome you're interested in, you could sketch it (sourmash sketch) and then use the branchwater tool to search most metagenomes in the SRA to see which ones have high containment with the genome your searched. You could then use the SRA metadata tables to filter to wastewater samples and the dig in more to the biogeography of those.

  2. The majority of the sequenced data in metagenomic assemblies from complex environmental186samples are typically contained in short contigs (e.g., < 5 kbp), especially in case of complex187communities with low abundance organisms17,75,76

    This would be really helpful context to have in the introduction, since it would inform why you chose to structure the methods (short kb contigs) the way you did.

  3. k-mer signature differences

    Would you be willing to briefly describe the size of k-mer used for this? I could imagine very different results for k-mer size of 4 (tetranucleotide abundances) vs. 21 or 31 (which are generally genus or species specific)

  4. 3.2 Factors affecting eukaryotic abundance in DWDS metagenomes

    I'm not sure if this is helpful, but especially if you end up with specific genomes that you want to look for, you could try using sourmash branchwater: https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1. If you have a eukaryotic genome you're interested in, you could sketch it (sourmash sketch) and then use the branchwater tool to search most metagenomes in the SRA to see which ones have high containment with the genome your searched. You could then use the SRA metadata tables to filter to wastewater samples and the dig in more to the biogeography of those.

  5. k-mer signature differences

    Would you be willing to briefly describe the size of k-mer used for this? I could imagine very different results for k-mer size of 4 (tetranucleotide abundances) vs. 21 or 31 (which are generally genus or species specific)

  6. The majority of the sequenced data in metagenomic assemblies from complex environmental186samples are typically contained in short contigs (e.g., < 5 kbp), especially in case of complex187communities with low abundance organisms17,75,76

    This would be really helpful context to have in the introduction, since it would inform why you chose to structure the methods (short kb contigs) the way you did.