aMeta: an accurate and memory-efficient ancient metagenomic profiling workflow

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Analysis of microbial data from archaeological samples is a growing field with great potential for understanding ancient environments, lifestyles, and diseases. However, high error rates have been a challenge in ancient metagenomics, and the availability of computational frameworks that meet the demands of the field is limited. Here, we propose aMeta, an accurate metagenomic profiling workflow for ancient DNA designed to minimize the amount of false discoveries and computer memory requirements. Using simulated data, we benchmark aMeta against a current state-of-the-art workflow and demonstrate its superiority in microbial detection and authentication, as well as substantially lower usage of computer memory.

Article activity feed

  1. To validate the results from the KrakenUniq pre-screening step and further eliminate potential false-positive microbial identifications, aMeta performs an alignment with the Lowest Common Ancestor(LCA) algorithm implemented in MALT [20]. Alternatively, aMeta users can also select Bowtie2for a faster and more memory-efficient analysis but lacking LCA alignments, see Supplementary In-formation S2. While being more suitable than Bowtie2 for metagenomic profiling, MALT is veryresource demanding. In practice, only reference databases of limited size can be afforded when per-forming analysis with MALT, which might potentially compromise the accuracy of microbial detec-tion. For more details see Supplementary Information S3. In consequence, we aim at linking theunique capacity of KrakenUniq to work with large databases with the advantages of MALT for vali-dation of results via an LCA-alignment. For this purpose, aMeta automatically builds a project-spe-cific MALT database, based on a filtered list of microbial species identified by KrakenUniq. Inother words, the combination of microbes across the samples remaining after depth and breadth ofcoverage filtering of the KrakenUniq outputs is used to build a MALT database which allows therunning of LCA-based MALT alignments using realistic computational resources. We found thatthis design provides two to six times less computer memory (RAM) usage compared to traditionalways of building and using MALT databases, see Supplementary Figure 6.

    The genome-grist tool does something similar but is built around sourmash gather and BWA for mapping: https://github.com/dib-lab/genome-grist

    I'm curious what the advantage of mapping with an LCA algorithm is here. We've found LCA methods to lead to higher false positives for genome identification (see here: https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It would be really helpful to have this explained a bit more.

  2. Figure 2 schematically demonstrates why detecting microbial organisms solely based on depth ofcoverage (or simply coverage), which is largely equivalent to the number of mapped reads, mightlead to false-positive identifications. Suppose we have a toy reference genome of length 4 ∗ L and 4reads of length L mapping to the reference genome. When a microbe is truly detected, the readsshould map evenly across the reference genome, see Figure 2B. In contrast, in case of misalignedreads, i.e. when reads originating from species A map to the reference genome of species B, it iscommon to observe “piles'' of reads aligned to a few conserved regions of the reference genome,which is the case in Figure 2A (see also Supplementary Figure 1 for a real data example, wherereads from unknown microbial organisms are forced to map to Yersinia pestis reference genomealone).

    This is a really clear explanation!

  3. When comparing different KrakenUniq databases, we found that database size played an importantrole for robust microbial identification, see Supplementary Information S1. Specifically, small data-bases tended to have higher false-positive and false-negative rates for two reasons. First, microbespresent in a sample whose reference genomes were not included in the KrakenUniq database couldobviously not be identified, hence the high rate of false-negatives of smaller databases. Second, mi-crobes in the database that were genetically similar to the ones in a sample, appeared to be more of-ten erroneously identified, which contributed to the high rate of false-positives of smaller databases.For more details see Supplementary Information S1.

    This matches intuition, but is super useful to have it written out so clearly.

  4. Another popular general purpose aDNA pipeline, nf-core / eager[32], implements HOPS as an ancient microbiome profiling module within the pipeline, thereforewe do not specifically compare our workflow with nf-core / eager but concentrate on differencesbetween aMeta and HOPS.

    would it be possible to contribute parts of your pipeline back to the nf-core eager pipeline? It think it could give it more visibility and be more reproducibly than your snakemake pipeline given the use of docker containers etc. Your approach seems very nifty and would be great to make it more available! Parts of this might be simplified by work being done on the nf-core taxprofiler workflow which I think is building modules for krakenuniq

  5. The workflow is publicly available at https://github.com/NBISweden/aMeta.

    I took a look at your repository and it was very easy to navigate and follow! I have a couple of suggestions though. Would you be able to pin versions of software tools recorded in your envs/*yaml folder? This will make your pipeline more reproducible.

    It would also be helpful if you could parameterize your output directory. A lot of clusters separate write directories from run directories, so this would help with portability.

    Lastly, I'm curious if k-mer trimming would improve your results. khmer's trim-low-abund.py can trim samples with variable coverage using the -V flag if you're interested in exploring that further!

  6. The workflow is publicly available at https://github.com/NBISweden/aMeta.

    I took a look at your repository and it was very easy to navigate and follow! I have a couple of suggestions though. Would you be able to pin versions of software tools recorded in your envs/*yaml folder? This will make your pipeline more reproducible.

    It would also be helpful if you could parameterize your output directory. A lot of clusters separate write directories from run directories, so this would help with portability.

    Lastly, I'm curious if k-mer trimming would improve your results. khmer's trim-low-abund.py can trim samples with variable coverage using the -V flag if you're interested in exploring that further!

  7. Another popular general purpose aDNA pipeline, nf-core / eager[32], implements HOPS as an ancient microbiome profiling module within the pipeline, thereforewe do not specifically compare our workflow with nf-core / eager but concentrate on differencesbetween aMeta and HOPS.

    would it be possible to contribute parts of your pipeline back to the nf-core eager pipeline? It think it could give it more visibility and be more reproducibly than your snakemake pipeline given the use of docker containers etc. Your approach seems very nifty and would be great to make it more available! Parts of this might be simplified by work being done on the nf-core taxprofiler workflow which I think is building modules for krakenuniq

  8. To validate the results from the KrakenUniq pre-screening step and further eliminate potential false-positive microbial identifications, aMeta performs an alignment with the Lowest Common Ancestor(LCA) algorithm implemented in MALT [20]. Alternatively, aMeta users can also select Bowtie2for a faster and more memory-efficient analysis but lacking LCA alignments, see Supplementary In-formation S2. While being more suitable than Bowtie2 for metagenomic profiling, MALT is veryresource demanding. In practice, only reference databases of limited size can be afforded when per-forming analysis with MALT, which might potentially compromise the accuracy of microbial detec-tion. For more details see Supplementary Information S3. In consequence, we aim at linking theunique capacity of KrakenUniq to work with large databases with the advantages of MALT for vali-dation of results via an LCA-alignment. For this purpose, aMeta automatically builds a project-spe-cific MALT database, based on a filtered list of microbial species identified by KrakenUniq. Inother words, the combination of microbes across the samples remaining after depth and breadth ofcoverage filtering of the KrakenUniq outputs is used to build a MALT database which allows therunning of LCA-based MALT alignments using realistic computational resources. We found thatthis design provides two to six times less computer memory (RAM) usage compared to traditionalways of building and using MALT databases, see Supplementary Figure 6.

    The genome-grist tool does something similar but is built around sourmash gather and BWA for mapping: https://github.com/dib-lab/genome-grist

    I'm curious what the advantage of mapping with an LCA algorithm is here. We've found LCA methods to lead to higher false positives for genome identification (see here: https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It would be really helpful to have this explained a bit more.

  9. When comparing different KrakenUniq databases, we found that database size played an importantrole for robust microbial identification, see Supplementary Information S1. Specifically, small data-bases tended to have higher false-positive and false-negative rates for two reasons. First, microbespresent in a sample whose reference genomes were not included in the KrakenUniq database couldobviously not be identified, hence the high rate of false-negatives of smaller databases. Second, mi-crobes in the database that were genetically similar to the ones in a sample, appeared to be more of-ten erroneously identified, which contributed to the high rate of false-positives of smaller databases.For more details see Supplementary Information S1.

    This matches intuition, but is super useful to have it written out so clearly.

  10. Figure 2 schematically demonstrates why detecting microbial organisms solely based on depth ofcoverage (or simply coverage), which is largely equivalent to the number of mapped reads, mightlead to false-positive identifications. Suppose we have a toy reference genome of length 4 ∗ L and 4reads of length L mapping to the reference genome. When a microbe is truly detected, the readsshould map evenly across the reference genome, see Figure 2B. In contrast, in case of misalignedreads, i.e. when reads originating from species A map to the reference genome of species B, it iscommon to observe “piles'' of reads aligned to a few conserved regions of the reference genome,which is the case in Figure 2A (see also Supplementary Figure 1 for a real data example, wherereads from unknown microbial organisms are forced to map to Yersinia pestis reference genomealone).

    This is a really clear explanation!