aMeta: an accurate and memory-efficient ancient metagenomic profiling workflow

Zoé Pochon
Nora Bergfeldt
Emrah Kırdök
Mário Vicente
Thijessen Naidoo
Tom van der Valk
N. Ezgi Altınışık
Maja Krzewińska
Love Dalén
Anders Götherström
Claudio Mirabello
Per Unneberg
Nikolay Oskolkov

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Analysis of microbial data from archaeological samples is a growing field with great potential for understanding ancient environments, lifestyles, and diseases. However, high error rates have been a challenge in ancient metagenomics, and the availability of computational frameworks that meet the demands of the field is limited. Here, we propose aMeta, an accurate metagenomic profiling workflow for ancient DNA designed to minimize the amount of false discoveries and computer memory requirements. Using simulated data, we benchmark aMeta against a current state-of-the-art workflow and demonstrate its superiority in microbial detection and authentication, as well as substantially lower usage of computer memory.

Version published to 10.1186/s13059-023-03083-9
Oct 23, 2023
Arcadia Science
Apr 14, 2023

To validate the results from the KrakenUniq pre-screening step and further eliminate potential false-positive microbial identifications, aMeta performs an alignment with the Lowest Common Ancestor(LCA) algorithm implemented in MALT [20]. Alternatively, aMeta users can also select Bowtie2for a faster and more memory-efficient analysis but lacking LCA alignments, see Supplementary In-formation S2. While being more suitable than Bowtie2 for metagenomic profiling, MALT is veryresource demanding. In practice, only reference databases of limited size can be afforded when per-forming analysis with MALT, which might potentially compromise the accuracy of microbial detec-tion. For more details see Supplementary Information S3. In consequence, we aim at linking theunique capacity of KrakenUniq to work with large databases with the advantages of …

To validate the results from the KrakenUniq pre-screening step and further eliminate potential false-positive microbial identifications, aMeta performs an alignment with the Lowest Common Ancestor(LCA) algorithm implemented in MALT [20]. Alternatively, aMeta users can also select Bowtie2for a faster and more memory-efficient analysis but lacking LCA alignments, see Supplementary In-formation S2. While being more suitable than Bowtie2 for metagenomic profiling, MALT is veryresource demanding. In practice, only reference databases of limited size can be afforded when per-forming analysis with MALT, which might potentially compromise the accuracy of microbial detec-tion. For more details see Supplementary Information S3. In consequence, we aim at linking theunique capacity of KrakenUniq to work with large databases with the advantages of MALT for vali-dation of results via an LCA-alignment. For this purpose, aMeta automatically builds a project-spe-cific MALT database, based on a filtered list of microbial species identified by KrakenUniq. Inother words, the combination of microbes across the samples remaining after depth and breadth ofcoverage filtering of the KrakenUniq outputs is used to build a MALT database which allows therunning of LCA-based MALT alignments using realistic computational resources. We found thatthis design provides two to six times less computer memory (RAM) usage compared to traditionalways of building and using MALT databases, see Supplementary Figure 6.

The genome-grist tool does something similar but is built around sourmash gather and BWA for mapping: https://github.com/dib-lab/genome-grist

I'm curious what the advantage of mapping with an LCA algorithm is here. We've found LCA methods to lead to higher false positives for genome identification (see here: https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It would be really helpful to have this explained a bit more.

Read the original source
Arcadia Science
Apr 14, 2023

Figure 2 schematically demonstrates why detecting microbial organisms solely based on depth ofcoverage (or simply coverage), which is largely equivalent to the number of mapped reads, mightlead to false-positive identifications. Suppose we have a toy reference genome of length 4 ∗ L and 4reads of length L mapping to the reference genome. When a microbe is truly detected, the readsshould map evenly across the reference genome, see Figure 2B. In contrast, in case of misalignedreads, i.e. when reads originating from species A map to the reference genome of species B, it iscommon to observe “piles'' of reads aligned to a few conserved regions of the reference genome,which is the case in Figure 2A (see also Supplementary Figure 1 for a real data example, wherereads from unknown microbial organisms are forced to map to Yersinia pestis …

Figure 2 schematically demonstrates why detecting microbial organisms solely based on depth ofcoverage (or simply coverage), which is largely equivalent to the number of mapped reads, mightlead to false-positive identifications. Suppose we have a toy reference genome of length 4 ∗ L and 4reads of length L mapping to the reference genome. When a microbe is truly detected, the readsshould map evenly across the reference genome, see Figure 2B. In contrast, in case of misalignedreads, i.e. when reads originating from species A map to the reference genome of species B, it iscommon to observe “piles'' of reads aligned to a few conserved regions of the reference genome,which is the case in Figure 2A (see also Supplementary Figure 1 for a real data example, wherereads from unknown microbial organisms are forced to map to Yersinia pestis reference genomealone).

This is a really clear explanation!

Read the original source
Arcadia Science
Apr 14, 2023

and selects reads of length above 30 bp which have a good taxonomic specificit

Would you be willing to provide a reference for this? Maybe https://journals.asm.org/doi/10.1128/mSystems.00020-16 or https://bluegenes.github.io/2022-paper-protein-kmers/manuscript.pdf

Read the original source
Arcadia Science
Apr 14, 2023

e statistically robust enough

What statistics does this refer to?

Read the original source
Arcadia Science
Apr 14, 2023

When comparing different KrakenUniq databases, we found that database size played an importantrole for robust microbial identification, see Supplementary Information S1. Specifically, small data-bases tended to have higher false-positive and false-negative rates for two reasons. First, microbespresent in a sample whose reference genomes were not included in the KrakenUniq database couldobviously not be identified, hence the high rate of false-negatives of smaller databases. Second, mi-crobes in the database that were genetically similar to the ones in a sample, appeared to be more of-ten erroneously identified, which contributed to the high rate of false-positives of smaller databases.For more details see Supplementary Information S1.

This matches intuition, but is super useful to have it written out so clearly.

Read the original source
Arcadia Science
Apr 14, 2023

Another popular general purpose aDNA pipeline, nf-core / eager[32], implements HOPS as an ancient microbiome profiling module within the pipeline, thereforewe do not specifically compare our workflow with nf-core / eager but concentrate on differencesbetween aMeta and HOPS.

would it be possible to contribute parts of your pipeline back to the nf-core eager pipeline? It think it could give it more visibility and be more reproducibly than your snakemake pipeline given the use of docker containers etc. Your approach seems very nifty and would be great to make it more available! Parts of this might be simplified by work being done on the nf-core taxprofiler workflow which I think is building modules for krakenuniq

Read the original source
Arcadia Science
Apr 14, 2023

The workflow is publicly available at https://github.com/NBISweden/aMeta.

I took a look at your repository and it was very easy to navigate and follow! I have a couple of suggestions though. Would you be able to pin versions of software tools recorded in your envs/*yaml folder? This will make your pipeline more reproducible.

It would also be helpful if you could parameterize your output directory. A lot of clusters separate write directories from run directories, so this would help with portability.

Lastly, I'm curious if k-mer trimming would improve your results. khmer's trim-low-abund.py can trim samples with variable coverage using the -V flag if you're interested in exploring that further!

Read the original source
Arcadia Science
Oct 7, 2022

The workflow is publicly available at https://github.com/NBISweden/aMeta.

I took a look at your repository and it was very easy to navigate and follow! I have a couple of suggestions though. Would you be able to pin versions of software tools recorded in your envs/*yaml folder? This will make your pipeline more reproducible.

It would also be helpful if you could parameterize your output directory. A lot of clusters separate write directories from run directories, so this would help with portability.

Lastly, I'm curious if k-mer trimming would improve your results. khmer's trim-low-abund.py can trim samples with variable coverage using the -V flag if you're interested in exploring that further!

Read the original source
Arcadia Science
Oct 7, 2022

Another popular general purpose aDNA pipeline, nf-core / eager[32], implements HOPS as an ancient microbiome profiling module within the pipeline, thereforewe do not specifically compare our workflow with nf-core / eager but concentrate on differencesbetween aMeta and HOPS.

would it be possible to contribute parts of your pipeline back to the nf-core eager pipeline? It think it could give it more visibility and be more reproducibly than your snakemake pipeline given the use of docker containers etc. Your approach seems very nifty and would be great to make it more available! Parts of this might be simplified by work being done on the nf-core taxprofiler workflow which I think is building modules for krakenuniq

Read the original source
Arcadia Science
Oct 7, 2022

To validate the results from the KrakenUniq pre-screening step and further eliminate potential false-positive microbial identifications, aMeta performs an alignment with the Lowest Common Ancestor(LCA) algorithm implemented in MALT [20]. Alternatively, aMeta users can also select Bowtie2for a faster and more memory-efficient analysis but lacking LCA alignments, see Supplementary In-formation S2. While being more suitable than Bowtie2 for metagenomic profiling, MALT is veryresource demanding. In practice, only reference databases of limited size can be afforded when per-forming analysis with MALT, which might potentially compromise the accuracy of microbial detec-tion. For more details see Supplementary Information S3. In consequence, we aim at linking theunique capacity of KrakenUniq to work with large databases with the advantages of …

To validate the results from the KrakenUniq pre-screening step and further eliminate potential false-positive microbial identifications, aMeta performs an alignment with the Lowest Common Ancestor(LCA) algorithm implemented in MALT [20]. Alternatively, aMeta users can also select Bowtie2for a faster and more memory-efficient analysis but lacking LCA alignments, see Supplementary In-formation S2. While being more suitable than Bowtie2 for metagenomic profiling, MALT is veryresource demanding. In practice, only reference databases of limited size can be afforded when per-forming analysis with MALT, which might potentially compromise the accuracy of microbial detec-tion. For more details see Supplementary Information S3. In consequence, we aim at linking theunique capacity of KrakenUniq to work with large databases with the advantages of MALT for vali-dation of results via an LCA-alignment. For this purpose, aMeta automatically builds a project-spe-cific MALT database, based on a filtered list of microbial species identified by KrakenUniq. Inother words, the combination of microbes across the samples remaining after depth and breadth ofcoverage filtering of the KrakenUniq outputs is used to build a MALT database which allows therunning of LCA-based MALT alignments using realistic computational resources. We found thatthis design provides two to six times less computer memory (RAM) usage compared to traditionalways of building and using MALT databases, see Supplementary Figure 6.

The genome-grist tool does something similar but is built around sourmash gather and BWA for mapping: https://github.com/dib-lab/genome-grist

I'm curious what the advantage of mapping with an LCA algorithm is here. We've found LCA methods to lead to higher false positives for genome identification (see here: https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It would be really helpful to have this explained a bit more.

Read the original source
Arcadia Science
Oct 7, 2022

When comparing different KrakenUniq databases, we found that database size played an importantrole for robust microbial identification, see Supplementary Information S1. Specifically, small data-bases tended to have higher false-positive and false-negative rates for two reasons. First, microbespresent in a sample whose reference genomes were not included in the KrakenUniq database couldobviously not be identified, hence the high rate of false-negatives of smaller databases. Second, mi-crobes in the database that were genetically similar to the ones in a sample, appeared to be more of-ten erroneously identified, which contributed to the high rate of false-positives of smaller databases.For more details see Supplementary Information S1.

This matches intuition, but is super useful to have it written out so clearly.

Read the original source
Arcadia Science
Oct 7, 2022

e statistically robust enough

What statistics does this refer to?

Read the original source
Arcadia Science
Oct 7, 2022

Figure 2 schematically demonstrates why detecting microbial organisms solely based on depth ofcoverage (or simply coverage), which is largely equivalent to the number of mapped reads, mightlead to false-positive identifications. Suppose we have a toy reference genome of length 4 ∗ L and 4reads of length L mapping to the reference genome. When a microbe is truly detected, the readsshould map evenly across the reference genome, see Figure 2B. In contrast, in case of misalignedreads, i.e. when reads originating from species A map to the reference genome of species B, it iscommon to observe “piles'' of reads aligned to a few conserved regions of the reference genome,which is the case in Figure 2A (see also Supplementary Figure 1 for a real data example, wherereads from unknown microbial organisms are forced to map to Yersinia pestis …

Figure 2 schematically demonstrates why detecting microbial organisms solely based on depth ofcoverage (or simply coverage), which is largely equivalent to the number of mapped reads, mightlead to false-positive identifications. Suppose we have a toy reference genome of length 4 ∗ L and 4reads of length L mapping to the reference genome. When a microbe is truly detected, the readsshould map evenly across the reference genome, see Figure 2B. In contrast, in case of misalignedreads, i.e. when reads originating from species A map to the reference genome of species B, it iscommon to observe “piles'' of reads aligned to a few conserved regions of the reference genome,which is the case in Figure 2A (see also Supplementary Figure 1 for a real data example, wherereads from unknown microbial organisms are forced to map to Yersinia pestis reference genomealone).

This is a really clear explanation!

Read the original source
Arcadia Science
Oct 7, 2022

and selects reads of length above 30 bp which have a good taxonomic specificit

Would you be willing to provide a reference for this? Maybe https://journals.asm.org/doi/10.1128/mSystems.00020-16 or https://bluegenes.github.io/2022-paper-protein-kmers/manuscript.pdf

Read the original source
Version published to 10.1101/2022.10.03.510579v1 on bioRxiv
Oct 5, 2022

MetaFX: feature extraction from whole-genome metagenomic sequencing data

This article has 5 authors:
1. Artem Ivanov
2. Vladimir Popov
3. Maxim Morozov
4. Evgenii Olekhnovich
5. Vladimir Ulyantsev
This article has no evaluationsLatest version May 31, 2025
Gener anno : A Genomic Foundation Model for Metagenomic Annotation

This article has 6 authors:
1. Qiuyi Li
2. Wei Wu
3. Yiheng Zhu
4. Fuli Feng
5. Jieping Ye
6. Zheng Wang
This article has no evaluationsLatest version Jul 4, 2025
MADRe: Strain-Level Metagenomic Classification Through Assembly-Driven Database Reduction

This article has 4 authors:
1. Josipa Lipovac
2. Mile Šikić
3. Riccardo Vicedomini
4. Krešimir Križanović
This article has no evaluationsLatest version May 15, 2025

aMeta: an accurate and memory-efficient ancient metagenomic profiling workflow

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

MetaFX: feature extraction from whole-genome metagenomic sequencing data

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

MADRe: Strain-Level Metagenomic Classification Through Assembly-Driven Database Reduction

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

MetaFX: feature extraction from whole-genome metagenomic sequencing data

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

MADRe: Strain-Level Metagenomic Classification Through Assembly-Driven Database Reduction