The impact of ambient contamination on demultiplexing methods for single-nucleus multiome experiments
Curation statements for this article:-
Curated by eLife
eLife Assessment
This study introduces ambisim, a rigorously validated and well-documented simulation framework that enables the generation of synthetic, genotype-aware single-cell RNA and ATAC sequencing datasets under realistic conditions. The authors provide solid evidence of its utility by benchmarking multiple demultiplexing methods and proposing a new variant consistency metric. While the tool is valuable for guiding method selection, the interpretation of the new metric requires further clarification.
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (eLife)
Abstract
Abstract
Sample multiplexing has become an increasingly common design choice in droplet-based single-nucleus multi-omic sequencing experiments to reduce costs and remove technical variation. Genotype-based demultiplexing is one popular class of methods that was originally developed for single-cell RNA-seq, but has not been rigorously benchmarked in other assays, such as snATAC-seq and joint snRNA/snATAC assays, especially in the context of variable ambient RNA/DNA contamination. To address this, we develop ambisim, a genotype-aware read-level simulator that can flexibly control ambient molecule proportions and generate realistic joint snRNA/snATAC data. We use ambisim to evaluate demultiplexing methods across several important parameters: doublet rate, number of multiplexed donors, and coverage levels. Our simulations reveal that methods are variably impacted by ambient contamination in both modalities. We then applied the demultiplexing methods to two joint snRNA/snATAC datasets and found highly variable concordance between methods in both modalities. Finally, we develop a new metric, variant consistency, which we show is correlated with cell-level ambient molecule fractions in singlets. Applying our metric to two multiplexed joint snRNA/snATAC datasets reveals variable ambient contamination across experiments and modalities. We conclude that improved modelling of ambient material in demultiplexing algorithms will increase both sensitivity and specificity.
Article activity feed
-
Author response:
Reviewer #1 (Public review):
The usefulness of the proposed new metric of "variant consistency" and how it can guide users in selecting demultiplexing methods seems a little unclear. It correlates with the level of ambient RNA/DNA contamination, which makes it look like a metric on data quality. However, it does depend on the exact demultiplexing method, yet it's not clear how it directly connects to the "accuracy" of each demultiplexing method, which is the most important property that users of these methods care about. Since the simulated data has ground truth of donor identities available, I would suggest using the simulated data to show whether "variant consistency" directly indicates the accuracy of each method, especially the accuracy within those "C2" reads.
I also think the tool and analyses presented in this …
Author response:
Reviewer #1 (Public review):
The usefulness of the proposed new metric of "variant consistency" and how it can guide users in selecting demultiplexing methods seems a little unclear. It correlates with the level of ambient RNA/DNA contamination, which makes it look like a metric on data quality. However, it does depend on the exact demultiplexing method, yet it's not clear how it directly connects to the "accuracy" of each demultiplexing method, which is the most important property that users of these methods care about. Since the simulated data has ground truth of donor identities available, I would suggest using the simulated data to show whether "variant consistency" directly indicates the accuracy of each method, especially the accuracy within those "C2" reads.
I also think the tool and analyses presented in this paper need some further clarification and documentation on the details, such as how the cell-type gene and peak probabilities are determined in the simulation, and how doublets from different cell types are handled in the simulation and analysis. A few analyses and figures also need a more detailed description of the exact methods used.
We thank the reviewer for their suggestions. We plan on revising the manuscript to reflect their suggestions, which will include clarification of the variant consistency metric and its relationship with demultiplexing accuracy based on the simulations and additional detail regarding ambisim’s generation of multiplexed snRNA/snATAC.
Reviewer #2 (Public review):
(1) Throughout the manuscript, the figure legends are difficult to understand, and this makes it difficult to interpret the graphs.
(2) Since this is both a new tool and a benchmark, it would be worthwhile in the Discussion to comment on which demultiplexing tools one may want to choose for their dataset, especially given the warning against ensemble methods. From this extensive benchmarking, one may want to choose a tool based on the number of donors one has pooled, the modalities present, and perhaps even the ambient RNA (if it has been estimated previously).
(3) What are the minimal computational requirements for running ambisim? What is the time cost?
We thank the reviewer for their suggestions. We plan on updating the manuscript to better clarify figure legends. We will also outline a set of concrete recommendations in our discussion section based on different multiplexed experimental designs. Finally, we will also include extra computational benchmarks for ambisim.
-
-
-
eLife Assessment
This study introduces ambisim, a rigorously validated and well-documented simulation framework that enables the generation of synthetic, genotype-aware single-cell RNA and ATAC sequencing datasets under realistic conditions. The authors provide solid evidence of its utility by benchmarking multiple demultiplexing methods and proposing a new variant consistency metric. While the tool is valuable for guiding method selection, the interpretation of the new metric requires further clarification.
-
Reviewer #1 (Public review):
Summary:
The authors developed a tool for simulating multiplexed single-cell RNA-seq and ATAC-seq data with various adjustable settings like ambient RNA/DNA rate and sequencing depth. They used the simulated data with different settings to evaluate the performance of many demultiplexing methods. They also proposed a new metric at single-cell level that correlates with the RNA/DNA contamination level.
Strengths:
The simulation tool has a straightforward design and provides adjustability in multiple parameters that have practical relevance, such as sequencing depth and ambient contamination rate. With the growing use of multiplexing in single-cell RNAseq and ATACseq experiments, the tools and results in this paper can guide the experimental design and tool selection for many researchers. The simulation tool …
Reviewer #1 (Public review):
Summary:
The authors developed a tool for simulating multiplexed single-cell RNA-seq and ATAC-seq data with various adjustable settings like ambient RNA/DNA rate and sequencing depth. They used the simulated data with different settings to evaluate the performance of many demultiplexing methods. They also proposed a new metric at single-cell level that correlates with the RNA/DNA contamination level.
Strengths:
The simulation tool has a straightforward design and provides adjustability in multiple parameters that have practical relevance, such as sequencing depth and ambient contamination rate. With the growing use of multiplexing in single-cell RNAseq and ATACseq experiments, the tools and results in this paper can guide the experimental design and tool selection for many researchers. The simulation tool also provides a platform for benchmarking newly developed demultiplexing tools.
Weaknesses:
The usefulness of the proposed new metric of "variant consistency" and how it can guide users in selecting demultiplexing methods seems a little unclear. It correlates with the level of ambient RNA/DNA contamination, which makes it look like a metric on data quality. However, it does depend on the exact demultiplexing method, yet it's not clear how it directly connects to the "accuracy" of each demultiplexing method, which is the most important property that users of these methods care about. Since the simulated data has ground truth of donor identities available, I would suggest using the simulated data to show whether "variant consistency" directly indicates the accuracy of each method, especially the accuracy within those "C2" reads.
I also think the tool and analyses presented in this paper need some further clarification and documentation on the details, such as how the cell-type gene and peak probabilities are determined in the simulation, and how doublets from different cell types are handled in the simulation and analysis. A few analyses and figures also need a more detailed description of the exact methods used.
-
Reviewer #2 (Public review):
Li et al. describe ambisim, a tool with the goal of creating realistic synthetic single-nucleus RNA/ATAC sequencing datasets. It has become standard to pool multiple genetically distinct donors when using single-cell sequencing followed by genotype-based demultiplexing (i.e., using donor single-nucleotide variants to identify specific donor origin). A plethora of tools exist to accomplish this demultiplexing, but advanced tools to create synthetic datasets, and therefore provide definitive benchmarking, are lacking. Ambisim is a well-thought-out simulator that improves upon previous tools available by allowing for modeling of variable ambient contamination proportions and doing so in a genotype-aware fashion. This provides more realistic synthetic datasets that provide challenging scenarios for future …
Reviewer #2 (Public review):
Li et al. describe ambisim, a tool with the goal of creating realistic synthetic single-nucleus RNA/ATAC sequencing datasets. It has become standard to pool multiple genetically distinct donors when using single-cell sequencing followed by genotype-based demultiplexing (i.e., using donor single-nucleotide variants to identify specific donor origin). A plethora of tools exist to accomplish this demultiplexing, but advanced tools to create synthetic datasets, and therefore provide definitive benchmarking, are lacking. Ambisim is a well-thought-out simulator that improves upon previous tools available by allowing for modeling of variable ambient contamination proportions and doing so in a genotype-aware fashion. This provides more realistic synthetic datasets that provide challenging scenarios for future demultiplexing tools. The authors use ambisim to benchmark a large number of available and commonly used genotype-free and -dependent demultiplexing tools. They identify the strengths and weaknesses of these tools. They also go on to define a new metric, variant consistency, to further assess demultiplexing performance across tools. Overall, this manuscript provides a useful framework to more thoroughly evaluate future demultiplexing tools, as well as provides rationale for tool selection depending on a user's experimental conditions.
The authors provide measured conclusions that are supported by their findings. There are some aspects that are unclear.
(1) Throughout the manuscript, the figure legends are difficult to understand, and this makes it difficult to interpret the graphs.
(2) Since this is both a new tool and a benchmark, it would be worthwhile in the Discussion to comment on which demultiplexing tools one may want to choose for their dataset, especially given the warning against ensemble methods. From this extensive benchmarking, one may want to choose a tool based on the number of donors one has pooled, the modalities present, and perhaps even the ambient RNA (if it has been estimated previously).
(3) What are the minimal computational requirements for running ambisim? What is the time cost?
-