Evaluation of genetic demultiplexing of single-cell sequencing data from model species
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Review Commons)
Abstract
Single-cell sequencing (sc-seq) provides a species agnostic tool to study cellular processes. However, these technologies are expensive and require sufficient cell quantities and biological replicates to avoid artifactual results. An option to address these problems is pooling cells from multiple individuals into one sc-seq library. In humans, genotype-based computational separation (i.e., demultiplexing) of pooled sc-seq samples is common. This approach would be instrumental for studying non-isogenic model organisms. We set out to determine whether genotype-based demultiplexing could be more broadly applied among species ranging from zebrafish to non-human primates. Using such non-isogenic species, we benchmark genotype-based demultiplexing of pooled sc-seq datasets against various ground truths. We demonstrate that genotype-based demultiplexing of pooled sc-seq samples can be used with confidence in several non-isogenic model organisms and uncover limitations of this method. Importantly, the only genomic resource required for this approach is sc-seq data and a de novo transcriptome. The incorporation of pooling into sc-seq study designs will decrease cost while simultaneously increasing the reproducibility and experimental options in non-isogenic model organisms.
Article activity feed
-
-
Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Official Revision Plan Document:
Manuscript number: #RC-2022-01681
Corresponding author(s): Nicholas, Leigh
1. General Statements
We sincerely appreciate these positive and helpful reviews. We are grateful for the constructive comments and we outline our responses below. Addressing these comments will further broaden the impact of the work and increase the power, reliability, and application of single cell approaches while decreasing the cost and labor intensive collection steps.
As single cell sequencing approaches have entered the mainstream, we are still finding flaws and artifacts from these methods. A major limitation of widely used collection …
Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Official Revision Plan Document:
Manuscript number: #RC-2022-01681
Corresponding author(s): Nicholas, Leigh
1. General Statements
We sincerely appreciate these positive and helpful reviews. We are grateful for the constructive comments and we outline our responses below. Addressing these comments will further broaden the impact of the work and increase the power, reliability, and application of single cell approaches while decreasing the cost and labor intensive collection steps.
As single cell sequencing approaches have entered the mainstream, we are still finding flaws and artifacts from these methods. A major limitation of widely used collection approaches is a difficulty in obtaining biological replicates, which are required to generate robust sequencing datasets. In general, a lack of biological replicates has been a major oversight in the vast majority of single cell studies, and any technique that can facilitate biological replicate collection should be widely applied. The elegance of SNP-based demultiplexing lies in the fact that it can be applied regardless of any external label, applied to previously collected data, and the data are already collected for every sample sequenced. We were pleased to have the reviewers agree and identify the many conceptual advances in this manuscript, with one major critique being noted by one reviewer as a lack of novelty.
Regarding the lack of novelty, we appreciate that SNP-based demultiplexing was not developed as a method within this manuscript, but disagree that a broad benchmarking and validation study that opens the doors to the use of SNP-based demuxing in any species with sufficient between animal genetic heterogeneity lacks novelty. To address this concern, we will now further emphasize the drawbacks and artifacts that can arise in the currently common practice of pooling samples and choosing not to demultiplex, while improving our explanation of our discoveries in this manuscript. The lack of biological replicates in single cell sequencing studies is rampant and needs to be addressed with approaches such as those demonstrated here. We also want to emphasize the importance of validating and benchmarking bioinformatic approaches with orthogonal, priorly established approaches (eg. wet-lab based methods), which had previously not been conducted for SNP-based demultiplexing, outside of human samples. The inbred nature of common lab animals and broad range in quality and availability of genomic resources make this a major step forward in bringing SNP-demultiplexing to all labs. We believe that our paper broadly extends, benchmarks and most importantly validates the advantages and limitations of SNP-based demuxing across various species.
2. Description of the planned revisions
Reviewer #1 (Evidence, reproducibility and clarity (Required)):
“Cardiello et al tested if souporcell (https://pubmed.ncbi.nlm.nih.gov/32366989/) can be used to demultiplex samples for some model organisms, based on identified SNPs. For this, they used synthetic multiplexed data, publicly available datasets and some new datasets, spanning samples from five model organisms. Their analysis indicates that souporcell could be used to
demultiplex scRNA-seq experiments for multiple species, which offers a cost-beneficent approach.
The manuscript reads well and shows this approach can work for different model organisms. However, unfortunately, I am confused about the amount of novelty in this manuscript. The method, souporcell, is already published. The authors indicate souporcell is not validated in non-human samples, but the original paper states that their method works with malaria parasite data (Fig 3b, FigS4). Adapting and using an available tool for different model organisms is good and groups working on different model organisms may find this manuscript useful, but the same could be said for the original article. Due to these reasons, I am not sure whether this manuscript has novelty sufficient for publication.”
__Our response: __We appreciate this constructive criticism that helped us realize that our novelty was not clearly stated in the first version of the manuscript. We need to improve our Introduction and our verbiage as to what has been previously performed and how this current manuscript provides novel insight into multiple previously unanswered questions which broadly extend the utility of SNP-based demultiplexing. To address this comment, we will revamp our Introduction, Results, and Discussion to more clearly highlight the novelty of this work.
__Planned revisions: __
Defining “validation”. We define validation as establishing the accuracy or validity of a method. Therefore, validation of SNP-based demultiplexing for use in non-human species requires comparison to an already proven, orthogonal method, such as a wet-lab based demultiplexing approach. The souporcell paper does not validate (i.e., confirm with an orthogonal wet-lab method) the results from souporcell in any species but humans. This lack of validation for SNP-based demultiplexing in samples from non-human species made it unclear how and if these approaches would work in other species. Human samples are expected to perform exceptionally well in this approach due to their extremely high genetic diversity and wealth of available genomic resources. Thus, while it was exciting that the original souporcell authors chose to try applying their algorithm to a non-human (e.g., malarial parasite) dataset, the paper left many unanswered questions about potential uses and accuracy. In addition to validating the accuracy of souporcell results in many species, we demonstrated that souporcell shows a relatively poor ability to call doublets in many non-human vertebrates. In addition to highlighting a novel drawback of the method, this demonstrates the need to validate the accuracy of different aspects of tools like souporcell when applied to new systems rather than use souporcell or other SNP-demuxers prior to validation. Highlighting other novel findings in this work: For instance, our assessment of which genomic resources are required for using SNP-based demultiplexing in different species, whether this could be applied to lab animals likely to be inbred to various degrees (and to address other reviewers comments, the inbred level permitted), assessment of the accuracy of SNP-demultiplexing in species with alignment references of varying qualities (i.e., only de novo transcriptome) and genomes of varying sizes (up to 30Gb, 10 times larger that of human, which can be extremely computational intensive), and the exploration of pooling and demultiplexing of multiple species in a single library. Making clear how we made the necessary adjustments to the original souporcell pipeline to successfully apply it to datasets with various resources available in these species.
(Reviewer #1): I also wrote down two minor points below:
“1- Doublets assigned by souporcell compared to the fluor-based assignment look random. In Fig 2 doublet recovery rate looks smaller, and in fig 3 doublet rate prediction looks more random. This is a bit confusing. Is there any explanation for this?”
__Our response: __We agree and thus noted in the manuscript that the detection of doublets in these datasets by Souporcell are not very reliable.
Planned revisions:
We will expand our Discussion to include brief hypotheses for factors that likely contributed to poor doublet detection by souporcell in these analyses. In the Discussion we will clearly suggest complementary approaches for improving the detection/removal of doublets in pooled scRNA-seq experiments through applying external gene expression-based doublet detection programs. We will also attempt to use these programs on at least one of our datasets to see how well independant doublet detection methods complement souporcell on pooled datasets. A full benchmarking of these doublet detection methods already exists and will be referenced in our Discussion.
Reviewer #1: “2- The authors discussed the immune system cells might show some variability in their discussion (referring to fig 3), but this is not clearly shown in the figures as data. Having a percentage bar graph could make it clearer for the readers.”
__Our response: __This is a valid point that we plan to address with the addition of a new figure as well as some clarifications in the text.
Planned revisions:
We will make a supplemental figure for Figure 3 in which we clearly demonstrate animal to animal variability. (bar plot of absolute cell numbers present from each individual animal present in each cell cluster as requested). In the new supplemental figure we will also include a new UMAP plot of fluorescently assigned cell identities belonging only to one of the three animals, which makes it easier to visualize the difference in numbers of cells from each animal present in each individual cell cluster. We will also cite papers that have already demonstrated the phenomena of animal to animal variability in scRNA-seq datasets. We will further emphasize that even in the absence of animal-to-animal variability in co-clustering, that demultiplexing pooled datasets is important because differential expression analysis is greatly enhanced with biological replicates.
__Reviewer #2 __(Evidence, reproducibility and clarity (Required)):
Major comments:
“1. SNP-based demultiplexing performed well on some species, such as zebrafish and Africa green monkey, from which over 90% of the cells analyzed were correctly identified. However, this accuracy decreases in Pleurodeles samples when a common SNPs VCF file is absent (Fig.3). It showed that cell identity can be more precisely defined with the increase of average read depth (Fig 3B). So, I am wondering whether the mis-defined cells shown in Fig. 3E, actually are cells with lower reads. It is better if the authors can test such a correlation between the cell identity and the depth of reads using the data from Fig. 3E.”
__Our response: __We are thankful to reviewer #2 for raising such a great point. We do see the accuracy of the benchmarking results for this experiment increase with increasing sequence depth/cell quality. However, the reasons for this are potentially more complex than just higher accuracy of souporcell in higher quality cells: The fluorescent-based demultiplexing that is being used for “ground truth” in benchmarking souporcell for this figure is more accurate in cells with higher read depth because more fluorescent gene reads are likely to be captured. Therefore analyzing the accuracy of souporcell relative to fluorescent-based demultiplexing over varying read depths can be confusing because it is possible that both methods improve in accuracy with higher read depth. Figure 3B attempts to illustrate this concept, and to demonstrate why we chose to benchmark only the cells with sufficient read depth (read depth between 5K, and 40K, and >1 fluorescent gene read per cell). We plan to complement our manuscript with additional figures and text that will make this clearer.
Planned revisions:
We will produce a plot similar to Figure 3B, but with a Y axis that is the percent agreement between the two methods. For Figure 4 we will also make a plot showing percent agreement between demux methods versus read depth. This plot will be a useful comparison to investigate whether scRNA read depth is directly affecting the quality of souporcell’s SNP-based demux results. Plotting this comparison for a dataset in which Cellplex/Cell hashing is the benchmarking demux method is a more fair test of the effect of sequencing depth on the souporcell demux results because cellplex results rely on reads from the cellplex library, which are an independent sequencing library from the scRNA reads. We will investigate whether the use of a common VCF file or lack thereof affects souporcell accuracy. To test this, we will try repeating souporcell demux of one dataset with and without a common VCF file input to see if the VCF file inclusion affects the accuracy of souporcell results.
Reviewer #2:
“2. Please discuss limitations of this approach in the manuscript. (1) To which extent, when SNPs are roughly present in the individuals of same species, SNP-based demultiplexing can be applied, e.g., individuals from an inbred strain (c57bl6 mice) would not work.(2) The authors experimentally tested two newt species using SNP-based demultiplexing. When multiple species are experimentally applied, may the cell/nuclei size variation cause problem?”
__Our response: __We agree with Reviewer 2 that this paper brings up many technical questions about the limits to which SNP-based demultiplexing will succeed. These limitations should be addressed more thoroughly in our Discussion section.
Planned revisions:
We will expand our Discussion to more fully discuss the predicted limits for SNP-based demuxing for separating pooled cells from genetically similar individuals. We referenced the single paper previously published which reported that Freemuxlet, a similar approach to souporcell, did not succeed when applied to cells pooled from multiple animals within an inbred mouse strain, but did succeed across mouse strains (though without any validation of results). We will expand this Discussion to address the expected effects of genetic diversity on the success of SNP-based demultiplexing methods. We will also note in this expanded Discussion that SNP-based demuxing worked in this paper on siblings (some of the xenopus, some of the zebrafish), and other SNP-based demuxers have been used successfully for demuxing cells from closely related individuals including human siblings (scSplit) and human maternal/fetal pairs (souporcell). We will expand our Discussion to address the potential drawbacks of pooling cells from different species or tissue types including the possibility of a bias in scRNA-seq sample preparation methods. We expect that variations in cell or nuclei sizes between species could cause biases in cell capture depending on the scRNA-seq library preparation method, especially with microfluidic based scRNA-seq preparation methods. We will search for a dataset that would allow for synthetic pooling of inbred mouse data and, if available, put this through our synthetic pooling and demuxing pipeline. While other papers have reported this does not work with other SNP demux tools, and on comments on the souporcell github (https://github.com/wheaton5/souporcell/issues/154) it does not seem to be working, we feel this would be a nice test/reference for showing the limitations for SNP-based demuxing in highly genetically similar individuals.
(Reviewer #2)* *
“3. What is the upper limit number of samples when using this model. Please make some estimation or discussion about it.”
__Our response: __We think this is a pressing question for the future of SNP-based demuxing and deserves further discussion in this manuscript. This is directly addressed by the authors of souporcell in a github thread with regard to human samples (worked on 21 human samples, may work in up to 40). At this point, we have no reason to believe that the limit on sample numbers should be different in other species.
Planned revisions:
We will include discussion about potential limits for the maximum number of samples that can be pooled and demuxed using this approach. As discussed below in response to reviewer 3, we will quantify the genetic differences in pooled datasets in this manuscript in order to give readers an improved prediction of how well SNP-based demuxers are likely to work on their animals of interest. We will look for previously published pooled dataset from zebrafish that includes multiple dozens of samples and attempt to SNP-demultiplex this pool. While we will be unable to validate the accuracy, given how well SNP-based demuxing has performed we can at least determine if cell origins are assigned.
Reviewer #2: Minor comments:
“1. Please add an algorithm principle of this model.”
__Our response: __Thanks for the suggestion, we will do so.
Planned revision:
We will direct readers to the algorithm principle of souporcell in the original paper and include a flowchart of our workflow for running souporcell piece by piece as we have done in the manuscript. As mentioned above, we will make clear how we made the necessary adjustments to the original souporcell pipeline to successfully apply it to datasets with various resources available in these species.
Reviewer #2:
“2. Give a clear definition of doublets including the ground truth and Souporcell result.”
__Our response: __We appreciate this recommendation. For the purposes of this paper our definition of a ‘doublet’ is a dataset represented by a single cell barcode that actually contains more than one cell. However, true doublets can be identified with absolute certainty only in our synthetically pooled datasets, because no demultiplexing approach used for benchmarking is 100% accurate. Therefore, ‘true doublet’ will refer to known doublets based on synthetically pooled dataset ground truths. Further, for our experimental datasets we will also use ‘confirmed doublet’ to refer to cells that were called doublets by both the ground truth and souporcell. And we will use ‘contested doublet’ to refer to cells in which the experimentally derived ground truth and souporcell result disagree about a potential doublet.
Planned revision:
We will insert a clear definition of doublets used in this paper as described above, including the complexity in identifying which doublets are real given the relationship between ground truth and the souporcell results for each experiment.
Reviewer #2:
“3. Authors should indicate the time cost of running one round of such analysis, the minimal computational requirements?”
__Our response: __This is an important point and will be helpful to readers.
__Planned revision: __
We will add to the manuscript information on the required time, RAM consumption, and computational requirements for running various setups for souporcell.
__Reviewer #3: __Major comments:
“The manuscript makes a convincing case for the ability of a preexisting SNP-based demultiplexing tool, called souporcell, to demultiplex pooled samples. The study uses three methods for validation: 1. In silico data pooling; 2. Pooling of transgenic lines; 3. Pooling of cells tagged with CMOs (10x genomics). The results are consistent across experiments.
The authors propose that souporcell is a solution for demultiplexing pooled samples whenever sample tagging methods are not feasible. Although the authors test this approach in several species and conditions, the validation does not cover all possible cases and situations, obviously. Indeed, the authors recommend potential users to run pilot validation experiments with a secondary demultiplexing methods.
However, the manuscript would become more useful if the following points are addressed:
First, what is the genetic relatedness of the individuals pooled in the experiments? What is the SNP frequency in the samples analyzed, and how does that compare to SNP frequency in mouse strains? (The number of SNPs in the VCF is reported in a supplementary table but not discussed in the main text). This point is extremely important: as the authors mention, it is not possible to demultiplex samples from the same mouse strain. Inbreeding is relatively common in laboratory species, even unconventional ones; therefore, information on genetic relatedness and SNP rate would help readers assess whether SNP-based demultiplexing has a good chance to work in their systems. Addressing this point does not require any additional experiments, and computing from the single-cell reads how many SNPs distinguish the individuals pooled here should be straightforward.”
__Our response: __We appreciate the comments raised by reviewer #3.These are valuable critiques and will greatly improve the manuscript.
__Planned revisions: __
We will expand our Discussion with a paragraph on the limits for genetic differences required for SNP-based demuxing to work, as mentioned in response to Reviewer 2. This will include references to Table 1 values on SNP numbers utilized in each analysis, and hypotheses on the absolute limits for genetic relatedness. We will expand Table 1B to include green monkey. As mentioned in response to Reviewer 2, if previously published data we will also try applying souporcell to data from an inbred mouse line to test run an extreme case of applying SNP-based demuxing to data from very inbred animals. We will more clearly annotate the known relationship between individuals in our experiments, and will discuss this within our Discussion. We will contact the zebrafish and axolotl authors and ask if these animals were siblings. We will identify and apply a method for quantifying the genetic relationship between individuals in each scRNA-seq experiment in this study, to enable us to provide readers with a quantitative measure of genetic diversity present in each experiment. This analysis should shed some light on the requirements for genetic variability in order for SNP-based demultiplexers to succeed.
Reviewer #3:____* *
“Moreover, the relatively limited number of samples pooled does not validate the use of souporcell with a larger number of samples. For example: in developmental studies, often dozens of embryos are collected and pooled. What are the potential caveats of using souporcell for demultiplexing larger number of samples? The Discussion would be a good place to warn potential users of the limitations of the approach.”
__Our response: __We agree this could still be a limitation, and for developmental studies with multiple dozens of samples, further exploration of optimal demultiplexing methods or the combination of computational and wet-lab based demux methods may be required.
Planned revision:
We will expand our Discussion on predicted limits for SNP-based demuxing of high sample pools, as discussed in response to Reviewer 2. We agree that developmental projects often involve pooling large numbers of samples, so it is worth clearly outlining the benefits and risks of planning to use SNP-based demultiplexing on such high sample pools, and to outline the limits as discussed by the developer of souporcell. As stated above, we will work to identify a previously published pooled zebrafish dataset with multiple dozens of samples and run souporcell on it. While this will not provide any validation it will at the least determine if we are able to assign cell origins, which have thus far been very reliable when assignments have been made.
Reviewer #3: Minor comments:
“- is the accuracy of doublet detection rate a function of number of samples? This can be tested by repeating the monkey in silico experiment with three individuals.”
__Our response: __This is a good question. We do not thing that the number of samples substantially affects the accuracy of doublet detection by souporcell, but we will test this.
__Planned revision: __
As suggested, we will repeat the monkey analysis with 3 samples to see how this changes doublet detection. Overall, due to the low quality of doublet detection by souporcell found in this manuscript, we will expand our Discussion of doublet detection to propose some potentially useful recommendations for making conservative doublet calls with souporcell external programs (addressed above in response to Reviewer 2. We expect that the more substantial filtering of the monkey datasets relative to the zebrafish dataset prior to pooling contributed to this question. To make these differences more obvious we will more deliberately emphasize the differences in dataset filtering for each experiment.
__ Description of the revisions that have already been incorporated in the transferred manuscript__
4. Description of analyses that authors prefer not to carry out
__From Reviewer 1: __
“More generally, showing more direct evidence for the variability of different cell types (not just the immune system) could be informative for scRNA-seq users.”
__Our response: __We do not plan to conduct extensive analyses of other published single cell datasets to provide a further reason for why it is important to have biological replicates for single cell experiments. When building this manuscript, we chose not to pursue the option of publishing an analysis of published single cell datasets in which we could identify artifactual results and animal to animal variability, because we worried that this would be harmful to future open science efforts, and therefore, counterproductive. Further, past papers have already demonstrated the issue of batch effects and animal to animal variability in scRNA-seq datasets, and the requirement for biological replicates to facilitate differential expression analysis. As mentioned above, we will do a better job citing the papers that address these points.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Summary:
In this manuscript, Cardiello and colleagues address the problem of demultiplexing pooled samples in single-cell RNA sequencing (scRNAseq) experiments. The manuscript benchmarks the use of a preexisting SNP-based demultiplexing tool, called souporcell, in pooled samples from non-conventional laboratory species. The validation includes computational pooling of published data from different individuals (zebrafish, green monkey), and generation of new pooled data with independent ground-truth information available (with frogs and three salamander species). The authors conclude that souporcell is suitable for demultiplexing …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Summary:
In this manuscript, Cardiello and colleagues address the problem of demultiplexing pooled samples in single-cell RNA sequencing (scRNAseq) experiments. The manuscript benchmarks the use of a preexisting SNP-based demultiplexing tool, called souporcell, in pooled samples from non-conventional laboratory species. The validation includes computational pooling of published data from different individuals (zebrafish, green monkey), and generation of new pooled data with independent ground-truth information available (with frogs and three salamander species). The authors conclude that souporcell is suitable for demultiplexing scRNAseq data collected as pools from different individuals. The authors propose that SNP-based demultiplexing can be used to monitor and correct for batch effects, whenever data need to be collected as pools (for example: small sample sizes, developmental datasets etc).
Major comments:
The manuscript makes a convincing case for the ability of a preexisting SNP-based demultiplexing tool, called souporcell, to demultiplex pooled samples. The study uses three methods for validation: 1. In silico data pooling; 2. Pooling of transgenic lines; 3. Pooling of cells tagged with CMOs (10x genomics). The results are consistent across experiments.
The authors propose that souporcell is a solution for demultiplexing pooled samples whenever sample tagging methods are not feasible. Although the authors test this approach in several species and conditions, the validation does not cover all possible cases and situations, obviously. Indeed, the authors recommend potential users to run pilot validation experiments with a secondary demultiplexing methods.
However, the manuscript would become more useful if the following points are addressed:
First, what is the genetic relatedness of the individuals pooled in the experiments? What is the SNP frequency in the samples analyzed, and how does that compare to SNP frequency in mouse strains? (The number of SNPs in the VCF is reported in a supplementary table but not discussed in the main text). This point is extremely important: as the authors mention, it is not possible to demultiplex samples from the same mouse strain. Inbreeding is relatively common in laboratory species, even unconventional ones; therefore, information on genetic relatedness and SNP rate would help readers assess whether SNP-based demultiplexing has a good chance to work in their systems. Addressing this point does not require any additional experiments, and computing from the single-cell reads how many SNPs distinguish the individuals pooled here should be straightforward.
Moreover, the relatively limited number of samples pooled does not validate the use of souporcell with a larger number of samples. For example: in developmental studies, often dozens of embryos are collected and pooled. What are the potential caveats of using souporcell for demultiplexing larger number of samples? The Discussion would be a good place to warn potential users of the limitations of the approach.
Minor comments:
- is the accuracy of doublet detection rate a function of number of samples? This can be tested by repeating the monkey in silico experiment with three individuals.
Significance
The manuscript presents a technical advance, by validating the use of souporcell for demultiplexing scRNAseq data collected from non-conventional animal species.
The audience potentially interested in this paper is relatively broad. Potential readers include biologists that collect and analyze scRNAseq data from pooled samples, for instance scientists working in the fields of embryonic development and evolutionary developmental biology, but also clinical researchers. The manuscript will be particularly interesting for scientists working on amphibians, because souporcell is validated experimentally in three amphibian species.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Summary:
Provide a short summary of the findings and key conclusions (including methodology and model system(s) where appropriate). Please place your comments about significance in section 2.
This study provided a SNP-based demuxers to facilitate effective experimental design of scRNA-seq. This model used discrepancies in SNPs across species or individuals to trace back the source of cells in scRNA-seq experiments. Benchmarking the performance of demultiplexing, this study analyzed in silico or experimentally pooled scRNA-seq data from species including zebrafish, African green monkeys, Xenopus laevis, axolotl, Pleurodeles waltl, …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Summary:
Provide a short summary of the findings and key conclusions (including methodology and model system(s) where appropriate). Please place your comments about significance in section 2.
This study provided a SNP-based demuxers to facilitate effective experimental design of scRNA-seq. This model used discrepancies in SNPs across species or individuals to trace back the source of cells in scRNA-seq experiments. Benchmarking the performance of demultiplexing, this study analyzed in silico or experimentally pooled scRNA-seq data from species including zebrafish, African green monkeys, Xenopus laevis, axolotl, Pleurodeles waltl, and Notophthalmus viridescens. It demonstrated that high accurately demultiplex can be achieved regardless of existence of genome and a common SNP set. Overall, this study provided an economical, powerful, and less-biased pooled scRNA-seq data analysis method, depending minimally on the availability of genomic resources.
Major comments:
- SNP-based demultiplexing performed well on some species, such as zebrafish and Africa green monkey, from which over 90% of the cells analyzed were correctly identified. However, this accuracy decreases in Pleurodeles samples when a common SNPs VCF file is absent (Fig.3). It showed that cell identity can be more precisely defined with the increase of average read depth (Fig 3B). So, I am wondering whether the mis-defined cells shown in Fig. 3E, actually are cells with lower reads. It is better if the authors can test such a correlation between the cell identity and the depth of reads using the data from Fig. 3E.
- Please discuss limitations of this approach in the manuscript. (1) To which extent, when SNPs are roughly present in the individuals of same species, SNP-based demultiplexing can be applied, e.g., individuals from an inbred strain (c57bl6 mice) would not work.(2) The authors experimentally tested two newt species using SNP-based demultiplexing. When multiple species are experimentally applied, may the cell/nuclei size variation cause problem?
- What is the upper limit number of samples when using this model. Please make some estimation or discussion about it.
Minor comments:
- Please add an algorithm principle of this model.
- Give a clear definition of doublets including the ground truth and Souporcell result.
- Authors should indicate the time cost of running one round of such analysis, the minimal computational requirements?
Significance
- Accurate demultiplexing of pooled data can reduce the batch effect between data and experimental costs.
- This model will achieve good results in analyzing cell evolution between different species, or individuals of same species carrying sufficient SNPs.
- It is sufficient to run this analysis only with a de novo transcriptome, opened the possibility of using pooled sc-RNA analysis on less-investigated species.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
Cardiello et al tested if souporcell (https://pubmed.ncbi.nlm.nih.gov/32366989/) can be used to demultiplex samples for some model organisms, based on identified SNPs. For this, they used synthetic multiplexed data, publicly available datasets and some new datasets, spanning samples from five model organisms. Their analysis indicates that souporcell could be used to demultiplex scRNA-seq experiments for multiple species, which offers a cost-beneficent approach.
The manuscript reads well and shows this approach can work for different model organisms. However, unfortunately, I am confused about the amount of novelty in this …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
Cardiello et al tested if souporcell (https://pubmed.ncbi.nlm.nih.gov/32366989/) can be used to demultiplex samples for some model organisms, based on identified SNPs. For this, they used synthetic multiplexed data, publicly available datasets and some new datasets, spanning samples from five model organisms. Their analysis indicates that souporcell could be used to demultiplex scRNA-seq experiments for multiple species, which offers a cost-beneficent approach.
The manuscript reads well and shows this approach can work for different model organisms. However, unfortunately, I am confused about the amount of novelty in this manuscript. The method, souporcell, is already published. The authors indicate souporcell is not validated in non-human samples, but the original paper states that their method works with malaria parasite data (Fig 3b, FigS4). Adapting and using an available tool for different model organisms is good and groups working on different model organisms may find this manuscript useful, but the same could be said for the original article. Due to these reasons, I am not sure whether this manuscript has novelty sufficient for publication. I also wrote down two minor points below:
- Doublets assigned by souporcell compared to the fluor-based assignment look random. In Fig 2 doublet recovery rate looks smaller, and in fig 3 doublet rate prediction looks more random. This is a bit confusing. Is there any explanation for this?
- The authors discussed the immune system cells might show some variability in their discussion (referring to fig 3), but this is not clearly shown in the figures as data. Having a percentage bar graph could make it clearer for the readers. More generally, showing more direct evidence for the variability of different cell types (not just the immune system) could be informative for scRNA-seq users.
Significance
scRNA-Seq is becoming a routine approach to assay gene expression profiling. However, it remains costly. There are new approaches to multiplex and demultiplex samples to decrease the cost. Thus, it is good to see that one available tool works for five different model organisms.
Although it is good to see an available tool works for 5 different species, I am not sure about the novelty presented in this manuscript. Technical advances are not clear to this reviewer, as the method is already published. Moreover, this is a technical report manuscript and there is no biological conceptual advance. As a developmental biologist using single-cell mRNA sequencing, someone more directly from the single-cell field may have further comments on novelty, recommendations for references, and could comment on computational aspects in more detail.
-
