Multicellular factor analysis of single-cell data for a tissue-centric understanding of disease

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

Biomedical single-cell atlases describe disease at the cellular level. However, analysis of this data commonly focuses on cell-type-centric pairwise cross-condition comparisons, disregarding the multicellular nature of disease processes. Here, we propose multicellular factor analysis for the unsupervised analysis of samples from cross-condition single-cell atlases and the identification of multicellular programs associated with disease. Our strategy, which repurposes group factor analysis as implemented in multi-omics factor analysis, incorporates the variation of patient samples across cell-types or other tissue-centric features, such as cell compositions or spatial relationships, and enables the joint analysis of multiple patient cohorts, facilitating the integration of atlases. We applied our framework to a collection of acute and chronic human heart failure atlases and described multicellular processes of cardiac remodeling, independent to cellular compositions and their local organization, that were conserved in independent spatial and bulk transcriptomics datasets. In sum, our framework serves as an exploratory tool for unsupervised analysis of cross-condition single-cell atlases and allows for the integration of the measurements of patient cohorts across distinct data modalities.

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    General Statements

    Thank you for providing an initial assessment of our manuscript. We went through all the raised comments and suggestions aiming to improve our manuscript. Our manuscript will benefit from addressing them.

    Our main impression is that the concerns regarding the novelty of our work by Reviewers #1 and #3 come from the fact that we apply a known flexible statistical framework (group factor analysis) to novel applications in single-cell data analysis, namely the estimation of multicellular programs and sample-level unsupervised analysis. The core methodology of our work is indeed based on the popular tool Multi-omics factor analysis (MOFA). We see the novelty of our study in the formulation of these relatively new applications within this framework, and the demonstration of the added value that this formulation provides building on MOFA’s strengths, in particular by expanding the possibilities of downstream analysis of single-cell data including the meta-analysis of distinct single-cell patient cohorts and its integration to complementary bulk and spatial data modalities.

    The simultaneous estimation of multicellular programs together with sample-level unsupervised analysis is only possible with a single available tool, scITD, which is limited by its modeling strategy, based on tensor decomposition: with tensor decomposition, multicellular programs can not be estimated from distinct feature sets across cell-types, making this method less flexible and sensitive to technical effects, such as background expression. We compared our proposed methodology with scITD and showed the benefits of using group factor analysis as implemented in MOFA for this task. Moreover, as of now, no other methodology is able to estimate multicellular programs and perform sample-level unsupervised analysis, simultaneously in multiple independent single-cell atlases. We also showed how multicellular programs are traceable in bulk transcriptomics data and show that they are better fit to classify heart failure patients compared to classic cell-type deconvolution approaches.

    Altogether, we believe that our current manuscript complements existing literature and puts forward an approach with distinct features to analyze single-cell atlases. We will edit the text to make more explicit the novelty and advantages of our proposed methodology, and we will emphasize that our work does not mean to propose a new method, but rather demonstrate how group factor analysis can be used for novel sample-level analysis of single-cell data. We plan to incorporate the suggestions by Reviewer #1 regarding the inclusion of additional datasets, model validations, and novel applications involving a direct modeling of cell-compositions and spatial organization of cells. Moreover, we plan to discuss perspectives on how cell communications can be incorporated in the analysis of multicellular programs as suggested by Reviewer #2. Additionally, we will correct all the figure and text typos identified by the reviewers. Finally, we will provide an R package (https://github.com/saezlab/MOFAcellulaR) and python implementations (https://liana-py.readthedocs.io/en/latest/notebooks/mofacellular.html) that facilitate the use of our approach.

    Please find below the point-by-point response to the reviewers in blue, numbered for convenience.

    __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

    Remark to authors

    Flores et al. present a pipeline in which they leverage MOFA framework, a matrix factorization algorithm to infer multi-cellular programs (MCPs). Learning and using MCP has already been proposed by others. Yet, authors pursue a similar goals by using MOFA, providing a cell*sample matrix for different cell types as different views (instead of multiple modalities/views) as the input. They later apply MOFA using this data format on a series of applications to analyze acute and chronic human heart failure single-cell datasets using MCPs. Authors further try to expand their analysis by incorporating other modalities.

    Major points:

    1.1 As briefly outlined in the remarks, the current manuscript needs novel findings and methodology to grant a research article which I can' see here. The underlying matrix factorization is the original MOFA (literally imported in the code) with no modification to further optimize the method toward the task. While I appreciate and acknowledge the author's efforts resulting in a detailed analysis of heart samples, I think all of these could have been part of MOFA's existing tutorials.

    Response 1.1 As the reviewer correctly states, we used the framework and code of MOFA. The novelty lies in its application for the unsupervised analysis of samples from cross-condition single-cell data and the inference of MCPs. MOFA is a statistical framework implementing a generalization of group factor analysis with fast inference and its current version fits the task of MCP inference and unsupervised analysis of samples across cell-types that provides a more flexible modeling alternative than current available methods (as presented in Table 1 of the manuscript). Current work on MCP inference is based on the premise of multi-view factorization with distinct statistical modeling alternatives. As mentioned in the discussion of our manuscript, three main points distinguish our discussed methodology from present alternatives and provide evidence about its relevance and uniqueness over available tools:

    Simultaneous unsupervised analysis of samples across cell-types and inference of MCPs, together with comprehensive interpretable descriptions of the reconstruction of the original multi-view dataset. This is only currently possible with scITD (Mitchel et al, 2022) and is compared in the manuscript. DIALOGUE (Jerby-Arnon & Regev, 2022) is limited to the generation of MCPs and Tensor-cell2cell (Armingol et al, 2021) is only focused in cell-communications with limited interpretability.

    Flexible non-overlapping feature set that handles better technical effects such as background expression, as discussed in section “__2.2 Multicellular factor analysis for an unsupervised analysis of samples in single-cell cohorts”. __Moreover, as mentioned by the reviewer in a later point (Reviewer comment 1.2), this enables joint modeling of distinct aspects of the tissue, such as cell compositions, cell communications (preliminary work: https://liana-py.readthedocs.io/en/latest/notebooks/mofatalk.htm) and spatial organization.

    Joint-modeling of independent atlases that enables meta-analysis at the sample level of cross-condition single-cell data. No currently available methodology is capable of performing similar modeling. For these reasons, we believe that our work is worth being discussed and presented to the community as a research article. We will modify the discussion to put more emphasis on the added value of group factor analysis as implemented in MOFA.

    Moreover, we now provide an R package (https://github.com/saezlab/MOFAcellulaR) and python implementations within our analysis framework LIANA (https://liana-py.readthedocs.io/en/latest/notebooks/mofacellular.html) that facilitates the usage of our proposed methodology. The R and python implementations are compatible with current Bioconductor and scverse pipelines, respectively.

    Application of our methodology to heart failure datasets also revealed novel knowledge about heart disease processes:

    In myocardial infarction, we found that our MCPs associated with cardiac remodeling capture cell-state-independent gene expression changes. This provides a novel understanding on the effect of disease contexts in the expression profiles of specialized cells. This finding was not reported in the original atlas publication.

    In chronic heart failure, we identified a conserved MCP of cardiac remodeling across patient cohorts and etiologies, suggesting a common chronic phase between distinct initial causes of heart failure.

    Moreover, we showed that deconvoluted chronic heart failure MCPs from bulk transcriptomics better classify patients in comparison to classic cell-type composition deconvolution of bulk data. To our knowledge, this finding was not presented in any of the manuscripts of other methodologies focused on MCPs.

    Altogether, our current work shows a novel application of group factor analysis for the simultaneous estimation of MCPs and the sample-level unsupervised analysis of cross-condition single cell data. We showed the unique features compared to current available tools. Distinct post-hoc analysis in combination with other data modalities shows the biological relevance of our proposed methodology to complement the tissue-centric knowledge of disease.

    1.2 How can you explain that the results in donor-level analyses are not due to technical artifacts (batch variation)? Can this be used to infer a new patient similarity map? For example, I would test this by leaving out a few patients from training, projecting them, and seeing where they would end up in the manifold or classifying disease conditions for new patients and explaining the classification by MCPs responsible for that condition.

    Response 1.2 When knowledge of the technical batches is available it is possible to test for association between these labels and the factors encoding MCPs as shown in Figure 2.

    In our current applications, we additionally showed the biological relevance of our estimated MCPs by mapping them to spatial and bulk data sets, which is a direct way of testing how generalizable were our findings:

    In the application of MOFA to human myocardial infarction data, we mapped the gene loadings conforming the MCP associated with cardiac remodeling to paired spatial transcriptomics datasets. We showed that in general, the cell-type specific expression of the MCP of cardiac remodeling encompassed larger areas in ischemic and fibrotic samples compared to myogenic (control) samples.

    In the application of MOFA to chronic human end-stage heart failure data, we mapped the gene loadings conforming the MCP associated with cardiac remodeling to 16 independent bulk transcriptomics datasets of heart failure. There we showed that the cell-type specific expression of the MCP of cardiac remodeling separates heart failure patients from control individuals. Regarding the generation of new patient similarity maps, it is possible to estimate the positions of new samples in the manifold formed by the factors representing the MCPs. As suggested by the reviewer we will show this by classifying heart failure single-cell samples using MCPs of two independent patient cohorts (presented in section 2.7).

    1.3 The bulk and spatial analysis are used posthoc after running MOFA, I think since MOFA can use non-overlapping features set, it would be interesting to see if deconvoluted bulk or ST data can be encoded as another view (one view from scRNAseq data for each cell-type and another view from bulk RNA-seq or ST, you can get normalized expression per spot (for ST) or per sample (for bulk) and use them as input.

    Response 1.3 Thanks for the suggestion. We agree that the possibility of using non-overlapping features opens options of complex models that include the cell-type compositional and organizational aspects of tissues. However these features must be quantified in the same sample, thus it is limited to samples profiled simultaneously at different scales.

    We will present the results of a sample-level joint model of multicellular programs together with cell-proportions and spatial dependencies using the myocardial infarction dataset presented in section 2.2. For this dataset based on our previous work we have the compositions of major cell-types and their spatial relationships based on spatially contextualized models (Kuppe et al, 2022). We will run a MOFA model and show how it can be used to find factors associated with structural and molecular features of tissues.

    __Minor: __

    1.4 Some figure references are not correct (e.g., "the single-cell data into a multi-view data representation by estimating pseudo bulk gene expression profiles for each cell-type across samples (Figure 1b)." should be figure 2b)

    Response 1.4 Thanks for pointing this out. We apologize for these mistakes and we will adjust all labels correctly.

    1.5 The paper is well written, but there could be some more clarifications about what authors consider as cell-type and cell-state, condition, MCPs which I think is critical to current analysis (see here https://linkinghub.elsevier.com/retrieve/pii/S0092867423001599) for the reader not familiar with those concepts.

    Response 1.5 We agree with the reviewer that it is important to introduce these concepts in more detail to avoid confusion. We will adapt the current manuscript to incorporate these definitions in the introduction.

    __Reviewer #1 (Significance (Required)): __

    1.6 While I find the concept of MCPs interesting, the current work seems like a series of vignettes and tutorials by simply applying MOFA on different datasets (The authors rightfully state this). However, It needs to be clarified what the novelty is since there is no algorithmic improvement to current MCP methods (because there is no new method) nor novel biological findings. Additionally, even in the current form, the applications are limited to the heart, and the generalization of this proposed analysis pipeline to other tissues and datasets is not explored. Overall, the paper lacks focus and novelty, which is required to grant a publication at this level.

    Response 1.6 As mentioned in response 1.1, we show that group factor analysis as implemented in MOFA has advantages given its flexibility of the feature space, the joint-modeling of independent datasets, and the interpretability of the model. We will make these advantages clearer in the discussion, and we will explicitly mention the disadvantages and lack of functionalities of available methods.

    The applications were mainly done in heart data for consistency although they represent four distinct single-cell datasets, one spatial transcriptomics dataset, and 16 independent bulk transcriptomics datasets. For completeness, as suggested by the reviewer, we will show the application of our methodology to peripheral blood mononuclear cell data of lupus samples (preliminary results: https://liana-py.readthedocs.io/en/latest/notebooks/mofacellular.html)

    __expertise: Computational biology, single-cell genomics, machine learning __

    __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

    Summary:

    The authors use MOFA, an unsupervised method to analyze multi-omics data, to create multicellular programs of cross-condition multi-sample studies. First, for each cell-type, a pseudobulk expression matrix per sample is created. The cell-type now functions as the separate view, typically reserved for the different omics layers in MOFA. This then results in a latent space with a certain number of factors across samples. The factors, representing coordinated gene expression changes across cell-types, can then be checked for associations with covariates of interest across the samples.

    MOFA is well-suited for this task, as it can handle missing data and it is a linear model facilitating the interpretation of the factors. Users should be aware that MOFA can estimate the number of factors, but the pseudobulk profiles require a rigorous selection of cell-type specific marker genes. The result will be most suited for downstream analysis if there is a clear association with one factor and a clinical covariate of interest. In a final step, a positive or negative gene signature can be created by setting a cut-off on the gene weights for that specific factor.

    The method is applied on 3 separate data sets of heart disease, each time demonstrating that at least one of the factors is associated with a disease covariate of interest. The authors also compare the method to a competitor tool, scITD, and explore to what extent a factor mainly captures variance associated with (i) a general condition covariate or rather (ii) specific cell states.

    The multicellular programs are also mapped to spatial data with spot resolution. Though this analysis does not bring any novel biological insight in the use case, it does support the claim that the programs are associated with the covariate of interest.

    The most interesting applications of MOFA are in my opinion the potential for meta-analysis of single-cell studies and validation of cell-type specific gene signatures with publicly available bulkRNAseq data sets.

    The authors provide various data sets and data types to support their claims and the paper is well written. The relevant code and data has been made available.

    We thank the reviewer for the positive comments to our work.

    __Major comments __

    2.1 What is the added value of the gene signatures obtained from MOFA compared to e.g. a naive univariate approach? In theory, a similar collection of genes or gene signature could be obtained by running a differential gene expression analysis across the samples for each cell-type (e.g. myogenic vs ischemic ) and applying a set of relevant cut-offs or filters on the results. In other words, does MOFA detect genes that would otherwise be missed?

    Response 2.1 Thank you for the relevant comment. The original motivation of our work is the unsupervised analysis of samples based on a manifold formed by a collection of multicellular molecular programs. We envisioned that this unsupervised analysis would be relevant in situations where a clear histological or clinical classification of samples is not possible with reliability. As mentioned by Reviewer #1 in comment 1.2, one advantage of these approaches is that they create patient similarity maps, which have been shown useful to stratify patients in a recent analogous work in multiple sclerosis (Macnair et al, 2022). The cell-type signatures obtained from relevant factors explaining the patient stratification avoid the likelihood of performing “double dipping” by avoiding the need of a direct differential expression analysis between newly formed groups.

    In our applications, the generation of cell-type signatures (here called multicellular programs) associated to a specific clinical covariate (eg. control vs perturbation) are post-hoc analyses of the generated manifold. And as the reviewer correctly points out, these signatures should be similar to performing direct differential expression analysis between those patient conditions. In the related work of scITD (Mitchel et al, 2022) the authors showed high concordance between the cell-type signatures and the results of differential expression analysis. For completion, we will similarly quantify the degree of overlap between genes of our generated signatures with the ones coming from differential expression analysis.

    It is relevant to mention that in complex experimental designs with multiple conditions, our approach facilitates patient ordering, which allows the understanding of one condition in the context of all the others, avoiding the need of multiple testing and the definition of multiple contrasts, as mentioned in the text.

    We will incorporate these points in the discussion section of the manuscript.

    2.2 Could scITD also be used for meta-analysis or could the obtained gene signatures of that method also be mapped to bulkRNAseq data? If so, it would be interesting to show the relative performance with MOFA. If not, this specific advantage should be highlighted.

    Response 2.2 Thank you for pointing this out. scITD does not provide a group-based model to perform meta-analysis, and this feature is one of the main advantages of group factor analysis as currently implemented in MOFA. We will highlight this feature in Table 1 and in the discussion.

    Although scITD signatures of a single study could be mapped to bulk transcriptomics data, the stringent tensor representation leads to the generation of signatures that may be influenced by technical effects as shown in the manuscript section 2.2. Thus we believe that the flexibility of the feature space in MOFA is an advantage for this task. We will add this observation to the discussion.

    2.3 Users need to specify gene set signatures based on the weights for a factor of interest. This might suggest a limitation to categorical covariates of interest. If the authors see potential for a continuous covariate of interest, this should at least be highlighted in the text and if possible demonstrated on a use case.

    Response 2.3 In our applications we limited ourselves to categorical variables, however, it is possible to associate factors to continuous variables. An implementation of the association with continuous variables is already available in our newly created R package “MOFAcellulaR”: https://github.com/saezlab/MOFAcellulaR/blob/main/R/get_associations.R.

    The datasets we analyzed have no continuous clinical covariates to showcase this functionality, but as suggested by the reviewer we will highlight this feature in the text.

    __Minor comments __

    2.4 In Figure 2c the association between factor 2 and the technical factor shows a very strong outlier. Please verify that the association is still significant after applying a more robust statistical test (e.g. non-parametric test as Wilcoxon).

    Response 2.4 Thanks for the observation, we will test these differences with a non-parametric test.

    2.5 For mapping the cell-type specific factor signatures to bulk transcriptomics, the exact performed comparison or model is unclear. There are seven cell-type signatures for each sample in every study. Was there a t-test run for each cell-type or was a summary measure taken across the cell-types? he thresholding is also rather lenient (adj. p-val 0.1).

    Response 2.5 We are sorry for not being clear about our procedure. After identifying the multicellular program associated with heart failure estimated from the two single cell studies meta-analyzed, we calculated the weighted mean expression of the seven cell-type signatures independently to every sample of the 16 bulk studies. In other words each sample within each bulk study will be represented by a vector of 7 values representing the relative expression of a cell-type specific signature (Figure 6D-left). For each bulk transcriptomics study, first, we centered the gene expression data before calculating the weighted mean.

    In supplementary figure 4-e we show the results of performing a t-test of the cell-type scores between heart failure and control samples within each study. Given the relative low sample size of most of the studies (affecting the power of the test), we chose a not so stringent adjusted p-value. For completion, we will show the results of a more classical threshold (adj. p-value

    2.6 typo in abstract: In sum, our framework serves as an exploratory tool for unsupervised analysis of cross-condition single-cell ***atlas*** and allows for the integration of the measurements of patient cohorts across distinct data modalities

    Response 2.6 Thanks for pointing out this typo. We will modify the text.

    2.7 In Figure 4a it is not clear to me why on the one hand we see marker enrichment vs loading enrichment with healthy and disease.

    Response 2.7 We apologize, this is a typo after editing the labels. Both should contain the marker enrichment label. We will fix this.

    2.8 IN Figure 4b it would help if the same color scheme would be maintained throughout the paper (here now black and white) and if for the cell states the boxplots would be connected per condition, emphasizing the (absence) of change across cell states within a condition.

    Response 2.8 We thank the reviewer for the suggestion. We will reorganize the panels showing the gene expression per condition and fix the color scheme.

    __Reviewer #2 (Significance (Required)): __

    __General assessment: __

    2.9 MOFA is well-suited for detecting multicellular programs because it can handle missing data and allows for easy interpretation of the factors as a linear method. It might have particular potential for meta-analysis across multiple studies and reevaluating bulkRNAseq data sets, but in the current manuscript it is unclear to what extent this is a specific advantage of MOFA or could also be done with competitors. The authors show how the obtained results and associations with clinical covariates can be validated across multiple data types. How the resulting multicellular programs can provide additional biological insight or form the starting point for additional downstream analysis (e.g. cell communication) is not covered in the paper.

    Response 2.9 We thank the reviewer for highlighting the methodological advantages of group factor analysis for the estimation of multicellular programs and the unsupervised analysis of samples from cross-condition single-cell atlas. As mentioned in response 1.1 and 2.2, the added value of our methodology is the flexibility of feature views (that goes beyond gene expression) and simultaneous modeling of independent single-cell datasets, a feature not present in any of the currently available methods that facilitates the meta-analysis of datasets across modalities.

    While we interpret the presented multicellular programs in the context of cellular functions and the division of labor of cell states, it is true that we did not attempt to provide mechanistic hypotheses, for example, via cell-cell communication, on how this coordination across cell-types emerges.

    Previous work of the related tool Tensor-cell2cell (Armingol et al, 2021) has presented the idea of the estimation of multicellular programs from cell-cell communications and group factor analysis can also be used for this task (preliminary work: https://liana-py.readthedocs.io/en/latest/notebooks/mofatalk.html). We will discuss in the text perspectives on how the estimation of multicellular programs can be linked to the inference of cell communications from single-cell data together with analysis alternatives previously proposed by scITD and Tensor-cell2cell. However, we believe that this question requires further work and it is out of scope of our current manuscript.

    __Audience: This paper will be mainly of interest to a specialized public interested in unsupervised methods for large scale multi-sample and multi-condition studies. __

    __Reviewer: main background in the analysis of scRNAseq data. __

    __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __

    This manuscript by Saez-Rodriguez and colleagues proposes to repurpose Multi-Omics Factor Analysis for the use of single cell data. The initial open problem stated by the paper is the need for a framework to map multicellular programs (such as derived from factor analysis) to other modalities such as spatial or bulk data. The authors propose to repurpose MOFA for use in single cell data. Case studies involve human heart failure datasets (and focuses on spatial and bulk comparisons).

    There are particular issues with clarity regarding the key methodological contribution (and assessment of it), discussed under significance.

    __Reviewer #3 (Significance (Required)): __

    3.1 I am very puzzled by the repeated claims the manuscript makes that their central methodological contribution and innovation is to use MOFA for single cell data. One of their citations for MOFA is to MOFA+, which is precisely that (in a relatively popular manuscript published by the original authors of MOFA and not overlapping with the present authors). I am left to wonder what I missed.

    Response 3.1 We apologize for the misunderstanding, as mentioned in the response to review 1.1 and explained by reviewer 2’s summary, the main objective of our work is to use the statistical framework of group factor analysis for the inference of multicellular programs and the sample-level unsupervised analysis of cross-condition single-cell data, which is a distinct task to multimodal integration (Argelaguet et al, 2021).

    While it is true that MOFA+ introduced expansions to the model for the modeling of single-cell data, namely fast inference and group-based modeling, the main focus in their applications is the multimodal integration of data, where each cell is represented by a collection of distinct collection of features (e.g. chromatin accessibility and gene expression). Unlike multimodal integration, here we propose a different approach to analyze single-cell data at the sample level instead of the cell level, without modifying the underlying statistical model (see section 2.1 of the manuscript).

    In detail, what we assume is that samples of single-cell transcriptomics data (e.g. tissue from a patient) can be represented by a collection of independent vectors collecting the gene expression information of cell types composing the tissue analyzed. Decomposition of these multiple views with group factor analysis produces a manifold that captures multicellular programs (coordinated expression processes across cell-types), or shared variability across cell-types simultaneously. Altogether, this represents a novel usage of group factor analysis in an application for the inference of multicellular programs, where the main focus is not at the cell-level but at the patient level.

    As a side note, Britta Velten, one of main developers of MOFA and coauthor of both the MOFA and MOFA+ papers, is a contributor and coauthor of this manuscript, and Ricard Argelaguet, who also led both versions of MOFA, gave us helpful feedback and is acknowledged as such on this work.

    3.2 Multimodal integration methods are fairly numerous and even if they're not all exactly factor analyses, it's strange to argue that MOFA fills some unique conceptual gap. I agree it fills something of an interesting gap (except for MOFA+ already filling it), but it's not like the quite popular spatial to single-cell integration approaches aren't doing similar things. If this is a methods paper (as it is presented) then there would have to be very substantially more comparative evaluation to these other approaches.

    Response 3.2 As presented in the previous response (3.1) our current work is not focused on multimodal integration, but rather the inference of multicellular programs and the sample-level unsupervised analysis of single-cell data. Given this, in the current manuscript we compared our proposed methodology with the only three other available methods that address at least partially the inference of multicellular programs (see Table 1 in our manuscript). In response 1.1 and 3.2 we discussed the advantages of our proposed methodology compared to available methods. In the manuscript section 2.2 we compared group factor analysis with tensor decomposition and showed that the former better deals with technical artifacts and better identifies known patient groups.

    We will distinguish our work from multimodal integration explicitly in the introduction and the manuscript section 2.1 to avoid confusions.

    3.3 The biological use cases are comparatively interesting and dominate the manuscript (but are still presented principally as use cases rather than a compelling biological narrative of their own).

    Response 3.3 The focus of our manuscript was the reintroduction of group factor analysis for the novel applications of the inference of multicellular programs and the sample-level unsupervised analysis from single-cell data. Given the distinct possibilities of post-hoc analyses, we mainly used acute and chronic heart failure data to showcase the utility of MOFA to connect spatial and bulk modalities with single-cell data.

    That said, as discussed in response 1.1, our analyses allowed to generate novel hypotheses of these datasets:

    In myocardial infarction, we found that our estimated multicellular programs associated with cardiac remodeling capture cell-state-independent gene expression changes. This provides a novel understanding of the effect of disease contexts in the expression profiles of specialized cells. In other words, we found that cell-states, regardless of their specialized function, share a common response in the tissue context.

    In chronic heart failure, we identified a conserved multicellular program of cardiac remodeling across patient cohorts and etiologies, suggesting a common chronic phase between distinct initial causes of heart failure, which again may be linked to the dominating response to the tissue context that is shared across etiologies.

    These two results support the observation that deconvoluted chronic heart failure multicellular programs from bulk transcriptomics better classify patients in comparison to classic cell-type composition deconvolution of bulk data. To our knowledge, this finding was not presented in any of the manuscripts of other methodologies focused on MCPs. We summarize these results in the third paragraph of the discussion in the manuscript:

    “In an application to a collection of public single-cell atlases of acute and chronic heart failure, we found evidence of dominant cell-state independent transcriptional deregulation of cell-types upon myocardial infarction. This may suggest that while certain functional states within a cell-type are more favored in a disease context, most of the cells of a specific type have a shared transcriptional profile in disease tissues. If part of this shared transcriptional profile is interpreted as a signature of the tissue microenvironment that drives cells in tissues towards specific functions, this result may also indicate that a major source of variability across tissues, besides cellular composition, is the degree in which the homeostatic transcriptional balance of the tissue is disturbed. By combining the results of multicellular factor analysis with spatial transcriptomics datasets, we explored this hypothesis and identified larger areas of cell-type-specific transcriptional alterations in diseased tissues. Given these observations on global alterations upon myocardial infarction, we meta-analyzed single-cell samples from two additional studies of healthy and heart failure patients with multiple cardiomyopathies. Here, we found a conserved transcriptional response across cell-types in failing hearts, despite technical and clinical variability between patients. Further, we could find traces of these cell-type alterations in independent bulk data sets. These observations suggest that our approach can estimate cell-type-specific transcriptional changes from bulk data that, together with changes in cell-type compositions, describe tissue pathophysiology. Altogether, these results highlight how MOFA can be used to integrate the measurements of independent single-cell, spatial, and bulk datasets to measure cell-type alterations in disease.”

    To fully assess the relevance of these observations, they should be investigated in more datasets and analyses, where shared functional cell-states across distinct heart failure etiologies are identified and then compared at their compositional and molecular level. This, in our opinion, represents an independent study on its own.

    3.4 Altogether, I found the framing of this manuscript very puzzling. It is possible the result would be more clearly presented if the use case was the major focus rather than the more conceptual point about factor analysis.

    Response 3.4 Thanks for the suggestion. The major aim of this manuscript is to highlight the versatility of the generalization of group factor analysis as implemented in MOFA for novel applications in single-cell data analysis, beyond multimodal integration of single cells. The definition of multicellular programs from single-cell data and its sample-level unsupervised analysis are relatively new analyses in the field, and thus we believe that it is timely to show how a known statistical framework can be used for these applications.

    We believe that a detailed analysis of single-cell datasets of heart failure deserves its own focus and it is out of scope of our current objective with this manuscript. We apologize for the apparent misunderstanding of the objective of our methodology. We will add these distinctions in the introduction of the manuscript.

    References

    Argelaguet R, Cuomo ASE, Stegle O & Marioni JC (2021) Computational principles and challenges in single-cell data integration. Nat Biotechnol 39: 1202–1215

    Armingol E, Baghdassarian H, Martino C, Perez-Lopez A, Knight R & Lewis NE (2021) Context-aware deconvolution of cell-cell communication with Tensor-cell2cell. BioRxiv

    Jerby-Arnon L & Regev A (2022) DIALOGUE maps multicellular programs in tissue from single-cell or spatial transcriptomics data. Nat Biotechnol 40: 1467–1477

    Kuppe C, Ramirez Flores RO, Li Z, Hayat S, Levinson RT, Liao X, Hannani MT, Tanevski J, Wünnemann F, Nagai JS, et al (2022) Spatial multi-omic map of human myocardial infarction. Nature 608: 766–777

    Macnair W, Calini D, Agirre E, Bryois J, Jaekel S, Kukanja P, Stokar-Regenscheit N, Ott V, Foo LC, Collin L, et al (2022) Single nuclei RNAseq stratifies multiple sclerosis patients into three distinct white matter glia responses. BioRxiv

    Mitchel J, Gordon MG, Perez RK, Biederstedt E, Bueno R, Ye CJ & Kharchenko P (2022) Tensor decomposition reveals coordinated multicellular patterns of transcriptional variation that distinguish and stratify disease individuals. BioRxiv

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    This manuscript by Saez-Rodriguez and colleagues proposes to repurpose Multi-Omics Factor Analysis for the use of single cell data. The initial open problem stated by the paper is the need for a framework to map multicellular programs (such as derived from factor analysis) to other modalities such as spatial or bulk data. The authors propose to repurpose MOFA for use in single cell data. Case studies involve human heart failure datasets (and focuses on spatial and bulk comparisons).

    There are particular issues with clarity regarding the key methodological contribution (and assessment of it), discussed under significance.

    Significance

    1. I am very puzzled by the repeated claims the manuscript makes that their central methodological contribution and innovation is to use MOFA for single cell data. One of their citations for MOFA is to MOFA+, which is precisely that (in a relatively popular manuscript published by the original authors of MOFA and not overlapping with the present authors). I am left to wonder what I missed.
    2. Multimodal integration methods are fairly numerous and even if they're not all exactly factor analyses, it's strange to argue that MOFA fills some unique conceptual gap. I agree it fills something of an interesting gap (except for MOFA+ already filling it), but it's not like the quite popular spatial to single-cell integration approaches aren't doing similar things. If this is a methods paper (as it is presented) then there would have to be very substantially more comparative evaluation to these other approaches.
    3. The biological use cases are comparatively interesting and dominate the manuscript (but are still presented principally as use cases rather than a compelling biological narrative of their own).

    Altogether, I found the framing of this manuscript very puzzling. It is possible the result would be more clearly presented if the use case was the major focus rather than the more conceptual point about factor analysis.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    Summary:

    The authors use MOFA, an unsupervised method to analyze multi-omics data, to create multicellular programs of cross-condition multi-sample studies. First, for each cell-type, a pseudobulk expression matrix per sample is created. The cell-type now functions as the separate view, typically reserved for the different omics layers in MOFA. This then results in a latent space with a certain number of factors across samples. The factors, representing coordinated gene expression changes across cell-types, can then be checked for associations with covariates of interest across the samples. MOFA is well-suited for this task, as it can handle missing data and it is a linear model facilitating the interpretation of the factors. Users should be aware that MOFA can estimate the number of factors, but the pseudobulk profiles require a rigorous selection of cell-type specific marker genes. The result will be most suited for downstream analysis if there is a clear association with one factor and a clinical covariate of interest. In a final step, a positive or negative gene signature can be created by setting a cut-off on the gene weights for that specific factor. The method is applied on 3 separate data sets of heart disease, each time demonstrating that at least one of the factors is associated with a disease covariate of interest. The authors also compare the method to a competitor tool, scITD, and explore to what extent a factor mainly captures variance associated with (i) a general condition covariate or rather (ii) specific cell states. The multicellular programs are also mapped to spatial data with spot resolution. Though this analysis does not bring any novel biological insight in the use case, it does support the claim that the programs are associated with the covariate of interest. The most interesting applications of MOFA are in my opinion the potential for meta-analysis of single-cell studies and validation of cell-type specific gene signatures with publicly available bulkRNAseq data sets. The authors provide various data sets and data types to support their claims and the paper is well written. The relevant code and data has been made available.

    Major comments

    • What is the added value of the gene signatures obtained from MOFA compared to e.g. a naive univariate approach? In theory, a similar collection of genes or gene signature could be obtained by running a differential gene expression analysis across the samples for each cell-type (e.g. myogenic vs ischemic ) and applying a set of relevant cut-offs or filters on the results. In other words, does MOFA detect genes that would otherwise be missed?
    • Could scITD also be used for meta-analysis or could the obtained gene signatures of that method also be mapped to bulkRNAseq data? If so, it would be interesting to show the relative performance with MOFA. If not, this specific advantage should be highlighted.
    • Users need to specify gene set signatures based on the weights for a factor of interest. This might suggest a limitation to categorical covariates of interest. If the authors see potential for a continuous covariate of interest, this should at least be highlighted in the text and if possible demonstrated on a use case.

    Minor comments

    • In Figure 2c the association between factor 2 and the technical factor shows a very strong outlier. Please verify that the association is still significant after applying a more robust statistical test (e.g. non-parametric test as Wilcoxon).
    • For mapping the cell-type specific factor signatures to bulk transcriptomics, the exact performed comparison or model is unclear. There are seven cell-type signatures for each sample in every study. Was there a t-test run for each cell-type or was a summary measure taken across the cell-types? he thresholding is also rather lenient (adj. p-val 0.1).
    • typo in abstract: In sum, our framework serves as an exploratory tool for unsupervised analysis of cross-condition single-cell atlas and allows for the integration of the measurements of patient cohorts across distinct data modalities
    • In Figure 4a it is not clear to me why on the one hand we see marker enrichment vs loading enrichment with healthy and disease.
    • IN Figure 4b it would help if the same color scheme would be maintained throughout the paper (here now black and white) and if for the cell states the boxplots would be connected per condition, emphasizing the (absence) of change across cell states within a condition.

    Significance

    General assessment:

    MOFA is well-suited for detecting multicellular programs because it can handle missing data and allows for easy interpretation of the factors as a linear method. It might have particular potential for meta-analysis across multiple studies and reevaluating bulkRNAseq data sets, but in the current manuscript it is unclear to what extent this is a specific advantage of MOFA or could also be done with competitors. The authors show how the obtained results and associations with clinical covariates can be validated across multiple data types. How the resulting multicellular programs can provide additional biological insight or form the starting point for additional downstream analysis (e.g. cell communication) is not covered in the paper.

    Audience: This paper will be mainly of interest to a specialized public interested in unsupervised methods for large scale multi-sample and multi-condition studies.

    Reviewer: main background in the analysis of scRNAseq data.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Remark to authors

    Flores et al. present a pipeline in which they leverage MOFA framework, a matrix factorization algorithm to infer multi-cellular programs (MCPs). Learning and using MCP has already been proposed by others. Yet, authors pursue a similar goals by using MOFA, providing a cell*sample matrix for different cell types as different views (instead of multiple modalities/views) as the input. They later apply MOFA using this data format on a series of applications to analyze acute and chronic human heart failure single-cell datasets using MCPs. Authors further try to expand their analysis by incorporating other modalities.

    Major points:

    As briefly outlined in the remarks, the current manuscript needs novel findings and methodology to grant a research article which I can' see here. The underlying matrix factorization is the original MOFA (literally imported in the code) with no modification to further optimize the method toward the task. While I appreciate and acknowledge the author's efforts resulting in a detailed analysis of heart samples, I think all of these could have been part of MOFA's existing tutorials.

    How can you explain that the results in donor-level analyses are not due to technical artifacts (batch variation)? Can this be used to infer a new patient similarity map? For example, I would test this by leaving out a few patients from training, projecting them, and seeing where they would end up in the manifold or classifying disease conditions for new patients and explaining the classification by MCPs responsible for that condition.

    The bulk and spatial analysis are used posthoc after running MOFA, I think since MOFA can use non-overlapping features set, it would be interesting to see if deconvoluted bulk or ST data can be encoded as another view (one view from scRNAseq data for each cell-type and another view from bulk RNA-seq or ST, you can get normalized expression per spot (for ST) or per sample (for bulk) and use them as input.

    Minor:

    Some figure references are not correct (e.g., "the single-cell data into a multi-view data representation by estimating pseudo bulk gene expression profiles for each cell-type across samples (Figure 1b)." should be figure 2b)

    The paper is well written, but there could be some more clarifications about what authors consider as cell-type and cell-state, condition, MCPs which I think is critical to current analysis (see here https://linkinghub.elsevier.com/retrieve/pii/S0092867423001599) for the reader not familiar with those concepts.

    Significance

    While I find the concept of MCPs interesting, the current work seems like a series of vignettes and tutorials by simply applying MOFA on different datasets (The authors rightfully state this). However, It needs to be clarified what the novelty is since there is no algorithmic improvement to current MCP methods (because there is no new method) nor novel biological findings. Additionally, even in the current form, the applications are limited to the heart, and the generalization of this proposed analysis pipeline to other tissues and datasets is not explored. Overall, the paper lacks focus and novelty, which is required to grant a publication at this level.

    expertise: Computational biology, single-cell genomics, machine learning