Improved integration of single cell transcriptome data demonstrated on heart failure in mice and men

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Biomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays. The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity. The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure. In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.

Article activity feed

  1. AbstractBiomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.KeypointsNovel integration workflow for scRNA-seq data from different species in an easy to use R-package (“OrthoIntegrate”).Improved one-to-one ortholog assignment via sequence similarity scores and string similarity calculations.Validation of “OrthoIntegrate” results with a case study of snRNA-seq from human heart failure with reduced ejection fraction and its corresponding mouse model

    Reviewer 2: Yinqi Bai

    Comments to Author: Jurado et al. reported a pipeline designed to optimize the detection of orthologous genes and utilized it to enhance the integration of cross-species single-cell RNA sequencing (scRNA-seq) data. They demonstrated the effectiveness of this pipeline by comparing shared and distinct regulatory pathways between human HFrEF (Heart Failure with Reduced Ejection Fraction) patients and the corresponding mouse model. The work provided reliable results that emphasize the importance of exercising caution when using mouse models to study disease mechanisms. However, many important factors should be critically thought about and benchmarked. Here are a few major issues:

    1. Ortholog identification has long been a critical and essential step for many comparative, evolutionary, and functional genomic analyses. To evaluate the performance of an orthology inference method, there are some gold standards available for benchmark testing, such as the Quest Orthology Benchmark Service (https://orthology.benchmarkservice.org). Whether OrthoIntegrate outperforms other methods should be comprehensively benchmarked on diverse datasets and metrics, rather than relying solely on the silhouette coefficient score from a heart single-cell RNA sequencing (scRNA-seq) dataset.
    2. According to the authors' integration pipeline, both human and mouse scRNA-seq data are individually clustered to assign cell type labels and are then further integrated with orthologous genes for clustering to assign new labels. How do the labels for each cell and each cell type change before and after the integration approach? Does cell type assignment become more reasonable after the integration? The authors should demonstrate that the selection of orthologous genes for clustering improves the accuracy of cell type assignment. The silhouette coefficient score is not a direct metric for assessing accuracy, as it can be influenced by biological factors. For example, in Supplementary Table 3, the silhouette scores of mouse-HFrEF samples generated by Paranoid and OMA are consistently higher than those by OrthoIntegrate, which is opposite to the control groups and human-HFrEF samples.
    3. The data analysis needs to be expanded further if there are findings with potential biological significance. For example, the authors mentioned, 'In cluster 25, we observe a group of genes showing increased expression in human FBs, and we also identify a set of genes that are negatively regulated in cluster 28 in human ECs.' However, there is no functional analysis, such as GO or KEGG pathway enrichment analysis, conducted to interpret the data and validate these findings.
    4. The discussion section is confusing. The authors should clarify whether the paper is primarily focused on research methods or data analysis. If it is a data analysis paper, the authors should conduct additional investigations to include further data analysis. If it is a research method paper, the authors should extend the discussion to relate to the algorithm itself.

    Minor comments:

    1. The cell number for each sample and each clustered cell type is critical for assessing the reliability of the results; however, this information is not provided in the paper.
    2. As the mouse model is generated through chronic infarction, it raises the question of why very few T/B cell markers are found in immune cells in Figure 1F. Is it possible that these cell types are not adequately captured in the mouse samples? In data integration analysis, the audience may be more interested in understanding how species-specific cell types perform, particularly when, for instance, only macrophages are the dominant immune cells found in human samples.
    3. On page 5, clarify "latter ones" in the sentence "Most of the latter ones were long non-coding RNAs with identical gene names."
    4. On page 5, correct the reference to Supplementary Figure 4A instead of Supplementary Figure 3A and Supplementary Table 3.
    5. On page 16, replace "regulated genes" with "differentially expressed genes (DEGs)" to accurately represent what the authors referred.

    Re-review:

    The author's additional analysis is commendable. With the inclusion of new evaluation metrics, the benchmark section now appears relatively comprehensive, and the explanations provided for the reduced NMI score are reasonable. In the results section, the supplementary information on functional enrichment further elucidates the biological functions of fibroblast cluster 25 and endothelial cell cluster.

    1. There are still some minor suggestions for improvement:
    2. The presentation of the biological findings in the discussion section could be more succinct to improve clarity.
    3. There is a lack of discussion on the impact of the numerous lncRNAs generated by OrthoIntegrate. This topic requires further exploration and elaboration.
    4. Reorganize the paragraphs for "Single cell pre-processing" and "Study samples" to clarify the source of the data used in the article. Emphasize the data generated by authors (E-MTAB-13264) and provide details on the single-cell sequencing process (not only the raw data pre-processing).
  2. AbstractBiomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae011), and has published the reviews under the same license. These are as follows.

    Reviewer 1: Ruoyan Li

    Comments to Author: In the manuscript entitled 'Improved integration of single cell transcriptome data demonstrates common and unique signatures of heart failure in mice and humans', the authors developed a pipeline (OrthoIntegrate) to assign gene orthologs across species and integrate cross-species single-cell RNA-seq data based on Seurat workflows. The authors further compared OrthoIntegrate to other orthologue databases and tools methods and highlighted a better performance of their method. To illustrate the potential applications of OrthoIntegrate, the authors integrated single-cell/single-nuclei RNA-seq data from cardiac tissue of heart failure patients with reduced ejection fraction (HFrEF) and a mouse model mimicking HFrEF using the pipeline. This revealed commonly regulated genes in the disease condition between species (i.e., genes related to cardiomyocyte energy metabolism) and species-specifically regulated genes (i.e., angiogenesis-related genes in humans). Overall, this is a well-designed study with the development of a useful cross-species single-cell data integration pipeline whose applications have been showcased in the context of heart failure (to me it is more like an improved orthologue assignment method)

    A few points need to be addressed before publishing

    1. The authors utilized the Needleman-Wunsch algorithm to generate one-to-one orthologs between human genes and mouse genes. What is the advantage of using this algorithm compared to other algorithms i.e., SAMap uses BLAST?
    2. The authors have shown the application of OrthoIntegrate in the context of heart failure between mice and humans. Could the authors include at least one more example of using OrthoIntegrate in other disease conditions or between other species to show the versatility of OrthoIntegrate?
    3. To assess the quality of clustering after integration, the authors calculated silhouette coefficients/scores and found that integration by OrthoIntegrate resulted in an improved clustering performance. Could the authors include more benchmarking metrics to assess the performance of OrthoIntegrate compared to other methods? The authors could consider metrics like the species mixing score used by BENGAL (Song et al., 2022, biorxiv; https://github.com/Functional-Genomics/BENGAL)
    4. Miscalling of figures: silhouette coefficients are shown in Supp_Fig_4 rather than Suppl_Fig_3.
    5. Some information on the used datasets in the manuscript has been shown in supplementary table 1, but it's still a bit confusing, for example, where the mouse and human HFrEF datasets come from. I am not exactly sure, but I presume HFrEF datasets are from E-MTAB-13264? This information should be described more explicitly in the method section.