Genes in Humans and Mice: Insights from Deep learning of 777K Bulk Transcriptomes

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Mice are widely used as animal models in biomedical research, favored for their small size, ease of breeding, and anatomical and physiological similarities to humans 1,2 . However, discrepancies between mouse gene experiment results and the actual behavior of human genes are not uncommon, despite their shared DNA sequence similarity 3-8 . This suggests that DNA sequence similarity does not always reliably predict functional similarity. On the other hand, RNA expression of genes could offer additional information about gene function 9,10 . However, comprehensive characterization of genes through their expression can be challenging with traditional methods, due to the dynamic nature of gene expression and its high variability in different biological contexts. In this study, we undertook characterization and inter-species comparison of human and mouse genes by applying innovative deep learning methodologies on a large dataset of 410K human and 366K mouse bulk RNA-seq samples. This was achieved by using gene representations from our Transformer-based GeneRAIN model 11,12 . These gene representations, aggregating information from large gene expression datasets, provided insights beyond DNA sequence similarity, helping to elucidate differences in disease and phenotype associations between human and mouse genes. We propose that this approach will support future decision making around whether the mouse will be an appropriate model for studying specific human genes, and whether the results of specific mouse gene studies are likely to be recapitulated in humans. Our methodological innovations offer valuable lessons for future deep learning applications in cross-species omics data. The interspecies gene relationship findings from our study can contribute valuable insights to enhance our understanding of the biology and evolution of the two species.

Article activity feed

  1. Aligning gene embeddings from different species into the same space opens up the possibility of using advanced deep learning technology for inter-species comparisons.

    Are your methods versatile enough to allow this approach for new species pairs if there is enough RNAseq data? What is the minimum number of RNA seq samples needed?

  2. These genes have low RNA similarities in mice, suggesting that the correlation of these genes with other genes and possibly their functions have diverged between mice and humans. This divergence could contribute to the discrepancies observed between the human and mouse studies.

    Is there a way to look at this more systematically? This is hugely valuable.

  3. This suggests that our approach, although relying solely on transcriptome data, can achieve superior performance compared to models that incorporate multi-omics data.

    Do you think this is because of the expression levels of genes or some other signal? I think unpacking this more would be a big value-added for understanding this model.

  4. To mitigate bias from using a single phenotype annotation dataset, we analyzed another dataset, specifically the ‘mouse models of human diseases’ annotations from the MGI database. This dataset includes information about human diseases, their mouse models, and the associated genes. Given the low number of shared associated diseases among the homologous genes (Extended Data Fig. 8b), we focused on comparing proportion of each gene group with shared disease association(s). Results indicated that homologs with high RNA similarity have largest proportion of shared association(s). In contrast, homologs with low RNA similarity showed a smaller proportion, even if their DNA similarity is high (Fig. 3b and Supplementary Table 2).

    This is very cool. it would be interesting to correlate this with the number of failed clinical trials for therapies developed in mouse and applied to humans. It might also be interesting to see if there are other DNA signals that could be used to improve DNA performance. Promoter sequences come to mind, but there might be others.

  5. The experiment indicated that it was not due to methodology limitations that fewer than 5,000 of 16,983 mouse genes were the nearest embedding neighbors of their human orthologs.

    Then what was it due to?

  6. 6,007 human genes their mouse orthologs were among the ten nearest embedding neighbors (Fig. 1g).

    What were the other, nearer embedding neighbors? copy number variants of the same genes? genes in the same pathway?

  7. Thus, we forced the 5,000 randomly selected one-to-one orthologous gene pairs to have the same gene embeddings (Fig. 1b).

    Do you have confirmation that all information capture in the embeddings is synonymous between these 1:1 orthologs? For example, doe the orthologs share the same promoters and transcription factor still? I think both of these would be fairly straightforward to check at scale.

  8. This includes gene-associated diseases/phenotypes, protein interactions, transcription factors, biological pathways, gene ontology, associated cell types and more.

    Thank you for including these details, this is very helpful

  9. Although multiple mouse gene expression studies have been conducted, they have often relied on techniques such as dimension reduction, phylogenetic clustering, co-expression analysis, and differential expression analysis16,18-33. These traditional methods face challenges in achieving a comprehensive comparison of genes at the RNA level, as they can be susceptible to batch effects, biased by small sample sizes, and constrained by the limited availability of samples from matched biological conditions33,34. Given the dynamic and complex nature of gene expression, which varies across genders, ages, tissues, and conditions, a thorough characterization at the RNA level necessitates integrating data from diverse biological contexts and a large collection of samples.

    I don't think this paragraph does justice to the work already undertaken and I don't think it highlights why there was a gap for this work to fill. I think the goals of the previous compendia that you included were very different than the goal of this paper, and I think that's ok! Unpacking that a bit more in this intro would be useful. Otherwise, it makes it sound like previous researchers made mistakes and that's why we need this paper, which I don't think is true.