Protein large language model assisted one-to-one gene homology mapping in cross-species single-cell transcriptome integration

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Cross-species integration of single-cell transcriptomes requires establishing gene correspondences to enable comparative analysis of expression profiles across organisms. Current approaches predominantly rely on Ensembl homology tables, whose default many-to-many mappings often amplify gene-family effects and introduce artifactual micro-clusters that lack clear cell-type identity, thereby complicating biological interpretation. While restricting mappings to a one-to-one scheme suppresses such artifacts, it reduces the number of homology gene pairs by approximately 8% (∼900 pairs). To address this limitation, we developed a protein large language model (pLLM)-based gene homology mapping strategy that boosts the number of homology gene pairs. By integrating pLLM-derived representations with sequence similarity, we constructed a fused mapping approach, which achieved top performance in a comprehensive benchmark based on a curated cross-species atlas—spanning nine datasets, 11 species, and over 3.2 million cells. Our method further identifies previously unannotated cell-type marker pairs, facilitating novel cross-species marker discovery. These results establish a robust framework for gene homology mapping in cross-species transcriptome integration, improving both accuracy and biological interpretability.

Article activity feed