geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets—arising from annotation version updates, historical renaming, and synonym reassignment—introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020–2025 and five CellRanger versions shows that 1.41%–6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707–1,098 genes per dataset pair. Notably, CellRanger annotation version—rather than data collection year—was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync .