geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration

Zhijun Feng
Ting Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets—arising from annotation version updates, historical renaming, and synonym reassignment—introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020–2025 and five CellRanger versions shows that 1.41%–6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707–1,098 genes per dataset pair. Notably, CellRanger annotation version—rather than data collection year—was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync .

Version published to 10.64898/2026.05.04.722831 on bioRxiv
May 7, 2026

geneslator: an R package for comprehensive gene identifier conversion and annotation

This article has 6 authors:
1. Giulia Cavallaro
2. Giovanni Micale
3. Grete Francesca Privitera
4. Alfredo Pulvirenti
5. Stefano Forte
6. Salvatore Alaimo
This article has no evaluationsLatest version Apr 1, 2026
MAJEC: unified gene, isoform, and locus-level transposable element quantification from RNA-seq

This article has 2 authors:
1. Tian-Yeh Lim
2. Ari J. Firestone
This article has no evaluationsLatest version Apr 14, 2026
Disease-guided functional gene mapping across species reveals translational correspondences beyond sequence orthology

This article has 2 authors:
1. Jinyun Yan
2. Ze Cao
This article has no evaluationsLatest version May 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

geneslator: an R package for comprehensive gene identifier conversion and annotation

MAJEC: unified gene, isoform, and locus-level transposable element quantification from RNA-seq

Disease-guided functional gene mapping across species reveals translational correspondences beyond sequence orthology