geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration

doi:10.64898/2026.05.04.722831

geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration

2026 · doi:10.64898/2026.05.04.722831

preprint OA: closed

Full text JSON View at publisher

Full text 2,054 characters · extracted from oa-doi-fallback · click to expand

Abstract Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets—arising from annotation version updates, historical renaming, and synonym reassignment—introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020–2025 and five CellRanger versions shows that 1.41%–6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707–1,098 genes per dataset pair. Notably, CellRanger annotation version—rather than data collection year—was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync. Competing Interest Statement The authors have declared no competing interest. Footnotes Abbreviations - QC - quality control - NCBI - National Center for Biotechnology Information - scRNA-seq - single-cell RNA sequencing - snRNA-seq - single-nucleus RNA sequencing - GTF - Gene Transfer Format - FASTQ - FASTQ format - HGNC - HUGO Gene Nomenclature Committee - MGI - Mouse Genome Informatics - RGD - Rat Genome Database - ID - identifier - RIKEN - RIKEN (The Institute of Physical and Chemical Research, Japan) - GEO - Gene Expression Omnibus

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00