Full text
2,054 characters
· extracted from
oa-doi-fallback
· click to expand
Abstract
Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets—arising from annotation version updates, historical renaming, and synonym reassignment—introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020–2025 and five CellRanger versions shows that 1.41%–6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707–1,098 genes per dataset pair. Notably, CellRanger annotation version—rather than data collection year—was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Abbreviations
- QC
- quality control
- NCBI
- National Center for Biotechnology Information
- scRNA-seq
- single-cell RNA sequencing
- snRNA-seq
- single-nucleus RNA sequencing
- GTF
- Gene Transfer Format
- FASTQ
- FASTQ format
- HGNC
- HUGO Gene Nomenclature Committee
- MGI
- Mouse Genome Informatics
- RGD
- Rat Genome Database
- ID
- identifier
- RIKEN
- RIKEN (The Institute of Physical and Chemical Research, Japan)
- GEO
- Gene Expression Omnibus
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.