Batch-Harmonized Machine Learning Framework for Cross-Cohort RNA Biomarker Discovery in Pancreatic Adenocarcinoma

doi:10.1101/2025.11.14.688421

Batch-Harmonized Machine Learning Framework for Cross-Cohort RNA Biomarker Discovery in Pancreatic Adenocarcinoma

2025 · doi:10.1101/2025.11.14.688421

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 1,848 characters · extracted from oa-doi-fallback · 4 sections · click to expand

Abstract

Background Pancreatic ductal adenocarcinoma (PDAC) lacks reliable prognostic biomarkers. RNA-based signatures suffer from poor reproducibility due to batch effects and platform heterogeneity between microarray and RNA-seq data, limiting machine learning applications.

Methods

We developed a computational pipeline harmonizing RNA-seq data from multiple repositories using ComBat batch correction, followed by Random Forest and XGBoost classification. Restricting analysis to RNA-seq platforms only, we achieved 14,137 common genes between TCGA-PAAD (n=178) and validation cohort GSE71729 (n=357). We quantified batch correction efficacy via silhouette coefficients and trained models on survival outcomes.

Results

ComBat correction eliminated dataset-specific clustering (silhouette coefficient: 0.866→-0.012). Random Forest achieved 64% training accuracy, identifying five prognostic biomarkers: LAMC2, DKK1, ITGB6, GPRC5A, and MAL2. These genes showed consistent importance across models and biological relevance to invasion, epithelial-mesenchymal transition, and tumor suppression. Models successfully generalized independent validation data.

Conclusions

We present the first open-source R pipeline optimized for RNA-seq-based, cross-cohort biomarker discovery in pancreatic cancer. Platform-matched datasets yielded superior gene coverage versus multi-platform approaches, enabling robust machine learning classification. Our framework identifies five novel prognostic genes and provides a reproducible method for multi-center RNA biomarker studies, available through an interactive Shiny application. Availability All code, processed data, and the interactive Shiny application are available at https://github.com/MarkBarsoumMarkarian/rna-harmonization-ai Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00