Batch-Harmonized Machine Learning Framework for Cross-Cohort RNA Biomarker Discovery in Pancreatic Adenocarcinoma

preprint OA: closed
📄 Open PDF Full text JSON View at publisher
Full text 1,848 characters · extracted from oa-doi-fallback · 4 sections · click to expand

Abstract

Background Pancreatic ductal adenocarcinoma (PDAC) lacks reliable prognostic biomarkers. RNA-based signatures suffer from poor reproducibility due to batch effects and platform heterogeneity between microarray and RNA-seq data, limiting machine learning applications.

Methods

We developed a computational pipeline harmonizing RNA-seq data from multiple repositories using ComBat batch correction, followed by Random Forest and XGBoost classification. Restricting analysis to RNA-seq platforms only, we achieved 14,137 common genes between TCGA-PAAD (n=178) and validation cohort GSE71729 (n=357). We quantified batch correction efficacy via silhouette coefficients and trained models on survival outcomes.

Results

ComBat correction eliminated dataset-specific clustering (silhouette coefficient: 0.866→-0.012). Random Forest achieved 64% training accuracy, identifying five prognostic biomarkers: LAMC2, DKK1, ITGB6, GPRC5A, and MAL2. These genes showed consistent importance across models and biological relevance to invasion, epithelial-mesenchymal transition, and tumor suppression. Models successfully generalized independent validation data.

Conclusions

We present the first open-source R pipeline optimized for RNA-seq-based, cross-cohort biomarker discovery in pancreatic cancer. Platform-matched datasets yielded superior gene coverage versus multi-platform approaches, enabling robust machine learning classification. Our framework identifies five novel prognostic genes and provides a reproducible method for multi-center RNA biomarker studies, available through an interactive Shiny application. Availability All code, processed data, and the interactive Shiny application are available at https://github.com/MarkBarsoumMarkarian/rna-harmonization-ai Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00