NeoMiriX: An Integrated Bioinformatics Platform for Cancer Prediction Using microRNA Expression

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 105,355 characters · extracted from preprint-html · click to expand
NeoMiriX: An Integrated Bioinformatics Platform for Cancer Prediction Using microRNA Expression | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article NeoMiriX: An Integrated Bioinformatics Platform for Cancer Prediction Using microRNA Expression Bishoy Malak Shafik Tadros This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9576755/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Cancer represents one of the most complex and heterogeneous families of diseases in modern medicine, collectively constituting the second leading cause of mortality worldwide. MicroRNAs (miRNAs), short non-coding RNA molecules of approximately 18–25 nucleotides, have emerged as highly tissue-specific and disease-specific molecular biomarkers owing to their conserved roles in post-transcriptional regulation of gene expression. Objective This manuscript presents NeoMiriX, a comprehensive, modular, open-source Python bioinformatics platform integrating miRNA expression profiling, multi-database biological annotation, machine learning-based cancer type prediction, pathway enrichment analysis, risk stratification, and automated clinical report generation. Methods NeoMiriX ingests miRNA expression data from TCGA, GEO, miRBase, and HMDD v3.2. The platform implements six normalization strategies (TPM, RPKM, log2, quantile, z-score, TCGA-protocol), differential expression analysis with FDR correction, and a composite biomarker scoring engine. Machine learning employs Random Forest, SVM, Gradient Boosting, XGBoost, and Logistic Regression within a stratified cross-validated pipeline. Results NeoMiriX successfully implements end-to-end cancer prediction workflows, identifying cancer-specific miRNA signatures across breast, lung, colorectal, hepatocellular, and glioblastoma tumour types. The biomarker scoring engine consistently prioritises hsa-miR-21-5p, hsa-miR-155-5p, hsa-miR-34a-5p, hsa-let-7a-5p, and hsa-miR-210-3p. Risk stratification classifies samples into LOW, MODERATE, HIGH, and INCONCLUSIVE categories. Conclusion NeoMiriX bridges miRNA cancer biology and computational oncology in a scalable, clinically accessible platform. Future work will focus on prospective validation, single-cell miRNA integration, and federated learning. Bioinformatics Cancer Biology Artificial Intelligence and Machine Learning microRNA cancer prediction bioinformatics platform machine learning TCGA biomarker discovery NeoMiriX random forest differential expression pathway analysis Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction 1.1 The Global Burden of Cancer and Diagnostic Challenges Cancer is not a single disease but an umbrella term encompassing over 100 distinct malignancies characterised by uncontrolled cellular proliferation, evasion of programmed cell death, tissue invasion, and systemic metastasis [ 1 ]. According to data published by the International Agency for Research on Cancer through GLOBOCAN 2022, approximately 20 million new cancer diagnoses were recorded globally, and an estimated 9.7 million deaths were attributable to malignant disease [ 2 ]. A particularly challenging diagnostic scenario is carcinoma of unknown primary (CUP), accounting for approximately 2–5% of all malignant diagnoses [ 3 ]. Emerging miRNA expression profiling has demonstrated potential to resolve the tissue of origin in CUP cases with molecular precision, offering a paradigm shift in diagnostic oncology [ 4 ]. 1.2 MicroRNAs: Biology, Regulation, and Cancer Relevance MicroRNAs are endogenous short non-coding RNA molecules of approximately 18–25 nucleotides that function as post-transcriptional regulators of gene expression. First described by Lee and colleagues in Caenorhabditis elegans in 1993 [ 5 ], miRNAs have since been identified across virtually all metazoan species, with the human genome encoding more than 2,600 precursor miRNAs catalogued in miRBase release 22 [ 6 ]. Their biogenesis proceeds through Drosha–DGCR8 cleavage, nuclear export via Exportin-5, Dicer processing, and RISC loading, directing sequence-complementary target silencing [ 7 ]. A single miRNA may regulate hundreds of mRNAs across multiple biological pathways including cell cycle control, apoptosis, EMT, angiogenesis, and immune evasion [ 8 ]. Crucially, miRNA expression profiles are highly tissue-specific and are systematically altered in cancer in a cancer-type-dependent manner, meaning that the miRNA expression signature of a tumour retains information about its cell of origin even after metastasis [ 4 ]. 1.3 Bioinformatics in Cancer miRNA Research The Cancer Genome Atlas (TCGA) provided the first comprehensive, harmonised multi-omic dataset spanning 33 cancer types and over 11,000 patients, enabling development of supervised and unsupervised machine learning models for cancer classification [ 12 ]. Parallel advances in database infrastructure—miRBase [ 6 ], HMDD [ 13 ], miRTarBase [ 14 ], and TargetScan [ 15 ]—have enriched the interpretability of computational predictions. 1.4 Current Limitations in miRNA-Based Prediction Systems Despite maturity of the miRNA biomarker field, existing computational tools suffer from normalization inconsistency, single-database training without independent validation, command-line-only interfaces, and fragmented analysis workflows requiring chaining of multiple incompatible tools [ 16 ]. No existing open-source tool provides integrated risk stratification translating probabilistic predictions into clinical risk categories within a plugin-extensible architecture. 1.5 Rationale and Objectives of NeoMiriX NeoMiriX was designed to address the full spectrum of identified limitations through an integrated, modular, and clinically accessible platform. Its objectives are: (i) a unified preprocessing pipeline for diverse input formats; (ii) real-time connectivity to multiple oncology databases; (iii) an ensemble of complementary machine learning classifiers with cross-validated metrics; (iv) automated pathway, network, and survival analyses; (v) clinically actionable risk stratification; and (vi) professional clinical reports via a graphical interface. 2. Literature Review 2.1 Historical Development of miRNA Cancer Biomarkers The discovery that miRNA expression profiles carry diagnostic information in cancer was established by Lu and colleagues in 2005 [ 4 ], who demonstrated that miRNA signatures could classify human cancers by tissue of origin with substantially greater accuracy than mRNA expression profiles, particularly for poorly differentiated tumours. Iorio et al. [ 17 ] subsequently characterised miRNA expression in breast cancer, identifying consistent upregulation of hsa-miR-21, hsa-miR-155, hsa-miR-206, and hsa-miR-210. By 2006, Volinia and colleagues performed the first large-scale pan-cancer miRNA analysis identifying a common cancer signature across six solid tumour types [ 18 ]. Clinical relevance of circulating miRNAs was established by Mitchell and colleagues, who demonstrated that serum miR-141 discriminated prostate cancer patients from healthy controls [ 19 ]. 2.2 Machine Learning Applications in miRNA-Based Cancer Classification Early approaches employed SVMs trained on microarray-derived miRNA expression data, benefiting from the SVM's robustness in high-dimensional, low-sample-count settings [ 21 ]. Random forest classifiers gained traction owing to their ability to handle correlated features, provide feature importance rankings, and resist overfitting through bootstrap aggregation [ 22 ]. Gradient boosting methods, including XGBoost, have demonstrated superior performance in multiple genomic data classification benchmarks [ 23 ]. Deep learning approaches—CNNs and GNNs applied to miRNA interaction networks—have achieved over 95% accuracy across 32 TCGA cancer types in benchmark studies [ 24 ], though they require large datasets and extensive computational resources. A persistent challenge is the absence of standardised benchmarking protocols, addressed by NeoMiriX's reproducible pipeline [ 16 ]. 2.3 Key Database Resources in miRNA Cancer Research TCGA remains the most comprehensive single source of harmonised miRNA sequencing data across cancer types, with profiles for over 11,000 tumour samples spanning 33 histotypes [ 12 ]. GEO provides access to over 4 million individual samples across tens of thousands of studies [ 25 ]. miRBase catalogues over 38,589 precursor sequences across 271 species [ 6 ]. HMDD v3.2 curates over 35,000 experimentally validated miRNA-disease associations [ 13 ]. miRTarBase provides over 2.5 million validated miRNA-target interactions [ 14 ]. 2.4 Existing Software Tools for miRNA Analysis miRSystem [ 26 ] performs target gene enrichment but does not accept expression matrices or perform cancer classification. miRCancer [ 27 ] provides a queryable association database but offers no prediction capability. DIANA-miRPath v3.0 [ 28 ] enables pathway analysis of miRNA lists but not expression matrix-based classification. miRNet [ 29 ] provides network visualisation but not cancer type classification. TCGA Biolinks [ 30 ] requires advanced R skills and is limited to TCGA data. OncomiR [ 31 ] provides survival analysis but is restricted to TCGA and excludes user-provided datasets. None integrates the complete workflow that NeoMiriX provides. 2.5 Normalization and Batch Effect Correction in miRNA Studies Normalization choice represents one of the most consequential methodological decisions in miRNA analysis. A benchmark study by Sheng and colleagues demonstrated that different normalization approaches produce substantially different ranked lists of differentially expressed miRNAs from the same dataset [ 16 ]. Batch effects—systematic technical variation from differences in library preparation, sequencing run, or platform—represent a major confound in multi-source analyses [ 32 ]. NeoMiriX implements six normalization strategies and mean-centering batch correction, with documentation directing users to specialised tools (ComBat, ComBat-seq, limma) for rigorous multi-cohort integration. 3. Methodology The NeoMiriX platform was developed in Python 3.10 and comprises tightly integrated modules spanning data acquisition, preprocessing, feature engineering, machine learning, biological annotation, risk stratification, and report generation. 3.1 Data Sources and Acquisition 3.1.1 The Cancer Genome Atlas (TCGA) TCGA generated multi-omic profiling data across 33 cancer types and more than 11,000 patient samples [ 12 ]. NeoMiriX interfaces with TCGA through the NCI GDC API, enabling programmatic retrieval of miRNA quantification files and clinical metadata. The tcga_biomarker_database submodule pre-loads cancer-type-specific miRNA biomarkers derived from published TCGA analyses. 3.1.2 Gene Expression Omnibus (GEO) NCBI GEO provides access to over 4 million samples deposited in more than 170,000 series [ 25 ]. NeoMiriX supports import of GSE SOFT files and expression matrices with automated parsing of platform annotation to map probe identifiers to miRBase-standardised miRNA names. 3.1.3 miRBase miRBase provides the authoritative reference for miRNA nomenclature, mature sequences, precursor hairpin structures, and accession identifiers [ 6 ]. NeoMiriX's miRBaseConnector queries the miRBase REST API to validate user-supplied miRNA identifiers and retrieve sequence information, flagging outdated synonyms and correcting common naming inconsistencies. 3.1.4 Human miRNA Disease Database (HMDD v3.2) HMDD v3.2 curates over 35,000 experimentally validated miRNA-disease associations [ 13 ]. NeoMiriX's HMDDConnector extracts association records for each candidate miRNA, with the count contributing to the composite discriminatory score assigned by the biomarker scoring engine. 3.2 Data Preprocessing Pipeline 3.2.1 Quality Control Each dataset is validated for miRNA nomenclature compliance, expression value range plausibility, missing data proportion, and sample dimensionality consistency via the DatasetValidator module. Samples or features failing configurable quality thresholds are flagged in the quality control report. 3.2.2 Normalization Methods NeoMiriX implements six normalization strategies: (1) TPM—expression values divided by column sum and multiplied by 10⁶; (2) RPKM—computed as (count / gene_length_kb) / (total_mapped_reads / 10⁶); (3) Log2 transformation—log₂(count + 1) for variance stabilisation; (4) Quantile normalization—rank-based distribution alignment; (5) Z-score standardisation—mean-centring and unit-variance scaling per feature; and (6) TCGA-protocol normalization—sequential TPM followed by log2 transformation, mirroring the TCGA miRNA pipeline. 3.2.3 Batch Effect Correction and Missing Data For multi-batch datasets, NeoMiriX applies mean-centering batch effect correction via the BatchEffectCorrector module. Missing expression values with > 20% missingness are removed; remaining missing values are replaced by column-wise mean imputation. 3.3 Feature Engineering and Biomarker Selection 3.3.1 Differential Expression Analysis For labelled datasets, NeoMiriX performs DEA using the Wilcoxon rank-sum test with Benjamini-Hochberg FDR correction [ 33 ], default threshold FDR 1. Significant DEMs are ranked by absolute fold change and passed to the biomarker scoring engine. 3.3.2 Biomarker Scoring Engine The biomarker_scoring_engine assigns a composite discriminatory score integrating: statistical DEA score, HMDD association count, TCGA validation status, and miRTarBase interaction count. Weights are user-configurable. The engine builds a ranked biomarker weight matrix across all cancer types, enabling cancer-specific feature prioritization. 3.3.3 Dimensionality Reduction PCA (sklearn.decomposition.PCA) and t-SNE (sklearn.manifold.TSNE) are available for exploratory visualisation of high-dimensional miRNA expression spaces, enabling identification of cancer-type-specific clustering patterns. 3.4 Machine Learning Prediction Pipeline 3.4.1 Random Forest RandomForestClassifier (n_estimators = 500, sqrt(n_features) per split) provides ensemble classification with Gini impurity-based feature importance scores directly informing biomarker prioritisation. 3.4.2 Support Vector Machine SVC with RBF kernel (γ and C optimised by grid search), probability estimates via Platt scaling, is suited to high-dimensional miRNA expression spaces. 3.4.3 Gradient Boosting and XGBoost NeoMiriX preferentially instantiates XGBClassifier (n_estimators = 200, learning_rate = 0.05, max_depth = 4, subsample = 0.8, colsample_bytree = 0.8), falling back to GradientBoostingClassifier when XGBoost is unavailable. 3.4.4 Logistic Regression Multinomial LogisticRegression (L2 regularisation, lbfgs solver) serves as an interpretable linear baseline with coefficient vectors providing per-miRNA contribution to each cancer class. 3.4.5 Cross-Validation Framework All models are evaluated within stratified 5-fold cross-validation (sklearn.model_selection.cross_val_score). Final models are trained on all data with cross-validation metrics representing expected out-of-sample performance. Model serialisation uses joblib via ModelPersistenceManager. 3.5 System Architecture and Integration NeoMiriX is architected around a modular pipeline design implemented in Python 3.10, exposing a graphical user interface via PySide6. The nine-stage pipeline is summarised in Table 1 and illustrated in Fig. 1 . Table 1 NeoMiriX nine-stage analytical pipeline with module names and functions. Stage Module / Class Function 1. Input DatasetValidator CSV/TSV/Excel/FASTA/GEO-SOFT — validate schema, shape, encoding 2. QC QualityControlStep Nomenclature check, range validation, missingness assessment, batch metadata 3. Normaliz. NormalizationStep TPM, RPKM, log2, quantile, z-score, TCGA-protocol; BatchEffectCorrector 4. Feature Sel. DEA + BiomarkerScoring DEA fold change + FDR; composite discriminatory scoring across databases 5. ML Predict. MLPredictionStep RF, SVM, GB/XGBoost, LR; 5-fold stratified CV; joblib serialisation 6. Enrichment EnrichmentAnalysisStep Pathway over-representation (PI3K-AKT, cell cycle, apoptosis) 7. Network NetworkAnalysisStep miRNA–mRNA network from TargetScan + correlation inference |r| ≥ 0.6 8. Risk Strat. classify_risk() {LOW, MODERATE, HIGH, INCONCLUSIVE} — validated output enforcement 9. Output ReportGenerator HTML/PDF/PPTX reports; PySide6 GUI; Plotly/Matplotlib visualisations The system features a PluginSystem class exposing lifecycle hooks at seven processing stages, a TTL-based in-memory cache (3600-second expiry), and local SQLite caching for offline operation. The DatabaseManager unifies access to sixteen external resources—miRBase, HMDD, miRTarBase, TCGA/GDC, ClinVar, UniProt, GnomAD, COSMIC, DrugBank, ChEMBL, ClinicalTrials.gov, PubMed, cBioPortal, dbSNP, ENA, and DDBJ. 3.6 Risk Stratification Framework The classify_risk() function translates probabilistic predictions into four validated clinical categories: LOW (probability 0.6), and INCONCLUSIVE (insufficient confidence). The validate_final_risk_level() function enforces strict output integrity by raising a ValueError for any non-permitted output. 3.7 Evaluation Metrics Performance is evaluated using: overall Accuracy, per-class Precision (TP / (TP + FP)), per-class Recall (TP / (TP + FN)), and ROC-AUC computed via sklearn.metrics.roc_auc_score (multi_class='ovr', average='macro'). All metrics are reported within stratified 5-fold cross-validation. 4. Results The following section characterises the functional outputs of the NeoMiriX platform across its primary analytical modules. 4.1 Database Integration and Knowledge Base Population Upon initialisation, NeoMiriX's DatabaseManager establishes connectivity with all sixteen integrated external databases and initiates background synchronisation of the top 500 most clinically significant human miRNAs into the local SQLite cache. This covers all members of the miR-17-92 cluster, the miR-200 family, the let-7 family, and individual high-evidence oncomiRs including hsa-miR-21-5p, hsa-miR-155-5p, and hsa-miR-210-3p. The HMDDConnector returns structured association records including disease name, experimental method, evidence category, PubMed identifiers, and literature year. The miRBaseConnector validates nomenclature and retrieves mature sequences in FASTA format. 4.2 Preprocessing Pipeline Validation The DataManager normalization methods produce outputs consistent with expectations for each method. TPM normalization correctly scales each column to sum to 10⁶ reads. Log2 normalization compresses expression dynamic ranges spanning five to six orders of magnitude into approximately 15–20 log2 units, substantially improving conditioning of downstream analyses. TCGA-protocol normalization reproduces the TCGA miRNA pipeline, enabling direct comparison with TCGA reference profiles. The QualityControlStep successfully flags non-standard miRNA identifiers, samples with > 50% zero-expressed features, and features with > 20% missing values. 4.3 Top miRNA Biomarkers Identified and Validated by NeoMiriX The biomarker scoring engine consistently prioritises the miRNAs detailed in Table 2 , all with established experimental evidence in multiple independent datasets. Figure 2 shows the miRNA expression heatmap across cancer types and normal tissue, illustrating the distinct signatures captured by NeoMiriX's feature selection pipeline. Table 2 Top ten miRNA biomarkers prioritised by NeoMiriX. CRC = colorectal cancer; GBM = glioblastoma multiforme; AML = acute myeloid leukaemia; DLBCL = diffuse large B-cell lymphoma; TS-miR = tumour suppressor miRNA. miRNA Direction Cancer Types Mechanism Key Target(s) Reference hsa-miR-21-5p Up Breast, Lung, CRC, Glioma, Gastric OncomiR PTEN, PDCD4 Iorio et al., 2005 [ 17 ] hsa-miR-155-5p Up Breast, DLBCL, AML, Lung OncomiR SHIP1, SOCS1 Volinia et al., 2006 [ 18 ] hsa-miR-34a-5p Down Breast, CRC, Lung, Pancreatic TS-miR CDK6, BCL-2 He et al., 2007 [ 34 ] hsa-let-7a-5p Down Lung, Breast, Colon TS-miR KRAS, HMGA2 Johnson et al., 2005 [ 35 ] hsa-miR-210-3p Up Breast, Renal, GBM Hypoxia/HIF-1α ISCU, COX10 Huang et al., 2009 [ 36 ] hsa-miR-200c-3p Variable Ovarian, Breast, Bladder EMT suppressor ZEB1, ZEB2 Gregory et al., 2008 [ 37 ] hsa-miR-141-3p Up Prostate, CRC, Gastric OncomiR ZEB2, PHLPP Mitchell et al., 2008 [ 19 ] hsa-miR-122-5p Down Hepatocellular Carcinoma TS-miR CCNG1, ADAM17 Coulouarn et al., 2009 [ 38 ] hsa-miR-182-5p Up Breast, Melanoma, Prostate OncomiR FOXO3, MITF Zhao et al., 2011 [ 39 ] 4.4 Machine Learning Prediction Pipeline and PCA When supplied with a labelled miRNA expression matrix, NeoMiriX's MLPredictionStep automatically instantiates the appropriate classifier and executes the full cross-validated pipeline. XGBoost (n_estimators = 200, learning_rate = 0.05, max_depth = 4) is preferentially selected. The model.predict_proba() output provides a probability distribution over all cancer classes, sorted in descending order with the top three predictions returned. For unsupervised inference—single sample or unlabelled datasets—the system returns a ranked prediction from biomarker score matching against the TCGA reference database, explicitly flagged as score-based rather than model-trained. Figure 3 shows PCA of miRNA expression profiles demonstrating clear separation of cancer types from normal tissue, validating the discriminatory power of the miRNA feature space. 4.5 Risk Stratification and Clinical Output The risk stratification framework produces clinically structured outputs for each analysed sample. The validate_final_risk_level() function enforces strict output integrity, preventing propagation of ambiguous classifications into clinical reports. Figure 4 shows the distribution of risk categories across a simulated 120-sample multi-cancer cohort, illustrating the proportion of samples assigned to each category overall (Panel A) and broken down by cancer type (Panel B). 4.6 Pathway Enrichment, Network Analysis, and Biomarker Importance The EnrichmentAnalysisStep identifies enriched biological pathways among target genes of predicted differentially expressed miRNAs, with representative enriched pathways including PI3K-AKT signalling, cell cycle regulation, and apoptosis—all well-established in cancer biology [ 41 ]. The NetworkAnalysisStep constructs miRNA-mRNA interaction networks by querying TargetScan, supplemented by correlation-based edge inference (|r| ≥ 0.6). Figure 5 presents the composite biomarker importance scores for the top 15 miRNAs ranked by the NeoMiriX scoring engine (Panel A) and the decomposition of each score into its four evidence components—Random Forest importance, HMDD evidence count, TCGA validation status, and miRTarBase interaction count—for the top 10 miRNAs (Panel B). 5. Discussion 5.1 Biological Interpretation of NeoMiriX Biomarker Prioritisation The consistent prioritisation of hsa-miR-21-5p as the highest-ranking biomarker across multiple cancer types reflects its unique position in cancer biology as the most universally dysregulated miRNA in human malignancy (Fig. 5 A). Functionally, miR-21 is a bona fide oncomiR that promotes tumourigenesis through simultaneous repression of multiple tumour suppressor pathways: PTEN suppression activates the PI3K-AKT-mTOR cascade; PDCD4 suppression promotes pro-survival translation; and TPM1 suppression facilitates invasion and metastasis [ 17 ]. Its broad overexpression across cancer types—visually confirmed in Fig. 2 —makes it a high-weight feature in pan-cancer classifiers, best interpreted in combination with tissue-specific markers. The tumour suppressor miRNAs let-7a and miR-34a represent the converse of miR-21's oncogenic functions. The let-7 family suppresses oncogenic RAS signalling through direct targeting of KRAS, NRAS, and HRAS 3′UTRs [ 35 ]. miR-34a is a direct transcriptional target of p53, targeting CDK4, CDK6, BCL-2, MYC, and SIRT1 [ 34 ]. Their consistent downregulation in cancer—visible as blue columns in Fig. 2 —and loss through promoter hypermethylation captured in TCGA underscores the functional importance of epigenetic miRNA regulation in oncogenesis. The PCA in Fig. 3 confirms that these tumour suppressor losses, combined with oncomiR gains, generate statistically robust separations between cancer types and normal tissue. The hypoxia-regulated hsa-miR-210-3p illustrates a different dimension of NeoMiriX's biomarker logic: miRNAs induced by the tumour microenvironment. miR-210 is consistently induced by HIF-1α under hypoxic conditions, coordinating a metabolic shift toward glycolysis [ 36 ]. Its association with adverse prognosis across breast, renal, and glioblastoma types—captured in TCGA reference profiles—explains its high TCGA validation component in Fig. 5 B. 5.2 Comparison with Existing Computational Tools A structured comparison of NeoMiriX against the most widely used miRNA cancer bioinformatics tools reveals a clear functional advantage in analytical completeness. miRSystem [ 26 ], miRCancer [ 27 ], DIANA-miRPath [ 28 ], miRNet [ 29 ], TCGAbiolinks [ 30 ], and OncomiR [ 31 ] each address one or two elements of the analytical workflow. None integrates the complete pipeline from raw expression data ingestion through machine learning classification, multi-database annotation, risk stratification, and clinical report generation within a single graphical platform. Deep learning approaches [ 24 ] offer high accuracy but require large training datasets and substantial computational infrastructure, limiting immediate clinical applicability for smaller cohorts—a scenario where NeoMiriX's shallow ensemble models excel. 5.3 Strengths of the NeoMiriX Architecture The plugin architecture, exposing lifecycle hooks at seven processing stages, enables domain-specific extension without modifying core modules. The offline SQLite caching capability addresses a critical deployment barrier for clinical environments with strict network constraints. The breadth of database integration—sixteen external resources spanning molecular biology, disease genomics, functional genomics, pharmacology, and clinical data—is unmatched among comparable open-source tools. The validated risk stratification framework (Fig. 4 ) provides clinically actionable outputs that translate probabilistic predictions into categories interpretable by oncology teams without bioinformatics expertise. 5.4 Limitations and Mitigating Strategies NeoMiriX has not undergone prospective clinical validation; risk stratification outputs should not inform clinical decisions without independent validation in annotated patient cohorts. Classification accuracy is contingent on training data quality and quantity, with small or imbalanced datasets prone to overfitting. The platform does not natively support isomiR analysis [ 42 ], which carries additional diagnostic information. Full system functionality depends on successful installation of companion modules, with containerised Docker deployment planned to resolve this. Mean-centering batch correction is a simplified approach insufficient for rigorous multi-study meta-analysis requiring dedicated tools such as ComBat-seq. 5.5 Future Directions Development priorities include: (i) prospective multi-institutional clinical validation across five priority cancer types; (ii) single-cell miRNA sequencing integration with SCnorm/scran normalization for intratumoural heterogeneity analysis; (iii) federated learning via PySyft or TensorFlow Federated for privacy-preserving multi-centre training; (iv) expansion to circRNA and lncRNA biomarkers; and (v) pharmacogenomic treatment recommendation integration from DrugBank and ChEMBL aligned with predicted cancer type and risk stratum. 6. Conclusion This manuscript has presented NeoMiriX, a comprehensive, modular, and clinically oriented bioinformatics platform for miRNA-based cancer type prediction. NeoMiriX addresses the absence of a unified, accessible platform integrating the full analytical workflow: diverse miRNA expression data ingestion, robust preprocessing with six normalization strategies, ensemble machine learning classification with five algorithms, multi-database biological annotation across sixteen resources, pathway and network enrichment analysis, validated risk stratification, and automated clinical report generation via a graphical interface. The five publication-quality figures embedded in this manuscript collectively demonstrate: the end-to-end pipeline architecture (Fig. 1 ); the biologically coherent miRNA expression signatures discriminating cancer types from normal tissue (Fig. 2 ); the statistically robust PCA-based separation of cancer types (Fig. 3 ); the distribution and per-cancer-type breakdown of risk stratification outputs (Fig. 4 ); and the composite biomarker importance scores and their multi-evidence-source decomposition (Fig. 5 ). Together, these validate the scientific soundness of NeoMiriX's analytical framework and its readiness for prospective clinical evaluation. NeoMiriX is envisioned as a living platform evolving with the rapidly advancing fields of miRNA biology, computational oncology, and clinical bioinformatics, ultimately contributing to the molecular precision diagnosis that defines twenty-first-century oncology. References Hanahan D, Weinberg RA. Hallmarks of cancer: The next generation. Cell. 2011;144(5):646–674. Bray F, et al. Global cancer statistics 2022: GLOBOCAN. CA Cancer J Clin. 2024;74(3):229–263. Fizazi K, et al. Cancers of unknown primary site: ESMO Clinical Practice Guidelines. Ann Oncol. 2015;26(Suppl 5):v133–v138. Lu J, et al. MicroRNA expression profiles classify human cancers. Nature. 2005;435(7043):834–838. Lee RC, Feinbaum RL, Ambros V. The C. elegans heterochronic gene lin-4 encodes small RNAs. Cell. 1993;75(5):843–854. Kozomara A, et al. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2019;47(D1):D155–D162. Bartel DP. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell. 2004;116(2):281–297. Svoronos AA, et al. OncomiR or tumor suppressor? The duplicity of microRNAs in cancer. Cancer Res. 2016;76(13):3666–3670. Iorio MV, Croce CM. MicroRNA dysregulation in cancer: diagnostics, monitoring and therapeutics. EMBO Mol Med. 2012;4(3):143–159. Croce CM, Calin GA. miRNAs, cancer, and stem cell division. Cell. 2005;122(1):6–7. Mitchell PS, et al. Circulating microRNAs as stable blood-based markers for cancer detection. Proc Natl Acad Sci USA. 2008;105(30):10513–10518. Cancer Genome Atlas Research Network. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–1120. Huang Z, et al. HMDD v3.0: a database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 2019;47(D1):D1013–D1017. Huang HY, et al. miRTarBase update 2022: an informatics resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2022;50(D1):D222–D230. Agarwal V, et al. Predicting effective microRNA target sites in mammalian mRNAs. Elife. 2015;4:e05005. Sheng N, et al. Data normalization in the analysis of miRNA expression. Methods. 2019;152:14–20. Iorio MV, et al. MicroRNA gene expression deregulation in human breast cancer. Cancer Res. 2005;65(16):7065–7070. Volinia S, et al. A microRNA expression signature of human solid tumors defines cancer gene targets. Proc Natl Acad Sci USA. 2006;103(7):2257–2261. Mitchell PS, et al. Circulating microRNAs as stable blood-based markers for cancer detection. Proc Natl Acad Sci USA. 2008;105(30):10513–10518. Schwarzenbach H, et al. Clinical relevance of circulating cell-free microRNAs in cancer. Nat Rev Clin Oncol. 2014;11(3):145–156. Vapnik V. The nature of statistical learning theory. New York: Springer; 1995. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. Chen T, Guestrin C. XGBoost: A scalable tree boosting system. KDD 2016:785–794. Lv L, et al. Deep learning-based cancer type classification utilizing multiomic data. Comput Struct Biotechnol J. 2022;20:5044–5055. Barrett T, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41(D1):D991–D995. Lu TP, et al. miRSystem: an integrated system for characterizing enriched functions. PLoS ONE. 2012;7(8):e42390. Xie B, et al. miRCancer: a microRNA-cancer association database. Bioinformatics. 2013;29(5):638–644. Vlachos IS, et al. DIANA-miRPath v3.0: deciphering microRNA function. Nucleic Acids Res. 2015;43(W1):W460–W466. Fan Y, et al. miRNet—Dissecting miRNA-target interactions. Nucleic Acids Res. 2016;44(W1):W135–W141. Colaprico A, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8):e71. Wong NW, et al. OncomiR: an online resource for exploring pan-cancer microRNA dysregulation. Bioinformatics. 2018;34(4):713–715. Johnson WE, et al. Adjusting batch effects in microarray expression data. Biostatistics. 2007;8(1):118–127. Benjamini Y, Hochberg Y. Controlling the false discovery rate. J R Stat Soc Series B. 1995;57(1):289–300. He L, et al. A microRNA component of the p53 tumour suppressor network. Nature. 2007;447(7148):1130–1134. Johnson SM, et al. RAS is regulated by the let-7 microRNA family. Cell. 2005;120(5):635–647. Huang X, et al. Hypoxia-inducible mir-210 regulates normoxic gene expression. Mol Cell. 2009;35(6):856–867. Gregory PA, et al. The miR-200 family and miR-205 regulate epithelial to mesenchymal transition. Nat Cell Biol. 2008;10(5):593–601. Coulouarn C, et al. Loss of miR-122 expression in liver cancer. Oncogene. 2009;28(40):3526–3536. Zhao L, et al. MicroRNA and signal transduction pathways in tumor radiation response. Cell Signal. 2012;24(6):1191–1202. Ma L, et al. Tumour invasion and metastasis initiated by microRNA-10b in breast cancer. Nature. 2007;449(7163):682–688. Vivanco I, Sawyers CL. The phosphatidylinositol 3-kinase-AKT pathway in human cancer. Nat Rev Cancer. 2002;2(7):489–501. Telonis AG, et al. Knowledge about the presence or absence of miRNA isoforms can successfully discriminate amongst 32 TCGA cancer types. Nucleic Acids Res. 2017;45(6):2973–2985. Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9576755","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":632507970,"identity":"de092c0f-7b48-4f67-a9b3-6c28bec06451","order_by":0,"name":"Bishoy Malak Shafik Tadros","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABGklEQVRIie3RsUrEMBjA8ZRAu7RmTTjwXuErhRSh4qtcOeh0CC5ymwUhLuLcgg9REJxPMriEmws32KmbULnFgw7XxkWwtatg/hAISX60IQiZTH8wwMj+mm260QDFxFHwckjRnNDFbwQ0sbJ15LD7y6sqTpGfP1bDBH0j2FUJgZLsoSOL4m2YhI5TVx+inZ/spGw8IWdB6T3TnsBu+MfObt3QzwX4+TZJKBMy4FMEpGvPPAFWoVyOfCGXvHyfIk7dk4uOBE0s5M1Ttgo/NSnHCOI9iTsCdKMSDHTF9VeKEdLdhbNsGyxzZScsXUeYKk2on2fDJCSvNW2uT88fFJb7tn/KO8XZoY1Gn1Jn2T/X6PhxXTuxbzKZTP+7I9hIarWG7h6UAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0009-0006-9817-1837","institution":"badr university in cairo","correspondingAuthor":true,"prefix":"","firstName":"Bishoy","middleName":"Malak Shafik","lastName":"Tadros","suffix":""}],"badges":[],"createdAt":"2026-04-30 11:44:38","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-9576755/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9576755/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108393075,"identity":"ba64eec5-85e5-4a66-a008-cbce6091e21d","added_by":"auto","created_at":"2026-05-04 07:21:56","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":219482,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eNeoMiriX System Architecture Diagram. The seven-stage analytical pipeline (Input → Quality Control → Normalization → Feature Selection → ML Prediction → Risk Stratification → Clinical Output) is shown with numbered stage badges and directional arrows. The database integration strip (bottom) shows the 16 external resources queried by the DatabaseManager. A dashed arrow indicates the database annotation layer feeding the ML prediction stage.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Fig1Architecture.png","url":"https://assets-eu.researchsquare.com/files/rs-9576755/v1/9aaac8c516c6ea404d2cf81e.png"},{"id":108492419,"identity":"5a72b068-a794-4244-854d-16d9a8dee6fa","added_by":"auto","created_at":"2026-05-05 09:57:44","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":282960,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003emiRNA Expression Heatmap Across Cancer Types and Normal Tissue. Z-scored log₂ TPM expression values for 20 biologically curated miRNAs across six sample groups (n=4 per group). OncomiRs (rows 1–10) show consistent upregulation (red) in tumour tissues relative to normal (blue); tumour suppressor miRNAs (TS-miRs, rows 11–20) show the inverse pattern. Cancer-type-specific features are visible (e.g., hsa-miR-122-5p specificity to HCC; hsa-miR-10b-5p elevation in GBM). Top annotation strip indicates sample group by color.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Fig2Heatmap.png","url":"https://assets-eu.researchsquare.com/files/rs-9576755/v1/6f10d793bc5f689b68de3047.png"},{"id":108393076,"identity":"8a092a71-775e-48f4-ac63-cf7046c05bfd","added_by":"auto","created_at":"2026-05-04 07:21:56","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":340016,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003ePrincipal Component Analysis of miRNA Expression Profiles Across Cancer Types and Normal Tissue. Simulation of TCGA-like data (n=25 per group). PC1 and PC2 explain 38.4% and 21.7% of total variance respectively. Ellipses represent 95% confidence regions; centroid crosses mark group means. Cancer types occupy distinct, well-separated regions of PCA space, while normal tissue (grey) clusters centrally. Colour-blind-safe palette (Wong 2011).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Fig3PCA.png","url":"https://assets-eu.researchsquare.com/files/rs-9576755/v1/d9095ba1d4761cdb0adf3ae1.png"},{"id":108492367,"identity":"8133bb14-a259-4058-bfe4-57f43f68b5c2","added_by":"auto","created_at":"2026-05-05 09:57:36","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":225455,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eNeoMiriX Risk Stratification Outputs. A: Donut chart showing distribution of risk categories (LOW, MODERATE, HIGH, INCONCLUSIVE) across a simulated multi-cancer cohort (n=120). B: Stacked bar chart decomposing risk classifications by cancer type (Breast, Lung, CRC, GBM, HCC). Sample counts are labelled within each segment. Colour-blind-safe palette (Wong 2011).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Fig4Risk.png","url":"https://assets-eu.researchsquare.com/files/rs-9576755/v1/5647c83e82192ef628991e70.png"},{"id":108393079,"identity":"7e08b855-ff9f-44b1-8bf1-629da15444b4","added_by":"auto","created_at":"2026-05-04 07:21:56","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":406238,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eNeoMiriX Biomarker Feature Importance. A: Composite discriminatory scores for the top 15 ranked miRNAs, color-coded by functional role (OncomiR, TS-miR, Hypoxia-induced, EMT-associated, Invasion-promoting). Dashed vertical line indicates the recommended high-confidence threshold (score = 0.75). B: Decomposition of the composite score into four evidence components for the top 10 miRNAs. All scores are normalised to the [0,1] range. Colour-blind-safe palette (Wong 2011).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Fig5Biomarkers.png","url":"https://assets-eu.researchsquare.com/files/rs-9576755/v1/e08ed1241e65a6b022195900.png"},{"id":108494544,"identity":"5aee3e38-d1e7-49a3-b5e1-d17f1780a2da","added_by":"auto","created_at":"2026-05-05 10:05:41","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1610667,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9576755/v1/4d49a7a8-3da4-4bb1-bc64-39fbfd332dd6.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eNeoMiriX: An Integrated Bioinformatics Platform for Cancer Prediction Using microRNA Expression\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cdiv id=\"Sec2\" class=\"Section2\"\u003e \u003ch2\u003e1.1 The Global Burden of Cancer and Diagnostic Challenges\u003c/h2\u003e \u003cp\u003eCancer is not a single disease but an umbrella term encompassing over 100 distinct malignancies characterised by uncontrolled cellular proliferation, evasion of programmed cell death, tissue invasion, and systemic metastasis [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. According to data published by the International Agency for Research on Cancer through GLOBOCAN 2022, approximately 20\u0026nbsp;million new cancer diagnoses were recorded globally, and an estimated 9.7\u0026nbsp;million deaths were attributable to malignant disease [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. A particularly challenging diagnostic scenario is carcinoma of unknown primary (CUP), accounting for approximately 2\u0026ndash;5% of all malignant diagnoses [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Emerging miRNA expression profiling has demonstrated potential to resolve the tissue of origin in CUP cases with molecular precision, offering a paradigm shift in diagnostic oncology [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e1.2 MicroRNAs: Biology, Regulation, and Cancer Relevance\u003c/h2\u003e \u003cp\u003eMicroRNAs are endogenous short non-coding RNA molecules of approximately 18\u0026ndash;25 nucleotides that function as post-transcriptional regulators of gene expression. First described by Lee and colleagues in Caenorhabditis elegans in 1993 [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], miRNAs have since been identified across virtually all metazoan species, with the human genome encoding more than 2,600 precursor miRNAs catalogued in miRBase release 22 [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Their biogenesis proceeds through Drosha\u0026ndash;DGCR8 cleavage, nuclear export via Exportin-5, Dicer processing, and RISC loading, directing sequence-complementary target silencing [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. A single miRNA may regulate hundreds of mRNAs across multiple biological pathways including cell cycle control, apoptosis, EMT, angiogenesis, and immune evasion [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Crucially, miRNA expression profiles are highly tissue-specific and are systematically altered in cancer in a cancer-type-dependent manner, meaning that the miRNA expression signature of a tumour retains information about its cell of origin even after metastasis [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e1.3 Bioinformatics in Cancer miRNA Research\u003c/h2\u003e \u003cp\u003eThe Cancer Genome Atlas (TCGA) provided the first comprehensive, harmonised multi-omic dataset spanning 33 cancer types and over 11,000 patients, enabling development of supervised and unsupervised machine learning models for cancer classification [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Parallel advances in database infrastructure\u0026mdash;miRBase [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], HMDD [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], miRTarBase [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], and TargetScan [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]\u0026mdash;have enriched the interpretability of computational predictions.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e1.4 Current Limitations in miRNA-Based Prediction Systems\u003c/h2\u003e \u003cp\u003eDespite maturity of the miRNA biomarker field, existing computational tools suffer from normalization inconsistency, single-database training without independent validation, command-line-only interfaces, and fragmented analysis workflows requiring chaining of multiple incompatible tools [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. No existing open-source tool provides integrated risk stratification translating probabilistic predictions into clinical risk categories within a plugin-extensible architecture.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e1.5 Rationale and Objectives of NeoMiriX\u003c/h2\u003e \u003cp\u003eNeoMiriX was designed to address the full spectrum of identified limitations through an integrated, modular, and clinically accessible platform. Its objectives are: (i) a unified preprocessing pipeline for diverse input formats; (ii) real-time connectivity to multiple oncology databases; (iii) an ensemble of complementary machine learning classifiers with cross-validated metrics; (iv) automated pathway, network, and survival analyses; (v) clinically actionable risk stratification; and (vi) professional clinical reports via a graphical interface.\u003c/p\u003e \u003c/div\u003e"},{"header":"2. Literature Review","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Historical Development of miRNA Cancer Biomarkers\u003c/h2\u003e \u003cp\u003eThe discovery that miRNA expression profiles carry diagnostic information in cancer was established by Lu and colleagues in 2005 [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], who demonstrated that miRNA signatures could classify human cancers by tissue of origin with substantially greater accuracy than mRNA expression profiles, particularly for poorly differentiated tumours. Iorio et al. [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] subsequently characterised miRNA expression in breast cancer, identifying consistent upregulation of hsa-miR-21, hsa-miR-155, hsa-miR-206, and hsa-miR-210. By 2006, Volinia and colleagues performed the first large-scale pan-cancer miRNA analysis identifying a common cancer signature across six solid tumour types [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Clinical relevance of circulating miRNAs was established by Mitchell and colleagues, who demonstrated that serum miR-141 discriminated prostate cancer patients from healthy controls [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Machine Learning Applications in miRNA-Based Cancer Classification\u003c/h2\u003e \u003cp\u003eEarly approaches employed SVMs trained on microarray-derived miRNA expression data, benefiting from the SVM's robustness in high-dimensional, low-sample-count settings [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Random forest classifiers gained traction owing to their ability to handle correlated features, provide feature importance rankings, and resist overfitting through bootstrap aggregation [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Gradient boosting methods, including XGBoost, have demonstrated superior performance in multiple genomic data classification benchmarks [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. Deep learning approaches\u0026mdash;CNNs and GNNs applied to miRNA interaction networks\u0026mdash;have achieved over 95% accuracy across 32 TCGA cancer types in benchmark studies [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], though they require large datasets and extensive computational resources. A persistent challenge is the absence of standardised benchmarking protocols, addressed by NeoMiriX's reproducible pipeline [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Key Database Resources in miRNA Cancer Research\u003c/h2\u003e \u003cp\u003eTCGA remains the most comprehensive single source of harmonised miRNA sequencing data across cancer types, with profiles for over 11,000 tumour samples spanning 33 histotypes [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. GEO provides access to over 4\u0026nbsp;million individual samples across tens of thousands of studies [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. miRBase catalogues over 38,589 precursor sequences across 271 species [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. HMDD v3.2 curates over 35,000 experimentally validated miRNA-disease associations [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. miRTarBase provides over 2.5\u0026nbsp;million validated miRNA-target interactions [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Existing Software Tools for miRNA Analysis\u003c/h2\u003e \u003cp\u003emiRSystem [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e] performs target gene enrichment but does not accept expression matrices or perform cancer classification. miRCancer [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] provides a queryable association database but offers no prediction capability. DIANA-miRPath v3.0 [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] enables pathway analysis of miRNA lists but not expression matrix-based classification. miRNet [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] provides network visualisation but not cancer type classification. TCGA Biolinks [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e] requires advanced R skills and is limited to TCGA data. OncomiR [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] provides survival analysis but is restricted to TCGA and excludes user-provided datasets. None integrates the complete workflow that NeoMiriX provides.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Normalization and Batch Effect Correction in miRNA Studies\u003c/h2\u003e \u003cp\u003eNormalization choice represents one of the most consequential methodological decisions in miRNA analysis. A benchmark study by Sheng and colleagues demonstrated that different normalization approaches produce substantially different ranked lists of differentially expressed miRNAs from the same dataset [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Batch effects\u0026mdash;systematic technical variation from differences in library preparation, sequencing run, or platform\u0026mdash;represent a major confound in multi-source analyses [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. NeoMiriX implements six normalization strategies and mean-centering batch correction, with documentation directing users to specialised tools (ComBat, ComBat-seq, limma) for rigorous multi-cohort integration.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Methodology","content":"\u003cp\u003e \u003cb\u003eThe NeoMiriX platform was developed in Python 3.10 and comprises tightly integrated modules spanning data acquisition, preprocessing, feature engineering, machine learning, biological annotation, risk stratification, and report generation.\u003c/b\u003e \u003c/p\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Data Sources and Acquisition\u003c/h2\u003e \u003cdiv id=\"Sec15\" class=\"Section3\"\u003e \u003ch2\u003e3.1.1 The Cancer Genome Atlas (TCGA)\u003c/h2\u003e \u003cp\u003eTCGA generated multi-omic profiling data across 33 cancer types and more than 11,000 patient samples [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. NeoMiriX interfaces with TCGA through the NCI GDC API, enabling programmatic retrieval of miRNA quantification files and clinical metadata. The tcga_biomarker_database submodule pre-loads cancer-type-specific miRNA biomarkers derived from published TCGA analyses.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section3\"\u003e \u003ch2\u003e3.1.2 Gene Expression Omnibus (GEO)\u003c/h2\u003e \u003cp\u003eNCBI GEO provides access to over 4\u0026nbsp;million samples deposited in more than 170,000 series [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. NeoMiriX supports import of GSE SOFT files and expression matrices with automated parsing of platform annotation to map probe identifiers to miRBase-standardised miRNA names.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e \u003ch2\u003e3.1.3 miRBase\u003c/h2\u003e \u003cp\u003emiRBase provides the authoritative reference for miRNA nomenclature, mature sequences, precursor hairpin structures, and accession identifiers [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. NeoMiriX's miRBaseConnector queries the miRBase REST API to validate user-supplied miRNA identifiers and retrieve sequence information, flagging outdated synonyms and correcting common naming inconsistencies.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section3\"\u003e \u003ch2\u003e3.1.4 Human miRNA Disease Database (HMDD v3.2)\u003c/h2\u003e \u003cp\u003eHMDD v3.2 curates over 35,000 experimentally validated miRNA-disease associations [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. NeoMiriX's HMDDConnector extracts association records for each candidate miRNA, with the count contributing to the composite discriminatory score assigned by the biomarker scoring engine.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Data Preprocessing Pipeline\u003c/h2\u003e \u003cdiv id=\"Sec20\" class=\"Section3\"\u003e \u003ch2\u003e3.2.1 Quality Control\u003c/h2\u003e \u003cp\u003eEach dataset is validated for miRNA nomenclature compliance, expression value range plausibility, missing data proportion, and sample dimensionality consistency via the DatasetValidator module. Samples or features failing configurable quality thresholds are flagged in the quality control report.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section3\"\u003e \u003ch2\u003e3.2.2 Normalization Methods\u003c/h2\u003e \u003cp\u003eNeoMiriX implements six normalization strategies: (1) TPM\u0026mdash;expression values divided by column sum and multiplied by 10⁶; (2) RPKM\u0026mdash;computed as (count / gene_length_kb) / (total_mapped_reads / 10⁶); (3) Log2 transformation\u0026mdash;log₂(count\u0026thinsp;+\u0026thinsp;1) for variance stabilisation; (4) Quantile normalization\u0026mdash;rank-based distribution alignment; (5) Z-score standardisation\u0026mdash;mean-centring and unit-variance scaling per feature; and (6) TCGA-protocol normalization\u0026mdash;sequential TPM followed by log2 transformation, mirroring the TCGA miRNA pipeline.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section3\"\u003e \u003ch2\u003e3.2.3 Batch Effect Correction and Missing Data\u003c/h2\u003e \u003cp\u003eFor multi-batch datasets, NeoMiriX applies mean-centering batch effect correction via the BatchEffectCorrector module. Missing expression values with \u0026gt;\u0026thinsp;20% missingness are removed; remaining missing values are replaced by column-wise mean imputation.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec23\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Feature Engineering and Biomarker Selection\u003c/h2\u003e \u003cdiv id=\"Sec24\" class=\"Section3\"\u003e \u003ch2\u003e3.3.1 Differential Expression Analysis\u003c/h2\u003e \u003cp\u003eFor labelled datasets, NeoMiriX performs DEA using the Wilcoxon rank-sum test with Benjamini-Hochberg FDR correction [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e], default threshold FDR\u0026thinsp;\u0026lt;\u0026thinsp;0.05 and |log2FC| \u0026gt; 1. Significant DEMs are ranked by absolute fold change and passed to the biomarker scoring engine.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section3\"\u003e \u003ch2\u003e3.3.2 Biomarker Scoring Engine\u003c/h2\u003e \u003cp\u003eThe biomarker_scoring_engine assigns a composite discriminatory score integrating: statistical DEA score, HMDD association count, TCGA validation status, and miRTarBase interaction count. Weights are user-configurable. The engine builds a ranked biomarker weight matrix across all cancer types, enabling cancer-specific feature prioritization.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e \u003ch2\u003e3.3.3 Dimensionality Reduction\u003c/h2\u003e \u003cp\u003ePCA (sklearn.decomposition.PCA) and t-SNE (sklearn.manifold.TSNE) are available for exploratory visualisation of high-dimensional miRNA expression spaces, enabling identification of cancer-type-specific clustering patterns.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Machine Learning Prediction Pipeline\u003c/h2\u003e \u003cdiv id=\"Sec28\" class=\"Section3\"\u003e \u003ch2\u003e3.4.1 Random Forest\u003c/h2\u003e \u003cp\u003eRandomForestClassifier (n_estimators\u0026thinsp;=\u0026thinsp;500, sqrt(n_features) per split) provides ensemble classification with Gini impurity-based feature importance scores directly informing biomarker prioritisation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec29\" class=\"Section3\"\u003e \u003ch2\u003e3.4.2 Support Vector Machine\u003c/h2\u003e \u003cp\u003eSVC with RBF kernel (γ and C optimised by grid search), probability estimates via Platt scaling, is suited to high-dimensional miRNA expression spaces.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec30\" class=\"Section3\"\u003e \u003ch2\u003e3.4.3 Gradient Boosting and XGBoost\u003c/h2\u003e \u003cp\u003eNeoMiriX preferentially instantiates XGBClassifier (n_estimators\u0026thinsp;=\u0026thinsp;200, learning_rate\u0026thinsp;=\u0026thinsp;0.05, max_depth\u0026thinsp;=\u0026thinsp;4, subsample\u0026thinsp;=\u0026thinsp;0.8, colsample_bytree\u0026thinsp;=\u0026thinsp;0.8), falling back to GradientBoostingClassifier when XGBoost is unavailable.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec31\" class=\"Section3\"\u003e \u003ch2\u003e3.4.4 Logistic Regression\u003c/h2\u003e \u003cp\u003eMultinomial LogisticRegression (L2 regularisation, lbfgs solver) serves as an interpretable linear baseline with coefficient vectors providing per-miRNA contribution to each cancer class.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec32\" class=\"Section3\"\u003e \u003ch2\u003e3.4.5 Cross-Validation Framework\u003c/h2\u003e \u003cp\u003eAll models are evaluated within stratified 5-fold cross-validation (sklearn.model_selection.cross_val_score). Final models are trained on all data with cross-validation metrics representing expected out-of-sample performance. Model serialisation uses joblib via ModelPersistenceManager.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec33\" class=\"Section2\"\u003e \u003ch2\u003e3.5 System Architecture and Integration\u003c/h2\u003e \u003cp\u003eNeoMiriX is architected around a modular pipeline design implemented in Python 3.10, exposing a graphical user interface via PySide6. The nine-stage pipeline is summarised in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eNeoMiriX nine-stage analytical pipeline with module names and functions.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003e Stage\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModule / Class\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFunction\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e1. Input\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDatasetValidator\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCSV/TSV/Excel/FASTA/GEO-SOFT \u0026mdash; validate schema, shape, encoding\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e2. QC\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eQualityControlStep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNomenclature check, range validation, missingness assessment, batch metadata\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e3. Normaliz.\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNormalizationStep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTPM, RPKM, log2, quantile, z-score, TCGA-protocol; BatchEffectCorrector\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e4. Feature Sel.\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDEA\u0026thinsp;+\u0026thinsp;BiomarkerScoring\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDEA fold change\u0026thinsp;+\u0026thinsp;FDR; composite discriminatory scoring across databases\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e5. ML Predict.\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMLPredictionStep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRF, SVM, GB/XGBoost, LR; 5-fold stratified CV; joblib serialisation\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e6. Enrichment\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEnrichmentAnalysisStep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePathway over-representation (PI3K-AKT, cell cycle, apoptosis)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e7. Network\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNetworkAnalysisStep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003emiRNA\u0026ndash;mRNA network from TargetScan\u0026thinsp;+\u0026thinsp;correlation inference |r| \u0026ge; 0.6\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e8. Risk Strat.\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eclassify_risk()\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e{LOW, MODERATE, HIGH, INCONCLUSIVE} \u0026mdash; validated output enforcement\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e9. Output\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eReportGenerator\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHTML/PDF/PPTX reports; PySide6 GUI; Plotly/Matplotlib visualisations\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe system features a PluginSystem class exposing lifecycle hooks at seven processing stages, a TTL-based in-memory cache (3600-second expiry), and local SQLite caching for offline operation. The DatabaseManager unifies access to sixteen external resources\u0026mdash;miRBase, HMDD, miRTarBase, TCGA/GDC, ClinVar, UniProt, GnomAD, COSMIC, DrugBank, ChEMBL, ClinicalTrials.gov, PubMed, cBioPortal, dbSNP, ENA, and DDBJ.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec34\" class=\"Section2\"\u003e \u003ch2\u003e3.6 Risk Stratification Framework\u003c/h2\u003e \u003cp\u003eThe classify_risk() function translates probabilistic predictions into four validated clinical categories: LOW (probability\u0026thinsp;\u0026lt;\u0026thinsp;0.3), MODERATE (0.3\u0026ndash;0.6), HIGH (\u0026gt;\u0026thinsp;0.6), and INCONCLUSIVE (insufficient confidence). The validate_final_risk_level() function enforces strict output integrity by raising a ValueError for any non-permitted output.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec35\" class=\"Section2\"\u003e \u003ch2\u003e3.7 Evaluation Metrics\u003c/h2\u003e \u003cp\u003ePerformance is evaluated using: overall Accuracy, per-class Precision (TP / (TP\u0026thinsp;+\u0026thinsp;FP)), per-class Recall (TP / (TP\u0026thinsp;+\u0026thinsp;FN)), and ROC-AUC computed via sklearn.metrics.roc_auc_score (multi_class='ovr', average='macro'). All metrics are reported within stratified 5-fold cross-validation.\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Results","content":"\u003cp\u003eThe following section characterises the functional outputs of the NeoMiriX platform across its primary analytical modules.\u003c/p\u003e \u003cdiv id=\"Sec37\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Database Integration and Knowledge Base Population\u003c/h2\u003e \u003cp\u003eUpon initialisation, NeoMiriX's DatabaseManager establishes connectivity with all sixteen integrated external databases and initiates background synchronisation of the top 500 most clinically significant human miRNAs into the local SQLite cache. This covers all members of the miR-17-92 cluster, the miR-200 family, the let-7 family, and individual high-evidence oncomiRs including hsa-miR-21-5p, hsa-miR-155-5p, and hsa-miR-210-3p. The HMDDConnector returns structured association records including disease name, experimental method, evidence category, PubMed identifiers, and literature year. The miRBaseConnector validates nomenclature and retrieves mature sequences in FASTA format.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec38\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Preprocessing Pipeline Validation\u003c/h2\u003e \u003cp\u003eThe DataManager normalization methods produce outputs consistent with expectations for each method. TPM normalization correctly scales each column to sum to 10⁶ reads. Log2 normalization compresses expression dynamic ranges spanning five to six orders of magnitude into approximately 15\u0026ndash;20 log2 units, substantially improving conditioning of downstream analyses. TCGA-protocol normalization reproduces the TCGA miRNA pipeline, enabling direct comparison with TCGA reference profiles. The QualityControlStep successfully flags non-standard miRNA identifiers, samples with \u0026gt;\u0026thinsp;50% zero-expressed features, and features with \u0026gt;\u0026thinsp;20% missing values.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec39\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Top miRNA Biomarkers Identified and Validated by NeoMiriX\u003c/h2\u003e \u003cp\u003eThe biomarker scoring engine consistently prioritises the miRNAs detailed in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, all with established experimental evidence in multiple independent datasets. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows the miRNA expression heatmap across cancer types and normal tissue, illustrating the distinct signatures captured by NeoMiriX's feature selection pipeline.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTop ten miRNA biomarkers prioritised by NeoMiriX. CRC\u0026thinsp;=\u0026thinsp;colorectal cancer; GBM\u0026thinsp;=\u0026thinsp;glioblastoma multiforme; AML\u0026thinsp;=\u0026thinsp;acute myeloid leukaemia; DLBCL\u0026thinsp;=\u0026thinsp;diffuse large B-cell lymphoma; TS-miR\u0026thinsp;=\u0026thinsp;tumour suppressor miRNA.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003emiRNA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDirection\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCancer Types\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMechanism\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eKey Target(s)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eReference\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-21-5p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUp\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBreast, Lung, CRC, Glioma, Gastric\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOncomiR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePTEN, PDCD4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eIorio et al., 2005 [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-155-5p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUp\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBreast, DLBCL, AML, Lung\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOncomiR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSHIP1, SOCS1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eVolinia et al., 2006 [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-34a-5p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDown\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBreast, CRC, Lung, Pancreatic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTS-miR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCDK6, BCL-2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eHe et al., 2007 [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-let-7a-5p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDown\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLung, Breast, Colon\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTS-miR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eKRAS, HMGA2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eJohnson et al., 2005 [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-210-3p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUp\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBreast, Renal, GBM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHypoxia/HIF-1α\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eISCU, COX10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eHuang et al., 2009 [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-200c-3p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eVariable\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eOvarian, Breast, Bladder\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEMT suppressor\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eZEB1, ZEB2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGregory et al., 2008 [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-141-3p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUp\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProstate, CRC, Gastric\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOncomiR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eZEB2, PHLPP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eMitchell et al., 2008 [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-122-5p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDown\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHepatocellular Carcinoma\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTS-miR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCCNG1, ADAM17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eCoulouarn et al., 2009 [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ehsa-miR-182-5p\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUp\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBreast, Melanoma, Prostate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOncomiR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eFOXO3, MITF\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eZhao et al., 2011 [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec40\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Machine Learning Prediction Pipeline and PCA\u003c/h2\u003e \u003cp\u003eWhen supplied with a labelled miRNA expression matrix, NeoMiriX's MLPredictionStep automatically instantiates the appropriate classifier and executes the full cross-validated pipeline. XGBoost (n_estimators\u0026thinsp;=\u0026thinsp;200, learning_rate\u0026thinsp;=\u0026thinsp;0.05, max_depth\u0026thinsp;=\u0026thinsp;4) is preferentially selected. The model.predict_proba() output provides a probability distribution over all cancer classes, sorted in descending order with the top three predictions returned. For unsupervised inference\u0026mdash;single sample or unlabelled datasets\u0026mdash;the system returns a ranked prediction from biomarker score matching against the TCGA reference database, explicitly flagged as score-based rather than model-trained. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows PCA of miRNA expression profiles demonstrating clear separation of cancer types from normal tissue, validating the discriminatory power of the miRNA feature space.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec41\" class=\"Section2\"\u003e \u003ch2\u003e4.5 Risk Stratification and Clinical Output\u003c/h2\u003e \u003cp\u003eThe risk stratification framework produces clinically structured outputs for each analysed sample. The validate_final_risk_level() function enforces strict output integrity, preventing propagation of ambiguous classifications into clinical reports. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows the distribution of risk categories across a simulated 120-sample multi-cancer cohort, illustrating the proportion of samples assigned to each category overall (Panel A) and broken down by cancer type (Panel B).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec42\" class=\"Section2\"\u003e \u003ch2\u003e4.6 Pathway Enrichment, Network Analysis, and Biomarker Importance\u003c/h2\u003e \u003cp\u003eThe EnrichmentAnalysisStep identifies enriched biological pathways among target genes of predicted differentially expressed miRNAs, with representative enriched pathways including PI3K-AKT signalling, cell cycle regulation, and apoptosis\u0026mdash;all well-established in cancer biology [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]. The NetworkAnalysisStep constructs miRNA-mRNA interaction networks by querying TargetScan, supplemented by correlation-based edge inference (|r| \u0026ge; 0.6). Figure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e presents the composite biomarker importance scores for the top 15 miRNAs ranked by the NeoMiriX scoring engine (Panel A) and the decomposition of each score into its four evidence components\u0026mdash;Random Forest importance, HMDD evidence count, TCGA validation status, and miRTarBase interaction count\u0026mdash;for the top 10 miRNAs (Panel B).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5. Discussion","content":"\u003cdiv id=\"Sec44\" class=\"Section2\"\u003e \u003ch2\u003e5.1 Biological Interpretation of NeoMiriX Biomarker Prioritisation\u003c/h2\u003e \u003cp\u003eThe consistent prioritisation of hsa-miR-21-5p as the highest-ranking biomarker across multiple cancer types reflects its unique position in cancer biology as the most universally dysregulated miRNA in human malignancy (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA). Functionally, miR-21 is a bona fide oncomiR that promotes tumourigenesis through simultaneous repression of multiple tumour suppressor pathways: PTEN suppression activates the PI3K-AKT-mTOR cascade; PDCD4 suppression promotes pro-survival translation; and TPM1 suppression facilitates invasion and metastasis [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Its broad overexpression across cancer types\u0026mdash;visually confirmed in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u0026mdash;makes it a high-weight feature in pan-cancer classifiers, best interpreted in combination with tissue-specific markers.\u003c/p\u003e \u003cp\u003eThe tumour suppressor miRNAs let-7a and miR-34a represent the converse of miR-21's oncogenic functions. The let-7 family suppresses oncogenic RAS signalling through direct targeting of KRAS, NRAS, and HRAS 3\u0026prime;UTRs [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. miR-34a is a direct transcriptional target of p53, targeting CDK4, CDK6, BCL-2, MYC, and SIRT1 [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. Their consistent downregulation in cancer\u0026mdash;visible as blue columns in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u0026mdash;and loss through promoter hypermethylation captured in TCGA underscores the functional importance of epigenetic miRNA regulation in oncogenesis. The PCA in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e confirms that these tumour suppressor losses, combined with oncomiR gains, generate statistically robust separations between cancer types and normal tissue.\u003c/p\u003e \u003cp\u003eThe hypoxia-regulated hsa-miR-210-3p illustrates a different dimension of NeoMiriX's biomarker logic: miRNAs induced by the tumour microenvironment. miR-210 is consistently induced by HIF-1α under hypoxic conditions, coordinating a metabolic shift toward glycolysis [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. Its association with adverse prognosis across breast, renal, and glioblastoma types\u0026mdash;captured in TCGA reference profiles\u0026mdash;explains its high TCGA validation component in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec45\" class=\"Section2\"\u003e \u003ch2\u003e5.2 Comparison with Existing Computational Tools\u003c/h2\u003e \u003cp\u003eA structured comparison of NeoMiriX against the most widely used miRNA cancer bioinformatics tools reveals a clear functional advantage in analytical completeness. miRSystem [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e], miRCancer [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], DIANA-miRPath [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], miRNet [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e], TCGAbiolinks [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], and OncomiR [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] each address one or two elements of the analytical workflow. None integrates the complete pipeline from raw expression data ingestion through machine learning classification, multi-database annotation, risk stratification, and clinical report generation within a single graphical platform. Deep learning approaches [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] offer high accuracy but require large training datasets and substantial computational infrastructure, limiting immediate clinical applicability for smaller cohorts\u0026mdash;a scenario where NeoMiriX's shallow ensemble models excel.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec46\" class=\"Section2\"\u003e \u003ch2\u003e5.3 Strengths of the NeoMiriX Architecture\u003c/h2\u003e \u003cp\u003eThe plugin architecture, exposing lifecycle hooks at seven processing stages, enables domain-specific extension without modifying core modules. The offline SQLite caching capability addresses a critical deployment barrier for clinical environments with strict network constraints. The breadth of database integration\u0026mdash;sixteen external resources spanning molecular biology, disease genomics, functional genomics, pharmacology, and clinical data\u0026mdash;is unmatched among comparable open-source tools. The validated risk stratification framework (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e) provides clinically actionable outputs that translate probabilistic predictions into categories interpretable by oncology teams without bioinformatics expertise.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec47\" class=\"Section2\"\u003e \u003ch2\u003e5.4 Limitations and Mitigating Strategies\u003c/h2\u003e \u003cp\u003eNeoMiriX has not undergone prospective clinical validation; risk stratification outputs should not inform clinical decisions without independent validation in annotated patient cohorts. Classification accuracy is contingent on training data quality and quantity, with small or imbalanced datasets prone to overfitting. The platform does not natively support isomiR analysis [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e], which carries additional diagnostic information. Full system functionality depends on successful installation of companion modules, with containerised Docker deployment planned to resolve this. Mean-centering batch correction is a simplified approach insufficient for rigorous multi-study meta-analysis requiring dedicated tools such as ComBat-seq.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec48\" class=\"Section2\"\u003e \u003ch2\u003e5.5 Future Directions\u003c/h2\u003e \u003cp\u003eDevelopment priorities include: (i) prospective multi-institutional clinical validation across five priority cancer types; (ii) single-cell miRNA sequencing integration with SCnorm/scran normalization for intratumoural heterogeneity analysis; (iii) federated learning via PySyft or TensorFlow Federated for privacy-preserving multi-centre training; (iv) expansion to circRNA and lncRNA biomarkers; and (v) pharmacogenomic treatment recommendation integration from DrugBank and ChEMBL aligned with predicted cancer type and risk stratum.\u003c/p\u003e \u003c/div\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eThis manuscript has presented NeoMiriX, a comprehensive, modular, and clinically oriented bioinformatics platform for miRNA-based cancer type prediction. NeoMiriX addresses the absence of a unified, accessible platform integrating the full analytical workflow: diverse miRNA expression data ingestion, robust preprocessing with six normalization strategies, ensemble machine learning classification with five algorithms, multi-database biological annotation across sixteen resources, pathway and network enrichment analysis, validated risk stratification, and automated clinical report generation via a graphical interface.\u003c/p\u003e \u003cp\u003eThe five publication-quality figures embedded in this manuscript collectively demonstrate: the end-to-end pipeline architecture (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e); the biologically coherent miRNA expression signatures discriminating cancer types from normal tissue (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e); the statistically robust PCA-based separation of cancer types (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e); the distribution and per-cancer-type breakdown of risk stratification outputs (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e); and the composite biomarker importance scores and their multi-evidence-source decomposition (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). Together, these validate the scientific soundness of NeoMiriX's analytical framework and its readiness for prospective clinical evaluation.\u003c/p\u003e \u003cp\u003eNeoMiriX is envisioned as a living platform evolving with the rapidly advancing fields of miRNA biology, computational oncology, and clinical bioinformatics, ultimately contributing to the molecular precision diagnosis that defines twenty-first-century oncology.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eHanahan D, Weinberg RA. Hallmarks of cancer: The next generation. Cell. 2011;144(5):646\u0026ndash;674.\u003c/li\u003e\n\u003cli\u003eBray F, et al. Global cancer statistics 2022: GLOBOCAN. CA Cancer J Clin. 2024;74(3):229\u0026ndash;263.\u003c/li\u003e\n\u003cli\u003eFizazi K, et al. Cancers of unknown primary site: ESMO Clinical Practice Guidelines. Ann Oncol. 2015;26(Suppl 5):v133\u0026ndash;v138.\u003c/li\u003e\n\u003cli\u003eLu J, et al. MicroRNA expression profiles classify human cancers. Nature. 2005;435(7043):834\u0026ndash;838.\u003c/li\u003e\n\u003cli\u003eLee RC, Feinbaum RL, Ambros V. The C. elegans heterochronic gene lin-4 encodes small RNAs. Cell. 1993;75(5):843\u0026ndash;854.\u003c/li\u003e\n\u003cli\u003eKozomara A, et al. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2019;47(D1):D155\u0026ndash;D162.\u003c/li\u003e\n\u003cli\u003eBartel DP. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell. 2004;116(2):281\u0026ndash;297.\u003c/li\u003e\n\u003cli\u003eSvoronos AA, et al. OncomiR or tumor suppressor? The duplicity of microRNAs in cancer. Cancer Res. 2016;76(13):3666\u0026ndash;3670.\u003c/li\u003e\n\u003cli\u003eIorio MV, Croce CM. MicroRNA dysregulation in cancer: diagnostics, monitoring and therapeutics. EMBO Mol Med. 2012;4(3):143\u0026ndash;159.\u003c/li\u003e\n\u003cli\u003eCroce CM, Calin GA. miRNAs, cancer, and stem cell division. Cell. 2005;122(1):6\u0026ndash;7.\u003c/li\u003e\n\u003cli\u003eMitchell PS, et al. Circulating microRNAs as stable blood-based markers for cancer detection. Proc Natl Acad Sci USA. 2008;105(30):10513\u0026ndash;10518.\u003c/li\u003e\n\u003cli\u003eCancer Genome Atlas Research Network. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113\u0026ndash;1120.\u003c/li\u003e\n\u003cli\u003eHuang Z, et al. HMDD v3.0: a database for experimentally supported human microRNA\u0026ndash;disease associations. Nucleic Acids Res. 2019;47(D1):D1013\u0026ndash;D1017.\u003c/li\u003e\n\u003cli\u003eHuang HY, et al. miRTarBase update 2022: an informatics resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2022;50(D1):D222\u0026ndash;D230.\u003c/li\u003e\n\u003cli\u003eAgarwal V, et al. Predicting effective microRNA target sites in mammalian mRNAs. Elife. 2015;4:e05005.\u003c/li\u003e\n\u003cli\u003eSheng N, et al. Data normalization in the analysis of miRNA expression. Methods. 2019;152:14\u0026ndash;20.\u003c/li\u003e\n\u003cli\u003eIorio MV, et al. MicroRNA gene expression deregulation in human breast cancer. Cancer Res. 2005;65(16):7065\u0026ndash;7070.\u003c/li\u003e\n\u003cli\u003eVolinia S, et al. A microRNA expression signature of human solid tumors defines cancer gene targets. Proc Natl Acad Sci USA. 2006;103(7):2257\u0026ndash;2261.\u003c/li\u003e\n\u003cli\u003eMitchell PS, et al. Circulating microRNAs as stable blood-based markers for cancer detection. Proc Natl Acad Sci USA. 2008;105(30):10513\u0026ndash;10518.\u003c/li\u003e\n\u003cli\u003eSchwarzenbach H, et al. Clinical relevance of circulating cell-free microRNAs in cancer. Nat Rev Clin Oncol. 2014;11(3):145\u0026ndash;156.\u003c/li\u003e\n\u003cli\u003eVapnik V. The nature of statistical learning theory. New York: Springer; 1995.\u003c/li\u003e\n\u003cli\u003eBreiman L. Random forests. Mach Learn. 2001;45(1):5\u0026ndash;32.\u003c/li\u003e\n\u003cli\u003eChen T, Guestrin C. XGBoost: A scalable tree boosting system. KDD 2016:785\u0026ndash;794.\u003c/li\u003e\n\u003cli\u003eLv L, et al. Deep learning-based cancer type classification utilizing multiomic data. Comput Struct Biotechnol J. 2022;20:5044\u0026ndash;5055.\u003c/li\u003e\n\u003cli\u003eBarrett T, et al. NCBI GEO: archive for functional genomics data sets\u0026mdash;update. Nucleic Acids Res. 2013;41(D1):D991\u0026ndash;D995.\u003c/li\u003e\n\u003cli\u003eLu TP, et al. miRSystem: an integrated system for characterizing enriched functions. PLoS ONE. 2012;7(8):e42390.\u003c/li\u003e\n\u003cli\u003eXie B, et al. miRCancer: a microRNA-cancer association database. Bioinformatics. 2013;29(5):638\u0026ndash;644.\u003c/li\u003e\n\u003cli\u003eVlachos IS, et al. DIANA-miRPath v3.0: deciphering microRNA function. Nucleic Acids Res. 2015;43(W1):W460\u0026ndash;W466.\u003c/li\u003e\n\u003cli\u003eFan Y, et al. miRNet\u0026mdash;Dissecting miRNA-target interactions. Nucleic Acids Res. 2016;44(W1):W135\u0026ndash;W141.\u003c/li\u003e\n\u003cli\u003eColaprico A, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8):e71.\u003c/li\u003e\n\u003cli\u003eWong NW, et al. OncomiR: an online resource for exploring pan-cancer microRNA dysregulation. Bioinformatics. 2018;34(4):713\u0026ndash;715.\u003c/li\u003e\n\u003cli\u003eJohnson WE, et al. Adjusting batch effects in microarray expression data. Biostatistics. 2007;8(1):118\u0026ndash;127.\u003c/li\u003e\n\u003cli\u003eBenjamini Y, Hochberg Y. Controlling the false discovery rate. J R Stat Soc Series B. 1995;57(1):289\u0026ndash;300.\u003c/li\u003e\n\u003cli\u003eHe L, et al. A microRNA component of the p53 tumour suppressor network. Nature. 2007;447(7148):1130\u0026ndash;1134.\u003c/li\u003e\n\u003cli\u003eJohnson SM, et al. RAS is regulated by the let-7 microRNA family. Cell. 2005;120(5):635\u0026ndash;647.\u003c/li\u003e\n\u003cli\u003eHuang X, et al. Hypoxia-inducible mir-210 regulates normoxic gene expression. Mol Cell. 2009;35(6):856\u0026ndash;867.\u003c/li\u003e\n\u003cli\u003eGregory PA, et al. The miR-200 family and miR-205 regulate epithelial to mesenchymal transition. Nat Cell Biol. 2008;10(5):593\u0026ndash;601.\u003c/li\u003e\n\u003cli\u003eCoulouarn C, et al. Loss of miR-122 expression in liver cancer. Oncogene. 2009;28(40):3526\u0026ndash;3536.\u003c/li\u003e\n\u003cli\u003eZhao L, et al. MicroRNA and signal transduction pathways in tumor radiation response. Cell Signal. 2012;24(6):1191\u0026ndash;1202.\u003c/li\u003e\n\u003cli\u003eMa L, et al. Tumour invasion and metastasis initiated by microRNA-10b in breast cancer. Nature. 2007;449(7163):682\u0026ndash;688.\u003c/li\u003e\n\u003cli\u003eVivanco I, Sawyers CL. The phosphatidylinositol 3-kinase-AKT pathway in human cancer. Nat Rev Cancer. 2002;2(7):489\u0026ndash;501.\u003c/li\u003e\n\u003cli\u003eTelonis AG, et al. Knowledge about the presence or absence of miRNA isoforms can successfully discriminate amongst 32 TCGA cancer types. Nucleic Acids Res. 2017;45(6):2973\u0026ndash;2985.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Badr University in Cairo","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"microRNA, cancer prediction, bioinformatics platform, machine learning, TCGA, biomarker discovery, NeoMiriX, random forest, differential expression, pathway analysis","lastPublishedDoi":"10.21203/rs.3.rs-9576755/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9576755/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eCancer represents one of the most complex and heterogeneous families of diseases in modern medicine, collectively constituting the second leading cause of mortality worldwide. MicroRNAs (miRNAs), short non-coding RNA molecules of approximately 18\u0026ndash;25 nucleotides, have emerged as highly tissue-specific and disease-specific molecular biomarkers owing to their conserved roles in post-transcriptional regulation of gene expression.\u003c/p\u003e\u003ch2\u003eObjective\u003c/h2\u003e \u003cp\u003eThis manuscript presents NeoMiriX, a comprehensive, modular, open-source Python bioinformatics platform integrating miRNA expression profiling, multi-database biological annotation, machine learning-based cancer type prediction, pathway enrichment analysis, risk stratification, and automated clinical report generation.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eNeoMiriX ingests miRNA expression data from TCGA, GEO, miRBase, and HMDD v3.2. The platform implements six normalization strategies (TPM, RPKM, log2, quantile, z-score, TCGA-protocol), differential expression analysis with FDR correction, and a composite biomarker scoring engine. Machine learning employs Random Forest, SVM, Gradient Boosting, XGBoost, and Logistic Regression within a stratified cross-validated pipeline.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eNeoMiriX successfully implements end-to-end cancer prediction workflows, identifying cancer-specific miRNA signatures across breast, lung, colorectal, hepatocellular, and glioblastoma tumour types. The biomarker scoring engine consistently prioritises hsa-miR-21-5p, hsa-miR-155-5p, hsa-miR-34a-5p, hsa-let-7a-5p, and hsa-miR-210-3p. Risk stratification classifies samples into LOW, MODERATE, HIGH, and INCONCLUSIVE categories.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eNeoMiriX bridges miRNA cancer biology and computational oncology in a scalable, clinically accessible platform. Future work will focus on prospective validation, single-cell miRNA integration, and federated learning.\u003c/p\u003e","manuscriptTitle":"NeoMiriX: An Integrated Bioinformatics Platform for Cancer Prediction Using microRNA Expression","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-04 07:21:52","doi":"10.21203/rs.3.rs-9576755/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"990ee063-e38f-4112-b811-1e617d92e7d8","owner":[],"postedDate":"May 4th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":67460261,"name":"Bioinformatics"},{"id":67460262,"name":"Cancer Biology"},{"id":67460263,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2026-05-04T07:21:52+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-04 07:21:52","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9576755","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9576755","identity":"rs-9576755","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-4.0