COMBINE: A Comprehensive Multi-Omics Approach for Improving Breast Cancer Prognosis Classification in African American Women | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article COMBINE: A Comprehensive Multi-Omics Approach for Improving Breast Cancer Prognosis Classification in African American Women Xin Feng, Weiming Xie, Lin Dong, Yongxian Xin, Ruihao Xin This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-3852479/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Breast cancer disproportionately affects African American women under the age of 50, leading to higher incidence rates, more aggressive cancer subtypes, and increased mortality compared to other racial and ethnic groups. To enhance the prediction of onset risk and enable timely intervention and treatment, it is crucial to investigate the genetic and molecular factors associated with these disparities. This study introduces COMBINE, an innovative ensemble learning model that combines three types of omics data to improve the accuracy of breast cancer prognosis classification and reduce the model's time complexity. A comparative analysis of the fusion effects for African American and White women reveals a significant improvement in the fusion effect for African American women. Additionally, gene enrichment analysis highlights the importance of considering race when selecting relevant biomarkers. To address the challenges of cancer prognosis classification, a combination of qualitative and quantitative methods, along with ensemble learning, is employed. This comprehensive approach facilitates the exploration of new concepts for the application of multi-omics data, potentially leading to more personalized and effective treatment strategies. The study highlights the potential of ensemble learning as a fusion technique for multi-omics data in cancer prognosis classification. It emphasizes the importance of refining our understanding of the genetic and molecular factors contributing to disparities in breast cancer incidence and outcomes. Ultimately, this research has the potential to improve healthcare outcomes for African American women and alleviate the burden of this formidable disease. Biological sciences/Cancer/Breast cancer Biological sciences/Computational biology and bioinformatics/Classification and taxonomy Biological sciences/Computational biology and bioinformatics/Data mining Multi-omics Breast Cancer COMBINE Race Data Integration Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Introduction Multi-omics data, encompassing genomics, transcriptomics, proteomics, metabolomics, and epigenomics, provides an exhaustive perspective on a patient's molecular landscape 1 . The integration of this multi-omics information with conventional clinical detection indicators allows researchers to devise more accurate and tailored prediction models. This fusion facilitates the discovery of new biomarkers, the identification of distinct molecular signatures, and the clarification of complex molecular interplay that influences disease pathogenesis and progression 2 . The employment of big data mining technology for multi-omics data analysis has expedited the creation of sophisticated algorithms and computational models 3 . Consequently, researchers can now unravel intricate relationships among diverse molecular strata. This advancement has fostered a more profound comprehension of the fundamental mechanisms propelling disease onset and progression, as well as the recognition of potential therapeutic targets 4,5 . In summary, multi-omics data offers a comprehensive insight into the patient's molecular composition, and its integration with traditional clinical detection indicators promotes the development of precise and personalized prediction models. The application of big data mining technology in multi-omics data analysis 6 supports the advancement of cutting-edge algorithms and computational models, which in turn deepens the understanding of disease mechanisms and facilitates the discovery of potential therapeutic targets. Breast cancer has emerged as one of the most lethal malignant tumors globally, marked by a substantial number of new cases 7,8 . Consequently, the investigation of breast cancer mechanisms has garnered significant attention 9,10 . In recent years, the accelerated development of high-throughput gene sequencing technology has yielded copious amounts of biomics data, which contains genomic information intimately linked to cancer initiation and progression 11,12 . Therefore, the examination and analysis of biomics data can enhance our understanding of cancer mechanisms, ultimately bolstering the diagnosis, treatment, and prevention of the disease 13 . The five-year survival 14 rate serves as an evaluative measure of tumor patient survival and represents a standard criterion for assessing treatment efficacy, facilitating easy comparison. Even a minor improvement in survival can be deemed evidence of a clinically meaningful benefit 12 . As such, the systematic analysis of biomics data and the development of effective predictive models can play a crucial role in advancing our understanding of breast cancer mechanisms, which can ultimately lead to improved patient outcomes and more targeted therapeutic interventions. This study aims to examine the influence of racial information and natural factors on the incidence and progression of cancer by utilizing a multi-omics data fusion model for predicting breast cancer survival cycles. The main goal of this research is to improve the accuracy of breast cancer survival cycle prediction by developing an ensemble learning-based multi-omics fusion prediction model. This model integrates clinical, transcriptomic, and methylomic data from The Cancer Genome Atlas (TCGA) datasets. The experimental results demonstrate that the fusion of the three omics approaches (with an accuracy rate of 97.43%) outperforms single-omics experiments and other multi-omics and single-omics experiments based on race in the context of the three-omics experiments, considering racial disparities. This research provides technical support for predicting the survival cycle of breast cancer patients and introduces new concepts for studying breast cancer survival prognostics. It also offers valuable insights that can guide future research in the field of breast cancer survival prognosis. Materials and methods In this study, a deep neural network (DNN)-based multi-omics fusion method is employed to investigate the classification efficacy of three omics data types. Initially, the standardized dataset undergoes preliminary screening, with features exhibiting missing values greater than 20% and unchanged features being removed 15 . The sample intersection of the three omics data types is retained. Subsequently, all male samples are deleted, and the African American and White sample sets, the African American sample collection, and the White sample collection are preserved. The KNN Imputer algorithm is applied to impute missing values in the samples, addressing the issue of reduced classification accuracy due to missing values 16 . As this study focuses on survival prediction, features related to survival cycles in clinical data are removed, along with some features with missing values exceeding 20%. Given that the number of features in transcriptomics and methylomics substantially surpasses the sample size, overfitting of the model is a potential concern, resulting in the so-called curse of dimensionality 17,18 . Consequently, feature selection for methylomics and transcriptomics data is necessary. First, variance selection is employed to filter out features with minimal changes 19 . Subsequently, the optimal feature subset is selected according to the weights assigned by LinearSVC 20 . The SMOTE oversampling technique is then utilized to balance the training set 21 . Finally, distinct DNN models are constructed for triple cross-validation training in the three groups, maintaining the same proportion of positive and negative samples in each fold cross-validation. The ensemble learning concept is implemented to fuse the results of the three models. The experimental workflow is illustrated in Fig. 1 . Data source and preprocessing This study obtained data from The Cancer Genome Atlas (TCGA), a repository of cancer genomic maps that includes transcriptomics, DNA methylomics, and clinical information. Patient samples were intersected for each of the three omics data types, resulting in a final sample size of 567 White samples and 156 African American samples. Figure 2 displays the survival days for both groups: the blue section represents African American survival time density, while the red section represents White survival density. The abscissa and ordinate represent survival days and density, respectively. Patients were further classified as long- or short-term survivors based on the five-year survival criterion. To handle missing values in the sample, we removed features with more than 20% missing values. Then, we used the KNN Imputer method to fill in the remaining missing values while preserving the differences between features as much as possible. To address the issue of extensive data scale, which negatively impacts the algorithm's time complexity, we performed data standardization. Consequently, the feature data obtained encompassed 39,953 RNA-seq signatures, 395,007 methylation signatures, and 16 clinical information features. Feature Selection In this study, we employed two distinct feature selection algorithms, Variance estimation selection and SelectFromModel 20 , to conduct a two-step feature screening process for omics data with a larger number of features than the sample size. Variance refers to the dispersion of a single measure's distribution, indicating the squared average distance of the distribution. The mathematical representation of variance is shown below: \({{\sigma }}^{2}=\frac{1}{\text{n}}{\sum }_{\text{i}=1}^{\text{n}}{\left({\text{x}}_{\text{i} }-\stackrel{-}{\text{x}}\right)}^{2} \left(1\right)\) Here, σ represents the variance, n denotes the number of features, x i corresponds to the ith sample feature value, and x̅ signifies the average of the feature in each sample. By leveraging the advantages of variance estimation within the dataset, the high-dimensional features are initially reduced in dimensionality, thereby streamlining subsequent experiments. SelectFromModel (SFM) is a feature selection method that operates on the basis of feature importance weights. Like other feature selection functions, SFM necessitates specifying the number of features to retain. In this study, we combined the Linear Support Vector Classification (LSVC) classifier, which supports multi-classification and exhibits robust performance, with SFM to select features by evaluating their respective weights 22 . LSVC is a widely utilized method in medical big data applications for feature selection, offering a reliable and effective approach to the analysis of high-dimensional data. Sample Balance The Synthetic Minority Oversampling Technique (SMOTE) is employed to address imbalanced datasets 23 . In this study, we utilized the SMOTE algorithm from the Imbalanced-learn Python package 24 . The SMOTE algorithm mitigates the issue of overfitting commonly encountered in random sampling algorithms. It achieves this by analyzing minority samples and synthesizing new samples for the dataset based on these minority samples. The algorithm process proceeds as follows: For each sample x in the minority class, compute the Euclidean distance between it and all samples in the minority sample set, subsequently obtaining its k nearest neighbor samples. Establish a sampling ratio according to the sample imbalance ratio to determine the sampling magnification N. For each minority sample x, randomly select several samples from its k neighbors, with the chosen neighbors denoted as \({x}_{n}\) . For each randomly selected neighbor \({x}_{n}\) , construct a new sample with the original sample using the following equations: $$xnew=x+{\lambda }*\left|x-{x}_{n}\right| \left(2\right)$$ Here, \(xnew\) represents the newly constructed sample, λ is a random number between 0 and 1, and \({x}_{n}\) denotes the randomly selected neighbor. By implementing the SMOTE algorithm, this study effectively balances the dataset and addresses the challenges posed by imbalanced data in the context of classification and prediction tasks. DNN Neural Network Deep learning is founded upon the framework of neural network models, which involve the study of neural networks comprising multiple layers of hidden layers, as opposed to simpler neural network models 25 . The basic framework of a Deep Neural Network (DNN) is depicted in Fig. 3 , where the first layer is designated as the input layer, receiving the original data feature input. The layers between the first and last layers are referred to as hidden layers, serving as the primary units for data processing and feature learning. Hidden layers initially fit the input information from the preceding layer using a linear model, followed by a nonlinear transformation of the fitting result through an activation function, and subsequently pass the transformed result to the next layer for processing. The final layer, known as the output layer, delivers the model's ultimate calculation result. Through this layer-by-layer abstraction process, each layer within the DNN can extract more complex feature information, facilitating more profound data characterization and pattern learning for large-scale training samples. Common activation functions in DNNs include Sigmoid, Tanh, and others, which primarily serve to confer non-linear mapping capabilities to the network 26 . In the absence of activation functions, feedforward neural networks can only implement linear mappings, and multilayer neural networks become equivalent to single-layer neural networks. The addition of activation functions imparts hierarchical non-linear mapping learning abilities to deep neural networks. The Tanh activation function addresses the issue of non-centered Sigmoid outputs, which can result in slower convergence. With an exponential function shape, the Tanh activation function closely resembles biological neurons in a physical sense and can map input data between 0 and 1. Its mathematical expression is given as follows, where x represents the input value: $$\text{tanh}\left(x\right)=\frac{{e}^{x}-{e}^{-x}}{{e}^{x}+{e}^{-x}} \left(3\right)$$ The DNN training process employs a loss function to characterize the discrepancy between a patient's true label value and the network's predicted output value. By minimizing the loss function, the trainable parameters within the network are continuously updated and optimized to enhance life-time prediction performance 27 . This model employs multiple types of cross-entropy to ultimately classify breast cancer patients as long-survivors or short-survivors. The mathematical expression for this is as follows, where the predicted label of the ith sample is \({\widehat{y}}^{i}=({\widehat{y}}_{0}^{i},{\widehat{y}}_{1}^{i})\) , and the true label of the ith sample is \({y}^{i}=({y}_{0}^{i},{y}_{1}^{i})\) . $$categorical crossentropy\left(Y,\widehat{Y}\right)=-\frac{1}{m}{\sum }_{i=1}^{m}{\sum }_{j=1}^{n}{\widehat{y}}_{j}^{i}{log}\left({y}_{j}^{i}\right) \left(4\right)$$ Ensemble Learning Ensemble learning is not an independent machine learning algorithm; rather, it involves constructing and combining multiple machine learning models to complete learning tasks 28 . The integration of multiple classifiers leads to improved results. The voting mechanism (voting) is a combination strategy for classification problems in ensemble learning, with the fundamental concept being the fusion of multiple data sources to reduce error. For classification models, the hard voting method predicts the result as the most frequently occurring category among multiple models' predictions 29 . Soft voting, on the other hand, aggregates the probabilities of each type of prediction result and ultimately selects the class label with the largest sum of probabilities. In multi-omics fusion, the concept of soft voting is employed to obtain the results of multi-omics model fusion, which reduces the computational complexity of the model and effectively enhances its accuracy. Performance Measurements In machine learning and deep learning, evaluating the performance indicators of a model is essential for fully reflecting its recognition capabilities. The model's prediction results typically rely on the confusion matrix 30 , which includes True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). More advanced categorical indicators can be derived from the confusion matrix 31 , as demonstrated in the following formulas: $$\begin{array}{c}Acc=\frac{TP+TN}{TP+FP+TN+FN}\\ SN=\frac{TP}{TP+FN}\\ SP=\frac{TP}{TN+FP}\\ Pre=\frac{TP}{TP+FP}\\ F1=\frac{2\times Pre+SN}{SN+Pre}\end{array}$$ Results The results of data preprocessing In this study, breast cancer patient omics data were sourced from clinical information, methylation, and transcriptomics data in the TCGA database. To facilitate multi-omics model fusion, data preprocessing was divided into two parts. Initially, original data sample processing involved: ( 1 ) excluding 8 male samples to reduce noise, ( 2 ) retaining intersection samples of the three omics data and focusing on White and African American samples, ultimately obtaining 723 samples to satisfy the model fusion architecture requirements. Secondly, data feature preprocessing involved: ( 1 ) removing features with over 20% missing values in each omics, followed by data imputation, ( 2 ) excluding features relevant to survival prediction in clinical information and strongly correlated features. Consequently, 395,007 methylation features, 39,953 transcriptome features, and 16 clinical information features were selected. Supplementary Tables 1 and 2 provide further details regarding these characteristics. Details of the characteristics are shown in supplementary Tables 1 and 2. The result of feature number optimizing Following preliminary data processing, each transcriptome sample consisted of 39,953 features, while each methylation sample comprised 395,007 features. Excessive features can result in overfitting, reducing model generalizability. Therefore, this study aimed to reduce data dimensionality through feature selection. Initially, low-variance features were removed to eliminate those with minimal contribution to the model. Subsequently, LinearSVC was employed to weight transcriptome and methylation data, and significant differentiating features were screened based on weight values 32 . The feature start number was set at 5, end number at 150, and step interval at 5. Finally, a DNN model integrated with three-fold cross-validation optimized the feature subset. The optimal results comprised 135 features, with methylation accuracy at 78.56% and transcriptome accuracy at 79.66%. The overall trend indicated superior classification efficacy in transcriptome. Figure 4 presents the classification effect, dividing patients into transcriptome and methylation groups. The horizontal axis represents feature quantity, vertical axis indicates accuracy, the green line signifies methylomics, and the orange line denotes transcriptomics. The results of transcription data on different races The transcription data can be classified into three racial groups: White, African Americans, and non-racial groups. Figure 5 shows the experimental results, with the purple line representing African Americans, the orange line indicating White, and the green line signifying the non-racial group. The African American group exhibited significantly better performance than the White and non-racial groups, with the latter two achieving similar accuracies. The accuracy of the White group was 80.59% with 140 features, while the African American model achieved 97.43% accuracy with only 90 features. By selecting the best model for each dataset as the final result, the accuracy of the African American group improved by 19.98% compared to a non-race-based approach. The best models for the non-grouped and White grouping required 140 features to achieve 80.59% accuracy, whereas the African American group needed only 5 features to obtain better accuracy. The results of methylation data on different races Figure 6 shows the results of the study. The purple line represents the African American feature selection curve, the orange line represents the White grouping, and the green line represents the scenario without considering race. The graph shows that both the African American and White groups had improved accuracy compared to not considering race. The African American group had the best survival prediction performance, with a classification accuracy rate of 89.74% at 95 features. Compared to the African American group, the White group achieved a classification accuracy of 85.01% at 135 features, improving the African American classification accuracy by 4.73% compared to the race-neutral approach. In contrast, the non-grouped approach attained an accuracy of 78.56%, with the African American group's accuracy rate 11.18% higher than the non-grouped classification in a race-neutral context. The subgroup of African Americans required only 65 features, while the subgroup of White individuals needed 90 features. Both outperformed the race-neutral approach, which achieved 78.56% accuracy with 135 features. Prediction results of multiomics models For the final classification, DNN models were built for each of the three single-omics data sets, and their outputs were combined using the soft voting method of ensemble learning. The resulting groupings of individuals are depicted in Fig. 7 A, with non-racial individuals in purple, White individuals in orange, and African Americans in red. The analysis of multiple omics fusion showed that for African American patients, the accuracy of high-risk patient identification was 3.21% higher than the best transcriptome data in mono-omics and 16.02% higher than the worst clinical data in mono-omics. For white patients, the accuracy of multiple omics fusion was 7.94% higher than the best methylation data in mono-omics and 11.19% higher than the worst transcriptome data in mono-omics. The study found that ungrouped fusion was 8.16% more accurate than the best transcriptome data in mono-omics and 13.69% more accurate than the worst methylation data in mono-omics. These results suggest that the COMBINE approach is necessary for accurately identifying high-risk patients. The study selected a limited number of features for comparison due to the small number of African American samples. Figures 7 B, 7 C, and 7 D display the three mono-omics and fusion classification effect indicators of the three groups when 95 features were selected for gene expression and methylation, respectively. The COMBINE model demonstrated that combining multi-omics data produced superior results compared to the DNN strategy of single omics. Additionally, it yielded improved outcomes for different races, particularly among African American groups. Gene Function Analysis This study examined three biomarkers involved in the migration of breast cancer cells, including Hes1, CCDC71L, and KAP3A. Low expression of Hes1 protein inhibits proliferation and migration of triple-negative breast cancer cells (TNBC) and promotes apoptosis 33 , while overexpression of CCDC71L activates LINC00514, promoting TNBC cell proliferation, migration, and invasion, inhibiting apoptosis, and contributing to carcinogenesis 34 . KAP3A acts as a physiological substrate for breast tumor kinase (BRK), implicated in breast cancer cell migration 35 . Two additional biomarkers, AFG1L and LAGE3, serve as therapeutic targets, with AFG1L potentially being a valuable target for breast cancer treatment or prognosis 36 , and LAGE3 assisting in identifying early-stage TNBC patients at high risk of recurrence 37 . Furthermore, enrichment analysis on gene expression profiles of African American breast cancer patients was conducted, and SangerBox 38 was used to generate an enrichment analysis plot 39 , as shown in Fig. 8 . The endocrine resistance pathway was identified, where MED1 40 , a gene associated with estrogen receptor (ER) signaling, was found. The presence of the endocrine resistance pathway may explain the insensitivity of African American breast cancer patients to endocrine therapy. This finding holds significant implications for devising more effective treatment strategies, suggesting the need for targeted therapeutic interventions addressing the endocrine resistance pathway in African American patients. Nevertheless, further validation and exploration are necessary to determine this pathway's role and mechanism in African American patients. Statistical significance test To validate the effectiveness of the multi-omics fusion model for breast cancer, we conducted statistical significance tests. We performed t-tests and analysis of variance (ANOVA) on the fused model and the best-performing individual omics model. We utilized the TCGA breast cancer dataset and conducted 20 experiments, recording the performance metrics Acc and AUC. Subsequently, we used the scipy library to conduct t-tests and ANOVA on the recorded results. The analysis produced the following outcomes: the t-test values for the AUC metric were 5.953, and for the Acc metric were 5.086, with both corresponding p-values being 0.0. The ANOVA results showed an F-test value of 35.443 for the AUC metric, and 25.863 for the Acc metric, with the respective p-values also being 0.0. Based on both the t-values and p-values, it can be concluded that there is a significant difference in AUC and Acc metrics between the fusion of multi-omics data and the other groups. These statistical findings further support the argument for the advantages of integrating multi-omics data. Conclusion This study proposes a novel ensemble learning method, COMBINE, which integrates multiple omics data to predict the survival outcomes of breast cancer patients. The approach utilizes three distinct omics data types and integrates three deep neural networks, significantly improving the accuracy of survival prediction compared to using a single omics data type. The accuracy achieved is up to 10.10% higher. To address imbalanced positive and negative samples, we utilized the SMOTE method. Additionally, we modeled multiple omics integration separately for two ethnic groups, White and African American. This revealed a 12.83% improvement in accuracy for the African American group, emphasizing the significance of considering race as a critical factor when selecting biomarkers for personalized breast cancer treatment. KEGG pathway enrichment analysis was conducted on identified genes in African American patients, revealing a higher prevalence of the endocrine resistance pathway. This may contribute to their poorer survival outcomes, indicating that different treatment strategies may be required to overcome endocrine resistance and improve survival rates. However, further research is necessary to validate these conclusions. Although our study shows the benefits of using COMBINE for integrating multiple omics data, we only fused three types of structured single omics data. In the future, we plan to incorporate various types of omics data to develop a more dependable model for predicting breast cancer survival. Declarations Ethical Approval Not applicable Consent for publication Written informed consent for publication was obtained from all participants. Competing interests The authors declare no competing interests. Funding This work is supported by the National Natural Science Foundation of China Mathematics Tianyuan Fund Project (12326377), the Science and Technology Project of Science and Technology Department of Jilin Province (222621JC010397545,222621JC010694762). Author Contribution X.F. conceptualized the study, developed the methodology, provided the software, conducted the formal analysis, and wrote the original draft. W.X. contributed to the software and validated the results. L.D. curated the data. Y.X. also contributed to the software. R.X. administered the project, provided resources, supervised the study, acquired funding, and reviewed and edited the manuscript. Acknowledgments Not applicable Availability of data and materials The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request. References Liu, Y. et al. Metagenomics next-generation sequencing provides insights into the causative pathogens from critically ill patients with pneumonia and improves treatment strategies. Frontiers in Cellular and Infection Microbiology 12 (2023). Kalafi, E. Y., Nor, N., Taib, N. A., Ganggayah, M. & Dhillon, S. K. Machine Learning and Deep Learning Approaches in Breast Cancer Survival Prediction Using Clinical Data. Folia biologica 65, 212–220 (2019). Zhu, T. et al. Variations in genotype–phenotype correlations in phenylalanine hydroxylase deficiency in Chinese Han population. Gene (2013). Li, D.-m. & Feng, Y.-m. Signaling mechanism of cell adhesion molecules in breast cancer metastasis: potential therapeutic targets. Breast Cancer Research and Treatment 128, 7–21 (2011). Fan, Y., Xu, B.-h., Liao, Y., Yao, S. & Sun, Y. A retrospective study of metachronous and synchronous ipsilateral supraclavicular lymph node metastases in breast cancer patients. Breast 19 5, 365–369 (2010). Reel, P. S., Reel, S., Pearson, E. R., Trucco, E. & Jefferson, E. R. Using machine learning approaches for multi-omics data analysis: A review. Biotechnology advances, 107739 (2021). Fatima, N., Li, L., Hong, S. & Ahmed, H. Prediction of Breast Cancer, Comparative Review of Machine Learning Techniques and their Analysis. IEEE Access PP, 1–1 (2020). Wolff, A. C. et al. Randomized phase III placebo-controlled trial of letrozole plus oral temsirolimus as first-line endocrine therapy in postmenopausal women with locally advanced or metastatic breast cancer. Journal of clinical oncology: official journal of the American Society of Clinical Oncology 31 2, 195–202 (2013). Ulgen, A., Gürkut, Ö. & Li, W. Potential Predictive Factors for Breast Cancer Subtypes from a North Cyprus Cohort Analysis. Cyprus Journal of Medical Sciences (2019). Monzavi–Karbassi, B., Siegel, E. R., Medarametla, S., Makhoul, I. & Kieber–Emmons, T. Breast cancer survival disparity between African American and Caucasian women in Arkansas: A race-by-grade analysis. Oncol Lett 12, 1337–1342, doi: 10.3892/ol.2016.4804 (2016). Yu, H. J., Jing, C., Xiao, N., Zang, X. M. & Tan, Q. W. Structural difference analysis of adult's intestinal flora basing on the 16S rDNA gene sequencing technology. (2020). Karvinen, K. H., Raedeke, T. D., Arastu, H. H. & Allison, R. R. Exercise programming and counseling preferences of breast cancer survivors during or after radiation therapy. Oncology nursing forum 38 5, E326-334 (2011). Antoine, W. & Miernyk, J. A. A Multidimensional Scaling-Based Model for Analysis of Time-Index Biomics Data. (2009). Ellison, L. F., Bryant, H., Lockwood, G. & Shack, L. Conditional survival analyses across cancer sites. Health Reports 22, 21–25 (2011). Xin, F. et al. Detection and Comparative Analysis of Methylomic Biomarkers of Rheumatoid Arthritis. Frontiers in genetics 11 (2020). Afaq, J. et al. Water Quality Prediction Using KNN Imputer and Multilayer Perceptron. Water 14 (2022). Wang, H. et al. LaCOme: learning the latent convolutional patterns among transcriptomic features to improve classifications. Gene, 147246 (2023). Xin, R. et al. Computational Characterization of Undifferentially Expressed Genes with Altered Transcription Regulation in Lung Cancer. Genes 14 (2023). Fan, J., Guo, S. & Hao, N. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2012). Li, S. Identifying Optimal Wavelengths as Disease Signatures Using Hyperspectral Sensor and Machine Learning. Remote Sensing 13 (2021). Feng, S., Keung, J. W., Yu, X., Xiao, Y. & Zhang, M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf. Softw. Technol. 139, 106662 (2021). Feng, X. et al. MSFC: a new feature construction method for accurate diagnosis of mass spectrometry data. Scientific Reports 13, 15694, doi: 10.1038/s41598-023-42395-5 (2023). Fernandez, A., Garcia, S., Chawla, N. V. & Herrera, F. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research 61, 863–905 (2018). Guillaume, L., Fernando, N. & K., A. C. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. JOURNAL OF MACHINE LEARNING RESEARCH 18 (2017). Ginanjar, S., Suhartono, Wibowo, A. & Sarwoko, E. A. The best architecture selection with deep neural network (DNN) method for breast cancer classification using MicroRNA data. Journal of Physics: Conference Series 1524 (2020). Tian, Y.-q., Lai, Y. A. & Yang, C. Research of Consumption Behavior Prediction Based on Improved DNN. Scientific Programming (2022). Mahmoud, A. Sathurthi, S. & Saruladha, K. An analysis of parallel ensemble diabetes decision support system based on voting classifier for classification problem. Electron. Gov. an Int. J. 16, 25–38 (2020). Li, J. et al. MuscNet, a Weighted Voting Model of Multi-Source Connectivity Networks to Predict Mild Cognitive Impairment Using Resting-State Functional MRI. IEEE access: practical innovations, open solutions 8, 174023–174031 (2020). Zhiqin, W., Ruiqing, L., Minghui, W. & Ao, L. GPDBN: deep bilinear network integrating both genomic data and pathological images for breast cancer prognosis prediction. Bioinformatics (Oxford, England) 37 (2021). Tharwat, A. Classification assessment methods. Applied Computing and Informatics (2018). Haohui, L. & Shahadat, U. Explainable Stacking-Based Model for Predicting Hospital Readmission for Diabetic Patients. Information 13 (2022). Yao, L. & Tian, F. GRWD1 affects the proliferation, apoptosis, invasion and migration of triple negative breast cancer through the Notch signaling pathway. Exp Ther Med 24, 473, doi: 10.3892/etm.2022.11400 (2022). Luo, X. & Wang, H. LINC00514 upregulates CCDC71L to promote cell proliferation, migration and invasion in triple-negative breast cancer by sponging miR-6504-5p and miR-3139. Cancer Cell Int 21, 180, doi: 10.1186/s12935-021-01875-2 (2021). Lukong, K. E. & Richard, S. Breast tumor kinase BRK requires kinesin-2 subunit KAP3A in modulation of cell migration. Cell Signal 20, 432–442, doi: 10.1016/j.cellsig.2007.11.003 (2008). Luo, W. et al. Breast Cancer Prognosis Prediction and Immune Pathway Molecular Analysis Based on Mitochondria-Related Genes. Genet Res (Camb) 2022, 2249909, doi: 10.1155/2022/2249909 (2022). Yang, Y. S. et al. The early-stage triple-negative breast cancer landscape derives a novel prognostic signature and therapeutic target. Breast Cancer Res Treat 193, 319–330, doi: 10.1007/s10549-022-06537-z (2022). Shen, W. et al. Sangerbox: A comprehensive, interaction-friendly clinical bioinformatics analysis platform. iMeta 1, e36, doi: 10.1002/imt2.36 (2022). Kim, J. In silico analysis of differentially expressed genesets in metastatic breast cancer identifies potential prognostic biomarkers. World Journal of Surgical Oncology 19, 188, doi: 10.1186/s12957-021-02301-7 (2021). Wang, Y. et al. A Novel Multimodal MRI Analysis for Alzheimer's Disease Based on Convolutional Neural Network. Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference 2018, 754–757, doi: 10.1109/embc.2018.8512372 (2018). Additional Declarations No competing interests reported. Supplementary Files COMBINESupplementaryFiles.xls Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-3852479","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":268534749,"identity":"3b6558f0-112a-467b-83d4-81712aa52bba","order_by":0,"name":"Xin Feng","email":"","orcid":"","institution":"Jilin Institute of Chemical Technology","correspondingAuthor":false,"prefix":"","firstName":"Xin","middleName":"","lastName":"Feng","suffix":""},{"id":268534750,"identity":"c507be9d-6bd4-46c1-957f-7c52096d125a","order_by":1,"name":"Weiming Xie","email":"","orcid":"","institution":"Jilin Institute of Chemical Technology","correspondingAuthor":false,"prefix":"","firstName":"Weiming","middleName":"","lastName":"Xie","suffix":""},{"id":268534751,"identity":"937fc15d-3d28-47d5-b28f-19251fcc4d73","order_by":2,"name":"Lin Dong","email":"","orcid":"","institution":"Jilin University","correspondingAuthor":false,"prefix":"","firstName":"Lin","middleName":"","lastName":"Dong","suffix":""},{"id":268534752,"identity":"0ccf3b1c-c088-49da-aa20-caefef2b237f","order_by":3,"name":"Yongxian Xin","email":"","orcid":"","institution":"Australian National University","correspondingAuthor":false,"prefix":"","firstName":"Yongxian","middleName":"","lastName":"Xin","suffix":""},{"id":268534753,"identity":"cbaafc92-f970-486a-8172-903c6cd28b9b","order_by":4,"name":"Ruihao Xin","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6UlEQVRIiWNgGAWjYDACZhBhcICBgb0h8UACWCiBWC08BxKAWgyI0AIBQC0SCSCSCC0Gx3mPfeYpuCPPL/ngwYEHNX8Y+NlzDBh+7sCtRbKZL3k2j8Ezw5mzE4AOO2bAINnzxoCx9wxuLfzMPMbMPAaHGTfcBmpJbDBgMLiRY8DM2IZbCxtUi/2GmwcgWuwJaYHZkrjhBgPUFgkCWiSbeYwZ5xgcTp7ZA/aLMY/EmWcFB3vxaDE4f8aY4c2fw7b97GcSH/6okZPjb0/e+OAnHi1IgCcBTIKIA0RpAKYYYhWOglEwCkbBSAMATsFQS1M+CJQAAAAASUVORK5CYII=","orcid":"","institution":"Jilin Institute of Chemical Technology","correspondingAuthor":true,"prefix":"","firstName":"Ruihao","middleName":"","lastName":"Xin","suffix":""}],"badges":[],"createdAt":"2024-01-11 05:59:15","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-3852479/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-3852479/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":50011678,"identity":"5d5d5082-38be-45b1-b1e9-f8107af6e576","added_by":"auto","created_at":"2024-01-23 05:27:35","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":135677,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eExperimental flowchart\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/56a6d39091a1859c91d7ce9f.png"},{"id":50011542,"identity":"93e7a2d9-57df-489e-8c8f-8a9f6d60f44c","added_by":"auto","created_at":"2024-01-23 05:19:35","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":83802,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eillustrates the survival time density maps for different races among breast cancer patients. The graph clearly demonstrates that African American patients generally experience shorter survival times, whereas Caucasian patients tend to have longer survival times. This observation highlights the significant disparities in outcomes based on race in breast cancer treatment and emphasizes the need for targeted interventions to address these disparities.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/116d524715441e05aec56bc6.png"},{"id":50011547,"identity":"8ca9dc21-1880-4465-b2e1-4478acae2f4d","added_by":"auto","created_at":"2024-01-23 05:19:36","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":182845,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eBasic framework of a deep neural network\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/4bc047e48be6546eaba0dfb7.png"},{"id":50011543,"identity":"25e39862-28bc-4b56-9f80-fc3545301f08","added_by":"auto","created_at":"2024-01-23 05:19:35","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":31113,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEffect of race-independent feature selection on DNA methylation and transcriptome data\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/faa320b274f9bcdece70d9d9.png"},{"id":50011679,"identity":"16d53137-aff3-42bb-aee6-ff3f9780b7d1","added_by":"auto","created_at":"2024-01-23 05:27:36","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":33976,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe classification performance of transcriptomes in different populations is influenced by the number of features utilized.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/bc992f0f119d0b8b73a67894.png"},{"id":50011546,"identity":"e97c4f87-4822-4540-8282-44194b1a048f","added_by":"auto","created_at":"2024-01-23 05:19:36","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":38066,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe classification performance of methylation in different populations is influenced by the number of features utilized.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/40d53bcc2862c7b2b348ff43.png"},{"id":50011549,"identity":"fc8df0a0-0375-4afc-b0f7-225b9650a449","added_by":"auto","created_at":"2024-01-23 05:19:36","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":167399,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparison of single-omics and multiomics models in different racial groups \u003c/strong\u003e(A. Comparison of accuracy between different racial groups in multiomics models; B. Comparison of multiple evaluation metrics between multiomics and single-omics models under no racial distinctions; C. Comparison of multiple evaluation metrics between multiomics and single-omics models within the White race group; D. Comparison of multiple evaluation metrics between multiomics and single-omics models within the Black race group.)\u003c/p\u003e","description":"","filename":"image7.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/02d02e7097035673608ae0e9.png"},{"id":50011550,"identity":"93a4414b-a3ab-48dd-a62b-2bbd91627767","added_by":"auto","created_at":"2024-01-23 05:19:36","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":228475,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eKEGG Enrichment analysis of gene screening in African Americans\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"image8.png","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/2f729b563691165e78e9f2b8.png"},{"id":63244882,"identity":"5755ff6a-dfc4-45cf-852c-2e01dc689481","added_by":"auto","created_at":"2024-08-26 05:33:19","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1474846,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/75489d49-fa4d-4bc0-b0d3-7cfd4617ec54.pdf"},{"id":50011544,"identity":"250c987d-d320-4dbf-98d0-b30884ef3b67","added_by":"auto","created_at":"2024-01-23 05:19:35","extension":"xls","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":19968,"visible":true,"origin":"","legend":"","description":"","filename":"COMBINESupplementaryFiles.xls","url":"https://assets-eu.researchsquare.com/files/rs-3852479/v1/0509fb4265c4acbbe2a8c73e.xls"}],"financialInterests":"No competing interests reported.","formattedTitle":"COMBINE: A Comprehensive Multi-Omics Approach for Improving Breast Cancer Prognosis Classification in African American Women","fulltext":[{"header":"Introduction","content":"\u003cp\u003eMulti-omics data, encompassing genomics, transcriptomics, proteomics, metabolomics, and epigenomics, provides an exhaustive perspective on a patient's molecular landscape \u003csup\u003e1\u003c/sup\u003e. The integration of this multi-omics information with conventional clinical detection indicators allows researchers to devise more accurate and tailored prediction models. This fusion facilitates the discovery of new biomarkers, the identification of distinct molecular signatures, and the clarification of complex molecular interplay that influences disease pathogenesis and progression \u003csup\u003e2\u003c/sup\u003e. The employment of big data mining technology for multi-omics data analysis has expedited the creation of sophisticated algorithms and computational models \u003csup\u003e3\u003c/sup\u003e. Consequently, researchers can now unravel intricate relationships among diverse molecular strata. This advancement has fostered a more profound comprehension of the fundamental mechanisms propelling disease onset and progression, as well as the recognition of potential therapeutic targets \u003csup\u003e4,5\u003c/sup\u003e. In summary, multi-omics data offers a comprehensive insight into the patient's molecular composition, and its integration with traditional clinical detection indicators promotes the development of precise and personalized prediction models. The application of big data mining technology in multi-omics data analysis \u003csup\u003e6\u003c/sup\u003e supports the advancement of cutting-edge algorithms and computational models, which in turn deepens the understanding of disease mechanisms and facilitates the discovery of potential therapeutic targets.\u003c/p\u003e \u003cp\u003eBreast cancer has emerged as one of the most lethal malignant tumors globally, marked by a substantial number of new cases \u003csup\u003e7,8\u003c/sup\u003e. Consequently, the investigation of breast cancer mechanisms has garnered significant attention \u003csup\u003e9,10\u003c/sup\u003e. In recent years, the accelerated development of high-throughput gene sequencing technology has yielded copious amounts of biomics data, which contains genomic information intimately linked to cancer initiation and progression \u003csup\u003e11,12\u003c/sup\u003e. Therefore, the examination and analysis of biomics data can enhance our understanding of cancer mechanisms, ultimately bolstering the diagnosis, treatment, and prevention of the disease \u003csup\u003e13\u003c/sup\u003e. The five-year survival \u003csup\u003e14\u003c/sup\u003e rate serves as an evaluative measure of tumor patient survival and represents a standard criterion for assessing treatment efficacy, facilitating easy comparison. Even a minor improvement in survival can be deemed evidence of a clinically meaningful benefit \u003csup\u003e12\u003c/sup\u003e. As such, the systematic analysis of biomics data and the development of effective predictive models can play a crucial role in advancing our understanding of breast cancer mechanisms, which can ultimately lead to improved patient outcomes and more targeted therapeutic interventions.\u003c/p\u003e \u003cp\u003eThis study aims to examine the influence of racial information and natural factors on the incidence and progression of cancer by utilizing a multi-omics data fusion model for predicting breast cancer survival cycles. The main goal of this research is to improve the accuracy of breast cancer survival cycle prediction by developing an ensemble learning-based multi-omics fusion prediction model. This model integrates clinical, transcriptomic, and methylomic data from The Cancer Genome Atlas (TCGA) datasets. The experimental results demonstrate that the fusion of the three omics approaches (with an accuracy rate of 97.43%) outperforms single-omics experiments and other multi-omics and single-omics experiments based on race in the context of the three-omics experiments, considering racial disparities. This research provides technical support for predicting the survival cycle of breast cancer patients and introduces new concepts for studying breast cancer survival prognostics. It also offers valuable insights that can guide future research in the field of breast cancer survival prognosis.\u003c/p\u003e"},{"header":"Materials and methods","content":"\u003cp\u003eIn this study, a deep neural network (DNN)-based multi-omics fusion method is employed to investigate the classification efficacy of three omics data types. Initially, the standardized dataset undergoes preliminary screening, with features exhibiting missing values greater than 20% and unchanged features being removed \u003csup\u003e15\u003c/sup\u003e. The sample intersection of the three omics data types is retained. Subsequently, all male samples are deleted, and the African American and White sample sets, the African American sample collection, and the White sample collection are preserved. The KNN Imputer algorithm is applied to impute missing values in the samples, addressing the issue of reduced classification accuracy due to missing values \u003csup\u003e16\u003c/sup\u003e. As this study focuses on survival prediction, features related to survival cycles in clinical data are removed, along with some features with missing values exceeding 20%.\u003c/p\u003e \u003cp\u003eGiven that the number of features in transcriptomics and methylomics substantially surpasses the sample size, overfitting of the model is a potential concern, resulting in the so-called curse of dimensionality \u003csup\u003e17,18\u003c/sup\u003e. Consequently, feature selection for methylomics and transcriptomics data is necessary. First, variance selection is employed to filter out features with minimal changes \u003csup\u003e19\u003c/sup\u003e. Subsequently, the optimal feature subset is selected according to the weights assigned by LinearSVC \u003csup\u003e20\u003c/sup\u003e. The SMOTE oversampling technique is then utilized to balance the training set \u003csup\u003e21\u003c/sup\u003e. Finally, distinct DNN models are constructed for triple cross-validation training in the three groups, maintaining the same proportion of positive and negative samples in each fold cross-validation. The ensemble learning concept is implemented to fuse the results of the three models. The experimental workflow is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData source and preprocessing\u003c/h2\u003e \u003cp\u003eThis study obtained data from The Cancer Genome Atlas (TCGA), a repository of cancer genomic maps that includes transcriptomics, DNA methylomics, and clinical information. Patient samples were intersected for each of the three omics data types, resulting in a final sample size of 567 White samples and 156 African American samples. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e displays the survival days for both groups: the blue section represents African American survival time density, while the red section represents White survival density. The abscissa and ordinate represent survival days and density, respectively.\u003c/p\u003e \u003cp\u003ePatients were further classified as long- or short-term survivors based on the five-year survival criterion. To handle missing values in the sample, we removed features with more than 20% missing values. Then, we used the KNN Imputer method to fill in the remaining missing values while preserving the differences between features as much as possible. To address the issue of extensive data scale, which negatively impacts the algorithm's time complexity, we performed data standardization. Consequently, the feature data obtained encompassed 39,953 RNA-seq signatures, 395,007 methylation signatures, and 16 clinical information features.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eFeature Selection\u003c/h2\u003e \u003cp\u003eIn this study, we employed two distinct feature selection algorithms, Variance estimation selection and SelectFromModel \u003csup\u003e20\u003c/sup\u003e, to conduct a two-step feature screening process for omics data with a larger number of features than the sample size. Variance refers to the dispersion of a single measure's distribution, indicating the squared average distance of the distribution. The mathematical representation of variance is shown below:\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({{\\sigma }}^{2}=\\frac{1}{\\text{n}}{\\sum }_{\\text{i}=1}^{\\text{n}}{\\left({\\text{x}}_{\\text{i} }-\\stackrel{-}{\\text{x}}\\right)}^{2} \\left(1\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003cp\u003eHere, σ represents the variance, n denotes the number of features, x\u003csub\u003ei\u003c/sub\u003e corresponds to the ith sample feature value, and x̅ signifies the average of the feature in each sample. By leveraging the advantages of variance estimation within the dataset, the high-dimensional features are initially reduced in dimensionality, thereby streamlining subsequent experiments.\u003c/p\u003e \u003cp\u003eSelectFromModel (SFM) is a feature selection method that operates on the basis of feature importance weights. Like other feature selection functions, SFM necessitates specifying the number of features to retain. In this study, we combined the Linear Support Vector Classification (LSVC) classifier, which supports multi-classification and exhibits robust performance, with SFM to select features by evaluating their respective weights \u003csup\u003e22\u003c/sup\u003e. LSVC is a widely utilized method in medical big data applications for feature selection, offering a reliable and effective approach to the analysis of high-dimensional data.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eSample Balance\u003c/h2\u003e \u003cp\u003eThe Synthetic Minority Oversampling Technique (SMOTE) is employed to address imbalanced datasets \u003csup\u003e23\u003c/sup\u003e. In this study, we utilized the SMOTE algorithm from the Imbalanced-learn Python package \u003csup\u003e24\u003c/sup\u003e. The SMOTE algorithm mitigates the issue of overfitting commonly encountered in random sampling algorithms. It achieves this by analyzing minority samples and synthesizing new samples for the dataset based on these minority samples. The algorithm process proceeds as follows:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eFor each sample x in the minority class, compute the Euclidean distance between it and all samples in the minority sample set, subsequently obtaining its k nearest neighbor samples.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eEstablish a sampling ratio according to the sample imbalance ratio to determine the sampling magnification N. For each minority sample x, randomly select several samples from its k neighbors, with the chosen neighbors denoted as \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({x}_{n}\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eFor each randomly selected neighbor \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({x}_{n}\\)\u003c/span\u003e\u003c/span\u003e, construct a new sample with the original sample using the following equations:\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003cdiv id=\"Equa\" class=\"Equation\"\u003e \u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$xnew=x+{\\lambda }*\\left|x-{x}_{n}\\right| \\left(2\\right)$$\u003c/div\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003eHere, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(xnew\\)\u003c/span\u003e\u003c/span\u003e represents the newly constructed sample, λ is a random number between 0 and 1, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({x}_{n}\\)\u003c/span\u003e\u003c/span\u003e denotes the randomly selected neighbor.\u003c/p\u003e \u003cp\u003eBy implementing the SMOTE algorithm, this study effectively balances the dataset and addresses the challenges posed by imbalanced data in the context of classification and prediction tasks.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eDNN Neural Network\u003c/h2\u003e \u003cp\u003eDeep learning is founded upon the framework of neural network models, which involve the study of neural networks comprising multiple layers of hidden layers, as opposed to simpler neural network models \u003csup\u003e25\u003c/sup\u003e. The basic framework of a Deep Neural Network (DNN) is depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, where the first layer is designated as the input layer, receiving the original data feature input. The layers between the first and last layers are referred to as hidden layers, serving as the primary units for data processing and feature learning. Hidden layers initially fit the input information from the preceding layer using a linear model, followed by a nonlinear transformation of the fitting result through an activation function, and subsequently pass the transformed result to the next layer for processing. The final layer, known as the output layer, delivers the model's ultimate calculation result. Through this layer-by-layer abstraction process, each layer within the DNN can extract more complex feature information, facilitating more profound data characterization and pattern learning for large-scale training samples.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eCommon activation functions in DNNs include Sigmoid, Tanh, and others, which primarily serve to confer non-linear mapping capabilities to the network \u003csup\u003e26\u003c/sup\u003e. In the absence of activation functions, feedforward neural networks can only implement linear mappings, and multilayer neural networks become equivalent to single-layer neural networks. The addition of activation functions imparts hierarchical non-linear mapping learning abilities to deep neural networks. The Tanh activation function addresses the issue of non-centered Sigmoid outputs, which can result in slower convergence. With an exponential function shape, the Tanh activation function closely resembles biological neurons in a physical sense and can map input data between 0 and 1. Its mathematical expression is given as follows, where x represents the input value:\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\text{tanh}\\left(x\\right)=\\frac{{e}^{x}-{e}^{-x}}{{e}^{x}+{e}^{-x}} \\left(3\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe DNN training process employs a loss function to characterize the discrepancy between a patient's true label value and the network's predicted output value. By minimizing the loss function, the trainable parameters within the network are continuously updated and optimized to enhance life-time prediction performance \u003csup\u003e27\u003c/sup\u003e. This model employs multiple types of cross-entropy to ultimately classify breast cancer patients as long-survivors or short-survivors. The mathematical expression for this is as follows, where the predicted label of the ith sample is \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\widehat{y}}^{i}=({\\widehat{y}}_{0}^{i},{\\widehat{y}}_{1}^{i})\\)\u003c/span\u003e\u003c/span\u003e, and the true label of the ith sample is \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({y}^{i}=({y}_{0}^{i},{y}_{1}^{i})\\)\u003c/span\u003e\u003c/span\u003e.\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$categorical crossentropy\\left(Y,\\widehat{Y}\\right)=-\\frac{1}{m}{\\sum }_{i=1}^{m}{\\sum }_{j=1}^{n}{\\widehat{y}}_{j}^{i}{log}\\left({y}_{j}^{i}\\right) \\left(4\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eEnsemble Learning\u003c/h2\u003e \u003cp\u003eEnsemble learning is not an independent machine learning algorithm; rather, it involves constructing and combining multiple machine learning models to complete learning tasks \u003csup\u003e28\u003c/sup\u003e. The integration of multiple classifiers leads to improved results. The voting mechanism (voting) is a combination strategy for classification problems in ensemble learning, with the fundamental concept being the fusion of multiple data sources to reduce error. For classification models, the hard voting method predicts the result as the most frequently occurring category among multiple models' predictions \u003csup\u003e29\u003c/sup\u003e. Soft voting, on the other hand, aggregates the probabilities of each type of prediction result and ultimately selects the class label with the largest sum of probabilities. In multi-omics fusion, the concept of soft voting is employed to obtain the results of multi-omics model fusion, which reduces the computational complexity of the model and effectively enhances its accuracy.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003ePerformance Measurements\u003c/h2\u003e \u003cp\u003eIn machine learning and deep learning, evaluating the performance indicators of a model is essential for fully reflecting its recognition capabilities. The model's prediction results typically rely on the confusion matrix \u003csup\u003e30\u003c/sup\u003e, which includes True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). More advanced categorical indicators can be derived from the confusion matrix \u003csup\u003e31\u003c/sup\u003e, as demonstrated in the following formulas:\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\begin{array}{c}Acc=\\frac{TP+TN}{TP+FP+TN+FN}\\\\ SN=\\frac{TP}{TP+FN}\\\\ SP=\\frac{TP}{TN+FP}\\\\ Pre=\\frac{TP}{TP+FP}\\\\ F1=\\frac{2\\times Pre+SN}{SN+Pre}\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eThe results of data preprocessing\u003c/h2\u003e \u003cp\u003eIn this study, breast cancer patient omics data were sourced from clinical information, methylation, and transcriptomics data in the TCGA database. To facilitate multi-omics model fusion, data preprocessing was divided into two parts.\u003c/p\u003e \u003cp\u003eInitially, original data sample processing involved: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) excluding 8 male samples to reduce noise, (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) retaining intersection samples of the three omics data and focusing on White and African American samples, ultimately obtaining 723 samples to satisfy the model fusion architecture requirements.\u003c/p\u003e \u003cp\u003eSecondly, data feature preprocessing involved: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) removing features with over 20% missing values in each omics, followed by data imputation, (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) excluding features relevant to survival prediction in clinical information and strongly correlated features. Consequently, 395,007 methylation features, 39,953 transcriptome features, and 16 clinical information features were selected. Supplementary Tables\u0026nbsp;1 and 2 provide further details regarding these characteristics.\u003c/p\u003e \u003cp\u003eDetails of the characteristics are shown in supplementary Tables\u0026nbsp;1 and 2.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eThe result of feature number optimizing\u003c/h2\u003e \u003cp\u003eFollowing preliminary data processing, each transcriptome sample consisted of 39,953 features, while each methylation sample comprised 395,007 features. Excessive features can result in overfitting, reducing model generalizability. Therefore, this study aimed to reduce data dimensionality through feature selection. Initially, low-variance features were removed to eliminate those with minimal contribution to the model. Subsequently, LinearSVC was employed to weight transcriptome and methylation data, and significant differentiating features were screened based on weight values \u003csup\u003e32\u003c/sup\u003e. The feature start number was set at 5, end number at 150, and step interval at 5. Finally, a DNN model integrated with three-fold cross-validation optimized the feature subset. The optimal results comprised 135 features, with methylation accuracy at 78.56% and transcriptome accuracy at 79.66%. The overall trend indicated superior classification efficacy in transcriptome. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents the classification effect, dividing patients into transcriptome and methylation groups. The horizontal axis represents feature quantity, vertical axis indicates accuracy, the green line signifies methylomics, and the orange line denotes transcriptomics.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eThe results of transcription data on different races\u003c/h2\u003e \u003cp\u003eThe transcription data can be classified into three racial groups: White, African Americans, and non-racial groups. Figure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e shows the experimental results, with the purple line representing African Americans, the orange line indicating White, and the green line signifying the non-racial group. The African American group exhibited significantly better performance than the White and non-racial groups, with the latter two achieving similar accuracies. The accuracy of the White group was 80.59% with 140 features, while the African American model achieved 97.43% accuracy with only 90 features. By selecting the best model for each dataset as the final result, the accuracy of the African American group improved by 19.98% compared to a non-race-based approach. The best models for the non-grouped and White grouping required 140 features to achieve 80.59% accuracy, whereas the African American group needed only 5 features to obtain better accuracy.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eThe results of methylation data on different races\u003c/h2\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e shows the results of the study. The purple line represents the African American feature selection curve, the orange line represents the White grouping, and the green line represents the scenario without considering race. The graph shows that both the African American and White groups had improved accuracy compared to not considering race. The African American group had the best survival prediction performance, with a classification accuracy rate of 89.74% at 95 features. Compared to the African American group, the White group achieved a classification accuracy of 85.01% at 135 features, improving the African American classification accuracy by 4.73% compared to the race-neutral approach. In contrast, the non-grouped approach attained an accuracy of 78.56%, with the African American group's accuracy rate 11.18% higher than the non-grouped classification in a race-neutral context. The subgroup of African Americans required only 65 features, while the subgroup of White individuals needed 90 features. Both outperformed the race-neutral approach, which achieved 78.56% accuracy with 135 features.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003ePrediction results of multiomics models\u003c/h2\u003e \u003cp\u003eFor the final classification, DNN models were built for each of the three single-omics data sets, and their outputs were combined using the soft voting method of ensemble learning. The resulting groupings of individuals are depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eA, with non-racial individuals in purple, White individuals in orange, and African Americans in red. The analysis of multiple omics fusion showed that for African American patients, the accuracy of high-risk patient identification was 3.21% higher than the best transcriptome data in mono-omics and 16.02% higher than the worst clinical data in mono-omics. For white patients, the accuracy of multiple omics fusion was 7.94% higher than the best methylation data in mono-omics and 11.19% higher than the worst transcriptome data in mono-omics. The study found that ungrouped fusion was 8.16% more accurate than the best transcriptome data in mono-omics and 13.69% more accurate than the worst methylation data in mono-omics. These results suggest that the COMBINE approach is necessary for accurately identifying high-risk patients.\u003c/p\u003e \u003cp\u003eThe study selected a limited number of features for comparison due to the small number of African American samples. Figures\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eB, \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eC, and \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eD display the three mono-omics and fusion classification effect indicators of the three groups when 95 features were selected for gene expression and methylation, respectively. The COMBINE model demonstrated that combining multi-omics data produced superior results compared to the DNN strategy of single omics. Additionally, it yielded improved outcomes for different races, particularly among African American groups.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eGene Function Analysis\u003c/h2\u003e \u003cp\u003eThis study examined three biomarkers involved in the migration of breast cancer cells, including Hes1, CCDC71L, and KAP3A. Low expression of Hes1 protein inhibits proliferation and migration of triple-negative breast cancer cells (TNBC) and promotes apoptosis \u003csup\u003e33\u003c/sup\u003e, while overexpression of CCDC71L activates LINC00514, promoting TNBC cell proliferation, migration, and invasion, inhibiting apoptosis, and contributing to carcinogenesis \u003csup\u003e34\u003c/sup\u003e. KAP3A acts as a physiological substrate for breast tumor kinase (BRK), implicated in breast cancer cell migration \u003csup\u003e35\u003c/sup\u003e. Two additional biomarkers, AFG1L and LAGE3, serve as therapeutic targets, with AFG1L potentially being a valuable target for breast cancer treatment or prognosis\u003csup\u003e36\u003c/sup\u003e, and LAGE3 assisting in identifying early-stage TNBC patients at high risk of recurrence \u003csup\u003e37\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eFurthermore, enrichment analysis on gene expression profiles of African American breast cancer patients was conducted, and SangerBox\u003csup\u003e38\u003c/sup\u003e was used to generate an enrichment analysis plot \u003csup\u003e39\u003c/sup\u003e, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e. The endocrine resistance pathway was identified, where MED1 \u003csup\u003e40\u003c/sup\u003e, a gene associated with estrogen receptor (ER) signaling, was found. The presence of the endocrine resistance pathway may explain the insensitivity of African American breast cancer patients to endocrine therapy. This finding holds significant implications for devising more effective treatment strategies, suggesting the need for targeted therapeutic interventions addressing the endocrine resistance pathway in African American patients. Nevertheless, further validation and exploration are necessary to determine this pathway's role and mechanism in African American patients.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eStatistical significance test\u003c/h2\u003e \u003cp\u003eTo validate the effectiveness of the multi-omics fusion model for breast cancer, we conducted statistical significance tests. We performed t-tests and analysis of variance (ANOVA) on the fused model and the best-performing individual omics model. We utilized the TCGA breast cancer dataset and conducted 20 experiments, recording the performance metrics Acc and AUC. Subsequently, we used the scipy library to conduct t-tests and ANOVA on the recorded results.\u003c/p\u003e \u003cp\u003eThe analysis produced the following outcomes: the t-test values for the AUC metric were 5.953, and for the Acc metric were 5.086, with both corresponding p-values being 0.0. The ANOVA results showed an F-test value of 35.443 for the AUC metric, and 25.863 for the Acc metric, with the respective p-values also being 0.0. Based on both the t-values and p-values, it can be concluded that there is a significant difference in AUC and Acc metrics between the fusion of multi-omics data and the other groups. These statistical findings further support the argument for the advantages of integrating multi-omics data.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study proposes a novel ensemble learning method, COMBINE, which integrates multiple omics data to predict the survival outcomes of breast cancer patients. The approach utilizes three distinct omics data types and integrates three deep neural networks, significantly improving the accuracy of survival prediction compared to using a single omics data type. The accuracy achieved is up to 10.10% higher. To address imbalanced positive and negative samples, we utilized the SMOTE method. Additionally, we modeled multiple omics integration separately for two ethnic groups, White and African American. This revealed a 12.83% improvement in accuracy for the African American group, emphasizing the significance of considering race as a critical factor when selecting biomarkers for personalized breast cancer treatment. KEGG pathway enrichment analysis was conducted on identified genes in African American patients, revealing a higher prevalence of the endocrine resistance pathway. This may contribute to their poorer survival outcomes, indicating that different treatment strategies may be required to overcome endocrine resistance and improve survival rates. However, further research is necessary to validate these conclusions. Although our study shows the benefits of using COMBINE for integrating multiple omics data, we only fused three types of structured single omics data. In the future, we plan to incorporate various types of omics data to develop a more dependable model for predicting breast cancer survival.\u003c/p\u003e"},{"header":"Declarations","content":" \u003cp\u003e \u003cstrong\u003eEthical Approval\u003c/strong\u003e \u003cp\u003eNot applicable\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eConsent for publication\u003c/strong\u003e \u003cp\u003e Written informed consent for publication was obtained from all participants.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eCompeting interests\u003c/strong\u003e \u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eThis work is supported by the National Natural Science Foundation of China Mathematics Tianyuan Fund Project (12326377), the Science and Technology Project of Science and Technology Department of Jilin Province (222621JC010397545,222621JC010694762).\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eX.F. conceptualized the study, developed the methodology, provided the software, conducted the formal analysis, and wrote the original draft. W.X. contributed to the software and validated the results. L.D. curated the data. Y.X. also contributed to the software. R.X. administered the project, provided resources, supervised the study, acquired funding, and reviewed and edited the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgments\u003c/h2\u003e \u003cp\u003eNot applicable\u003c/p\u003e\u003ch2\u003eAvailability of data and materials\u003c/h2\u003e \u003cp\u003eThe datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eLiu, Y. \u003cem\u003eet al.\u003c/em\u003e Metagenomics next-generation sequencing provides insights into the causative pathogens from critically ill patients with pneumonia and improves treatment strategies. Frontiers in Cellular and Infection Microbiology 12 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKalafi, E. Y., Nor, N., Taib, N. A., Ganggayah, M. \u0026amp; Dhillon, S. K. Machine Learning and Deep Learning Approaches in Breast Cancer Survival Prediction Using Clinical Data. Folia biologica 65, 212\u0026ndash;220 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu, T. \u003cem\u003eet al.\u003c/em\u003e Variations in genotype\u0026ndash;phenotype correlations in phenylalanine hydroxylase deficiency in Chinese Han population. Gene (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, D.-m. \u0026amp; Feng, Y.-m. Signaling mechanism of cell adhesion molecules in breast cancer metastasis: potential therapeutic targets. Breast Cancer Research and Treatment 128, 7\u0026ndash;21 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFan, Y., Xu, B.-h., Liao, Y., Yao, S. \u0026amp; Sun, Y. A retrospective study of metachronous and synchronous ipsilateral supraclavicular lymph node metastases in breast cancer patients. Breast 19 5, 365\u0026ndash;369 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReel, P. S., Reel, S., Pearson, E. R., Trucco, E. \u0026amp; Jefferson, E. R. Using machine learning approaches for multi-omics data analysis: A review. Biotechnology advances, 107739 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFatima, N., Li, L., Hong, S. \u0026amp; Ahmed, H. Prediction of Breast Cancer, Comparative Review of Machine Learning Techniques and their Analysis. IEEE Access PP, 1\u0026ndash;1 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWolff, A. C. \u003cem\u003eet al.\u003c/em\u003e Randomized phase III placebo-controlled trial of letrozole plus oral temsirolimus as first-line endocrine therapy in postmenopausal women with locally advanced or metastatic breast cancer. Journal of clinical oncology: official journal of the American Society of Clinical Oncology 31 2, 195\u0026ndash;202 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUlgen, A., G\u0026uuml;rkut, \u0026Ouml;. \u0026amp; Li, W. Potential Predictive Factors for Breast Cancer Subtypes from a North Cyprus Cohort Analysis. Cyprus Journal of Medical Sciences (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMonzavi\u0026ndash;Karbassi, B., Siegel, E. R., Medarametla, S., Makhoul, I. \u0026amp; Kieber\u0026ndash;Emmons, T. Breast cancer survival disparity between African American and Caucasian women in Arkansas: A race-by-grade analysis. Oncol Lett 12, 1337\u0026ndash;1342, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3892/ol.2016.4804\u003c/span\u003e\u003cspan address=\"10.3892/ol.2016.4804\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu, H. J., Jing, C., Xiao, N., Zang, X. M. \u0026amp; Tan, Q. W. Structural difference analysis of adult's intestinal flora basing on the 16S rDNA gene sequencing technology. (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKarvinen, K. H., Raedeke, T. D., Arastu, H. H. \u0026amp; Allison, R. R. Exercise programming and counseling preferences of breast cancer survivors during or after radiation therapy. Oncology nursing forum 38 5, E326-334 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAntoine, W. \u0026amp; Miernyk, J. A. A Multidimensional Scaling-Based Model for Analysis of Time-Index Biomics Data. (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEllison, L. F., Bryant, H., Lockwood, G. \u0026amp; Shack, L. Conditional survival analyses across cancer sites. Health Reports 22, 21\u0026ndash;25 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXin, F. \u003cem\u003eet al.\u003c/em\u003e Detection and Comparative Analysis of Methylomic Biomarkers of Rheumatoid Arthritis. Frontiers in genetics 11 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAfaq, J. \u003cem\u003eet al.\u003c/em\u003e Water Quality Prediction Using KNN Imputer and Multilayer Perceptron. Water 14 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, H. \u003cem\u003eet al.\u003c/em\u003e LaCOme: learning the latent convolutional patterns among transcriptomic features to improve classifications. Gene, 147246 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXin, R. \u003cem\u003eet al.\u003c/em\u003e Computational Characterization of Undifferentially Expressed Genes with Altered Transcription Regulation in Lung Cancer. Genes 14 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFan, J., Guo, S. \u0026amp; Hao, N. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, S. Identifying Optimal Wavelengths as Disease Signatures Using Hyperspectral Sensor and Machine Learning. Remote Sensing 13 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeng, S., Keung, J. W., Yu, X., Xiao, Y. \u0026amp; Zhang, M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf. Softw. Technol. 139, 106662 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeng, X. \u003cem\u003eet al.\u003c/em\u003e MSFC: a new feature construction method for accurate diagnosis of mass spectrometry data. Scientific Reports 13, 15694, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41598-023-42395-5\u003c/span\u003e\u003cspan address=\"10.1038/s41598-023-42395-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFernandez, A., Garcia, S., Chawla, N. V. \u0026amp; Herrera, F. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research 61, 863\u0026ndash;905 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuillaume, L., Fernando, N. \u0026amp; K., A. C. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. JOURNAL OF MACHINE LEARNING RESEARCH 18 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGinanjar, S., Suhartono, Wibowo, A. \u0026amp; Sarwoko, E. A. The best architecture selection with deep neural network (DNN) method for breast cancer classification using MicroRNA data. \u003cem\u003eJournal of Physics: Conference Series\u003c/em\u003e 1524 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTian, Y.-q., Lai, Y. A. \u0026amp; Yang, C. Research of Consumption Behavior Prediction Based on Improved DNN. \u003cem\u003eScientific Programming\u003c/em\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMahmoud, A.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSathurthi, S. \u0026amp; Saruladha, K. An analysis of parallel ensemble diabetes decision support system based on voting classifier for classification problem. Electron. Gov. an Int. J. 16, 25\u0026ndash;38 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, J. \u003cem\u003eet al.\u003c/em\u003e MuscNet, a Weighted Voting Model of Multi-Source Connectivity Networks to Predict Mild Cognitive Impairment Using Resting-State Functional MRI. IEEE access: practical innovations, open solutions 8, 174023\u0026ndash;174031 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhiqin, W., Ruiqing, L., Minghui, W. \u0026amp; Ao, L. GPDBN: deep bilinear network integrating both genomic data and pathological images for breast cancer prognosis prediction. Bioinformatics (Oxford, England) 37 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTharwat, A. Classification assessment methods. Applied Computing and Informatics (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHaohui, L. \u0026amp; Shahadat, U. Explainable Stacking-Based Model for Predicting Hospital Readmission for Diabetic Patients. Information 13 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYao, L. \u0026amp; Tian, F. GRWD1 affects the proliferation, apoptosis, invasion and migration of triple negative breast cancer through the Notch signaling pathway. Exp Ther Med 24, 473, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3892/etm.2022.11400\u003c/span\u003e\u003cspan address=\"10.3892/etm.2022.11400\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLuo, X. \u0026amp; Wang, H. LINC00514 upregulates CCDC71L to promote cell proliferation, migration and invasion in triple-negative breast cancer by sponging miR-6504-5p and miR-3139. Cancer Cell Int 21, 180, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12935-021-01875-2\u003c/span\u003e\u003cspan address=\"10.1186/s12935-021-01875-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLukong, K. E. \u0026amp; Richard, S. Breast tumor kinase BRK requires kinesin-2 subunit KAP3A in modulation of cell migration. Cell Signal 20, 432\u0026ndash;442, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.cellsig.2007.11.003\u003c/span\u003e\u003cspan address=\"10.1016/j.cellsig.2007.11.003\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2008).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLuo, W. \u003cem\u003eet al.\u003c/em\u003e Breast Cancer Prognosis Prediction and Immune Pathway Molecular Analysis Based on Mitochondria-Related Genes. \u003cem\u003eGenet Res (Camb)\u003c/em\u003e 2022, 2249909, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1155/2022/2249909\u003c/span\u003e\u003cspan address=\"10.1155/2022/2249909\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang, Y. S. \u003cem\u003eet al.\u003c/em\u003e The early-stage triple-negative breast cancer landscape derives a novel prognostic signature and therapeutic target. Breast Cancer Res Treat 193, 319\u0026ndash;330, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s10549-022-06537-z\u003c/span\u003e\u003cspan address=\"10.1007/s10549-022-06537-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShen, W. \u003cem\u003eet al.\u003c/em\u003e Sangerbox: A comprehensive, interaction-friendly clinical bioinformatics analysis platform. \u003cem\u003eiMeta\u003c/em\u003e 1, e36, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1002/imt2.36\u003c/span\u003e\u003cspan address=\"10.1002/imt2.36\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, J. In silico analysis of differentially expressed genesets in metastatic breast cancer identifies potential prognostic biomarkers. World Journal of Surgical Oncology 19, 188, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12957-021-02301-7\u003c/span\u003e\u003cspan address=\"10.1186/s12957-021-02301-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, Y. \u003cem\u003eet al.\u003c/em\u003e A Novel Multimodal MRI Analysis for Alzheimer's Disease Based on Convolutional Neural Network. \u003cem\u003eAnnual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference\u003c/em\u003e 2018, 754\u0026ndash;757, doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/embc.2018.8512372\u003c/span\u003e\u003cspan address=\"10.1109/embc.2018.8512372\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2018).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Multi-omics, Breast Cancer, COMBINE, Race, Data Integration","lastPublishedDoi":"10.21203/rs.3.rs-3852479/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-3852479/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBreast cancer disproportionately affects African American women under the age of 50, leading to higher incidence rates, more aggressive cancer subtypes, and increased mortality compared to other racial and ethnic groups. To enhance the prediction of onset risk and enable timely intervention and treatment, it is crucial to investigate the genetic and molecular factors associated with these disparities. This study introduces COMBINE, an innovative ensemble learning model that combines three types of omics data to improve the accuracy of breast cancer prognosis classification and reduce the model's time complexity. A comparative analysis of the fusion effects for African American and White women reveals a significant improvement in the fusion effect for African American women. Additionally, gene enrichment analysis highlights the importance of considering race when selecting relevant biomarkers. To address the challenges of cancer prognosis classification, a combination of qualitative and quantitative methods, along with ensemble learning, is employed. This comprehensive approach facilitates the exploration of new concepts for the application of multi-omics data, potentially leading to more personalized and effective treatment strategies. The study highlights the potential of ensemble learning as a fusion technique for multi-omics data in cancer prognosis classification. It emphasizes the importance of refining our understanding of the genetic and molecular factors contributing to disparities in breast cancer incidence and outcomes. Ultimately, this research has the potential to improve healthcare outcomes for African American women and alleviate the burden of this formidable disease.\u003c/p\u003e","manuscriptTitle":"COMBINE: A Comprehensive Multi-Omics Approach for Improving Breast Cancer Prognosis Classification in African American Women","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-01-23 05:19:31","doi":"10.21203/rs.3.rs-3852479/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7d5336e3-b3bf-4e16-9504-76eb067ddd5c","owner":[],"postedDate":"January 23rd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":28284167,"name":"Biological sciences/Cancer/Breast cancer"},{"id":28284168,"name":"Biological sciences/Computational biology and bioinformatics/Classification and taxonomy"},{"id":28284169,"name":"Biological sciences/Computational biology and bioinformatics/Data mining"}],"tags":[],"updatedAt":"2024-08-26T05:17:10+00:00","versionOfRecord":[],"versionCreatedAt":"2024-01-23 05:19:31","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-3852479","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-3852479","identity":"rs-3852479","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.