Personalized Prediction of Survival Rate with Combination of Penalized Cox Models and Machine Learning in Patients with Colorectal Cancer

doi:10.21203/rs.3.rs-4024382/v1

Personalized Prediction of Survival Rate with Combination of Penalized Cox Models and Machine Learning in Patients with Colorectal Cancer

2024 · doi:10.21203/rs.3.rs-4024382/v1

preprint OA: closed

Full text JSON View at publisher

Full text 77,941 characters · extracted from preprint-html · click to expand

Personalized Prediction of Survival Rate with Combination of Penalized Cox Models and Machine Learning in Patients with Colorectal Cancer | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Personalized Prediction of Survival Rate with Combination of Penalized Cox Models and Machine Learning in Patients with Colorectal Cancer Seon Hwa Lee, Jae Myung Cha, Seung Jun Shin This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4024382/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background The investigation into individual survival rates within the patient population was typically conducted using the Cox proportional hazards model, with geometric black box models not being employed Aims We aims to evaluate the performance of machine learning algorithm in predicting survival rates more than 5 years for individual patients with colorectal cancer. Methods A total of 475 patients with CRC and complete data who had underwent surgery for colorectal cancer were analyze to measure individual's survival rate more than 5 years using a machine learning based on penalized Cox regression. We conducted thorough calculations to measure the individual's survival rate more than 5 years for performance evaluation. The receiver operating characteristic (ROC) curves for the LASSO penalized model, the SCAD penalized model, the unpenalized model, and the RSF model were analyzed. Results The least absolute shrinkage and selection operator penalized model displayed a mean AUC of 0.67 ± 0.06, the smoothly clipped absolute deviation penalized model exhibited a mean AUC of 0.65 ± 0.07, the unpenalized model showed a mean AUC of 0.64 ± 0.09. Notably, the random survival forests model outperformed the others, demonstrating the most favorable performance evaluation with a mean AUC of 0.71 ± 0.05. Conclusions Penalized Cox model is more efficient and leads to a more generalized model selection compared to the unpenalized Cox model as a prognosis prediction model for CRC. The results indicated that the random forest model, a black box model, outperformed the penalized Cox model in terms of performance. Colorectal cancer Machine Learning Penalized cox model Personalized prediction Survival rate Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Colorectal cancer (CRC) is a significant global health issue with the third most common cancer diagnosed and the second fatal cancer globally [ 1 ]. Furthermore, it is also a major health problem with a significant economic burden [ 2 ]. Recently, there have been significant advances in cancer treatment and prevention, and early detection can greatly improve outcomes of CRC [ 3 ]. Therefore, prediction of prognosis for CRC plays a crucial role in determining the optimal treatment options for patients with CRC [ 4 ]. To develop prediction model of prognosis of CRC, measuring and analyzing biological markers of CRC is an important. To simplify the model, prevent overfitting, improve predictive power, and resolve issues of multicollinearity between variables, reducing the number of variables in a model can be particularly beneficial in prediction model development. Although many studies have been identified prognostic factors for CRC [ 5 , 6 ], most of them have focused on average survival rates for the population with CRC as a whole, rather than examining individual survival rates [ 4 ]. The primary objective of this study is to assess the predictive performance of machine learning (ML) algorithms for individual survival rates more than 5 years in patients with CRC. The evaluation is conducted through the utilization of a penalized Cox regression model alongside black box models, aiming to enhance the precision of survival rate predictions. Materials and Methods Data Sources and Study Design We utilized Penalized and Unpenalized Cox models to assess individuals' survival rates more than 5 years, based on the single hospital database for patients with CRC, which was published in 2023 [ 7 ]. To enhance the depth and reliability of our performance evaluation, we developed and evaluated 100 individual ML models, while also focusing on improving interpretability to identify significant variables. We conducted thorough calculations to measure the individual's survival probability more than 5 years for performance evaluation, and identified solutions to address challenges of the interpretability stemming from the black box nature of ML. The database was a medical records of 1378 patients with CRC at a single university hospital between June 2006 and March 2020 [ 7 ]. The detailed information about the generation of clinical data and covariates including demographics, blood tests, pathological findings, and immunochemical tests were described in our previous study [ 7 ]. This study followed the key questions for determining the appropriate model type in ML in medicine [ 8 ]. This study was approved by the Institutional Review Board of the hospital (KHNMC IRB 2020-10-004), but informed consent was waived as it was reanalysis of previous publication [ 7 ]. Preprocessing We applied marginal screening methods based on both the Log-Rank test and Cox proportional hazards regression models to rapidly identify a candidate set of informative variables that possibly affects the survival probability of CRC patients [ 9 , 10 ]. These methods evaluate the individual contribution of each predictor to death, allowing researchers to identify and select the most informative predictors [ 9 , 10 ]. In this study, missing values were removed as imputation can distort data and potentially compromise the reliability and interpretability of results (Fig. 1 ). Penalized Cox model A Cox regression model was used for explanatory variables to analyze the hazard of an event occurring at time for the i-th of n patients. The hazard function for the i-th individual was modeled as \({h}_{i}\left(t\right)=\text{exp}\left({\beta }^{{\prime }}{x}_{i}\right){h}_{0}\left(t\right)\) [ 11 ]. This model allowed us to estimate the effects of the explanatory variables on the hazard of the event, which essentially determines the survival probability [11]. In order to identify informative variables, we consider two penalty functions, ‘least absolute shrinkage and selection operator (LASSO) penalty [ 12 ] and ‘smoothly clipped absolute deviation (SCAD)’penalty [ 13 , 14 ]. Both penalties yield sparse estimates by forcing the regression coefficients of un- (or weakly) informative to be zero. LASSO, which is a regression analysis method that performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the resulting statistical model, is the simplest and most popular choice in the ML methods, but can cause bias. The SCAD penalty is an improved version that yields a nearly unbiased solution with increased computational cost due to its nonconvexity [ 14 ]. In this study, we performed 10-fold cross-validation (CV) to select an optimal tuning parameter for the penalized Cox proportional hazards models. Tuning is essential in the penalized method to improve the generalization performance of the model. All analyses were implemented by R. We employed R packages [ 15 , 16 ] for the two version of penalized Cox models: glmnet package for LASSO-penalized Cox model, ncvreg package for SCAD penalized Cox Model. Random Survival Forest model Cox regression models typically assume a linear relationship and may struggle to handle nonlinear patterns effectively. However, ‘random survival forests (RSF)’ exhibit greater adaptability to the complexity of data by offering high flexibility in dealing with nonlinear relationships. Through bootstrap sampling, RSF enhance generalization performance and reduce overfitting by training diverse trees. Cox regression models, lacking this sampling technique by default, may find particular utility in small datasets where such an approach can be especially beneficial [ 17 ]. The analysis was conducted using the randomForestSRC package in R programming language. Statistical Analyses Considering trade-off between computational time, complexity, and statistical performance for specificities of the framework [ 18 ], we determined that generating 100 models was sufficient to achieve meaningful results. To estimate the survival probability for an individual in the test set, we used the Cox regression models that were selected using penalized Cox model. We have established a conditional survival probability ML algorithm to validate the predictive accuracy of the survival analysis model. In this study, one hundred penalized Cox models, P(time > 5-year I given X ts ), using 100 training sets, were analyzed by computing the survival rate at the i -th observation in the test set. We obtained the index corresponding to the time point t = 5 years. The resulting survival probabilities were stored in a vector named surv.prob. We also evaluated the performance of one hundred penalized Cox models using surv.prob and survival status in the test set by running the ROC curve 100 times and measuring the average AUC value. Hence, the series of algorithmic processes can be used to evaluate the goodness of fit of a model. The same procedure was applied equivalently to unpenalized Cox models and RSF models. To make internal validation, we created a matrix of survival probabilities and survival status for each individual. In this model, deltas represent the death status: if an individual is deceased, \({\delta }_{i}\) is 1, and if they survived, \({\delta }_{i}\) is 0. This was done for the test set in each of the 100 models. The prediction function takes data as input and generates an object for evaluating the performance of the predictive model. We used pred.roc object, which is used to calculate the receiver operating characteristics (ROC) curve for a survival analysis model. We calculated the true positive rate at a given false positive rate for a set of survival probabilities and survival status data [ 19 ], leveraging the pred.roc object. We generated the ROC curve for each model and calculated the mean ROC curve across all models [ 20 ]. Results Figure 1 illustrates the overall process of the study. According to the original study, laboratory data were missing from 0.9%-34.8% for each variable and pathologic data also missing from 1.0%-12.5%. A total of 478 patients with CRC were used for descriptive analysis from 1378 patients with CRC in previous study [ 7 ]. Finally, 475 patients included in the model after excluding three unknown cases. Clinical characteristics of CRC patients Table 1 shows the 20 variables related to the clinical characteristics of the subset of 478 patients after the removal of missing values. For study population, the median duration of follow-up was 1439.1 days, and 5.4% of them died during follow-up. Mean age at diagnosis of CRC was 66.6 years. Detailed information of laboratory and pathological date are described in Table 1 . Table 1 Clinical characteristics of colorectal cancer patients Clinical and pathological characteristics Original data Current study data Duration of follow-up (days), median [IQR] 1592.1, [732.8–2167.0] 1439.1, [734–1828] Death during follow-up 99 (7.2) 26 (5.4) Age at diagnosis (years) 63.7 ± 12.0 66.6 ± 11.4 Laboratory data * Number of total patients 1378 478 Body mass index (kg/ m 2 ) 23.3 ± 3.6 23.31 ± 3.6 Hemoglobin (g/dL) 12.4 ± 2.3 12.0 ± 2.5 Platelet count (×10³/μ l ) 261.2 ± 97.2 285.4 ± 95.4 Lymphocyte (%) 25.8 ± 10.5 23.9 ± 10.5 Neutrophil (%) 63.9 ± 12.5 66.2 ± 12.7 Albumin (mg/dL) 4.0 ± 0.5 4.0 ± 0.5 Carcinoembryonic antigen (ng/mL) 27.2 ± 235.9 26.7 ± 258.0 Carbohydrate antigen 19 − 9 (ng/mL) 76.5 ± 571.7 71.1 ± 603.2 C-reactive protein (mg/dL) 2.6 ± 5.3 3.2 ± 6.0 Neutrophil-to-lymphocyte ratio 3.7 ± 4.8 4.3 ± 4.9 Protein-albumin ratio 1.8 ± 0.2 1.8 ± 0.2 Lymphocyte/C-reactive protein ratio 12324.5 ± 21866.2 9205.7 ± 17986.9 Pathologic data ** Number of total patients 910 478 T stage T0-1 143 (16.0) 62 (13.1) T2 98 (11.0) 46 (9.7) T3 512 (57.1) 288 (60.3) T4 131 (14.6) 79 (16.8) Unknown 12 (1.3) 3 (0.1) N stage N0 519 (57.9) 269 (56.3) N1 217 (24.2) 127 (26.5) N2 154 (17.1) 82 (17.2) Unknown 7 (0.8) 0 (0.0) pTNM stage Stage I 207 (23.0) 95 (19.9) Stage II 294 (32.6) 166 (34.8) Stage III 329 (36.5) 179 (37.5) Stage IV 60 (6.7) 37 (7.7) Unknown 11 (1.2) 1 (0.1) Number of lymph node involvement 2.0 ± 4.2 2.0 ± 4.2 Lymphatic invasion 261 (30.2) 159 (33.3) Vascular invasion 50 (5.8) 36 (7.5) Neural invasion 80 (10.1) 53 (11.1) Values were expressed as mean ± SD or N (%) IQR, interquartile range Laboratory data * were missing from 0.9%-34.8% for each variable and pathologic data ** also missing from 1.0%-12.5%. After excluding "unknown" cases, a total of 475 patients with colorectal cancer were used for the model. LASSO and SCAD variable selection Figure 2 shows the variables selected by LASSO ranged from 1 to 14 variables, with an average of 7.0 ± 3.0 variables selected. Out of the 100 models constructed, 19 models had the highest frequency of selecting 8 variables. The variables selected by SCAD in Fig. 2 ranged from 1 to 9 variables, with an average of 4.3 ± 2.1 variables. Out of the 100 models constructed, 23 models had the highest frequency of selecting five variables. Figure 3 shows variables selected out of the 20 candidate variables across 100 models using penalized methods, and SCAD penalty selecting the same top 5 variables as LASSO. The most frequently selected five variables were neural invasion, number of lymph node (LN) involvement, cancer antigen 19 − 9 (CA19-9), lymphatic invasion, and hemoglobin (Hb). ROC analysis Figure 4 shows the ROC curves of the LASSO penalized model, the SCAD penalized model, the unpenalized model and the RSF model. The LASSO penalized model exhibited a mean AUC of 0.67 ± 0.06, with a relatively small standard deviation (SD) of 0.06. Similarly, the SCAD penalized model showed a mean AUC of 0.65 ± 0.07, with a SD of 0.07. These results emphasize the favorable performance of the LASSO and SCAD penalty Cox models, indicating their suitability for clinical variable selection. In comparison, the unpenalized model demonstrated a mean AUC of 0.64 ± 0.09, with a small SD of 0.09. Although the unpenalized Cox model has shown a reliable level of evaluation similar to using the LASSO and SCAD penalty methods, it could be less efficient as more variables are necessary than LASSO and SCAD penalty methods. The RSF model demonstrated the most favorable performance evaluation with an accuracy of 0.71 ± 0.05 with a small SD of 0.05. Discussion We demonstrated that the penalized Cox model is more efficient and leads to a more generalized model selection compared to the unpenalized Cox model which applies all 20 variables as a prognosis prediction model for patients with CRC. For variable selection, LASSO penalty model selected average 7.0 variables, while SCAD penalty model selected average 4.3 variables (Fig. 2 ). Furthermore, among 100 models, the LASSO method had the highest frequency of 19 models selecting 8 variables. On the other hand, the SCAD method had the highest frequency of 23 models selecting 5 variables among the 100 models. It's worth noting that SCAD selected fewer variables than LASSO due to its use of a nonconvex penalty function [ 13 ], while still maintaining comparable performance in terms of model selection and prediction accuracy. In our study, the SCAD penalty method was found to be more suitable for selecting critical biological markers for the prediction of prognosis in CRC patients. Because of its non-linear penalty function, however, SCAD involves solving a more computationally complex optimization problem, which may lead to higher computational costs compared to LASSO [ 13 , 21 ]. Moreover, it may be less suitable for large datasets due to the high computational cost. Therefore, SCAD may be more appropriate for relatively small datasets. This study also showed that neural invasion, number of LN involvement, CA19-9, lymphatic invasion, and Hb level are most important predictors for CRC patients, as they were most frequently selected in the 100 model fittings within LASSO and SCAD methods (Fig. 3 ). Five prognostic variables identified in our study were also demonstrated as important prognostic factors in previous studies [ 7 , 22 – 25 ]. Therefore, surgical pathology, such as neural invasion, number of LN involvement, and lymphatic invasion, CA19-9 and hemoglobin level, need to be more carefully monitored for the better management of the CRC. Figure 4 shows that RSF model exhibited the highest performance evaluation. One of the key strengths of our study is the rigorous model evaluation process, as we ran our model 100 times to assess its performance. This approach allowed us to establish the reliability and validity of our results and provide a more accurate estimate of the model's predictive power. Our study may be highlighted as interpretability has been resolved into ML models, making it easier to explain or interpret their operational methods and outcomes. This enhancement enables a clearer understanding of how the model operates and the ability to provide explanations for its results, leading to a higher level of trust in the model's predictions. Our study also differentiated from previous studies as we evaluated personalized 5-year survival rates based on ML [ 26 ] rather than average survival rates [ 4 ]. Other studies have evaluated within 5 years, it is noteworthy that our research assessed outcomes more than 5-year [ 26 ]. We found that mean AUC for the LASSO Cox model was 0.67 and the mean AUC for the SCAD Cox model was 0.65, while the mean AUC for the unpenalized Cox model was 0.64. Furthermore, the RSF model exhibited the highest mean AUC with a value of 0.71. These findings suggest that all four models, which were originally trained on a specific dataset of randomly assigned patients, can be reasonably applied to new datasets. Additionally, models with fewer variables, such as Lasso and SCAD, demonstrated noteworthy results compared to the unpenalized model. In the performance evaluation section, the RSF model emerged as the top-performing model. Our research has several limitations. First, we removed missing values from the data, however, imputation may make bias. Although the sample size was declined from the original dataset, we developed a complete dataset without any missing value. Second, our model was not validated from external data. However, we ran our model 100 times to assess its performance and had rigorous model evaluation process to minimize the potential bias. Even though, our model warrants validation from large, complete external dataset. Third, our study only focused on Asians, specifically Koreans, and thus did not represent a diverse range of ethnicities. Finally, we need to evaluate survival rate predictions using a wider range of ML methods [ 27 ]. Conclusion RSF Cox model was revealed as the most suitable approach, which was validated through rigorous performance evaluation conducted 100 times. Declarations Author’s contribution Seon Hwa Lee and Jae Myung Cha conceived and designed the study. Seon Hwa Lee and Seung Jun Shin performed the statistical analysis of the data; Seon Hwa Lee and Jae Myung Cha wrote the manuscript. All authors have read and approved the manuscript. Conflict of interest The authors declare that the research was conducted in the absence of any commercial or fnancial relationships that could be construed as a potential confict of interest. Disclosure of fnancial arrangements Not applicable. Ethical approval This study was approved by the Institutional Review Board of the hospital (Permission number KHNMC IRB 2020-10-004). References Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2021;71:209-249. Fitzmaurice C, Allen C, Barber RM, et al. Global, regional, and national cancer incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 32 cancer groups, 1990 to 2015: a systematic analysis for the global burden of disease study. JAMA oncology 2017;3:524-548. Bray F, Laversanne M, Weiderpass E, Soerjomataram I. The ever‐increasing importance of cancer as a leading cause of premature death worldwide. Cancer 2021;127:3029-3030. Rawla P, Sunkara T, Barsouk A. Epidemiology of colorectal cancer: incidence, mortality, survival, and risk factors. Prz Gastroenterol 2019;14:89-103. Sawicki T, Ruszkowska M, Danielewicz A, Niedźwiedzka E, Arłukowicz T, Przybyłowicz KE. A review of colorectal cancer in terms of epidemiology, risk factors, development, symptoms and diagnosis. Cancers (Basel) 2021;13:2025. Marley AR, Nan H. Epidemiology of colorectal cancer. Int J Mol Epidemiol Genet 2016;7:105. Cho Y, Park SB, Yoon JY, Kwak MS, Cha JM. Neutrophil to lymphocyte ratio can predict overall survival in patients with stage II to III colorectal cancer. Medicine 2023;102:e33279. Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med 2019;380:1347-1358. Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of machine learning research 2003;3:1157-1182. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. bioinformatics 2007;23:2507-2517. Cox DR. Regression Models and Life-Tables. J R Stat Soc Series B Stat Methodol 1972;34:187-202. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc Series B Stat Methodol 1996;58:267-288. Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Amer Statistical Assoc 2001;96:1348-1360. Breheny P (2016) Adaptive lasso, MCP, and SCAD. Available at: https://myweb.uiowa.edu/pbreheny/7240/s21/notes/3-03.pdf; Accessed Jan. 6. 2024. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010;33(1):1-22. Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat 2011; 5(1): 232-253. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat 2008;2:841-860, 820. Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv 2010;4:40-79, 40. Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett 2006;27:861-874. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 1997;30:1145-1159. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Series B Stat Methodol 2008;70:849-911. Krasna MJ, Flancbaum L, Cody RP, Shneibaum S, Ben Ari G. Vascular and neural invasion in colorectal carcinoma. Incidence and prognostic significance. Cancer 1988;61:1018-1023. Lei P, Ruan Y, Liu J, Zhang Q, Tang X, Wu J. Prognostic Impact of the Number of Examined Lymph Nodes in Stage II Colorectal Adenocarcinoma: A Retrospective Study. Gastroenterol Res Pract 2020;2020:8065972. Liang J, Wei Y, Zhao C, Hong C. [Metastatic lymph node ratio and outcome of surgical patients with stage III colorectal cancer]. Nan Fang Yi Ke Da Xue Xue Bao 2012;32:1663-1666. [Article in Chinese] Allison JE, Fraser CG, Halloran SP, Young GP. Population screening for colorectal cancer means getting FIT: the past, present, and future of colorectal cancer screening using the fecal immunochemical test for hemoglobin (FIT). Gut Liver 2014;8:117-130. Susič D, Syed-Abdul S, Dovgan E, Jonnagaddala J, Gradišek A. Artificial intelligence based personalized predictive survival among colorectal cancer patients. Comput Methods Programs Biomed 2023;231:107435. Wang P, Li Y, Reddy CK. Machine Learning for Survival Analysis: A Survey. ACM Comput Surv 2019;51:Article 110. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4024382","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":277505946,"identity":"a19ae7e0-69f8-436d-a8ad-ceac72728513","order_by":0,"name":"Seon Hwa Lee","email":"","orcid":"","institution":"Korea University","correspondingAuthor":false,"prefix":"","firstName":"Seon","middleName":"Hwa","lastName":"Lee","suffix":""},{"id":277505947,"identity":"99205b20-4a00-4087-9b22-2edd8ed53ee8","order_by":1,"name":"Jae Myung Cha","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAArUlEQVRIiWNgGAWjYDCCAwxsDAwVCQkwHrFazpCshbGNFC18tw8/e/BzXlqewQHmhx8YztwjrEXyXJq5Ye+2nGKDA2zGEgw3iglrMTjDwybBu60iccMBBjMGhg8JxGmR/DsHpIX9G/FapHkbcoBaeIC23CBCi+QZNjNpmWNpiTMP8xRLJJwhQgvfGeZnkm9qkhP7jrdv/PDhGBFaEIAZiEnSMApGwSgYBaMANwAA8Zs8KVe9EcUAAAAASUVORK5CYII=","orcid":"","institution":"Kyung Hee University","correspondingAuthor":true,"prefix":"","firstName":"Jae","middleName":"Myung","lastName":"Cha","suffix":""},{"id":277505948,"identity":"019827d8-5c2e-441c-99b9-c6e66e3eec1b","order_by":2,"name":"Seung Jun Shin","email":"","orcid":"","institution":"Korea University","correspondingAuthor":false,"prefix":"","firstName":"Seung","middleName":"Jun","lastName":"Shin","suffix":""}],"badges":[],"createdAt":"2024-03-07 12:05:08","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4024382/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4024382/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":52549184,"identity":"984ce163-ef17-41ba-8948-9357ff5a4dd7","added_by":"auto","created_at":"2024-03-12 20:02:39","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":183363,"visible":true,"origin":"","legend":"\u003cp\u003eA flow diagram for the study. A total of 475 colorectal cancer patients were used for the model after excluding “unknown” cases from baseline 1378 patients with CRC. For this model, 20 variables were analyzed out of a total of 40 variables. CRC = colorectal cancer, IBD = inflammatory bowel disease.\u003c/p\u003e","description":"","filename":"figure1600.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4024382/v1/e90cf1067e6951e98649d1fb.jpg"},{"id":52549187,"identity":"942e4c1b-7f20-430a-ba8f-0720541bb218","added_by":"auto","created_at":"2024-03-12 20:02:40","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":76566,"visible":true,"origin":"","legend":"\u003cp\u003eVariables in LASSO and SCAD penalty models.\u003c/p\u003e","description":"","filename":"figure2600.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4024382/v1/4af996d9926d5d6498d47dd1.jpg"},{"id":52549186,"identity":"751a89c6-5b16-49ed-8684-f83f7ee37afe","added_by":"auto","created_at":"2024-03-12 20:02:39","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":77492,"visible":true,"origin":"","legend":"\u003cp\u003eTotal number of variables selected with penalized methods. Most frequently selected variables are neural invasion, number of LN involvement, CA19-9, lymphatic invasion, and hemoglobin.\u003c/p\u003e","description":"","filename":"figure3600.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4024382/v1/1857adc3f7e10c0614818256.jpg"},{"id":52549467,"identity":"5654c6e7-2f32-484f-aa0b-95c30aebc0ee","added_by":"auto","created_at":"2024-03-12 20:10:39","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":26149,"visible":true,"origin":"","legend":"\u003cp\u003ePenalized (LASSO, SCAD), unpenalized and RSF Cox model mean ROC curve. Additionally, the bold line represents the average curve across the 100 models.\u003c/p\u003e","description":"","filename":"figure4600.png","url":"https://assets-eu.researchsquare.com/files/rs-4024382/v1/3708d76b55a40499542f0dc7.png"},{"id":52924562,"identity":"7e7640d1-bc1a-416f-9421-740c2b8cb099","added_by":"auto","created_at":"2024-03-18 17:52:50","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":495769,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4024382/v1/a58e88b7-6a99-4ea5-bcc7-e2857972560f.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Personalized Prediction of Survival Rate with Combination of Penalized Cox Models and Machine Learning in Patients with Colorectal Cancer","fulltext":[{"header":"Introduction","content":"\u003cp\u003eColorectal cancer (CRC) is a significant global health issue with the third most common cancer diagnosed and the second fatal cancer globally [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Furthermore, it is also a major health problem with a significant economic burden [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Recently, there have been significant advances in cancer treatment and prevention, and early detection can greatly improve outcomes of CRC [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Therefore, prediction of prognosis for CRC plays a crucial role in determining the optimal treatment options for patients with CRC [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. To develop prediction model of prognosis of CRC, measuring and analyzing biological markers of CRC is an important. To simplify the model, prevent overfitting, improve predictive power, and resolve issues of multicollinearity between variables, reducing the number of variables in a model can be particularly beneficial in prediction model development. Although many studies have been identified prognostic factors for CRC [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], most of them have focused on average survival rates for the population with CRC as a whole, rather than examining individual survival rates [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe primary objective of this study is to assess the predictive performance of machine learning (ML) algorithms for individual survival rates more than 5 years in patients with CRC. The evaluation is conducted through the utilization of a penalized Cox regression model alongside black box models, aiming to enhance the precision of survival rate predictions.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData Sources and Study Design\u003c/h2\u003e \u003cp\u003eWe utilized Penalized and Unpenalized Cox models to assess individuals' survival rates more than 5 years, based on the single hospital database for patients with CRC, which was published in 2023 [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. To enhance the depth and reliability of our performance evaluation, we developed and evaluated 100 individual ML models, while also focusing on improving interpretability to identify significant variables. We conducted thorough calculations to measure the individual's survival probability more than 5 years for performance evaluation, and identified solutions to address challenges of the interpretability stemming from the black box nature of ML. The database was a medical records of 1378 patients with CRC at a single university hospital between June 2006 and March 2020 [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. The detailed information about the generation of clinical data and covariates including demographics, blood tests, pathological findings, and immunochemical tests were described in our previous study [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. This study followed the key questions for determining the appropriate model type in ML in medicine [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. This study was approved by the Institutional Review Board of the hospital (KHNMC IRB 2020-10-004), but informed consent was waived as it was reanalysis of previous publication [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003ePreprocessing\u003c/h2\u003e \u003cp\u003eWe applied marginal screening methods based on both the Log-Rank test and Cox proportional hazards regression models to rapidly identify a candidate set of informative variables that possibly affects the survival probability of CRC patients [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. These methods evaluate the individual contribution of each predictor to death, allowing researchers to identify and select the most informative predictors [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. In this study, missing values were removed as imputation can distort data and potentially compromise the reliability and interpretability of results (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003ePenalized Cox model\u003c/h2\u003e \u003cp\u003eA Cox regression model was used for explanatory variables to analyze the hazard of an event occurring at time for the i-th of n patients. The hazard function for the i-th individual was modeled as \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({h}_{i}\\left(t\\right)=\\text{exp}\\left({\\beta }^{{\\prime }}{x}_{i}\\right){h}_{0}\\left(t\\right)\\)\u003c/span\u003e\u003c/span\u003e[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. This model allowed us to estimate the effects of the explanatory variables on the hazard of the event, which essentially determines the survival probability [11]. In order to identify informative variables, we consider two penalty functions, \u0026lsquo;least absolute shrinkage and selection operator (LASSO) penalty [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] and \u0026lsquo;smoothly clipped absolute deviation (SCAD)\u0026rsquo;penalty [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Both penalties yield sparse estimates by forcing the regression coefficients of un- (or weakly) informative to be zero. LASSO, which is a regression analysis method that performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the resulting statistical model, is the simplest and most popular choice in the ML methods, but can cause bias. The SCAD penalty is an improved version that yields a nearly unbiased solution with increased computational cost due to its nonconvexity [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn this study, we performed 10-fold cross-validation (CV) to select an optimal tuning parameter for the penalized Cox proportional hazards models. Tuning is essential in the penalized method to improve the generalization performance of the model. All analyses were implemented by R. We employed R packages [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] for the two version of penalized Cox models: glmnet package for LASSO-penalized Cox model, ncvreg package for SCAD penalized Cox Model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eRandom Survival Forest model\u003c/h2\u003e \u003cp\u003eCox regression models typically assume a linear relationship and may struggle to handle nonlinear patterns effectively. However, \u0026lsquo;random survival forests (RSF)\u0026rsquo; exhibit greater adaptability to the complexity of data by offering high flexibility in dealing with nonlinear relationships. Through bootstrap sampling, RSF enhance generalization performance and reduce overfitting by training diverse trees. Cox regression models, lacking this sampling technique by default, may find particular utility in small datasets where such an approach can be especially beneficial [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. The analysis was conducted using the randomForestSRC package in R programming language.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analyses\u003c/h2\u003e \u003cp\u003eConsidering trade-off between computational time, complexity, and statistical performance for specificities of the framework [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], we determined that generating 100 models was sufficient to achieve meaningful results. To estimate the survival probability for an individual in the test set, we used the Cox regression models that were selected using penalized Cox model. We have established a conditional survival probability ML algorithm to validate the predictive accuracy of the survival analysis model. In this study, one hundred penalized Cox models, P(time\u0026thinsp;\u0026gt;\u0026thinsp;5-year I given X\u003csub\u003ets\u003c/sub\u003e), using 100 training sets, were analyzed by computing the survival rate at the \u003cem\u003ei\u003c/em\u003e-th observation in the test set. We obtained the index corresponding to the time point t\u0026thinsp;=\u0026thinsp;5 years. The resulting survival probabilities were stored in a vector named surv.prob. We also evaluated the performance of one hundred penalized Cox models using surv.prob and survival status in the test set by running the ROC curve 100 times and measuring the average AUC value. Hence, the series of algorithmic processes can be used to evaluate the goodness of fit of a model. The same procedure was applied equivalently to unpenalized Cox models and RSF models.\u003c/p\u003e \u003cp\u003eTo make internal validation, we created a matrix of survival probabilities and survival status for each individual. In this model, deltas represent the death status: if an individual is deceased, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\delta }_{i}\\)\u003c/span\u003e\u003c/span\u003e is 1, and if they survived, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\delta }_{i}\\)\u003c/span\u003e\u003c/span\u003e is 0. This was done for the test set in each of the 100 models. The prediction function takes data as input and generates an object for evaluating the performance of the predictive model. We used pred.roc object, which is used to calculate the receiver operating characteristics (ROC) curve for a survival analysis model. We calculated the true positive rate at a given false positive rate for a set of survival probabilities and survival status data [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], leveraging the pred.roc object. We generated the ROC curve for each model and calculated the mean ROC curve across all models [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e illustrates the overall process of the study. According to the original study, laboratory data were missing from 0.9%-34.8% for each variable and pathologic data also missing from 1.0%-12.5%. A total of 478 patients with CRC were used for descriptive analysis from 1378 patients with CRC in previous study [\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e]. Finally, 475 patients included in the model after excluding three unknown cases.\u003c/p\u003e\n\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n \u003ch2\u003eClinical characteristics of CRC patients\u003c/h2\u003e\n \u003cp\u003eTable\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e shows the 20 variables related to the clinical characteristics of the subset of 478 patients after the removal of missing values. For study population, the median duration of follow-up was 1439.1 days, and 5.4% of them died during follow-up. Mean age at diagnosis of CRC was 66.6 years. Detailed information of laboratory and pathological date are described in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eClinical characteristics of colorectal cancer patients\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eClinical and pathological characteristics\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eOriginal data\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCurrent study data\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDuration of follow-up (days), median [IQR]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1592.1, [732.8\u0026ndash;2167.0]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1439.1, [734\u0026ndash;1828]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDeath during follow-up\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e99 (7.2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e26 (5.4)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAge at diagnosis (years)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63.7\u0026thinsp;\u0026plusmn;\u0026thinsp;12.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e66.6\u0026thinsp;\u0026plusmn;\u0026thinsp;11.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLaboratory data\u003csup\u003e*\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of total patients\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1378\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e478\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBody mass index (kg/ m\u003csup\u003e2\u003c/sup\u003e)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e23.3\u0026thinsp;\u0026plusmn;\u0026thinsp;3.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e23.31\u0026thinsp;\u0026plusmn;\u0026thinsp;3.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHemoglobin (g/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e12.4\u0026thinsp;\u0026plusmn;\u0026thinsp;2.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e12.0\u0026thinsp;\u0026plusmn;\u0026thinsp;2.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePlatelet count (\u0026times;10\u0026sup3;/\u0026mu;\u003cem\u003el\u003c/em\u003e)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e261.2\u0026thinsp;\u0026plusmn;\u0026thinsp;97.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e285.4\u0026thinsp;\u0026plusmn;\u0026thinsp;95.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLymphocyte (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e25.8\u0026thinsp;\u0026plusmn;\u0026thinsp;10.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e23.9\u0026thinsp;\u0026plusmn;\u0026thinsp;10.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNeutrophil (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63.9\u0026thinsp;\u0026plusmn;\u0026thinsp;12.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e66.2\u0026thinsp;\u0026plusmn;\u0026thinsp;12.7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAlbumin (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCarcinoembryonic antigen (ng/mL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e27.2\u0026thinsp;\u0026plusmn;\u0026thinsp;235.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e26.7\u0026thinsp;\u0026plusmn;\u0026thinsp;258.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCarbohydrate antigen 19\u0026thinsp;\u0026minus;\u0026thinsp;9 (ng/mL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e76.5\u0026thinsp;\u0026plusmn;\u0026thinsp;571.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e71.1\u0026thinsp;\u0026plusmn;\u0026thinsp;603.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eC-reactive protein (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.6\u0026thinsp;\u0026plusmn;\u0026thinsp;5.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3.2\u0026thinsp;\u0026plusmn;\u0026thinsp;6.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNeutrophil-to-lymphocyte ratio\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3.7\u0026thinsp;\u0026plusmn;\u0026thinsp;4.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4.3\u0026thinsp;\u0026plusmn;\u0026thinsp;4.9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eProtein-albumin ratio\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.8\u0026thinsp;\u0026plusmn;\u0026thinsp;0.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.8\u0026thinsp;\u0026plusmn;\u0026thinsp;0.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLymphocyte/C-reactive protein ratio\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e12324.5\u0026thinsp;\u0026plusmn;\u0026thinsp;21866.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e9205.7\u0026thinsp;\u0026plusmn;\u0026thinsp;17986.9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePathologic data\u003csup\u003e**\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of total patients\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e910\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e478\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eT stage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eT0-1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e143 (16.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e62 (13.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eT2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98 (11.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e46 (9.7)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eT3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e512 (57.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e288 (60.3)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eT4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e131 (14.6)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e79 (16.8)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eUnknown\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e12 (1.3)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3 (0.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eN stage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eN0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e519 (57.9)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e269 (56.3)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eN1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e217 (24.2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e127 (26.5)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eN2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e154 (17.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82 (17.2)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eUnknown\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7 (0.8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0 (0.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003epTNM stage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eStage I\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e207 (23.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e95 (19.9)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eStage II\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e294 (32.6)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e166 (34.8)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eStage III\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e329 (36.5)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e179 (37.5)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eStage IV\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e60 (6.7)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e37 (7.7)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eUnknown\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11 (1.2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1 (0.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNumber of lymph node involvement\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.0\u0026thinsp;\u0026plusmn;\u0026thinsp;4.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.0\u0026thinsp;\u0026plusmn;\u0026thinsp;4.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLymphatic invasion\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e261 (30.2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e159 (33.3)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVascular invasion\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e50 (5.8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e36 (7.5)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNeural invasion\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80 (10.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e53 (11.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003ctfoot\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"3\"\u003eValues were expressed as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD or N (%)\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"3\"\u003eIQR, interquartile range\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"3\"\u003eLaboratory data\u003csup\u003e*\u003c/sup\u003e were missing from 0.9%-34.8% for each variable and pathologic data\u003csup\u003e**\u003c/sup\u003e also missing from 1.0%-12.5%. After excluding \u0026quot;unknown\u0026quot; cases, a total of 475 patients with colorectal cancer were used for the model.\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tfoot\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\n \u003ch2\u003eLASSO and SCAD variable selection\u003c/h2\u003e\n \u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e shows the variables selected by LASSO ranged from 1 to 14 variables, with an average of 7.0\u0026thinsp;\u0026plusmn;\u0026thinsp;3.0 variables selected. Out of the 100 models constructed, 19 models had the highest frequency of selecting 8 variables. The variables selected by SCAD in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e ranged from 1 to 9 variables, with an average of 4.3\u0026thinsp;\u0026plusmn;\u0026thinsp;2.1 variables. Out of the 100 models constructed, 23 models had the highest frequency of selecting five variables. Figure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e shows variables selected out of the 20 candidate variables across 100 models using penalized methods, and SCAD penalty selecting the same top 5 variables as LASSO. The most frequently selected five variables were neural invasion, number of lymph node (LN) involvement, cancer antigen 19\u0026thinsp;\u0026minus;\u0026thinsp;9 (CA19-9), lymphatic invasion, and hemoglobin (Hb).\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003eROC analysis\u003c/h2\u003e\n \u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e shows the ROC curves of the LASSO penalized model, the SCAD penalized model, the unpenalized model and the RSF model. The LASSO penalized model exhibited a mean AUC of 0.67\u0026thinsp;\u0026plusmn;\u0026thinsp;0.06, with a relatively small standard deviation (SD) of 0.06. Similarly, the SCAD penalized model showed a mean AUC of 0.65\u0026thinsp;\u0026plusmn;\u0026thinsp;0.07, with a SD of 0.07. These results emphasize the favorable performance of the LASSO and SCAD penalty Cox models, indicating their suitability for clinical variable selection. In comparison, the unpenalized model demonstrated a mean AUC of 0.64\u0026thinsp;\u0026plusmn;\u0026thinsp;0.09, with a small SD of 0.09. Although the unpenalized Cox model has shown a reliable level of evaluation similar to using the LASSO and SCAD penalty methods, it could be less efficient as more variables are necessary than LASSO and SCAD penalty methods. The RSF model demonstrated the most favorable performance evaluation with an accuracy of 0.71\u0026thinsp;\u0026plusmn;\u0026thinsp;0.05 with a small SD of 0.05.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eWe demonstrated that the penalized Cox model is more efficient and leads to a more generalized model selection compared to the unpenalized Cox model which applies all 20 variables as a prognosis prediction model for patients with CRC. For variable selection, LASSO penalty model selected average 7.0 variables, while SCAD penalty model selected average 4.3 variables (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Furthermore, among 100 models, the LASSO method had the highest frequency of 19 models selecting 8 variables. On the other hand, the SCAD method had the highest frequency of 23 models selecting 5 variables among the 100 models. It's worth noting that SCAD selected fewer variables than LASSO due to its use of a nonconvex penalty function [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], while still maintaining comparable performance in terms of model selection and prediction accuracy. In our study, the SCAD penalty method was found to be more suitable for selecting critical biological markers for the prediction of prognosis in CRC patients. Because of its non-linear penalty function, however, SCAD involves solving a more computationally complex optimization problem, which may lead to higher computational costs compared to LASSO [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Moreover, it may be less suitable for large datasets due to the high computational cost. Therefore, SCAD may be more appropriate for relatively small datasets. This study also showed that neural invasion, number of LN involvement, CA19-9, lymphatic invasion, and Hb level are most important predictors for CRC patients, as they were most frequently selected in the 100 model fittings within LASSO and SCAD methods (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Five prognostic variables identified in our study were also demonstrated as important prognostic factors in previous studies [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan additionalcitationids=\"CR23 CR24\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Therefore, surgical pathology, such as neural invasion, number of LN involvement, and lymphatic invasion, CA19-9 and hemoglobin level, need to be more carefully monitored for the better management of the CRC.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows that RSF model exhibited the highest performance evaluation. One of the key strengths of our study is the rigorous model evaluation process, as we ran our model 100 times to assess its performance. This approach allowed us to establish the reliability and validity of our results and provide a more accurate estimate of the model's predictive power. Our study may be highlighted as interpretability has been resolved into ML models, making it easier to explain or interpret their operational methods and outcomes. This enhancement enables a clearer understanding of how the model operates and the ability to provide explanations for its results, leading to a higher level of trust in the model's predictions. Our study also differentiated from previous studies as we evaluated personalized 5-year survival rates based on ML [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e] rather than average survival rates [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Other studies have evaluated within 5 years, it is noteworthy that our research assessed outcomes more than 5-year [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. We found that mean AUC for the LASSO Cox model was 0.67 and the mean AUC for the SCAD Cox model was 0.65, while the mean AUC for the unpenalized Cox model was 0.64. Furthermore, the RSF model exhibited the highest mean AUC with a value of 0.71. These findings suggest that all four models, which were originally trained on a specific dataset of randomly assigned patients, can be reasonably applied to new datasets. Additionally, models with fewer variables, such as Lasso and SCAD, demonstrated noteworthy results compared to the unpenalized model. In the performance evaluation section, the RSF model emerged as the top-performing model.\u003c/p\u003e \u003cp\u003eOur research has several limitations. First, we removed missing values from the data, however, imputation may make bias. Although the sample size was declined from the original dataset, we developed a complete dataset without any missing value. Second, our model was not validated from external data. However, we ran our model 100 times to assess its performance and had rigorous model evaluation process to minimize the potential bias. Even though, our model warrants validation from large, complete external dataset. Third, our study only focused on Asians, specifically Koreans, and thus did not represent a diverse range of ethnicities. Finally, we need to evaluate survival rate predictions using a wider range of ML methods [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e].\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eRSF Cox model was revealed as the most suitable approach, which was validated through rigorous performance evaluation conducted 100 times.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAuthor\u0026rsquo;s contribution\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSeon Hwa Lee and Jae Myung Cha conceived and designed the study. Seon Hwa Lee and Seung Jun Shin performed the statistical analysis of the data; Seon Hwa Lee and Jae Myung Cha wrote the manuscript. All authors have read and approved the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that the research was conducted in the absence of any commercial or fnancial relationships that could be construed as a potential confict of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDisclosure of fnancial arrangements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthical approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was approved by the Institutional Review Board of the hospital (Permission number KHNMC IRB 2020-10-004).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eSung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. \u003cem\u003eCA Cancer J Clin\u003c/em\u003e 2021;71:209-249.\u003c/li\u003e\n\u003cli\u003eFitzmaurice C, Allen C, Barber RM, et al. Global, regional, and national cancer incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 32 cancer groups, 1990 to 2015: a systematic analysis for the global burden of disease study. \u003cem\u003eJAMA oncology\u003c/em\u003e 2017;3:524-548.\u003c/li\u003e\n\u003cli\u003eBray F, Laversanne M, Weiderpass E, Soerjomataram I. The ever‐increasing importance of cancer as a leading cause of premature death worldwide. \u003cem\u003eCancer\u003c/em\u003e 2021;127:3029-3030.\u003c/li\u003e\n\u003cli\u003eRawla P, Sunkara T, Barsouk A. Epidemiology of colorectal cancer: incidence, mortality, survival, and risk factors. \u003cem\u003ePrz Gastroenterol\u003c/em\u003e 2019;14:89-103.\u003c/li\u003e\n\u003cli\u003eSawicki T, Ruszkowska M, Danielewicz A, Niedźwiedzka E, Arłukowicz T, Przybyłowicz KE. A review of colorectal cancer in terms of epidemiology, risk factors, development, symptoms and diagnosis. \u003cem\u003eCancers (Basel)\u003c/em\u003e 2021;13:2025.\u003c/li\u003e\n\u003cli\u003eMarley AR, Nan H. Epidemiology of colorectal cancer. \u003cem\u003eInt J Mol Epidemiol Genet\u003c/em\u003e 2016;7:105.\u003c/li\u003e\n\u003cli\u003eCho Y, Park SB, Yoon JY, Kwak MS, Cha JM. Neutrophil to lymphocyte ratio can predict overall survival in patients with stage II to III colorectal cancer. \u003cem\u003eMedicine\u003c/em\u003e 2023;102:e33279.\u003c/li\u003e\n\u003cli\u003eRajkomar A, Dean J, Kohane I. Machine learning in medicine. \u003cem\u003eN Engl J Med\u003c/em\u003e 2019;380:1347-1358.\u003c/li\u003e\n\u003cli\u003eGuyon I, Elisseeff A. An introduction to variable and feature selection. \u003cem\u003eJournal of machine learning research\u003c/em\u003e 2003;3:1157-1182.\u003c/li\u003e\n\u003cli\u003eSaeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. \u003cem\u003ebioinformatics\u003c/em\u003e 2007;23:2507-2517.\u003c/li\u003e\n\u003cli\u003eCox DR. Regression Models and Life-Tables. \u003cem\u003eJ R Stat Soc Series B Stat Methodol\u003c/em\u003e 1972;34:187-202.\u003c/li\u003e\n\u003cli\u003eTibshirani R. Regression Shrinkage and Selection Via the Lasso. \u003cem\u003eJ R Stat Soc Series B Stat Methodol\u003c/em\u003e 1996;58:267-288.\u003c/li\u003e\n\u003cli\u003eFan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. \u003cem\u003eJ Amer Statistical Assoc\u003c/em\u003e 2001;96:1348-1360.\u003c/li\u003e\n\u003cli\u003eBreheny P (2016) Adaptive lasso, MCP, and SCAD. Available at: https://myweb.uiowa.edu/pbreheny/7240/s21/notes/3-03.pdf; Accessed Jan. 6. 2024.\u003c/li\u003e\n\u003cli\u003eFriedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. \u003cem\u003eJ Stat Softw\u003c/em\u003e 2010;33(1):1-22.\u003c/li\u003e\n\u003cli\u003eBreheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. \u003cem\u003eAnn. Appl. Stat\u003c/em\u003e 2011; 5(1): 232-253. \u003c/li\u003e\n\u003cli\u003eIshwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. \u003cem\u003eAnn Appl Stat\u003c/em\u003e 2008;2:841-860, 820.\u003c/li\u003e\n\u003cli\u003eArlot S, Celisse A. A survey of cross-validation procedures for model selection. \u003cem\u003eStat Surv\u003c/em\u003e 2010;4:40-79, 40.\u003c/li\u003e\n\u003cli\u003eFawcett T. An introduction to ROC analysis. \u003cem\u003ePattern Recognit Lett\u003c/em\u003e 2006;27:861-874.\u003c/li\u003e\n\u003cli\u003eBradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. \u003cem\u003ePattern Recognit\u003c/em\u003e 1997;30:1145-1159.\u003c/li\u003e\n\u003cli\u003eFan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. \u003cem\u003eJ R Stat Soc Series B Stat Methodol\u003c/em\u003e 2008;70:849-911.\u003c/li\u003e\n\u003cli\u003eKrasna MJ, Flancbaum L, Cody RP, Shneibaum S, Ben Ari G. Vascular and neural invasion in colorectal carcinoma. Incidence and prognostic significance. \u003cem\u003eCancer\u003c/em\u003e 1988;61:1018-1023.\u003c/li\u003e\n\u003cli\u003eLei P, Ruan Y, Liu J, Zhang Q, Tang X, Wu J. Prognostic Impact of the Number of Examined Lymph Nodes in Stage II Colorectal Adenocarcinoma: A Retrospective Study. \u003cem\u003eGastroenterol Res Pract\u003c/em\u003e 2020;2020:8065972.\u003c/li\u003e\n\u003cli\u003eLiang J, Wei Y, Zhao C, Hong C. [Metastatic lymph node ratio and outcome of surgical patients with stage III colorectal cancer]. \u003cem\u003eNan Fang Yi Ke Da Xue Xue Bao \u003c/em\u003e2012;32:1663-1666. [Article in Chinese]\u003c/li\u003e\n\u003cli\u003eAllison JE, Fraser CG, Halloran SP, Young GP. Population screening for colorectal cancer means getting FIT: the past, present, and future of colorectal cancer screening using the fecal immunochemical test for hemoglobin (FIT). \u003cem\u003eGut Liver\u003c/em\u003e 2014;8:117-130.\u003c/li\u003e\n\u003cli\u003eSusič D, Syed-Abdul S, Dovgan E, Jonnagaddala J, Gradi\u0026scaron;ek A. Artificial intelligence based personalized predictive survival among colorectal cancer patients. \u003cem\u003eComput Methods Programs Biomed\u003c/em\u003e 2023;231:107435.\u003c/li\u003e\n\u003cli\u003eWang P, Li Y, Reddy CK. Machine Learning for Survival Analysis: A Survey. \u003cem\u003eACM Comput Surv\u003c/em\u003e 2019;51:Article 110.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Colorectal cancer, Machine Learning, Penalized cox model, Personalized prediction, Survival rate","lastPublishedDoi":"10.21203/rs.3.rs-4024382/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4024382/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eThe investigation into individual survival rates within the patient population was typically conducted using the Cox proportional hazards model, with geometric black box models not being employed\u003c/p\u003e\u003ch2\u003eAims\u003c/h2\u003e \u003cp\u003eWe aims to evaluate the performance of machine learning algorithm in predicting survival rates more than 5 years for individual patients with colorectal cancer.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eA total of 475 patients with CRC and complete data who had underwent surgery for colorectal cancer were analyze to measure individual's survival rate more than 5 years using a machine learning based on penalized Cox regression. We conducted thorough calculations to measure the individual's survival rate more than 5 years for performance evaluation. The receiver operating characteristic (ROC) curves for the LASSO penalized model, the SCAD penalized model, the unpenalized model, and the RSF model were analyzed.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe least absolute shrinkage and selection operator penalized model displayed a mean AUC of 0.67\u0026thinsp;\u0026plusmn;\u0026thinsp;0.06, the smoothly clipped absolute deviation penalized model exhibited a mean AUC of 0.65\u0026thinsp;\u0026plusmn;\u0026thinsp;0.07, the unpenalized model showed a mean AUC of 0.64\u0026thinsp;\u0026plusmn;\u0026thinsp;0.09. Notably, the random survival forests model outperformed the others, demonstrating the most favorable performance evaluation with a mean AUC of 0.71\u0026thinsp;\u0026plusmn;\u0026thinsp;0.05.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003ePenalized Cox model is more efficient and leads to a more generalized model selection compared to the unpenalized Cox model as a prognosis prediction model for CRC. The results indicated that the random forest model, a black box model, outperformed the penalized Cox model in terms of performance.\u003c/p\u003e","manuscriptTitle":"Personalized Prediction of Survival Rate with Combination of Penalized Cox Models and Machine Learning in Patients with Colorectal Cancer","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-12 20:02:35","doi":"10.21203/rs.3.rs-4024382/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"82ad349c-8ad5-48ae-8ffa-3a927eaad3b2","owner":[],"postedDate":"March 12th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-03-18T17:44:39+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-12 20:02:35","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4024382","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4024382","identity":"rs-4024382","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00