Breast cancer prediction modeling based on SHAP interpretability analysis and XGBoost algorithm

doi:10.21203/rs.3.rs-6124339/v1

Breast cancer prediction modeling based on SHAP interpretability analysis and XGBoost algorithm

2025 · doi:10.21203/rs.3.rs-6124339/v1

preprint OA: closed

Full text JSON View at publisher

Full text 84,241 characters · extracted from preprint-html · click to expand

Breast cancer prediction modeling based on SHAP interpretability analysis and XGBoost algorithm | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Breast cancer prediction modeling based on SHAP interpretability analysis and XGBoost algorithm Xiuliang Guan, Jiaxue Cui, Lan Bai, Xiaodan Bi, Yixuan Liu, Chong Ren, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6124339/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 15 You are reading this latest preprint version Abstract Purpose To compare the predictive effectiveness and risk factor screening of extreme gradient ascent (XGBoost) model and four commonly used machine learning models for breast cancer diagnosis, and to interpret the model results by SHAP interpretability analysis. Materials and methods Breast tumor data from the UCI public database were used to screen the characteristic factors using the heat map of the correlation coefficient matrix, and five machine learning algorithms, XGBoost, Random Forest, K-Nearest Neighbors, Decision Tree, and Support Vector Machines, were compared by precision, recall, F1 value, and accuracy. The ROC curves of the five models were plotted, and the confusion matrix was used to classify the prediction results, resulting in the best-performing model, XGBoost. the XGBoost model, the decision tree model, and the random forest model were used to derive the order of importance of the feature factors, and an interpretability analysis was performed through the SHAP model to derive the important feature factors affecting the occurrence of breast cancer. Results The results of ROC curve showed that the accuracy of XGBoost model in the test set was 97.4%, the decision tree model was 91.2%, the random forest model was 95.6%, the neighborhood algorithm model was 94.7%, and the support vector machine model was 92.1%. The confusion matrix plot also gives 97.3% accuracy for the XGBoost model, 89.5% for the decision tree model, 95.6% for the random forest model, 94.7% for the proximity algorithm model, and 92.1% for the support vector machine model. the results of the feature importance scores of the three models, the first important feature is radius-worst. The SHAP interpretable model results showed that the main drivers for high risk patients were radius-worst,concave points-worst,concavity-worst.Also radius-worst interacted with concave points-worst. Conclusions XGBoost algorithm model is more accurate compared with traditional machine learning model, radius-worst is an important factor affecting breast cancer occurrence, and its interaction with concave points-worst exists. breast cancer XGBoost prediction SHAP machine learning Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Breast cancer is the most common malignancy in women worldwide and is the second leading cause of cancer death after lung cancer [ 1 ] . The American Cancer Society estimates that 313,510 Americans will be diagnosed with breast cancer and 42,780 will die from breast cancer in the United States in 2024 [ 2 – 3 ] . Research studies have shown that the incidence of breast cancer in China is growing rapidly [ 4 – 5 ] . Breast cancer accounts for 1/4 of cancer cases and 1/6 of cancer deaths in women, and the disease ranks first in most countries worldwide in terms of morbidity and mortality. More than 20,000 new cases of breast cancer are reported annually, resulting in 266.85 million deaths worldwide. Given its increasing prevalence, especially in developing countries such as China, there is an urgent need to improve diagnostic, therapeutic and prognostic strategies for breast cancer [ 6 ] . In response to the increasing burden of breast cancer, especially in countries in transition, the World Health Organization (WHO) launched the Global Breast Cancer Initiative (GBCI) in 2021, which aims to reduce breast cancer mortality by improving the accuracy of early diagnosis and increasing the proportion of individuals diagnosed with stage I or II breast cancer to more than 60% [ 7 ] . Tissue biopsy and immunohistochemistry are currently the gold standard for breast cancer diagnosis, but they are invasive diagnostic operations, and these techniques do not assess tumor molecular heterogeneity [ 8 ] . Typical early symptoms of breast cancer are mainly localized, such as breast lumps, breast skin abnormalities, nipple areola abnormalities, and nipple discharge [ 9 ] . Most breast cancer patients do not have obvious early symptoms, which can be easily overlooked. Therefore, early diagnosis of breast cancer patients is of great significance for the effectiveness of subsequent treatment. Currently in the era of big data, the healthcare industry possesses a huge amount of data, and traditional healthcare is gradually transforming into more efficient smart healthcare. At the same time, machine learning methods, data mining methods, and statistical methods are flourishing [ 10 – 12 ] . Extreme Gradient Boosting (XGBoost) is an integrated learning algorithm, which is based on a classification tree model [ 13 ] , and combines a set of less accurate classifiers into a more accurate classifier through an iterative computational method. Fast running speed, accurate training results and relaxed data requirements are its characteristics [14] . Stronger model generalization ability, higher scalability and faster computation are its advantages [15] .In this study, we compare the predictive effectiveness of the XGBoost model with four commonly used machine learning models, namely, decision tree model, random forest model, nearest neighbor algorithm model, and support vector machine model, to compare the predictive effectiveness of these five models in breast cancer diagnosis as well as breast tumor risk factor screening.With the current application of machine learning methods to medicine [ 16 ] , the black-box nature of the algorithms limits the trust of patients and clinicians and hinders their application in clinical settings [ 17 – 18 ] ..SHAP (SHapley Additive exPlanations), a game-theoretically based method for interpreting individual predictions based on optimal Shapley values [19–20] , has been widely used to guide the interpretation of various algorithms, i.e., to break the limitations of the black box and subsequently provide interpretable and visualizable clinical predictions of effective mL models, with significant advances in survival analysis in breast cancer [ 21 ] . In this paper, we will use the SHAP model to explain the specific contribution of each metric to the prediction, which can help us to deeply understand the importance of the features and construct more accurate and reliable prediction models for breast cancer diagnosis. In the field of smart healthcare, it assists medical diagnosis and accelerates the early diagnosis of breast cancer. Methods 1.1 Data Source. The dataset used in this study is the Breast Cancer Wisconsin Diagnostic Dataset. This dataset is publicly available from the University of California, Irvine (UCI) Machine Learning Repository [ 22 ] . It includes features or characteristics of cell nuclei extracted from breast masses that were sampled by fine needle aspiration (FNA), a common diagnostic modality in oncology. Clinical samples for this dataset were collected from January 1989 through November 1991.1 The data were collected from a variety of clinical sites, including the United States, the United Kingdom, and the United States of America. Relevant features in the digitized images of the FNA samples were extracted by the method described in references [ 23 – 25 ] . There were a total of 569 cases and 10 features, namely: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Each feature factor has mean value, worst value respectively, total 20 feature factors. 1.2 Data screening Visual analysis of feature correlation is carried out through the heat map of correlation coefficient matrix, taking one representative of each group of features with correlation higher than 0.9, and eliminating the rest to ensure the generalization ability of the subsequent model. Finally, 5 features are eliminated and 15 feature factors are tested. 1.3 XGBoost model XGBoost is an improved boosted tree model, which is one of the Boosting algorithms (Boosting is an integrated learning algorithm, i.e., it consists of a series of weak classifiers with dependencies that are integrated together to form a strong classifier according to different weights), and the core idea of this algorithm is to integrate many tree models together to form a very strong classifier. Currently, XGBoost has been widely used in a variety of fields such as risk factor screening, statistical learning and artificial intelligence [ 26 – 28 ] .The training process of XGBoost is to fit the residuals of the last prediction by continuously adding tree models. The model at the end of training contains a number of tree models with leaf nodes corresponding to sample features, and the leaf nodes generate scores. In this paper, we build an XGBoost prediction model based on scikit-learn, a machine learning framework in Python, and use GridSearchCV to adjust the hyperparameters of the model, and finally obtain the prediction model with the best performance. 1.4 Comparison of XGBoost model and commonly used machine learning models We compared the precision rate, recall rate, and F1 value of the model index values, and the model comparison accuracy was analyzed by using the receiver operating characteristic curve (ROC) to analyze the effectiveness of the XGBoost model and commonly used machine learning models in the prediction of benign and malignant breast tumor patients, and the area under the curve (AUC) was used to determine the strength of the prediction ability. The area under curve (AUC) was used to determine the strength of the prediction ability, and the larger the AUC value, the stronger the prediction ability. Subsequently, the decision boundary visualization diagram is drawn, the accuracy confusion matrix is compared, the top three models in terms of prediction ability are selected, the feature importance ranking is drawn, the ranking order is compared, the top 5 features of breast tumor risk factors are derived, and the XBGoost decision tree is drawn. The technical architecture is shown in Fig. 1 . 1.5 Interpretability of the model Shapley's method of Additive Interpretation (SHAP) was used to investigate the key factors that influence the forecasting process. Using the SHAP visualization model to explain the specific contribution of each metric to the prediction can help us gain insight into the importance of the features. 1.6 Statistical analysis We used subject work characteristics (ROC) curve analysis and area under the curve (AUC), as well as precision, recall, and F1 score metrics, to comprehensively assess the recognition ability of the model. All statistical analyses were performed using Python (v 3.7.1). Results 2.1 Basic situation A total of 569 cases were included in the study, including 212 patients with malignant tumors, accounting for 37.3%; 357 patients with benign tumors, accounting for 62.7%. The training set was set according to 80% of the data, and the validation set was set according to 20%. There are 455 cases in the training set and 114 cases in the validation set. 2.2 Feature selection Correlation coefficient matrix heat map is a method to visualize the correlation coefficient between features. We use the correlation coefficient matrix heat map to demonstrate the correlation between features, and the larger the coefficient in the map indicates that the features are more relevant. After visualizing the correlation coefficients of the features, the features are downscaled with reference to the heat map, and the five features with correlation coefficients greater than 0.9 are deleted, and the final 15 groups of feature factors included in the study are radius mean, texture mean, smoothness mean, compactness mean, concavity mean, mean, symmetry mean, fractal dimension mean, radius worst, texture worst, smoothness worst, compactness worst, concavity worst, concave points worst, symmetry worst, fractal dimension worst. see Fig. 2 A. 2.3 Comparison of predictive efficacy of XGBoost model and commonly used machine learning models Taking breast tumor benignity and malignancy as the dependent variable, and the 15 characteristic factors mentioned above as the independent variables, the data of the training and test groups were substituted into XGBoost and four commonly used machine learning models, respectively, and the ROC curves of the five models were plotted, see Fig. 2 B. The results showed that the XGBoost algorithm model, with an accuracy of 97.4% in the test set, the decision tree model with an accuracy of 91.2%, Random Forest model has an accuracy of 95.6% in the test set, K-Nearest Neighbors Algorithm model has an accuracy of 94.7% in the test set, and Support Vector Machine model has an accuracy of 92.1% in the test set.The XGBoost model predicts tumor benignness and malignancy with the highest AUC value and the strongest prediction ability, followed by the Random Forest model, and Neighborhood Algorithm model. Table 1 . Table 1 Comparison of five models in predicting benign and malignant breast tumors 训练集测试集算法精确率召回率 F1 精确率召回率 F1 准确率 SVM 0.93 0.93 0.93 0.91 0.90 0.91 92.105% KNN 0.95 0.95 0.95 0.94 0.94 0.94 94.736% DT 1.00 1.00 1.00 0.89 0.93 0.90 91.228% RF 1.00 1.00 1.00 0.94 0.96 0.95 95.614% XGB 1.00 1.00 1.00 0.97 0.97 0.97 97.368% （Support Vector Machine：SVM；K-Nearest Neighbors：KNN；Decision Tree：DT；Random Forest：RF；XGBoost：XGB） 2.4 Evaluation of model prediction performance In order to compare the performance of XGBoost model and commonly used machine learning models, parameter optimization was performed before model model testing, and the classification of prediction results was performed using the confusion matrix, which corresponds to the data distributions in the four cases of true malignant, false malignant, true benign, and false benign, respectively. The data distribution for different models is shown in Fig. 2 C. The experimental results were evaluated in terms of accuracy, which can also be obtained from the confusion matrix plot as 97.3% for XGBoost model, 89.5% for decision tree model, 95.6% for random forest model, 94.7% for neighborhood algorithm model, and 92.1% for support vector machine. The decision boundary visualization was used to depict the classes to which the data points belonged, and the classes of the data points in the feature space were downscaled, and the decision boundary visualization for the five models is shown in Fig. 2 D. 2.5 Feature importance scoring For the 15 feature factors included, XGBoost model, decision tree model, and random forest model, which are the top three predictive performance, are used to score the feature importance, and the 15 feature importance rankings are obtained.The top five importance rankings of XGBoost model are radius worst, radius mean, concave points worst, concave points worst, concave points worst, and concave points worst. The top five importance rankings of the decision tree model are radius worst, concave points worst, texture worst, texture mean, fratal dimension worst. Random forest model feature importance ranking is radius worst, concave points worst, concavity mean, radius mean, texture worst, and the top five features are radius worst, concave points worst, concavity mean, radius mean, texture worst, and the top five features are random forest model, and the top five features are radius worst, concave points worst, concavity mean, radius mean, texture mean, and fratal dimension worst, respectively. worst, which is consistent with clinical diagnostic experience and indicates good model accuracy. This is consistent with the clinical diagnosis experience, which indicates that the accuracy of the model is good. 2.6 Model interpretability analysis SHAP analysis was used to explain the XGBoost model by quantifying the contribution of individual features in the model. The method achieves the ranking of feature significance by calculating the SHAP mean. Notably, the mean and worst values of concave points, area, texture, and concave worst values became the top five key determinants. To visually represent the cumulative effect of each feature, we constructed a summary plot containing the SHAP values. This graphical representation provides a comprehensive understanding of how each feature contributes to individual patient prediction. Notably, elevated levels of these five traits were associated with elevated risk of breast cancer in patients. Where red represents a positive correlation and blue represents a negative correlation. (Fig. 4 A) The waterfall plot shows the probability of prediction of an outcome by a single feature, explaining the underlying contribution of each feature to the prediction of tumor benign and malignant outcomes in the dataset. (Fig. 4 B) The “force map” feature in the SHAP package was used to determine the overall SHAP value for each patient. Combining the overall SHAP values for promotion (red, leading to higher odds of breast cancer) and suppression (blue, leading to lower odds of breast cancer), the overall SHAP value for that patient was obtained as a functional outcome for breast cancer. The overall functional outcome of the model is 5.42, AREA, CONCAVE POINTS,TEXTURE, and COMPACTNESS are strong drivers of probability prediction, with CONCAVE POINTS playing a dominant role. (Fig. 4 c) The force diagram was then rotated 90° counterclockwise and the process was repeated for a sample of 569 patients in the test dataset, providing a global plot of the probability of clustering similar combinations of risk factors in the test dataset (Fig. 4 d). The figure shows the common characteristics of the subgroups of patients with high (red) or low (blue) predictive probabilities. As can be seen from the figure, the main drivers for high-risk patients are radius-worst,concave points-worst, and concavity-worst.The tree model for interaction prediction enables fast and accurate two-by-two interaction calculations, returning a matrix for each predicted value where the main effect is on the diagonal and the interaction effect is off the diagonal. The results show that radius-worst interacts with concave points-worst. See Fig. 5 . Discussion In this study, a machine learning model was used to screen 15 characteristic factors affecting the benignity and malignancy of breast tumors and predict their relationship with breast cancer occurrence, and the characteristic factor with the highest correlation with breast cancer occurrence was obtained: the worst value of tumor radius. This is consistent with clinical experience that typical early symptoms of breast cancer are dominated by local symptoms such as breast lumps, breast skin abnormalities, nipple areola abnormalities, and nipple overflow [ 29 ] . The size of the breast lump is closely related to the occurrence of breast cancer, and the discovery of breast lumps should be consulted as early as possible. Through the comparison between models, we concluded that the highest performing model was the XGBoost algorithm model with a model accuracy of 97.368%. The advantages are: the XGBoost algorithm is highly expandable, the base classifiers can be replaced, its inclusiveness of data outliers is high, and the algorithm's generalization ability is improved. In summary, the XGBoost model is suitable for predictive analysis of breast cancer, and its predictive ability is better than four commonly used machine learning models in both the training and validation groups. Compared to previous studies, machine learning models have higher diagnostic power than traditional Cox regression models [ 30 – 31 ] . However, the results of machine learning applications are often unexplained or unobservable, which casts doubt on the generalizability of the algorithm for predicting disease occurrence in a clinical setting. We therefore computed SHAP values for machine learning models for interpreting and visualizing the predicted results. Similarly, Arturo and coworkers have proposed the use of SHAP values in interpreting ML models to predict breast cancer survival prognosis [ 21 ] . These studies undoubtedly provide theoretical references and practical bases for related research.SHAP results show that radius-worst factor is the most important factor in breast cancer occurrence and is its positive correlation index. Its interaction with concave points-worst is worth conducting further research. Since the data in this study were obtained from public databases, it was not possible to match the data collected from hospitals with public databases due to the specificity of the indicators. Therefore, there was a lack of external validation during model testing. There are some limitations in this study, the sample data of the study is 569 cases, the data is too small, which leads to the short operation length of all five algorithms, and it is not possible to compare the superiority of the algorithmic model in terms of the operation length. Further validation is needed with multicenter and big data study results. This study raises new possibilities for replacing invasive testing methods with imaging in the future. We will also consider laboratory studies and prospective experimental studies. Conclusion In summary, we have built an excellent machine learning model for predicting the occurrence of breast cancer, obtained the worst value of the high-risk influencing factor for breast cancer: tumor radius, and built an interpretable model. Declarations Clinical trial number not applicable. Funding Dalian Medical Science Research Program（2311001） Availability of data and materials In this manuscript we use de-identified data from a public repository . The data are included on the BMC Med Res Method website. As such, ethical approval was not required. Acknowledgments We acknowledge and thank the investigators, scientists, and developers who have contributed to the scientific community by making their data, code, and software freely available Ethics approval and consent to participate In this manuscript we use de-identified data from a public repository The data are included on the BMC Med Res Method website. As such, ethical approval was not required. Competing interests The authors report no competing interests relating to this work. References Gradishar, William J et al. “Breast Cancer, Version 3.2024, NCCN Clinical Practice Guidelines in Oncology.” Journal of the National Comprehensive Cancer Network : JNCCN vol. 22,5 (2024): 331-357. doi:10.6004/jnccn.2024.0035 Siegel, Rebecca L et al. “Cancer statistics, 2022.” CA: a cancer journal for clinicians vol. 72,1 (2022): 7-33. doi:10.3322/caac.21708 Siegel, Rebecca L et al. “Cancer statistics, 2024.” CA: a cancer journal for clinicians vol. 74,1 (2024): 12-49. doi:10.3322/caac.21820 Jiang, Yi-Zhou et al. “Integrated multiomic profiling of breast cancer in the Chinese population reveals patient stratification and therapeutic vulnerabilities.” Nature cancer vol. 5,4 (2024): 673-690. doi:10.1038/s43018-024-00725-0 Sung, Hyuna et al. “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries.” CA: a cancer journal for clinicians vol. 71,3 (2021): 209-249. doi:10.3322/caac.21660 Li X, Li X, Yang B, Sun S, Wang S, Yu F, Wang T. Enhancing breast cancer outcomes with machine learning-driven glutamine metabolic reprogramming signature. Front Immunol. 2024 May 1;15:1369289. doi: 10.3389/fimmu.2024.1369289. PMID: 38756785; PMCID: PMC11097668. Anderson BO, Ilbawi AM, Fidarova E, et al. The Global Breast Cancer Initiative: a strategic collaboration to strengthen health care for non-communicable diseases. Lancet Oncol. 2021;22(5):578-581. doi: 10.1016/S1470-2045(21)00071-1 Freitas, Ana Julia Aguiar de et al. “Liquid Biopsy as a Tool for the Diagnosis, Treatment, and Monitoring of Breast Cancer.” International journal of molecular sciences vol. 23,17 9952. 1 Sep. 2022, doi:10.3390/ijms23179952 Gradishar, William J et al. “Breast Cancer, Version 3.2024, NCCN Clinical Practice Guidelines in Oncology.” Journal of the National Comprehensive Cancer Network : JNCCN vol. 22,5 (2024): 331-357. doi:10.6004/jnccn.2024.0035 Lai, Jianguo et al. “A radiogenomic multimodal and whole-transcriptome sequencing for preoperative prediction of axillary lymph node metastasis and drug therapeutic response in breast cancer: a retrospective, machine learning and international multicohort study.” International journal of surgery (London, England) vol. 110,4 2162-2177. 1 Apr. 2024, doi:10.1097/JS9.0000000000001082 Han, Xiaorui et al. “Development of a machine learning-based radiomics signature for estimating breast cancer TME phenotypes and predicting anti-PD-1/PD-L1 immunotherapy response.” Breast cancer research : BCR vol. 26,1 18. 29 Jan. 2024, doi:10.1186/s13058-024-01776-y Zhou, Sheng et al. “Breast Cancer Prediction Based on Multiple Machine Learning Algorithms.” Technology in cancer research & treatment vol. 23 (2024): 15330338241234791. doi:10.1177/15330338241234791 Chen TQ, Guestrin C. XGBoost: a scalable tree boosting system［C］//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA. New York, NY, USA: ACM, 2016: 785-94. GUAN X，ZHANG B，FU M，et al.Clinical and inflammatory features based machine learning model for fatal risk prediction of hospitalized COVID-19 patients: results from a retrospective cohort study ［J］ .Ann Med，2021，53 （1）：257-266. CHRISTOPHER T，BRODY J P.Evaluation of a genetic risk score for severity of COVID-19 using human chromosomal-scale length variation ［J / OL］ . Hum Genomics， 2020， 14 （1）［2021-06-04］ . https://pubmed. ncbi. nlm. nih. gov / 33036646/. DOI: 10.1186 /s40246-020-00288-y. Nensa F, Demircioglu A, Rischpler C. Artifcial intelligence in nuclear medicine. J Nucl Med. 2019;60(Suppl 2):29S-37S. https:// doi.org/10.2967/jnumed.118.220590. The Lancet Respiratory M. Opening the black box of machine learning. Lancet Respir Med. 2018;6(11):801. https://doi.org/10. 1016/S2213-2600(18)30425-9. Petch J, Di S, Nelson W. Opening the black box: the promise and limitations of explainable machine learning in cardiology. Can J Cardiol. 2022;38(2):204–13. https://doi.org/10.1016/j.cjca.2021. 09.004. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):56–67. https://doi.org/10.1038/s42256-019-0138-9. Lu, Wenhao et al. “Explainable and visualizable machine learning models to predict biochemical recurrence of prostate cancer.” Clinical & translational oncology : official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico vol. 26,9 (2024): 2369-2379. doi:10.1007/s12094-024-03480-x Moncada-Torres A, van Maaren MC, Hendriks MP, Siesling S, Geleijnse G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep. 2021;11(1):6968. https://doi.org/10.1038/s41598-021-86327-7. Lichman M. UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set. 2014. http://archive.ics.uci.edu/ml. Accessed 8 Aug Mangasarian OL, Street WN, Wolberg WH. Breast Cancer Diagnosis and Prognosis via Linear Programming: AAAI; 1994, pp. 83 - 86. Wolberg WH, Mangasariant OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA. 1990;87:9193–6. Bennett KP. Decision tree construction via linear programming: University of Wisconsin-Madison Department of Computer Sciences; 1992,pp. 97–101. Wu, Peng et al. “Pan-cancer characterization of cell-free immune-related miRNA identified as a robust biomarker for cancer diagnosis.” Molecular cancer vol. 23,1 31. 12 Feb. 2024, doi:10.1186/s12943-023-01915-7 Jiang, Yiyao et al. “Predicting anti-cancer drug sensitivity through WRE-XGBoost algorithm with weighted feature selection.” Genes & diseases vol. 12,2 101275. 22 Mar. 2024, doi:10.1016/j.gendis.2024.101275 Guan, Xiuliang et al. “Construction of the XGBoost model for early lung cancer prediction based on metabolic indices.” BMC medical informatics and decision making vol. 23,1 107. 13 Jun. 2023, doi:10.1186/s12911-023-02171-x Benitez Fuentes, Javier David et al. “Global Stage Distribution of Breast Cancer at Diagnosis: A Systematic Review and Meta-Analysis.” JAMA oncology vol. 10,1 (2024): 71-78. doi:10.1001/jamaoncol.2023.4837 Kim DW, Lee S, Kwon S, Nam W, Cha IH, Kim HJ. Deep learning-based survival prediction of oral cancer patients. Sci Rep. 2019;9(1):6994. https://doi.org/10.1038/s41598-019-43372-7. Nicolo C, Perier C, Prague M, Bellera C, MacGrogan G, Saut O, et al. Machine learning and mechanistic modeling for prediction of metastatic relapse in early-stage breast cancer. JCO Clin Cancer Inform. 2020;4:259–74. https://doi.org/10.1200/CCI.19.00133. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 22 Apr, 2025 Reviews received at journal 21 Apr, 2025 Reviews received at journal 09 Apr, 2025 Reviews received at journal 08 Apr, 2025 Reviewers agreed at journal 05 Apr, 2025 Reviewers agreed at journal 30 Mar, 2025 Reviewers agreed at journal 30 Mar, 2025 Reviewers agreed at journal 30 Mar, 2025 Reviewers agreed at journal 28 Mar, 2025 Reviewers agreed at journal 28 Mar, 2025 Reviewers invited by journal 27 Mar, 2025 Editor invited by journal 17 Mar, 2025 Editor assigned by journal 14 Mar, 2025 Submission checks completed at journal 14 Mar, 2025 First submitted to journal 27 Feb, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6124339","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":440772391,"identity":"71e6443f-05a2-4f59-9184-d6da964313bb","order_by":0,"name":"Xiuliang Guan","email":"","orcid":"","institution":"Central Hospital of Dalian University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Xiuliang","middleName":"","lastName":"Guan","suffix":""},{"id":440772392,"identity":"c714631c-9276-4296-bc6c-66ea41500c04","order_by":1,"name":"Jiaxue Cui","email":"","orcid":"","institution":"Dalian Medical University","correspondingAuthor":false,"prefix":"","firstName":"Jiaxue","middleName":"","lastName":"Cui","suffix":""},{"id":440772393,"identity":"9f84d8a0-fcfc-48d7-b47d-56ecaf97c094","order_by":2,"name":"Lan Bai","email":"","orcid":"","institution":"Yidu Cloud (Beijing) Technology","correspondingAuthor":false,"prefix":"","firstName":"Lan","middleName":"","lastName":"Bai","suffix":""},{"id":440772394,"identity":"9c15319a-3c12-4dc9-b3b5-17b775ace9db","order_by":3,"name":"Xiaodan Bi","email":"","orcid":"","institution":"Central Hospital of Dalian University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Xiaodan","middleName":"","lastName":"Bi","suffix":""},{"id":440772395,"identity":"91a87822-ea74-4898-9a01-c3ccc3213dab","order_by":4,"name":"Yixuan Liu","email":"","orcid":"","institution":"Central Hospital of Dalian University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Yixuan","middleName":"","lastName":"Liu","suffix":""},{"id":440772396,"identity":"63bfc38b-642b-4f2c-aa4d-8c1c3dec9a15","order_by":5,"name":"Chong Ren","email":"","orcid":"","institution":"Central Hospital of Dalian University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Chong","middleName":"","lastName":"Ren","suffix":""},{"id":440772397,"identity":"5c753ae8-0ea1-423a-964e-e7a488da74b3","order_by":6,"name":"Zitong Wang","email":"","orcid":"","institution":"Central Hospital of Dalian University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Zitong","middleName":"","lastName":"Wang","suffix":""},{"id":440772398,"identity":"cbc90aeb-3e49-424a-92bb-638afb84331b","order_by":7,"name":"Shen Li","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA60lEQVRIiWNgGAWjYBACPmYGNiBlw8PGzHxA4gFYLAG/FjaIljQ5Pva2BIkEBgMitIARw2FjOZ4zBkRqYWd/9uDjDubENomcjzcS2/4w8LPnGDD83IHPYTzmhjPPsAG15G62SGwzYJDseWPA2HsGrxY2ad42HpCWbRIgLQY3cgyYGdvwaWF/Jv23TQLksGdgLfaEtTCYSTO2GRiz8Zxhg9giQVALj5lkb1uCHBt7m7FFwjljHokzzwoO9uLRws9//JnEz7b/PPLNzA9vfCiTk+NvT9744CceLRiAB0QcIEHDKBgFo2AUjAIsAACeyETteCxmswAAAABJRU5ErkJggg==","orcid":"","institution":"Central Hospital of Dalian University of Technology","correspondingAuthor":true,"prefix":"","firstName":"Shen","middleName":"","lastName":"Li","suffix":""}],"badges":[],"createdAt":"2025-02-28 01:08:27","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6124339/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6124339/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":80812201,"identity":"1d3c49ef-1854-454d-acc6-b3785cd938a7","added_by":"auto","created_at":"2025-04-17 10:36:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":210378,"visible":true,"origin":"","legend":"\u003cp\u003eTechnical Architecture Diagram\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6124339/v1/9589ef4221c0f48bd812a556.png"},{"id":80812206,"identity":"35c3e9c3-b025-4447-a8f1-f6044ffcef17","added_by":"auto","created_at":"2025-04-17 10:36:27","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":199019,"visible":true,"origin":"","legend":"\u003cp\u003e(A) Feature correlation visualization; (B) Model test set accuracy ROC; (C) Confusion matrix; and (D) Decision boundary visualization\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6124339/v1/4374b6fa100ead3cdcadc60e.png"},{"id":80812203,"identity":"34dbd69e-caca-4641-96a9-01d57e9cef9d","added_by":"auto","created_at":"2025-04-17 10:36:27","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":93573,"visible":true,"origin":"","legend":"\u003cp\u003eFeature importance ranking diagram. (A) XGBoost model; (B) Decision Tree model; (C) Random Forest model\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-6124339/v1/5fe93471c801e69c26f2fecf.png"},{"id":80812207,"identity":"56bbd0d4-c080-40c9-8c68-1daf2d574778","added_by":"auto","created_at":"2025-04-17 10:36:27","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":206748,"visible":true,"origin":"","legend":"\u003cp\u003eModel visualization with SHAP values. (A) SHAP shape summary diagram, with radius-worst providing the most significant contribution to the model. (B) SHAP waterfall diagram. In the waterfall above, the x-axis has the value of the target (causal) variable, i.e., the probability. X represents the selected observations, and f (x) = 4.799 is the predicted probability of the model. Given inputs x and E [f (x) ] = -0.679 are the expected values of the target variable. (C) SHAP force diagram. (D) The force diagram is displayed centrally after being rotated 90 ° counterclockwise.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-6124339/v1/49fc68845cd2abdcf1debfd0.png"},{"id":80812205,"identity":"f6c4dfed-c988-45ed-a0a5-52563cdd5466","added_by":"auto","created_at":"2025-04-17 10:36:27","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":81766,"visible":true,"origin":"","legend":"\u003cp\u003eInteraction prediction. (A) Interaction prediction tree model. (B) Interaction effect scatterplot.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-6124339/v1/90cdd4fd264c7888cdc7dc3f.png"},{"id":80815448,"identity":"04eb29a6-1c1b-4d27-94c0-cf5d24c5584e","added_by":"auto","created_at":"2025-04-17 11:00:32","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1468639,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6124339/v1/98f527bc-d9a0-4533-8c8f-bf3a76cb68e0.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Breast cancer prediction modeling based on SHAP interpretability analysis and XGBoost algorithm","fulltext":[{"header":"Introduction","content":"\u003cp\u003eBreast cancer is the most common malignancy in women worldwide and is the second leading cause of cancer death after lung cancer \u003csup\u003e[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]\u003c/sup\u003e. The American Cancer Society estimates that 313,510 Americans will be diagnosed with breast cancer and 42,780 will die from breast cancer in the United States in 2024 \u003csup\u003e[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]\u003c/sup\u003e. Research studies have shown that the incidence of breast cancer in China is growing rapidly \u003csup\u003e[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]\u003c/sup\u003e. Breast cancer accounts for 1/4 of cancer cases and 1/6 of cancer deaths in women, and the disease ranks first in most countries worldwide in terms of morbidity and mortality. More than 20,000 new cases of breast cancer are reported annually, resulting in 266.85\u0026nbsp;million deaths worldwide. Given its increasing prevalence, especially in developing countries such as China, there is an urgent need to improve diagnostic, therapeutic and prognostic strategies for breast cancer \u003csup\u003e[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]\u003c/sup\u003e. In response to the increasing burden of breast cancer, especially in countries in transition, the World Health Organization (WHO) launched the Global Breast Cancer Initiative (GBCI) in 2021, which aims to reduce breast cancer mortality by improving the accuracy of early diagnosis and increasing the proportion of individuals diagnosed with stage I or II breast cancer to more than 60% \u003csup\u003e[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]\u003c/sup\u003e. Tissue biopsy and immunohistochemistry are currently the gold standard for breast cancer diagnosis, but they are invasive diagnostic operations, and these techniques do not assess tumor molecular heterogeneity \u003csup\u003e[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]\u003c/sup\u003e. Typical early symptoms of breast cancer are mainly localized, such as breast lumps, breast skin abnormalities, nipple areola abnormalities, and nipple discharge \u003csup\u003e[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]\u003c/sup\u003e. Most breast cancer patients do not have obvious early symptoms, which can be easily overlooked. Therefore, early diagnosis of breast cancer patients is of great significance for the effectiveness of subsequent treatment.\u003c/p\u003e \u003cp\u003eCurrently in the era of big data, the healthcare industry possesses a huge amount of data, and traditional healthcare is gradually transforming into more efficient smart healthcare. At the same time, machine learning methods, data mining methods, and statistical methods are flourishing \u003csup\u003e[\u003cspan additionalcitationids=\"CR11\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]\u003c/sup\u003e. Extreme Gradient Boosting (XGBoost) is an integrated learning algorithm, which is based on a classification tree model \u003csup\u003e[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]\u003c/sup\u003e, and combines a set of less accurate classifiers into a more accurate classifier through an iterative computational method. Fast running speed, accurate training results and relaxed data requirements are its characteristics \u003csup\u003e[14]\u003c/sup\u003e. Stronger model generalization ability, higher scalability and faster computation are its advantages \u003csup\u003e[15]\u003c/sup\u003e.In this study, we compare the predictive effectiveness of the XGBoost model with four commonly used machine learning models, namely, decision tree model, random forest model, nearest neighbor algorithm model, and support vector machine model, to compare the predictive effectiveness of these five models in breast cancer diagnosis as well as breast tumor risk factor screening.With the current application of machine learning methods to medicine \u003csup\u003e[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e16\u003c/span\u003e]\u003c/sup\u003e, the black-box nature of the algorithms limits the trust of patients and clinicians and hinders their application in clinical settings \u003csup\u003e[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/sup\u003e..SHAP (SHapley Additive exPlanations), a game-theoretically based method for interpreting individual predictions based on optimal Shapley values \u003csup\u003e[19\u0026ndash;20]\u003c/sup\u003e, has been widely used to guide the interpretation of various algorithms, i.e., to break the limitations of the black box and subsequently provide interpretable and visualizable clinical predictions of effective mL models, with significant advances in survival analysis in breast cancer \u003csup\u003e[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/sup\u003e. In this paper, we will use the SHAP model to explain the specific contribution of each metric to the prediction, which can help us to deeply understand the importance of the features and construct more accurate and reliable prediction models for breast cancer diagnosis. In the field of smart healthcare, it assists medical diagnosis and accelerates the early diagnosis of breast cancer.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e \u003cb\u003e1.1 Data Source.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe dataset used in this study is the Breast Cancer Wisconsin Diagnostic Dataset. This dataset is publicly available from the University of California, Irvine (UCI) Machine Learning Repository \u003csup\u003e[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/sup\u003e. It includes features or characteristics of cell nuclei extracted from breast masses that were sampled by fine needle aspiration (FNA), a common diagnostic modality in oncology. Clinical samples for this dataset were collected from January 1989 through November 1991.1 The data were collected from a variety of clinical sites, including the United States, the United Kingdom, and the United States of America. Relevant features in the digitized images of the FNA samples were extracted by the method described in references \u003csup\u003e[\u003cspan additionalcitationids=\"CR24\" citationid=\"CR21\" class=\"CitationRef\"\u003e23\u003c/span\u003e–\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e25\u003c/span\u003e]\u003c/sup\u003e. There were a total of 569 cases and 10 features, namely: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Each feature factor has mean value, worst value respectively, total 20 feature factors.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e1.2 Data screening\u003c/h2\u003e \u003cp\u003eVisual analysis of feature correlation is carried out through the heat map of correlation coefficient matrix, taking one representative of each group of features with correlation higher than 0.9, and eliminating the rest to ensure the generalization ability of the subsequent model. Finally, 5 features are eliminated and 15 feature factors are tested.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003e1.3 XGBoost model\u003c/h3\u003e\n\u003cp\u003eXGBoost is an improved boosted tree model, which is one of the Boosting algorithms (Boosting is an integrated learning algorithm, i.e., it consists of a series of weak classifiers with dependencies that are integrated together to form a strong classifier according to different weights), and the core idea of this algorithm is to integrate many tree models together to form a very strong classifier. Currently, XGBoost has been widely used in a variety of fields such as risk factor screening, statistical learning and artificial intelligence \u003csup\u003e[\u003cspan additionalcitationids=\"CR27\" citationid=\"CR24\" class=\"CitationRef\"\u003e26\u003c/span\u003e–\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e28\u003c/span\u003e]\u003c/sup\u003e.The training process of XGBoost is to fit the residuals of the last prediction by continuously adding tree models. The model at the end of training contains a number of tree models with leaf nodes corresponding to sample features, and the leaf nodes generate scores.\u003c/p\u003e \u003cp\u003eIn this paper, we build an XGBoost prediction model based on scikit-learn, a machine learning framework in Python, and use GridSearchCV to adjust the hyperparameters of the model, and finally obtain the prediction model with the best performance.\u003c/p\u003e\n\u003ch3\u003e1.4 Comparison of XGBoost model and commonly used machine learning models\u003c/h3\u003e\n\u003cp\u003eWe compared the precision rate, recall rate, and F1 value of the model index values, and the model comparison accuracy was analyzed by using the receiver operating characteristic curve (ROC) to analyze the effectiveness of the XGBoost model and commonly used machine learning models in the prediction of benign and malignant breast tumor patients, and the area under the curve (AUC) was used to determine the strength of the prediction ability. The area under curve (AUC) was used to determine the strength of the prediction ability, and the larger the AUC value, the stronger the prediction ability. Subsequently, the decision boundary visualization diagram is drawn, the accuracy confusion matrix is compared, the top three models in terms of prediction ability are selected, the feature importance ranking is drawn, the ranking order is compared, the top 5 features of breast tumor risk factors are derived, and the XBGoost decision tree is drawn. The technical architecture is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003e1.5 Interpretability of the model\u003c/h3\u003e\n\u003cp\u003eShapley's method of Additive Interpretation (SHAP) was used to investigate the key factors that influence the forecasting process. Using the SHAP visualization model to explain the specific contribution of each metric to the prediction can help us gain insight into the importance of the features.\u003c/p\u003e\n\u003ch3\u003e1.6 Statistical analysis\u003c/h3\u003e\n\u003cp\u003eWe used subject work characteristics (ROC) curve analysis and area under the curve (AUC), as well as precision, recall, and F1 score metrics, to comprehensively assess the recognition ability of the model. All statistical analyses were performed using Python (v 3.7.1).\u003c/p\u003e"},{"header":"Results","content":"\u003ch2\u003e2.1 Basic situation\u003c/h2\u003e\n\u003cp\u003eA total of 569 cases were included in the study, including 212 patients with malignant tumors, accounting for 37.3%; 357 patients with benign tumors, accounting for 62.7%. The training set was set according to 80% of the data, and the validation set was set according to 20%. There are 455 cases in the training set and 114 cases in the validation set.\u003c/p\u003e\n\u003ch3\u003e2.2 Feature selection\u003c/h3\u003e\n\u003cp\u003eCorrelation coefficient matrix heat map is a method to visualize the correlation coefficient between features. We use the correlation coefficient matrix heat map to demonstrate the correlation between features, and the larger the coefficient in the map indicates that the features are more relevant. After visualizing the correlation coefficients of the features, the features are downscaled with reference to the heat map, and the five features with correlation coefficients greater than 0.9 are deleted, and the final 15 groups of feature factors included in the study are radius mean, texture mean, smoothness mean, compactness mean, concavity mean, mean, symmetry mean, fractal dimension mean, radius worst, texture worst, smoothness worst, compactness worst, concavity worst, concave points worst, symmetry worst, fractal dimension worst. see Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003eA.\u003c/p\u003e\n\u003ch2\u003e2.3 Comparison of predictive efficacy of XGBoost model and commonly used machine learning models\u003c/h2\u003e\n\u003cp\u003eTaking breast tumor benignity and malignancy as the dependent variable, and the 15 characteristic factors mentioned above as the independent variables, the data of the training and test groups were substituted into XGBoost and four commonly used machine learning models, respectively, and the ROC curves of the five models were plotted, see Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003eB. The results showed that the XGBoost algorithm model, with an accuracy of 97.4% in the test set, the decision tree model with an accuracy of 91.2%, Random Forest model has an accuracy of 95.6% in the test set, K-Nearest Neighbors Algorithm model has an accuracy of 94.7% in the test set, and Support Vector Machine model has an accuracy of 92.1% in the test set.The XGBoost model predicts tumor benignness and malignancy with the highest AUC value and the strongest prediction ability, followed by the Random Forest model, and Neighborhood Algorithm model. Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\n\u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eComparison of five models in predicting benign and malignant breast tumors\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\" colspan=\"3\"\u003e\n \u003cp\u003e训练集\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003e测试集\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e算法\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e精确率\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e召回率\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eF1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e精确率\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e召回率\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eF1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e准确率\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSVM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e92.105%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eKNN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94.736%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.89\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e91.228%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.96\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e95.614%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e97.368%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e（Support Vector Machine：SVM；K-Nearest Neighbors：KNN；Decision Tree：DT；Random Forest：RF；XGBoost：XGB）\u003c/p\u003e\n\u003ch2\u003e2.4 Evaluation of model prediction performance\u003c/h2\u003e\n\u003cp\u003eIn order to compare the performance of XGBoost model and commonly used machine learning models, parameter optimization was performed before model model testing, and the classification of prediction results was performed using the confusion matrix, which corresponds to the data distributions in the four cases of true malignant, false malignant, true benign, and false benign, respectively. The data distribution for different models is shown in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003eC. The experimental results were evaluated in terms of accuracy, which can also be obtained from the confusion matrix plot as 97.3% for XGBoost model, 89.5% for decision tree model, 95.6% for random forest model, 94.7% for neighborhood algorithm model, and 92.1% for support vector machine. The decision boundary visualization was used to depict the classes to which the data points belonged, and the classes of the data points in the feature space were downscaled, and the decision boundary visualization for the five models is shown in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003eD.\u003c/p\u003e\n\u003ch2\u003e2.5 Feature importance scoring\u003c/h2\u003e\n\u003cp\u003eFor the 15 feature factors included, XGBoost model, decision tree model, and random forest model, which are the top three predictive performance, are used to score the feature importance, and the 15 feature importance rankings are obtained.The top five importance rankings of XGBoost model are radius worst, radius mean, concave points worst, concave points worst, concave points worst, and concave points worst. The top five importance rankings of the decision tree model are radius worst, concave points worst, texture worst, texture mean, fratal dimension worst. Random forest model feature importance ranking is radius worst, concave points worst, concavity mean, radius mean, texture worst, and the top five features are radius worst, concave points worst, concavity mean, radius mean, texture worst, and the top five features are random forest model, and the top five features are radius worst, concave points worst, concavity mean, radius mean, texture mean, and fratal dimension worst, respectively. worst, which is consistent with clinical diagnostic experience and indicates good model accuracy. This is consistent with the clinical diagnosis experience, which indicates that the accuracy of the model is good.\u003c/p\u003e\n\u003ch2\u003e2.6 Model interpretability analysis\u003c/h2\u003e\n\u003cp\u003eSHAP analysis was used to explain the XGBoost model by quantifying the contribution of individual features in the model. The method achieves the ranking of feature significance by calculating the SHAP mean. Notably, the mean and worst values of concave points, area, texture, and concave worst values became the top five key determinants. To visually represent the cumulative effect of each feature, we constructed a summary plot containing the SHAP values. This graphical representation provides a comprehensive understanding of how each feature contributes to individual patient prediction. Notably, elevated levels of these five traits were associated with elevated risk of breast cancer in patients. Where red represents a positive correlation and blue represents a negative correlation. (Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003eA)\u003c/p\u003e\n\u003cp\u003eThe waterfall plot shows the probability of prediction of an outcome by a single feature, explaining the underlying contribution of each feature to the prediction of tumor benign and malignant outcomes in the dataset. (Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003eB)\u003c/p\u003e\n\u003cp\u003eThe \u0026ldquo;force map\u0026rdquo; feature in the SHAP package was used to determine the overall SHAP value for each patient. Combining the overall SHAP values for promotion (red, leading to higher odds of breast cancer) and suppression (blue, leading to lower odds of breast cancer), the overall SHAP value for that patient was obtained as a functional outcome for breast cancer. The overall functional outcome of the model is 5.42, AREA, CONCAVE POINTS,TEXTURE, and COMPACTNESS are strong drivers of probability prediction, with CONCAVE POINTS playing a dominant role. (Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003ec) The force diagram was then rotated 90\u0026deg; counterclockwise and the process was repeated for a sample of 569 patients in the test dataset, providing a global plot of the probability of clustering similar combinations of risk factors in the test dataset (Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003ed). The figure shows the common characteristics of the subgroups of patients with high (red) or low (blue) predictive probabilities. As can be seen from the figure, the main drivers for high-risk patients are radius-worst,concave points-worst, and concavity-worst.The tree model for interaction prediction enables fast and accurate two-by-two interaction calculations, returning a matrix for each predicted value where the main effect is on the diagonal and the interaction effect is off the diagonal. The results show that radius-worst interacts with concave points-worst. See Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn this study, a machine learning model was used to screen 15 characteristic factors affecting the benignity and malignancy of breast tumors and predict their relationship with breast cancer occurrence, and the characteristic factor with the highest correlation with breast cancer occurrence was obtained: the worst value of tumor radius. This is consistent with clinical experience that typical early symptoms of breast cancer are dominated by local symptoms such as breast lumps, breast skin abnormalities, nipple areola abnormalities, and nipple overflow \u003csup\u003e[\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e29\u003c/span\u003e]\u003c/sup\u003e. The size of the breast lump is closely related to the occurrence of breast cancer, and the discovery of breast lumps should be consulted as early as possible. Through the comparison between models, we concluded that the highest performing model was the XGBoost algorithm model with a model accuracy of 97.368%. The advantages are: the XGBoost algorithm is highly expandable, the base classifiers can be replaced, its inclusiveness of data outliers is high, and the algorithm's generalization ability is improved. In summary, the XGBoost model is suitable for predictive analysis of breast cancer, and its predictive ability is better than four commonly used machine learning models in both the training and validation groups.\u003c/p\u003e\u003cp\u003eCompared to previous studies, machine learning models have higher diagnostic power than traditional Cox regression models \u003csup\u003e[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e30\u003c/span\u003e–\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e31\u003c/span\u003e]\u003c/sup\u003e. However, the results of machine learning applications are often unexplained or unobservable, which casts doubt on the generalizability of the algorithm for predicting disease occurrence in a clinical setting. We therefore computed SHAP values for machine learning models for interpreting and visualizing the predicted results. Similarly, Arturo and coworkers have proposed the use of SHAP values in interpreting ML models to predict breast cancer survival prognosis \u003csup\u003e[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/sup\u003e. These studies undoubtedly provide theoretical references and practical bases for related research.SHAP results show that radius-worst factor is the most important factor in breast cancer occurrence and is its positive correlation index. Its interaction with concave points-worst is worth conducting further research. Since the data in this study were obtained from public databases, it was not possible to match the data collected from hospitals with public databases due to the specificity of the indicators. Therefore, there was a lack of external validation during model testing. There are some limitations in this study, the sample data of the study is 569 cases, the data is too small, which leads to the short operation length of all five algorithms, and it is not possible to compare the superiority of the algorithmic model in terms of the operation length. Further validation is needed with multicenter and big data study results. This study raises new possibilities for replacing invasive testing methods with imaging in the future. We will also consider laboratory studies and prospective experimental studies.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn summary, we have built an excellent machine learning model for predicting the occurrence of breast cancer, obtained the worst value of the high-risk influencing factor for breast cancer: tumor radius, and built an interpretable model.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eClinical trial number\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;not applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDalian Medical Science Research Program（2311001）\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this manuscript we use de-identified data from a public repository . The data are included on the BMC Med Res Method website. As such, ethical approval was not required.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe acknowledge and thank the investigators, scientists, and developers who have contributed to the scientific community by making their data, code, and software freely available\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this manuscript we use de-identified data from a public repository \u0026nbsp;The data are included on the BMC Med Res Method website. As such, ethical approval was not required.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors report no competing interests relating to this work.\u003cstrong\u003e\u003c/strong\u003e\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eGradishar, William J et al. \u0026ldquo;Breast Cancer, Version 3.2024, NCCN Clinical Practice Guidelines in Oncology.\u0026rdquo; \u003cem\u003eJournal of the National Comprehensive Cancer Network : JNCCN\u003c/em\u003e vol. 22,5 (2024): 331-357. doi:10.6004/jnccn.2024.0035\u003c/li\u003e\n\u003cli\u003eSiegel, Rebecca L et al. \u0026ldquo;Cancer statistics, 2022.\u0026rdquo; \u003cem\u003eCA: a cancer journal for clinicians\u003c/em\u003e vol. 72,1 (2022): 7-33. doi:10.3322/caac.21708\u003c/li\u003e\n\u003cli\u003eSiegel, Rebecca L et al. \u0026ldquo;Cancer statistics, 2024.\u0026rdquo; \u003cem\u003eCA: a cancer journal for clinicians\u003c/em\u003e vol. 74,1 (2024): 12-49. doi:10.3322/caac.21820\u003c/li\u003e\n\u003cli\u003eJiang, Yi-Zhou et al. \u0026ldquo;Integrated multiomic profiling of breast cancer in the Chinese population reveals patient stratification and therapeutic vulnerabilities.\u0026rdquo; \u003cem\u003eNature cancer\u003c/em\u003e vol. 5,4 (2024): 673-690. doi:10.1038/s43018-024-00725-0\u003c/li\u003e\n\u003cli\u003eSung, Hyuna et al. \u0026ldquo;Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries.\u0026rdquo; \u003cem\u003eCA: a cancer journal for clinicians\u003c/em\u003e vol. 71,3 (2021): 209-249. doi:10.3322/caac.21660\u003c/li\u003e\n\u003cli\u003eLi X, Li X, Yang B, Sun S, Wang S, Yu F, Wang T. Enhancing breast cancer outcomes with machine learning-driven glutamine metabolic reprogramming signature. Front Immunol. 2024 May 1;15:1369289. doi: 10.3389/fimmu.2024.1369289. PMID: 38756785; PMCID: PMC11097668.\u003c/li\u003e\n\u003cli\u003eAnderson BO, Ilbawi AM, Fidarova E, et al. The Global Breast Cancer Initiative: a strategic collaboration to strengthen health care for non-communicable diseases. Lancet Oncol. 2021;22(5):578-581. doi: 10.1016/S1470-2045(21)00071-1\u003c/li\u003e\n\u003cli\u003eFreitas, Ana Julia Aguiar de et al. \u0026ldquo;Liquid Biopsy as a Tool for the Diagnosis, Treatment, and Monitoring of Breast Cancer.\u0026rdquo; International journal of molecular sciences vol. 23,17 9952. 1 Sep. 2022, doi:10.3390/ijms23179952\u003c/li\u003e\n\u003cli\u003eGradishar, William J et al. \u0026ldquo;Breast Cancer, Version 3.2024, NCCN Clinical Practice Guidelines in Oncology.\u0026rdquo; \u003cem\u003eJournal of the National Comprehensive Cancer Network : JNCCN\u003c/em\u003e vol. 22,5 (2024): 331-357. doi:10.6004/jnccn.2024.0035\u003c/li\u003e\n\u003cli\u003eLai, Jianguo et al. \u0026ldquo;A radiogenomic multimodal and whole-transcriptome sequencing for preoperative prediction of axillary lymph node metastasis and drug therapeutic response in breast cancer: a retrospective, machine learning and international multicohort study.\u0026rdquo; \u003cem\u003eInternational journal of surgery (London, England)\u003c/em\u003e vol. 110,4 2162-2177. 1 Apr. 2024, doi:10.1097/JS9.0000000000001082\u003c/li\u003e\n\u003cli\u003eHan, Xiaorui et al. \u0026ldquo;Development of a machine learning-based radiomics signature for estimating breast cancer TME phenotypes and predicting anti-PD-1/PD-L1 immunotherapy response.\u0026rdquo; \u003cem\u003eBreast cancer research : BCR\u003c/em\u003e vol. 26,1 18. 29 Jan. 2024, doi:10.1186/s13058-024-01776-y\u003c/li\u003e\n\u003cli\u003eZhou, Sheng et al. \u0026ldquo;Breast Cancer Prediction Based on Multiple Machine Learning Algorithms.\u0026rdquo; \u003cem\u003eTechnology in cancer research \u0026amp; treatment\u003c/em\u003e vol. 23 (2024): 15330338241234791. doi:10.1177/15330338241234791\u003c/li\u003e\n\u003cli\u003eChen TQ, Guestrin C. XGBoost: a scalable tree boosting system［C］//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA. New York, NY, USA: ACM, 2016: 785-94.\u003c/li\u003e\n\u003cli\u003eGUAN X，ZHANG B，FU M，et al.Clinical and inflammatory features based machine learning model for fatal risk prediction of hospitalized COVID-19 patients: results from a retrospective cohort study ［J］ .Ann Med，2021，53 （1）：257-266.\u003c/li\u003e\n\u003cli\u003eCHRISTOPHER T，BRODY J P.Evaluation of a genetic risk score for severity of COVID-19 using human chromosomal-scale length variation ［J / OL］ . Hum Genomics， 2020， 14 （1）［2021-06-04］ . https://pubmed. ncbi. nlm. nih. gov / 33036646/. DOI: 10.1186 /s40246-020-00288-y.\u003c/li\u003e\n\u003cli\u003eNensa F, Demircioglu A, Rischpler C. Artifcial intelligence in nuclear medicine. J Nucl Med. 2019;60(Suppl 2):29S-37S. https:// doi.org/10.2967/jnumed.118.220590. \u003c/li\u003e\n\u003cli\u003eThe Lancet Respiratory M. Opening the black box of machine learning. Lancet Respir Med. 2018;6(11):801. https://doi.org/10. 1016/S2213-2600(18)30425-9. \u003c/li\u003e\n\u003cli\u003ePetch J, Di S, Nelson W. Opening the black box: the promise and limitations of explainable machine learning in cardiology. Can J Cardiol. 2022;38(2):204\u0026ndash;13. https://doi.org/10.1016/j.cjca.2021. 09.004. \u003c/li\u003e\n\u003cli\u003eLundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):56\u0026ndash;67. https://doi.org/10.1038/s42256-019-0138-9.\u003c/li\u003e\n\u003cli\u003eLu, Wenhao et al. \u0026ldquo;Explainable and visualizable machine learning models to predict biochemical recurrence of prostate cancer.\u0026rdquo; \u003cem\u003eClinical \u0026amp; translational oncology : official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico\u003c/em\u003e vol. 26,9 (2024): 2369-2379. doi:10.1007/s12094-024-03480-x\u003c/li\u003e\n\u003cli\u003eMoncada-Torres A, van Maaren MC, Hendriks MP, Siesling S, Geleijnse G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep. 2021;11(1):6968. https://doi.org/10.1038/s41598-021-86327-7.\u003c/li\u003e\n\u003cli\u003eLichman M. UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set. 2014. http://archive.ics.uci.edu/ml. Accessed 8 Aug\u003c/li\u003e\n\u003cli\u003eMangasarian OL, Street WN, Wolberg WH. Breast Cancer Diagnosis and Prognosis via Linear Programming: AAAI; 1994, pp. 83 - 86.\u003c/li\u003e\n\u003cli\u003eWolberg WH, Mangasariant OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA. 1990;87:9193\u0026ndash;6.\u003c/li\u003e\n\u003cli\u003eBennett KP. Decision tree construction via linear programming: University of Wisconsin-Madison Department of Computer Sciences; 1992,pp. 97\u0026ndash;101.\u003c/li\u003e\n\u003cli\u003eWu, Peng et al. \u0026ldquo;Pan-cancer characterization of cell-free immune-related miRNA identified as a robust biomarker for cancer diagnosis.\u0026rdquo; \u003cem\u003eMolecular cancer\u003c/em\u003e vol. 23,1 31. 12 Feb. 2024, doi:10.1186/s12943-023-01915-7\u003c/li\u003e\n\u003cli\u003eJiang, Yiyao et al. \u0026ldquo;Predicting anti-cancer drug sensitivity through WRE-XGBoost algorithm with weighted feature selection.\u0026rdquo; \u003cem\u003eGenes \u0026amp; diseases\u003c/em\u003e vol. 12,2 101275. 22 Mar. 2024, doi:10.1016/j.gendis.2024.101275\u003c/li\u003e\n\u003cli\u003eGuan, Xiuliang et al. \u0026ldquo;Construction of the XGBoost model for early lung cancer prediction based on metabolic indices.\u0026rdquo; \u003cem\u003eBMC medical informatics and decision making\u003c/em\u003e vol. 23,1 107. 13 Jun. 2023, doi:10.1186/s12911-023-02171-x\u003c/li\u003e\n\u003cli\u003eBenitez Fuentes, Javier David et al. \u0026ldquo;Global Stage Distribution of Breast Cancer at Diagnosis: A Systematic Review and Meta-Analysis.\u0026rdquo; JAMA oncology vol. 10,1 (2024): 71-78. doi:10.1001/jamaoncol.2023.4837\u003c/li\u003e\n\u003cli\u003eKim DW, Lee S, Kwon S, Nam W, Cha IH, Kim HJ. Deep learning-based survival prediction of oral cancer patients. Sci Rep. 2019;9(1):6994. https://doi.org/10.1038/s41598-019-43372-7.\u003c/li\u003e\n\u003cli\u003eNicolo C, Perier C, Prague M, Bellera C, MacGrogan G, Saut O, et al. Machine learning and mechanistic modeling for prediction of metastatic relapse in early-stage breast cancer. JCO Clin Cancer Inform. 2020;4:259\u0026ndash;74. https://doi.org/10.1200/CCI.19.00133.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-research-methodology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmrm","sideBox":"Learn more about [BMC Medical Research Methodology](http://bmcmedresmethodol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmrm/default.aspx","title":"BMC Medical Research Methodology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"breast cancer, XGBoost, prediction, SHAP, machine learning","lastPublishedDoi":"10.21203/rs.3.rs-6124339/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6124339/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003ePurpose\u003c/h2\u003e \u003cp\u003eTo compare the predictive effectiveness and risk factor screening of extreme gradient ascent (XGBoost) model and four commonly used machine learning models for breast cancer diagnosis, and to interpret the model results by SHAP interpretability analysis.\u003c/p\u003e\u003ch2\u003eMaterials and methods\u003c/h2\u003e \u003cp\u003eBreast tumor data from the UCI public database were used to screen the characteristic factors using the heat map of the correlation coefficient matrix, and five machine learning algorithms, XGBoost, Random Forest, K-Nearest Neighbors, Decision Tree, and Support Vector Machines, were compared by precision, recall, F1 value, and accuracy. The ROC curves of the five models were plotted, and the confusion matrix was used to classify the prediction results, resulting in the best-performing model, XGBoost. the XGBoost model, the decision tree model, and the random forest model were used to derive the order of importance of the feature factors, and an interpretability analysis was performed through the SHAP model to derive the important feature factors affecting the occurrence of breast cancer.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe results of ROC curve showed that the accuracy of XGBoost model in the test set was 97.4%, the decision tree model was 91.2%, the random forest model was 95.6%, the neighborhood algorithm model was 94.7%, and the support vector machine model was 92.1%. The confusion matrix plot also gives 97.3% accuracy for the XGBoost model, 89.5% for the decision tree model, 95.6% for the random forest model, 94.7% for the proximity algorithm model, and 92.1% for the support vector machine model. the results of the feature importance scores of the three models, the first important feature is radius-worst. The SHAP interpretable model results showed that the main drivers for high risk patients were radius-worst,concave points-worst,concavity-worst.Also radius-worst interacted with concave points-worst.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eXGBoost algorithm model is more accurate compared with traditional machine learning model, radius-worst is an important factor affecting breast cancer occurrence, and its interaction with concave points-worst exists.\u003c/p\u003e","manuscriptTitle":"Breast cancer prediction modeling based on SHAP interpretability analysis and XGBoost algorithm","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-17 10:36:23","doi":"10.21203/rs.3.rs-6124339/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-04-22T06:28:28+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-21T18:16:27+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-09T17:35:05+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-08T15:21:20+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"200275416351589211573709805066097359484","date":"2025-04-05T13:20:42+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"88452104536201638430942946553975330192","date":"2025-03-31T00:24:17+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"223017803365079518416671518383718573661","date":"2025-03-30T15:00:06+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"232886943584045897911299881334056877064","date":"2025-03-30T08:19:26+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"282429236510685268886560595161063794849","date":"2025-03-28T05:47:34+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"281250923347049306065773661401138712838","date":"2025-03-28T04:15:49+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-03-27T19:37:29+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-03-17T11:27:42+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-03-14T13:35:46+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-03-14T13:34:05+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Medical Research Methodology","date":"2025-02-28T01:04:54+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-research-methodology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmrm","sideBox":"Learn more about [BMC Medical Research Methodology](http://bmcmedresmethodol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmrm/default.aspx","title":"BMC Medical Research Methodology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"5929bdff-2643-4853-b5a0-77d42ce94685","owner":[],"postedDate":"April 17th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2026-05-21T01:54:10+00:00","versionOfRecord":[],"versionCreatedAt":"2025-04-17 10:36:23","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6124339","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6124339","identity":"rs-6124339","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00