Predicting Endometriosis Onset Using Machine Learning Algorithms

preprint OA: gold CC0 ⤵ 9 in-corpus citations
AI-generated summary by claude@2026-06, 2026-06-07

This study employed Logistic Regression and XGBoost machine learning algorithms to identify key diagnostic and procedural codes that predict the onset of endometriosis from patient medical history data.

One-sentence paraphrase of the abstract; not a substitute for reading it. No clinical advice. How this works

AI-generated deep summary by claude@2026-06, 2026-06-07 · read from full text

This paper uses US healthcare claims patient-level data (2019 cohorts with 36 months of prior history) to build machine learning models that predict the likelihood of endometriosis onset, training logistic regression and extreme gradient boosting algorithms on features derived from diagnosis, procedure, and treatment codes. The target cohort consisted of 314,101 confirmed endometriosis patients identified via ICD-10 codes, and controls were selected from a larger random sample of female patients (age ≥18) using propensity score matching by age and medical history. The authors report identifying directly and indirectly related medical events as important features for accurate prediction. This paper is centrally about endometriosis — using logistic regression and XGBoost on claims-based medical history to predict endometriosis onset.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Abstract

Abstract Background Endometriosis is a common progressive female health disorder in which tissues similar to the lining of the uterus grow on other parts of the body like ovaries, fallopian tubes, bowel, and other parts of reproductive organs. In women, it is one of the most common causes of pelvic pain and infertility. In the US, one in every ten women of reproductive age group has endometriosis. The actual cause of endometriosis is still unknown, and it is quite difficult to diagnose. There are several theories regarding the cause; however, not a single theory has been scientifically proven. Methods In this paper, we try to identify the drivers of endometriosis’ diagnoses via leveraging advanced Machine Learning (ML) algorithms. The primary risks of infertility and other health complications can be minimized to a great extent, if likelihood of endometriosis can be predicted well in advance. As a result, the proper medical care and treatment can be given to the impacted patients. To demonstrate the feasibility, Logistic Regression (LR) and eXtreme Gradient Boosting (XGB) algorithms were trained on 36 months of medical history data. Results The machine learning models were used to predict the likelihood of disease on qualified patients from the healthcare claims patient level database. Several directly and indirectly features were identified as important in accurate prediction of the condition onset, including selected diagnosis and procedure codes. Conclusions Leveraging the machine learning approaches can aid early prediction of the disease and offer an opportunity for patients to receive the needed medical treatment earlier in the patient journey. Creating a typing tool that can be integrated into the Electronic Health Records (EHR) systems and easily accessed by healthcare providers could further aid the objective of improving the diagnosis activities and inform the diagnostic processes that would result in timely and precise diagnosis, ultimately increasing patient care and quality of life.
Full text 145,525 characters · extracted from preprint-html · click to expand
Predicting Endometriosis Onset Using Machine Learning Algorithms | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research article Predicting Endometriosis Onset Using Machine Learning Algorithms Ewa J Kleczyk, Aparna Peri, Tarachand Yadav, Ramachandra Komera, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-135736/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 9 You are reading this latest preprint version Abstract Background Endometriosis is a common progressive female health disorder in which tissues similar to the lining of the uterus grow on other parts of the body like ovaries, fallopian tubes, bowel, and other parts of reproductive organs. In women, it is one of the most common causes of pelvic pain and infertility. In the US, one in every ten women of reproductive age group has endometriosis. The actual cause of endometriosis is still unknown, and it is quite difficult to diagnose. There are several theories regarding the cause; however, not a single theory has been scientifically proven. Methods In this paper, we try to identify the drivers of endometriosis’ diagnoses via leveraging advanced Machine Learning (ML) algorithms. The primary risks of infertility and other health complications can be minimized to a great extent, if likelihood of endometriosis can be predicted well in advance. As a result, the proper medical care and treatment can be given to the impacted patients. To demonstrate the feasibility, Logistic Regression (LR) and eXtreme Gradient Boosting (XGB) algorithms were trained on 36 months of medical history data. Results The machine learning models were used to predict the likelihood of disease on qualified patients from the healthcare claims patient level database. Several directly and indirectly features were identified as important in accurate prediction of the condition onset, including selected diagnosis and procedure codes. Conclusions Leveraging the machine learning approaches can aid early prediction of the disease and offer an opportunity for patients to receive the needed medical treatment earlier in the patient journey. Creating a typing tool that can be integrated into the Electronic Health Records (EHR) systems and easily accessed by healthcare providers could further aid the objective of improving the diagnosis activities and inform the diagnostic processes that would result in timely and precise diagnosis, ultimately increasing patient care and quality of life. Internal Medicine Preventive Medicine Endometriosis infertility likelihood Logistic Regression Machine Learning eXtreme Gradient Boosting Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Background Recent advancement in Artificial Intelligence (AI) and Machine Learning (ML) has provided the opportunity for AI and ML application in the healthcare area, while also slowly improving on the performance benchmark set by the classical statistical techniques [ 1 ]. In recent years, healthcare service providers have also shown interest towards data science and machine learning in disease diagnosing. Disease prediction using data mining and machine learning algorithms with patient medical history such as diagnosis of disease, medical and surgical procedures, therapeutics, and treatments, etc., has been slowly introduced to aid decision making processes [ 2 , 3 , 4 ]. Many statistical and machine learning techniques have been applied to either pathological or clinical data to study the disease in detail and also predict its likelihood of occurrence. Deep learning algorithms such as Convolutional Neural Network (CNN) have been found to predict disease onset and progression with a greater precision compared to analyzing just medical image data [ 5 ]. Since healthcare is one of the leading industries with a large amount of structured and unstructured data, it is imperative to use the known advanced techniques to extract the hidden data patterns. Machine Learning algorithms with the help of big data technology has made it easier to mine the vast amount of unstructured data and aided in making important decisions related to patients’ health [ 6 ]. Due to its high precision and robustness in comparison to conventional statistical methods, most medical scientists have been attracted towards these models to understand the key drivers of disease onset and progression prediction. Artificial Intelligence, Machine Learning, and big data have been playing a pivotal role in improving healthcare infrastructure, patient care, as well as disease diagnosing, prediction and forecasting, drug discovery, etc., and thereby, reducing medical costs, shortening the time to diagnoses and treatment, as well as enhancing patients’ quality of life and access to healthcare [ 7 ]. With this motivation in mind, we selected endometriosis as the condition to study in this article. Endometriosis is one of the most common disorders seen in women of a menstruating age in which tissues like the endometrium lining grow on the outer part of the uterus and other organs of the pelvic region. The signs and symptoms vary from patient to patient with some patients having mild symptoms, while others display a moderate to severe level of condition occurrence. The most common symptoms of endometriosis are pelvic pain, dysmenorrhea, and infertility. There is no guaranteed treatment for endometriosis at this time; however, with an early diagnosis and available medical and surgical options, healthcare providers can reduce the risks of potential complications and improve the quality of life for their patients. If we can identify or predict the probability of endometriosis onset by analyzing the medical history of diagnosed patients, the results might help benefit both the healthcare providers’ diagnosis process and patients’ well-being and quality of life. In this study, the Logistic Regression (LR) and eXtreme Gradient Boosting (XGB) algorithms were used to predict endometriosis occurrence when leveraging medical history of the diagnosed patients. The remainder of the article is organized as follows: in Sect. 2 , we briefly review the project objective; in Sect. 3 , we describe different methods used in data preparation, feature engineering, feature selection and model training and validation; in Sect. 4 , we present the model outputs and results; and in Sect. 5 , we conclude the study with a summary of our findings. 2. Objectives The following objectives will be addressed in this article: Train machine learning algorithms to predict the likelihood of endometriosis. Identify the most significant medical events in the patient journey that lead to the diagnosis of endometriosis. Score entire database using the best performing trained models. Profile patients using the predicted scores. 3. Methods Overview The data source for this project is the healthcare claims patient level database with the study time period from January 31, 2019 to December 31, 2019. Patient cohorts: study target and control were established using endometriosis ICD 10 diagnosis codes. As endometriosis is a female only condition, female patients 18 and older were part of the study target cohort. A control cohort is often used to create a patient sample to compare with the study target cohort and is selected using cohort matching algorithms. 36 months of patient medical history prior to the first disease event in 2019 were extracted for both the study target and control cohorts. The healthcare claims patient level data includes diagnosis codes, medical and surgical codes, therapeutics and treatments prescribed at the transactional level. A number of analytical methods was leveraged for the analysis from the rules-based patient qualification criteria to Machine Learning algorithms to derive probability of endometriosis onset. The following sub-sections of the article present a detailed explanation for each of the selected methods. The healthcare claims patient level dataset considered in the analysis is specific to the US healthcare market. 3.1. Healthcare claims patient level database The healthcare claims patient level database is an anonymous longitudinal patient data set that can be used by organizations that are directly or indirectly associated to healthcare [ 9 , 41 ]. There has been an increasing interest in patient-level data, as researchers, healthcare providers, and pharmaceutical companies are realizing the potential of creating better comparisons of effective treatment outcomes by analyzing longitudinal data that represent individual patient-based experiences and interactions with the US healthcare system [ 42 ]. The healthcare claims patient level database leveraged for this study consists of medical, hospital, and prescriptions claims across all payment types [ 10 , 44 ]. The database covers more than 317 million patients in the US, spans over more than 17 years of medical health history, and includes more than 1.9 million healthcare providers [ 43 ]. Figure 1 presents the summary of information in the database. 3.2. Cohort selection For this study, we identified 314,101 confirmed endometriosis patients in 2019 in the healthcare claims patient level database, using predefined ICD 10 diagnosis codes (Table 1 ). Female patients age 18 and above were selected to the study target cohort. For the control cohort, a random sample of 3 million female patients with the same age criterion was extracted from the database. Table 1 ICD 10 diagnosis codes of endometriosis Diagnosis Codes Diagnosis Long Description N80.0 Endometriosis of uterus N80.1 Endometriosis of ovary N80.2 Endometriosis of fallopian tube N80.3 Endometriosis of pelvic peritoneum N80.4 Endometriosis of rectovaginal septum and vagina N80.5 Endometriosis of intestine N80.6 Endometriosis in cutaneous scar N80.8 Other endometriosis N80.9 Endometriosis, unspecified To select a control cohort of an equal size to the study target groups out of 3 million patients, a noble technique known as ‘propensity score match’ was used [ 18 ]. Propensity matching algorithm [ 19 ], a statistical technique, selects the control cohort based on similar characteristics or covariates observed in the study target cohort. Covariates considered for selection were patient age and medical history [ 20 ]. Table 2 presents the distribution comparison between the study target and control cohorts by age and Census geographies. The patient age variable was created via grouping age ranges and US states were grouped into regions. Table 2 Comparison between target and control cohort by age and region respectively Age Group Target Control Region Target Control 18–24 6.45% 6.55% South 39.90% 39.90% 25–34 25.01% 25.24% Midwest 22.78% 22.76% 35–44 37.57% 37.08% Northeast 18.82% 18.84% 45–54 23.13% 23.18% West 17.02% 17.02% 55–64 6.22% 6.31% Other 1.48% 1.48% 65+ 1.62% 1.64% 3.3. Data extraction The next step in the analysis process was to extract the entire medical history of the patients from the available information in the healthcare claims patient level database. In order to ensure extraction of healthcare history data prior to the first condition event, the event date for the target cohort was established for each patient. In the case of the control cohort, the first activity in 2019 was considered as the event date. Using these event dates of respective patients, 36 months of medical history data was extracted. Historical data presented all the medical events in patient history, including diagnoses for comorbid conditions, medical and surgical procedures, therapeutics, and treatment prescribed to patients. Top 1000 diagnosis codes, top 800 medical and surgical procedures, and top 500 prescribed drugs were only considered for further analysis as these top codes constituted more than 80% of total data. A pivot table was created where data at the transaction level was aggregated by the anonymized patient ID. After historical medical claims data preprocessing for both cohorts independently, a dataset was integrated into a single data frame. The integrated data frame had more than 2,600 features. The dataset was further standardized and split into two groups, a training and test set, using 70:30 ratio respectively [ 21 ]. The training dataset is used to identify the key features of endometriosis onset, while the test group is used to validate if these features would predict the test group condition onset accurately [ 22 ]. Splitting the data into train and test sets helps to assess the model performance and its generalizing ability on unseen data [ 23 ]. 3.4. Machine Learning algorithms’ overview Machine Learning algorithms can be grouped into two categories: supervised and unsupervised learning. 3.4.a. Supervised learning algorithms Supervised learning is the process of training or building the machine learning algorithms in which algorithms learn to map from input space (X) to output space (Y), i.e. Y = f(X) [ 25 ]. The major objective is to approximate the mapping function (f) in order to ensure that when a new data point (x) is added we can predict (y) outcome [ 26 ]. Supervised learning algorithms are mainly used for classification and prediction problems [ 32 ]. Following are the most popular supervised algorithms: logistic regression, decision trees (DTs), random forest (RF), extreme gradient boosting, support vector machines (SVMs), Naïve Bayes , adaptive boosting (AdaBoost), artificial neural network (ANN) etc. [ 31 ]. 3.4.b. Unsupervised learning algorithms Unsupervised learning algorithms, on the other hand, try to learn the hidden pattern within the input dataset (X) [ 28 ]. These models are called unsupervised because there is no supervision to guide the models as compared to the supervised learning [ 29 ]. Algorithms are left at their own abilities to learn, discover and showcase the patterns in the input data (X). These algorithms are highly popular in the tasks to discover the natural clusters, dimension reduction, anomaly detection, etc. k-Means clustering, principal component analysis (PCA), factor analysis (FA), singular value decomposition (SVD), apriori algorithm (association rule) are some popular examples of unsupervised learning algorithms [ 31 ]. Depending on the study objectives and the available data, algorithms are explored, tested for performance and data type fit, and selected accordingly. We framed the endometriosis onset prediction into a supervised classification problem and selected Logistic Regression and XGB models to develop a highly predictive algorithm of the disease onset. SVM, RF, AdaBoost, ANN, etc. are the other options that were explored in disease prediction; however, Logistic Regression and XGB were selected to predict the condition onset. Logistic Regression allows study of the odds of endometriosis occurrence for a given medical event [ 15 ], while XGB has more flexibility in fine tuning the hyper-parameters in comparison to other tree based algorithms [ 11 ]. Logistic Regression Logistic Regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist [ 14 , 15 ]. Mathematically, a binary logistic model has a dependent variable with two possible values, where the two values are labeled "0" and "1" [ 33 ]. Outputs with more than two values are modeled by multinomial logistic regression. Logistic Regression is used in various fields, including healthcare and social sciences [ 34 ]. xExtreme Gradient Boosting Gradient boosting algorithm is a machine learning algorithm which is an ensemble of weak prediction models, mostly decision trees [ 11 ]. An individual tree is a simple, often unreliable, model but when multiple trees are grouped together, they can create a robust algorithm [ 12 ]. XGB starts by creating a first simple tree [ 35 ], which than progresses sequentially and builds upon the weaker learners, with each iteration revising the previous tree until an optimal point is reached, such as the number of trees (estimators) to build the solution [ 36 ]. Chi-Square Test The Chi-square test is one of the most widely used non-parametric tests [ 37 ], often utilized to test the independence between observed and expected frequencies of one or more attributes in a contingency table, popularly known as ‘test goodness of fit’ [ 38 ]. In this work, the Chi-square test is used to identify top significant features given the dependent variable (Y) [ 40 ]. Logistic Regression, being the simplest of the machine learning algorithms, was selected as the base model for the analysis and used to compare other models’ performance. Both Logistic Regression and XGB models were trained, and top 1,000 features from each algorithm were selected out of more than 2,600 features used in the model runs. To decrease the number of data elements and to select only the most important variables to predicting the condition onset, we also used a Chi-Square test to identify the top 1,000 features. As a next step, the unique features from each model were utilized to train the final machine learning model to predict the endometriosis occurrence probability. Algorithms were trained on Python 3.5 using ‘ scikit-learn’ and ‘ xgboost ’ libraries. 4. Results 4.1. Significant features selection Table 3 presents the machine learning model performance metrics, which indicate that both the Logistic Regression and XGB models performed relatively well in predicting the condition onset. The models’ accuracy ranged between 88% − 96%. Algorithms Statistic Train Set Test Set 'LR Accuracy 96% 96% Sensitivity/TPR/Recall 95% 95% Specificity/TNR 98% 97% Precision/PPV 98% 97% f1-Score 0.96 0.96 AUC 0.96 0.96 XGB Accuracy 90% 88% Sensitivity/TPR/Recall 86% 84% Specificity/TNR 95% 93% Precision/PPV 95% 92% f1-Score 0.9 0.88 AUC 0.9 0.88 Table 3 . Classification metrics of train and test sets for LR and XGB model Figure 2 presents the Receiver Operating Characteristic (ROC) curves on the test set for Logistic Regression and XGB models. The Area under the ROC Curve (AUC) had values between 0.88–0.96. Chi-square test was also applied on data before standardization. The top 1,000 features were selected from the Logistic Regression, XGB and Chi-square algorithms to train the final machine learning model. Most of the top features identified by the selected models were related to medical and surgical procedures as well as diagnosis codes. Patients diagnosed with endometriosis underwent a series of medical and surgical procedures and had various diagnostic symptoms and comorbid conditions. The Chi-square significance test was run at the 95% significance confidence interval to aid in identification of the topmost significant features. Table 4 Most significant features from LR, XGB and Chi-Square test Feature Feature Description D_N85_8 Other specified non inflammatory disorder of uterus D_N94_6 Dysmenorrhea, unspecified D_N94_9 Unspecified condition associated with female genital organs and menstrual cycle D_R10_2 Pelvic and Perineal Pain D_Z01_419 Encounter for gynecological examination (general) (routine) without abnormal findings P_00840 Anesthesia Intraperitoneal Lower Abd W/Laps Nos P_00944 Anesthesia vaginal hysterectomy incl biopsy P_52000 Cystourethroscopy P_58571 Laps total hysterect 250 GM/ 250 g w/tube/ovar P_58662 Laps Fulg/Exc Ovary Viscera/ Peritoneal Surface P_76830 Us Transvaginal P_J1950 Injection. Leuprolide acetate (for depot suspens) R_Norethindrone_Acetate Norethindrone Acetate SPCLT_EM Emergency Medicine SPCLT_FM Family medicine SPCLT_HO Hematology/Oncology SPCLT_OBG Obstetrics and gynecology Section 1 of this work describes endometriosis and its associated signs and symptoms such as ‘ painful periods’, ‘lower abdominal and pelvic pain’, ‘heavy bleeding during periods’, ‘pain during urination and bowel movement’, ‘constipation and diarrhea’, ‘infertility’, ‘painful sexual intercourse’, etc. [ 16 , 17 ]. Identifying these prominent medical events from patients’ medical history by the models is the objective of this work. Hence, it is desirable to validate the model performance by analyzing the top features, whether they would help predict endometriosis’ onset. Table 4 presents the top features identified by the machine learning models, which are directly or indirectly associated with endometriosis. Features such as ‘ non inflammatory disorder of uterus (D_N85_8)’, ‘pelvic and perineal pain (D_R10_2)’ are the diagnosis codes, presenting the association with the risks and symptoms of endometriosis [ 45 ]. Procedure codes such ‘anesthesia of lower abdomen for laparoscopy (P_00840)’, ‘vaginal hysterectomy including biopsy (P_00944)’ are the top procedures often associated with the diagnosis as well treatment of endometriosis [ 45 ]. Furthermore, the machine learning models suggest that patients often consult with specialists including ‘emergency medicine (SPCLT_EM)’, ‘family medicine (SPCLT_FM)’, ‘obstetrics and gynecology (SPCLT_OBG)’ when experiencing related symptoms and gynecological issues. Overall, the machine learning models selected top features closely related to the onset of endometriosis, which implies that when tracking any of the features the condition onset could be diagnosed sooner. 4.2. Feature selection for market definition Top features from all three algorithms, which were specific to target cohort were identified. These features presented to be important in diagnosing the endometriosis condition and were selected for patient scorning criteria. The therapeutics as well as medical and surgical procedure codes specific to endometriosis treatment such as Orilissa, Marilissa, and Lupron Depot were excluded. Around 9.5 million female patients age 18 and above were qualified for scoring. 4.3. Propensity model training and validation Using the top features selected, Logistic Regression and XGB models were re-trained. As the number of features was reduced, in the beginning we observed a drop in model performance. After several iterations and hyper-parameter tuning, the predictive power of XGB significantly improved compared to the previous iterations; however, we did not see any improvement in the Logistic Regression model results. Interestingly, both models were able to identify additional new features aligned with endometriosis. Table 5 List of top features identified by re-trained models Features Feature Description D_D25_0 Submucous leiomyoma of uterus D_F43_0 Acute stress reaction D_N83_291 Other ovarian cyst, right side D_N85_2 Hypertrophy of uterus D_N92_4 Excessive bleeding in the premenopausal period D_N94_12 Deep dyspareunia D_N94_3 Premenstrual tension syndrome D_N94_5 Secondary dysmenorrhea D_N97_0 Female infertility associated with anovulation D_Z79_890 Hormone replacement therapy D_Z80_41 Family history of malignant neoplasm of ovary P_58661 Laparoscopy w/rmvl adnexal structures R_ACETAMINOPHEN Acetaminophen R_MEGESTROL_ACETATE Megestrol acetate R_LIDOCAINE_HCL Lidocaine hcl The re-trained machine learning models identified all the top features discussed in Sect. 4.1 . In Table 5 , we present the additional features recognized by XGB and Logistic Regression models, which are highly significant in predicting the likelihood of endometriosis. The models suggest that features like ‘submucous leiomyoma of uterus (D_D25_0)’, ‘ovarian cyst (D_N83_291)’, ‘deep dyspareunia (‘D_N94_5)’,’female infertility associated with anovulation (D_N97_0)’ are important in predicting the likelihood of endometriosis. The models have also flagged Acetaminophen (R_ACETAMINOPHEN), Megestrol acetate (R_MEGESTROL_ACETATE) & Lidocaine hcl (R_LIDOCAINE_HCL) drugs as the strong predictors of endometriosis. Table 6 shows that the XGB model performed better compared to the Logistic Regression model. Figure 3 shows the Receiver Operating Characteristic (ROC) curves on test sets for both re-trained Logistic Regression and XGB models. The Area under the ROC Curve (AUC) values of LR and XGB models on test were 0.87 and 0.96 respectively. Figure 4 suggests that the XGB model was able to more accurately differentiate target from control than LR model. Hence, we used XGB model to score the qualified patients. Algorithms Statistic Train Set Test Set LR Accuracy 87% 87% Sensitivity/TPR/Recall 75% 75% Specificity/TNR 98% 98% Precision/PPV 98% 98% f1-Score 0.85 0.85 AUC 0.87 0.87 -XGB Accuracy 96% 94% Sensitivity/TPR/Recall 93% 90% Specificity/TNR 99% 98% Precision/PPV 99% 97% f1-Score 0.96 0.93 AUC 0.96 0.94 Table 6 . Classification metric of LR and XGB model on train and test set 4.4. Scoring qualified patients The last step of the model evaluation is to score qualified patients to assess the model’s predictability of condition onset. A complete medical history of 9.5 million qualified patients was extracted for 36 months, which included diagnosis codes, medical and surgical procedure codes, medications and treatments prescribed as well as practitioners’ therapy expertise and Board-Certified Specialty. After data pre-processing, the likelihood of endometriosis was predicted using the trained XGB model. A probability distribution of 9.5 million scored patients is shown in Fig. 5 . We observed that most of the predicted probability values are concentrated either towards 0 or 1. Considering 0.5 as the threshold, the XGB model suggests that around 36% of the scored patients are likely to get diagnosed with endometriosis sometime in the future. Assuming an ability to leverage the significant variables in diagnosing the condition onset, practitioners can give special medical care and advice in time to these patients, thereby, reducing the risks of endometriosis and its related complications. 5. Discussion Overall, the machine learning models have identified top features that can explain endometriosis onset in advance. As noted, Tables 4 and 5 in the 4. Results Section, these features include diagnosis codes, medical and surgical procedure codes, as well as physician specialties that often support patients through their healthcare journey. For the preliminary Logistic Regression, XGB, and Chi-Square runs as noted in Table 4 , the following top variables were identified as important in predicting the condition onset: 1) diagnoses codes: ‘ non inflammatory disorder of uterus (D_N85_8)’, ‘dysmenorrhea (D_N94_6)’, ‘pelvic and perineal pain (D_R10_2)’, ‘unspecified condition associated with female genital organs and menstrual cycle (D_N94_9) clearly show association with the risks and symptoms of endometriosis [ 45 ]; 2) medical and surgical procedure codes such ‘anesthesia of lower abdomen for laparoscopy (P_00840)’, ‘vaginal hysterectomy including biopsy (P_00944)’, ‘cystourethroscopy (P_52000)’, ‘laparoscopy, surgical with fulguration or excision of lesions of the ovary, peritoneal surface (P_58662)’ are associated with the diagnosis as well treatment of endometriosis [ 45 ]. From the patient medical journey and healthcare access side, the machine learning models suggest that patients often consult with specialists, including ‘emergency medicine (SPCLT_EM)’, ‘family medicine (SPCLT_FM)’, ‘obstetrics and gynecology (SPCLT_OBG)’ when experiencing endometriosis related symptoms and gynecological issues. Patients with the history of endometriosis or untreated endometriosis are at a higher risk of developing either an ovarian cancer or ‘ endometriosis associated adenocarcinoma ,’ which can also serve as an indicator of potential occurrence of the condition [ 52 , 53 , 54 ]. The machine learning models selected as one of the top healthcare provider specialties ‘ hematology/oncology (SPCLT_HO)’. This finding suggests that if a patient has any signs and symptoms as noted above, a consultation with an oncologist is recommended [ 55 , 56 ]. Overall, the machine learning models selected top features directly related to the onset of endometriosis, which implies that when tracking any of the features the condition onset could be diagnosed sooner. As noted in Table 5 above, Logistic Regression and XGB models identified additional features, which are important in predicting the likelihood of endometriosis. The models suggest that features like ‘submucous leiomyoma of uterus (D_D25_0)’, ‘ovarian cyst (D_N83_291)’, ‘hypertrophy of uterus (D_N85_2)’, ‘excessive bleeding in the premenopausal period (D_N92_4)’,’deep dyspareunia (‘D_N94_5)’,’female infertility associated with anovulation (D_N97_0)’, ‘premenstrual tension syndrome (D_94_3)’, ’hormone replacement therapy (D_Z79_890)’,’family history of malignant neoplasm of ovary’ are highly significant in predicting the likelihood of endometriosis. There are also several articles, which support the models’ claims that fibroids, ovarian cysts, infertility, menstrual period complications, family history of neoplasm of ovary, hormone therapy etc . have strong association with endometriosis [ 48 ]. Recent clinical research also supports that women of reproductive age with ‘ chronic stress ’ are at a higher risk of developing endometriosis [ 47 ]. The machine learning models have also identified Acetaminophen (R_ACETAMINOPHEN), Megestrol acetate (R_MEGESTROL_ACETATE) & Lidocaine hcl (R_LIDOCAINE_HCL) drugs as the strong predictors of endometriosis, as these drugs are often prescribed as analgesics, birth control & treatment of endometrial cancer and to numb the skin/muscles respectively. Furthermore, features such as ‘submucous leiomyoma of uterus (D_D25_0)’ and ‘hypertrophy of uterus (D_N85_2)’ are significant predictors [ 49 , 50 ] in the disease onset; however, more clinical research is needed to support this statement, as these conditions have similar symptoms, but patients are less likely to develop endometriosis [ 51 ]. Overall, the top data elements present the key features that should be considered when diagnosing endometriosis in adult women in order to decrease the time to diagnosis. As noted in the 4.4 Section of the article, when using these variables in the diagnostic processes, we can with a high accuracy predict the condition onset and differentiate accurately between patients with and without the disease. 6. Conclusions In this article, we validated the crucial role of AI and ML in the disease diagnosis, prediction, and forecasting. We analyzed medical history of patients with endometriosis using machine learning algorithms and re-trained XGB model on selected important features, which were applied to predict the likelihood of endometriosis occurrence in the adult female population. Early prediction of the disease can offer an opportunity for patients to receive needed medical treatment earlier in the patient journey. Creating a typing tool that can be integrated into the Electronic Health Records (EHR) systems and easily accessed by healthcare providers could further aid the objective of improving the diagnosis activities and inform the diagnostic processes that would result in timely and precise diagnosis, ultimately increasing patient care and quality of life. In our future work, we plan to explore advanced deep learning algorithms to further enhance the model performance and increase the accuracy of the machine learning models in predicting the likelihood of the disease onset. Abbreviations Artificial Intelligence (AI) Machine Learning (ML) Logistic Regression (LR) eXtreme Gradient Boosting (XGB) Principal Component Analysis (PCA) Factor Analysis (FA) Singular Value Decomposition (SVD) Receiver Operating Characteristic Curve (ROC) Area under the ROC Curve (AUC) Electronic Health Records (EHR) Declarations Ethics approval and consent to participate: Symphony Health, PRA Health Sciences Privacy Risk Review Group reviewed the article. The de-identified dataset is used for analytics only. No direct patient identifiers are noted. The analysis presents only a negligible risk of re-identification of an individual, which is consistent with HIPAA Privacy Rules. No additional administrative permissions or ethics approvals were required to access and use the medical claims data described in this study. Consent for publication: Not applicable. Availability of data and materials: The data that support the findings of this study are available from Symphony Health, PRA Health Sciences, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Competing Interest: The authors declare that they have no competing interests. Funding: Authors work for Symphony Health, PRA Health Sciences. The data used in the article is the property of Symphony Health, PRA Health Sciences. Authors used the data for publication of this article. Authors' contributions: all authors have read and approved the manuscript. EJK: Research Principal, Data Scientist – corresponding author, responsible for the overall study design, analytics plan, and documentation AP: Project Manager - responsible for day to day activities of the research TY: Lead Data Scientist: leading and conducting the analytics for the research project RK, M, VG, S, and MH: Data Scientists on the research project, performing the analytical analysis Acknowledgements: Authors would like to recognize Heather Valera and Koichi Iwata for their review of document drafts, and their valuable feedback in improving the article content. References Doupe P, Faghmous J, Basu S., Machine Learning for Health Services Researchers. Value Health. 22(7): 808-815, 2019. William H. Crown, PhD. Potential application of machine learning in health outcomes research and some statistical cautions. International Society for Pharmacoeconomics and Outcomes Research (ISPOR) , 2015. 1098-3015$36.00, DOI: https://doi.org/10.1016/j.jval.2014.12.005 Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, Rajesh Ranganath. A review of challenges and opportunities in machine learning for health. arXivLabs . 2019 v4, https://arxiv.org/abs/1806.00388 Varun H Buch, Irfan Ahmed, Mahiben Maruthappu. Artificial intelligence in medicine: current trends and future possibilities. British Journal of General Practice 2018; 68 (668): 143-144. DOI: https://doi.org/10.3399/bjgp18X695213 Alvin Rajkomar, Sneha Lingam, Andrew G. Taylor, Michael Blum, John Mongan. High-throughput classification of radiographs using deep convolutional neural networks. Journal of Digital Imaging 30, 95–101(2016). DOI: https://doi.org/10.1007/s10278-016-9914-9 Min Chen, Yixue Hao, Kai Hwang, Lu Wang, Lin Wang. Disease prediction by machine learning over big data from healthcare communities. IEEE, 2169-3536 (2017), DOI: https://doi.org/10.1109/ACCESS.2017.2694446 Adriana Gabriela Alexandru, Irina-Miruna Radu, Madalina - Lavinia Bizon. Big data in healthcare - opportunities and challenges. Informatica Economică vol.22, no. 2/2018 . DOI: https://doi.org/10.12948/issn14531305/22.2.2018.05 Iroju Olaronke, Ojerinde Oluwaseun. Big data in healthcare: Prospects, challenges and resolutions. IEEE , 16602629, 2016. DOI: https://doi.org/10.1109/FTC.2016.7821747 Getting the Most Out of Longitudinal Patient Data. Anonymous patient-level data (APLD) [Online] https://www.rxdatascience.com/blog/getting-most-out-of-longitudinal-patient-data Integrated Dataverse (IDV®). [Online] https://symphonyhealth.prahs.com/what-we-do/view-health-data Jerome H. Friedman. Greedy function approximation : A gradient boosting machine. The Annals of Statistics Volume 29 , (2001), 1189-1232 DOI: https://doi.org/10.1214/aos/1013203451 Extreme Gradient Boosting. [Online] https://xgboost.readthedocs.io/en/latest/tutorials/model.html, https://info.cambridgespark.com/latest/getting-started-with-xgboost S. Cramer. The origins of logistic regression. Tinbergen Institute discussion paper, TI 2002-119/4 Logistic Regression. [Online] https://en.wikipedia.org/wiki/Logistic_regression Endometriosis signs and symptoms. [Online] https://www.hopkinsmedicine.org/health/conditions-and-diseases/endometriosis Endometriosis signs and symptoms. [Online] https://www.health.qld.gov.au/news-events/news/signs-symptoms-endometriosis M Sanni Ali, Daniel Prieto-Alhambra, Luciane Cruz Lopes, Dandara Ramos, Nivea Bispo, Maria Y. Ichihara, Julia M. Pescarini, Elizabeth Williamson, Rosemeire L. Fiaccone, Mauricio L. Barreto, and Liam Smeeth. Propensity score methods in health technology assessment: principles, extended applications, and recent advances. Front Pharmacol 10: 973 (2019) . DOI: https://dx.doi.org/10.3389/fphar.2019.00973 Rosenbaum P. R., Rubin D. B. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55 (1983). DOI: https://doi.org/10.1093/biomet/70.1.41 Rosenbaum P. R., Rubin D. B. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician 39:1, 33–38 (1985). DOI: https://doi.org/10.1080/00031305.1985.10479383 Yun Xu, Royston Goodacre. On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of Analysis and Testing 2(3) (2017) . DOI: https://doi.org/10.1007/s41664-018-0068-2 Rachel Lea Ballantyne Draelos. Best Use of Train/Val/Test Splits, with Tips for Medical Data. Glass Box Machine Learning and Medicine. [Online] https://glassboxmedicine.com/2019/09/15/best-use-of-train-val-test-splits-with-tips-for-medical-data/ Kevin Dobbin, Richard Simon. Optimally splitting cases for training and testing high dimensional classifiers. BMC Medical Genomics, 4:31 (2011) . DOI: https://doi.org/10.1186/1755-8794-4-31 Andrius Vabalas, Emma Gowen, Ellen Poliakoff, Alexander J Casson. Machine learning algorithm validation with a limited sample size. Plos One (2019). DOI: https://doi.org/10.1371/journal.pone.0224365 Hastie, R. Tibshirani, and J. Friedman, “Overview of supervised learning,” The elements of statistical learning. Springer, 2009, pp. 9–39 . Alpaydın, E. (2014). Introduction to machine learning. Cambridge, MA: MIT Press. Kotsiantis, S. B. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249–268 . Hastie, R. Tibshirani, and J. Friedman, “Unsupervised learning,” The elements of statistical learning. Springer, 2009, pp. 485–585 . Agnieszka Wosiak, Agata Zamecznik, Katarzyna Niewiadomska-Jarosik. Supervised and unsupervised machine learning for improved identification of intrauterine growth restriction types. Federated Conference on Computer Science and Information Systems (FedCSIS) . IEEE (2016) Hinton, Geoffrey; Sejnowski, Terrence. Unsupervised Learning: Foundations of Neural Computation. MIT Press (1999) . ISBN 978-0262581684. Mohamed Alloghani, Dhiya Al-Jumeily, Jamila Mustafina, Ahmed J. Aljaaf, Abir Hussain. A systematic review on supervised and unsupervised machine learning algorithms for data science. Supervised and Unsupervised Learning for Data Science (pp.3-21) . DOI: https://doi.org/10.1007/978-3-030-22475-2_1 Osvaldo Simeone. A very brief introduction to machine learning with applications to communication systems. arXiv preprint arXiv:1808.02342v4 (2018) Hosmer, David W.; Lemeshow, Stanley (2013). Applied Logistic Regression. New York: Wiley. ISBN 978-0-470-58247-3 . Alan Agresti (2012). Categorical Data Analysis. Hoboken. John Wiley and Sons . ISBN 978-0-470-46363-5. Chen, Tianqi; Guestrin, Carlos, "XGBoost: A Scalable Tree Boosting System". In Krishnapuram, Balaji; Shah, Mohak; Smola, Alexander J.; Aggarwal, Charu C.; Shen, Dou; Rastogi, Rajeev (eds.). Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco , 2016. pp. 785–794. arXiv:1603.02754. DOI: https://doi.org/10.1145/2939672.2939785 Hastie, T., Tibshirani, R., Friedman, J. H., "10. Boosting and Additive Trees". The Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 337–384 (2009) Cochran, William G. (1952). The Chi-square test of goodness of fit. The Annals of Mathematical Statistics . 23 (3): 315–345. DOI: https://doi.org/10.1214/aoms/1177729380 On the interpretation of χ2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society. Vol. 85, No. 1 (1922), pp. 87-94 . DOI: https://doi.org/10.2307/2340521 Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Introduction to Information Retrieval. Feature selection, Chi-Square feature selection Cambridge University Press. 2008 Chi-Square feature selection. “Scikit-learn” python library . [Online] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html Marketing, patient data, and privacy concerns. https://www.reutersevents.com/pharma/commercial/marketing-patient-data-and-privacy-concerns Data insights. https://prahs.com/healthcare-intelligence/data-insights Symphony Health Solutions. https://symphonyhealth.prahs.com/ Symphony Health Solutions, What we do. https://symphonyhealth.prahs.com/what-we-do OBG Manag. Endometriosis and infertility: Expert answers to 6 questions to help pinpoint the best route to pregnancy. Mdedge ObGyn 27(6):30-35 (2015). https://www.mdedge.com/obgyn/article/99912/surgery/endometriosis-and-infertility-expert-answers-6-questions-help-pinpoint/ Jon k. Hathaway, MD, PhD, FACS. Decoding Coding. What is the Best Way to Code for Endometriosis? NewsScope, volume 33, issue -2 (2019) . https://newsscope.aagl.org/volume-33-issue-2/decoding-coding-what-is-the-best-way-to-code-for-endometriosis/ Fernando M. Reis, Larissa M. Coutinho, Silvia Vannuccini, Stefano Luisi & Felice Petraglia, Is Stress a Cause or a Consequence of Endometriosis? Reproductive Sciences volume 27, pages39–45(2020). DOI https://doi.org/10.1007/s43032-019-00053-0 Endometriosis – Risks, Signs, Symptoms, Diagnosis and Treatment https://www.mayoclinic.org/diseases-conditions/endometriosis/symptoms-causes/syc-20354656 https://www.webmd.com/women/endometriosis/endometriosis-causes-symptoms-treatment Bo Liang, Yang-Gui Xie, Xiao Ping Xu, and Chun-Hong Hu1. Diagnosis and treatment of submucous myoma of the uterus with interventional ultrasound. NCBI, PMC Oncol Lett (2018) . DOI: https://doi.org/10.3892/ol.2018.8122 Endometriosis vs. Adenomyosis: Similarities and Differences https://www.healthline.com/health/womens-health/adenomyosis-vs-endometriosis Endometrial Hyperplasia. https://my.clevelandclinic.org/health/diseases/16569-atypical-endometrial-hyperplasia. Marina Kvaskoff, Andrew W Horne, Stacey A Missmer. Informing women with endometriosis about ovarian cancer risk. The Lancet Journal, volume 390, issue 10111, P2433-2434, (2017) . DOI: https://doi.org/10.1016/S0140-6736(17)33049-0 Aline Veras Morais Brilhante, Kathiane Lustosa Augusto, Manuela Cavalcante Portela, Luiz Carlos Gabriele Sucupira, Luiz Adriano Freitas Oliveira, Ana Juariana Magalhães Veríssimo Pouchaim, Lívia Rocha Mesquita Nóbrega, Thaís Fontes de Magalhães, and Leonardo Robson Pinheiro Sobreira. Endometriosis and Ovarian Cancer: an Integrative Review (Endometriosis and Ovarian Cancer). Asian Pac J Cancer Prev. (2017) 18(1): 11–16. DOI: https://doi.org/10.22034/APJCP.2017.18.1.11 John P. Cunha, DO, FACOEP. What Will Happen if Endometriosis Is not Treated? emedicinehealth (2019) [online] https://www.emedicinehealth.com/ask_what_will_happen_if_endometriosis_not_treated/article_em.htm#doctor%E2%80%99s_response A. Michael Coppa, MD. What Happens if Endometriosis is Left Untreated? https://www.drcoppaobgyn.com/blog/what-happens-if-endometriosis-is-left-untreated Endometriosis and ovarian cancer risk. [Online] https://ovarian.org.uk/news-and-blog/blog/endometriosis-and-ovarian-cancer-risk/ Cite Share Download PDF Status: Under Review Version 1 posted Review # 2 received at journal 12 Feb, 2021 Editorial decision: Minor revision 12 Feb, 2021 Reviewer # 2 agreed at journal 20 Jan, 2021 Review # 1 received at journal 04 Jan, 2021 Editor assigned by journal 24 Dec, 2020 Reviewers invited by journal 24 Dec, 2020 Reviewer # 1 agreed at journal 24 Dec, 2020 Submission checks completed at journal 24 Dec, 2020 Editor invited by journal 16 Dec, 2020 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-135736","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research article","associatedPublications":[],"authors":[{"id":7081320,"identity":"6f110fe2-a17f-4d43-8659-376b76f4336b","order_by":0,"name":"Ewa J Kleczyk","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABNklEQVRIie3RwUrDMBjA8YZId0mZx4wO+wpfERRRtlfpKGSXMioDmShYEeql3uNbzMvOHYP2MnYe5LB68eShvTkQNS0yhW5jR8H+IcklP5IQRamq+qMhz80XFSX5AnJgOZoK3UagIBh+E0J2ICrdidSpnSAOZ4bBGbtc+qPece0h0t1Bi7R1D6WZUyINzgANgZm3nEVC80X/JJgxnU9tQpohbjyOSgSEBSiBiYVp1xfIF53h3DnSNR/Li1kq1sqkLbqpJJ+WKsn5ckU+bjYS0J38YqFFKIsUbUW8yUZCF6/umINtcvJi62Qm+jCdHp6SKCZk3rlb95Z6ED89B4OWYdwzM3u7ED2IA1OQ66uDGrfHaVYmeWEx71vyh1RFzsre948gb+3+n/Ny+V4QnG7fWlVVVfXP+gJDE2saJqHNRwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0001-8902-8349","institution":"Symphony Health","correspondingAuthor":true,"prefix":"","firstName":"Ewa","middleName":"J","lastName":"Kleczyk","suffix":""},{"id":7081321,"identity":"7dee9e05-30a0-4b25-b3e8-4ef1aa093360","order_by":1,"name":"Aparna Peri","email":"","orcid":"","institution":"Symphony Health","correspondingAuthor":false,"prefix":"","firstName":"Aparna","middleName":"","lastName":"Peri","suffix":""},{"id":7081322,"identity":"289d8ba4-caad-4168-9b5e-06cc25d0bb06","order_by":2,"name":"Tarachand Yadav","email":"","orcid":"","institution":"Symphony Health","correspondingAuthor":false,"prefix":"","firstName":"Tarachand","middleName":"","lastName":"Yadav","suffix":""},{"id":7081323,"identity":"88281527-4e93-4812-93cb-f2aa4b4a0dab","order_by":3,"name":"Ramachandra Komera","email":"","orcid":"","institution":"Symphony Health Solutions: Symphony Health","correspondingAuthor":false,"prefix":"","firstName":"Ramachandra","middleName":"","lastName":"Komera","suffix":""},{"id":7081324,"identity":"1321aab7-a24c-4e1e-9b84-bc46a74c0b79","order_by":4,"name":"Maruthi Peri","email":"","orcid":"","institution":"Symphony Health Solutions: Symphony Health","correspondingAuthor":false,"prefix":"","firstName":"Maruthi","middleName":"","lastName":"Peri","suffix":""},{"id":7081325,"identity":"e7600c7a-5092-4cfb-b48f-73e7bb6a330d","order_by":5,"name":"Vara Guduru","email":"","orcid":"","institution":"Symphony Health Solutions: Symphony Health","correspondingAuthor":false,"prefix":"","firstName":"Vara","middleName":"","lastName":"Guduru","suffix":""},{"id":7081326,"identity":"26e2a5f2-3189-41da-a552-d666010ce26b","order_by":6,"name":"Stalin Amirtharaj","email":"","orcid":"","institution":"Symphony Health Solutions: Symphony Health","correspondingAuthor":false,"prefix":"","firstName":"Stalin","middleName":"","lastName":"Amirtharaj","suffix":""},{"id":7081327,"identity":"d16206cf-aa50-4e97-847f-15652b0d1e59","order_by":7,"name":"Ming Huang","email":"","orcid":"","institution":"Symphony Health Solutions: Symphony Health","correspondingAuthor":false,"prefix":"","firstName":"Ming","middleName":"","lastName":"Huang","suffix":""}],"badges":[],"createdAt":"2020-12-24 18:14:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-135736/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-135736/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":4631919,"identity":"8054b58f-e415-4348-8185-9613a21e86b2","added_by":"auto","created_at":"2020-12-31 15:58:14","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":232361,"visible":true,"origin":"","legend":"Healthcare Claims Patient Level Database Summary","description":"","filename":"Fig1.png","url":"https://assets-eu.researchsquare.com/files/rs-135736/v1/f5cf5ab0cad2c66ee5e54336.png"},{"id":4631355,"identity":"4279b843-c972-43c1-8b87-2b30b5574733","added_by":"auto","created_at":"2020-12-31 15:52:14","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":103518,"visible":true,"origin":"","legend":"presents the Receiver Operating Characteristic (ROC) curves on the test set for Logistic Regression and XGB models. The Area under the ROC Curve (AUC) had values between 0.88 -0.96. Chi-square test was also applied on data before standardization. ","description":"","filename":"Fig2.png","url":"https://assets-eu.researchsquare.com/files/rs-135736/v1/6a24f50e93cf05b23f72d458.png"},{"id":4631677,"identity":"34b1a98b-04b4-4347-ae2f-63b6ba774fc6","added_by":"auto","created_at":"2020-12-31 15:55:14","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":131910,"visible":true,"origin":"","legend":"ROC curves of LR and XG models on test set","description":"","filename":"Fig3.png","url":"https://assets-eu.researchsquare.com/files/rs-135736/v1/c65ff2b03339f233616ef4fa.png"},{"id":4631920,"identity":"0a0f1453-8f79-4fd8-838b-6a6956a560f9","added_by":"auto","created_at":"2020-12-31 15:58:14","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":72986,"visible":true,"origin":"","legend":"Distribution of probability on test data set for both LR and XGB model. Figure on right hand is of XGB and most of values are grouped at extreme values.","description":"","filename":"Fig4.png","url":"https://assets-eu.researchsquare.com/files/rs-135736/v1/634d358bf5383c311b0753e2.png"},{"id":4631678,"identity":"e90e2809-2ecb-4a84-a9c3-f6844307f66b","added_by":"auto","created_at":"2020-12-31 15:55:14","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":97299,"visible":true,"origin":"","legend":"Distribution of patients by predicted probability score","description":"","filename":"Fig5.png","url":"https://assets-eu.researchsquare.com/files/rs-135736/v1/5e570ddc23d534650f2b481c.png"},{"id":13642577,"identity":"a3f43072-3634-4961-9211-227c734e9d73","added_by":"auto","created_at":"2021-09-17 09:07:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":971095,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-135736/v1/bcabe9f3-883d-48e8-963f-caba4b8cfdba.pdf"}],"financialInterests":"","formattedTitle":"Predicting Endometriosis Onset Using Machine Learning Algorithms","fulltext":[{"header":"1. Background","content":" \u003cp\u003eRecent advancement in Artificial Intelligence (AI) and Machine Learning (ML) has provided the opportunity for AI and ML application in the healthcare area, while also slowly improving on the performance benchmark set by the classical statistical techniques [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In recent years, healthcare service providers have also shown interest towards data science and machine learning in disease diagnosing. Disease prediction using data mining and machine learning algorithms with patient medical history such as diagnosis of disease, medical and surgical procedures, therapeutics, and treatments, etc., has been slowly introduced to aid decision making processes [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Many statistical and machine learning techniques have been applied to either pathological or clinical data to study the disease in detail and also predict its likelihood of occurrence. Deep learning algorithms such as Convolutional Neural Network (CNN) have been found to predict disease onset and progression with a greater precision compared to analyzing just medical image data [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eSince healthcare is one of the leading industries with a large amount of structured and unstructured data, it is imperative to use the known advanced techniques to extract the hidden data patterns. Machine Learning algorithms with the help of big data technology has made it easier to mine the vast amount of unstructured data and aided in making important decisions related to patients\u0026rsquo; health [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Due to its high precision and robustness in comparison to conventional statistical methods, most medical scientists have been attracted towards these models to understand the key drivers of disease onset and progression prediction. Artificial Intelligence, Machine Learning, and big data have been playing a pivotal role in improving healthcare infrastructure, patient care, as well as disease diagnosing, prediction and forecasting, drug discovery, etc., and thereby, reducing medical costs, shortening the time to diagnoses and treatment, as well as enhancing patients\u0026rsquo; quality of life and access to healthcare [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWith this motivation in mind, we selected endometriosis as the condition to study in this article. Endometriosis is one of the most common disorders seen in women of a menstruating age in which tissues like the endometrium lining grow on the outer part of the uterus and other organs of the pelvic region. The signs and symptoms vary from patient to patient with some patients having mild symptoms, while others display a moderate to severe level of condition occurrence. The most common symptoms of endometriosis are pelvic pain, dysmenorrhea, and infertility. There is no guaranteed treatment for endometriosis at this time; however, with an early diagnosis and available medical and surgical options, healthcare providers can reduce the risks of potential complications and improve the quality of life for their patients. If we can identify or predict the probability of endometriosis onset by analyzing the medical history of diagnosed patients, the results might help benefit both the healthcare providers\u0026rsquo; diagnosis process and patients\u0026rsquo; well-being and quality of life. In this study, the Logistic Regression (LR) and eXtreme Gradient Boosting (XGB) algorithms were used to predict endometriosis occurrence when leveraging medical history of the diagnosed patients.\u003c/p\u003e \u003cp\u003eThe remainder of the article is organized as follows: in Sect.\u0026nbsp;\u003cspan refid=\"Sec2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, we briefly review the project objective; in Sect.\u0026nbsp;\u003cspan refid=\"Sec3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, we describe different methods used in data preparation, feature engineering, feature selection and model training and validation; in Sect.\u0026nbsp;\u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e4\u003c/span\u003e, we present the model outputs and results; and in Sect.\u0026nbsp;\u003cspan refid=\"Sec13\" class=\"InternalRef\"\u003e5\u003c/span\u003e, we conclude the study with a summary of our findings.\u003c/p\u003e "},{"header":"2. Objectives","content":"\u003cp\u003eThe following objectives will be addressed in this article:\u003c/p\u003e\n\u003col style=\"list-style-type: lower-roman;\"\u003e\n\u003cli\u003e\n\u003cp\u003eTrain machine learning algorithms to predict the likelihood of endometriosis.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eIdentify the most significant medical events in the patient journey that lead to the diagnosis of endometriosis.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eScore entire database using the best performing trained models.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eProfile patients using the predicted scores.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"3. Methods Overview","content":" \u003cp\u003eThe data source for this project is the healthcare claims patient level database with the study time period from January 31, 2019 to December 31, 2019. Patient cohorts: study target and control were established using endometriosis ICD 10 diagnosis codes. As endometriosis is a female only condition, female patients 18 and older were part of the study target cohort. A control cohort is often used to create a patient sample to compare with the study target cohort and is selected using cohort matching algorithms. 36 months of patient medical history prior to the first disease event in 2019 were extracted for both the study target and control cohorts. The healthcare claims patient level data includes diagnosis codes, medical and surgical codes, therapeutics and treatments prescribed at the transactional level.\u003c/p\u003e \u003cp\u003eA number of analytical methods was leveraged for the analysis from the rules-based patient qualification criteria to Machine Learning algorithms to derive probability of endometriosis onset. The following sub-sections of the article present a detailed explanation for each of the selected methods. The healthcare claims patient level dataset considered in the analysis is specific to the US healthcare market.\u003c/p\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Healthcare claims patient level database\u003c/h2\u003e \u003cp\u003eThe healthcare claims patient level database is an anonymous longitudinal patient data set that can be used by organizations that are directly or indirectly associated to healthcare [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]. There has been an increasing interest in patient-level data, as researchers, healthcare providers, and pharmaceutical companies are realizing the potential of creating better comparisons of effective treatment outcomes by analyzing longitudinal data that represent individual patient-based experiences and interactions with the US healthcare system [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe healthcare claims patient level database leveraged for this study consists of medical, hospital, and prescriptions claims across all payment types [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. The database covers more than 317\u0026nbsp;million patients in the US, spans over more than 17\u0026nbsp;years of medical health history, and includes more than 1.9\u0026nbsp;million healthcare providers [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e presents the summary of information in the database.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Cohort selection\u003c/h2\u003e \u003cp\u003eFor this study, we identified 314,101 confirmed endometriosis patients in 2019 in the healthcare claims patient level database, using predefined ICD 10 diagnosis codes (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Female patients age 18 and above were selected to the study target cohort. For the control cohort, a random sample of 3\u0026nbsp;million female patients with the same age criterion was extracted from the database.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eICD 10 diagnosis codes of endometriosis\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDiagnosis Codes\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDiagnosis Long Description\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis of uterus\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis of ovary\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis of fallopian tube\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis of pelvic peritoneum\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis of rectovaginal septum and vagina\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis of intestine\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis in cutaneous scar\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOther endometriosis\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN80.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndometriosis, unspecified\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTo select a control cohort of an equal size to the study target groups out of 3\u0026nbsp;million patients, a noble technique known as \u0026lsquo;propensity score match\u0026rsquo; was used [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Propensity matching algorithm [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], a statistical technique, selects the control cohort based on similar characteristics or covariates observed in the study target cohort. Covariates considered for selection were patient age and medical history [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e presents the distribution comparison between the study target and control cohorts by age and Census geographies. The patient age variable was created via grouping age ranges and US states were grouped into regions.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison between target and control cohort by age and region respectively\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge Group\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTarget\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eControl\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRegion\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTarget\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eControl\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e18\u0026ndash;24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e6.45%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e6.55%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSouth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e39.90%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e39.90%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e25\u0026ndash;34\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e25.01%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e25.24%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMidwest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e22.78%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e22.76%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e35\u0026ndash;44\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e37.57%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e37.08%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNortheast\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e18.82%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e18.84%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e45\u0026ndash;54\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e23.13%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e23.18%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eWest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e17.02%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e17.02%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e55\u0026ndash;64\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e6.22%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e6.31%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eOther\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e1.48%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.48%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e65+\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1.62%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.64%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.3. Data extraction\u003c/h2\u003e \u003cp\u003eThe next step in the analysis process was to extract the entire medical history of the patients from the available information in the healthcare claims patient level database. In order to ensure extraction of healthcare history data prior to the first condition event, the event date for the target cohort was established for each patient. In the case of the control cohort, the first activity in 2019 was considered as the event date.\u003c/p\u003e \u003cp\u003eUsing these event dates of respective patients, 36 months of medical history data was extracted. Historical data presented all the medical events in patient history, including diagnoses for comorbid conditions, medical and surgical procedures, therapeutics, and treatment prescribed to patients. Top 1000 diagnosis codes, top 800 medical and surgical procedures, and top 500 prescribed drugs were only considered for further analysis as these top codes constituted more than 80% of total data. A pivot table was created where data at the transaction level was aggregated by the anonymized patient ID. After historical medical claims data preprocessing for both cohorts independently, a dataset was integrated into a single data frame. The integrated data frame had more than 2,600 features. The dataset was further standardized and split into two groups, a training and test set, using 70:30 ratio respectively [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. The training dataset is used to identify the key features of endometriosis onset, while the test group is used to validate if these features would predict the test group condition onset accurately [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Splitting the data into train and test sets helps to assess the model performance and its generalizing ability on unseen data [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e3.4. Machine Learning algorithms\u0026rsquo; overview\u003c/h2\u003e \u003cp\u003eMachine Learning algorithms can be grouped into two categories: supervised and unsupervised learning.\u003c/p\u003e \u003cp\u003e \u003cem\u003e3.4.a. Supervised learning algorithms\u003c/em\u003e \u003c/p\u003e \u003cp\u003eSupervised learning is the process of training or building the machine learning algorithms in which algorithms learn to map from input space (X) to output space (Y), i.e. Y\u0026thinsp;=\u0026thinsp;f(X) [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. The major objective is to approximate the mapping function (f) in order to ensure that when a new data point (x) is added we can predict (y) outcome [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Supervised learning algorithms are mainly used for classification and prediction problems [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. Following are the most popular supervised algorithms: \u003cem\u003elogistic regression, decision trees (DTs), random forest (RF), extreme gradient boosting, support vector machines (SVMs), Na\u0026iuml;ve Bayes\u003c/em\u003e, \u003cem\u003eadaptive boosting (AdaBoost), artificial neural network (ANN) etc.\u003c/em\u003e [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cem\u003e3.4.b. Unsupervised learning algorithms\u003c/em\u003e \u003c/p\u003e \u003cp\u003eUnsupervised learning algorithms, on the other hand, try to learn the hidden pattern within the input dataset (X) [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. These models are called unsupervised because there is no supervision to guide the models as compared to the supervised learning [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Algorithms are left at their own abilities to learn, discover and showcase the patterns in the input data (X). These algorithms are highly popular in the tasks to discover the natural clusters, dimension reduction, anomaly detection, etc. \u003cem\u003ek-Means clustering, principal component analysis (PCA), factor analysis (FA), singular value decomposition (SVD), apriori algorithm (association rule)\u003c/em\u003e are some popular examples of unsupervised learning algorithms [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eDepending on the study objectives and the available data, algorithms are explored, tested for performance and data type fit, and selected accordingly. We framed the endometriosis onset prediction into a supervised classification problem and selected Logistic Regression and XGB models to develop a highly predictive algorithm of the disease onset. SVM, RF, AdaBoost, ANN, etc. are the other options that were explored in disease prediction; however, Logistic Regression and XGB were selected to predict the condition onset. Logistic Regression allows study of the odds of endometriosis occurrence for a given medical event [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], while XGB has more flexibility in fine tuning the hyper-parameters in comparison to other tree based algorithms [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cem\u003eLogistic Regression\u003c/em\u003e \u003c/p\u003e \u003cp\u003eLogistic Regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Mathematically, a binary logistic model has a dependent variable with two possible values, where the two values are labeled \"0\" and \"1\" [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. Outputs with more than two values are modeled by multinomial logistic regression. Logistic Regression is used in various fields, including healthcare and social sciences [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cem\u003exExtreme Gradient Boosting\u003c/em\u003e \u003c/p\u003e \u003cp\u003eGradient boosting algorithm is a machine learning algorithm which is an ensemble of weak prediction models, mostly decision trees [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. An individual tree is a simple, often unreliable, model but when multiple trees are grouped together, they can create a robust algorithm [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. XGB starts by creating a first simple tree [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e], which than progresses sequentially and builds upon the weaker learners, with each iteration revising the previous tree until an optimal point is reached, such as the number of trees (estimators) to build the solution [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cem\u003eChi-Square Test\u003c/em\u003e \u003c/p\u003e \u003cp\u003eThe Chi-square test is one of the most widely used non-parametric tests [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e], often utilized to test the independence between observed and expected frequencies of one or more attributes in a contingency table, popularly known as \u0026lsquo;test goodness of fit\u0026rsquo; [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. In this work, the Chi-square test is used to identify top significant features given the dependent variable (Y) [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eLogistic Regression, being the simplest of the machine learning algorithms, was selected as the base model for the analysis and used to compare other models\u0026rsquo; performance. Both Logistic Regression and XGB models were trained, and top 1,000 features from each algorithm were selected out of more than 2,600 features used in the model runs. To decrease the number of data elements and to select only the most important variables to predicting the condition onset, we also used a Chi-Square test to identify the top 1,000 features. As a next step, the unique features from each model were utilized to train the final machine learning model to predict the endometriosis occurrence probability. Algorithms were trained on Python 3.5 using \u0026lsquo;\u003cem\u003escikit-learn\u0026rsquo;\u003c/em\u003e and \u0026lsquo;\u003cem\u003exgboost\u003c/em\u003e\u0026rsquo; libraries.\u003c/p\u003e \u003c/div\u003e "},{"header":"4. Results","content":" \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e4.1. Significant features selection\u003c/h2\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003epresents the machine learning model performance metrics, which indicate that both the Logistic Regression and XGB models performed relatively well in predicting the condition onset. The models\u0026rsquo; accuracy ranged between 88% \u0026minus;\u0026thinsp;96%.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithms\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eStatistic\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTrain Set\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTest Set\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e'LR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e96%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e96%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSensitivity/TPR/Recall\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e95%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpecificity/TNR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e98%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e97%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision/PPV\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e98%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e97%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ef1-Score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003eXGB\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e90%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e88%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSensitivity/TPR/Recall\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e86%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e84%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpecificity/TNR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e93%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision/PPV\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e92%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ef1-Score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.88\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.88\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Classification metrics of train and test sets for LR and XGB model\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e presents the Receiver Operating Characteristic (ROC) curves on the test set for Logistic Regression and XGB models. The Area under the ROC Curve (AUC) had values between 0.88\u0026ndash;0.96. Chi-square test was also applied on data before standardization.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe top 1,000 features were selected from the Logistic Regression, XGB and Chi-square algorithms to train the final machine learning model. Most of the top features identified by the selected models were related to medical and surgical procedures as well as diagnosis codes. Patients diagnosed with endometriosis underwent a series of medical and surgical procedures and had various diagnostic symptoms and comorbid conditions. The Chi-square significance test was run at the 95% significance confidence interval to aid in identification of the topmost significant features.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eMost significant features from LR, XGB and Chi-Square test\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFeature\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFeature Description\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N85_8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOther specified non inflammatory disorder of uterus\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N94_6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDysmenorrhea, unspecified\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N94_9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUnspecified condition associated with female genital organs and menstrual cycle\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_R10_2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePelvic and Perineal Pain\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_Z01_419\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEncounter for gynecological examination (general) (routine) without abnormal findings\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_00840\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAnesthesia Intraperitoneal Lower Abd W/Laps Nos\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_00944\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAnesthesia vaginal hysterectomy incl biopsy\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_52000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCystourethroscopy\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_58571\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLaps total hysterect 250 GM/\u0026lt; w/rmvl tube/ovary\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_58573\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLaparoscopy tot hysterectomy\u0026thinsp;\u0026gt;\u0026thinsp;250\u0026nbsp;g w/tube/ovar\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_58662\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLaps Fulg/Exc Ovary Viscera/ Peritoneal Surface\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_76830\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUs Transvaginal\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_J1950\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eInjection. Leuprolide acetate (for depot suspens)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eR_Norethindrone_Acetate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNorethindrone Acetate\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSPCLT_EM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEmergency Medicine\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSPCLT_FM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFamily medicine\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSPCLT_HO\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHematology/Oncology\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSPCLT_OBG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eObstetrics and gynecology\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eSection \u003cspan refid=\"Sec1\" class=\"InternalRef\"\u003e1\u003c/span\u003e of this work describes endometriosis and its associated signs and symptoms such as \u0026lsquo;\u003cem\u003epainful periods\u0026rsquo;, \u0026lsquo;lower abdominal and pelvic pain\u0026rsquo;, \u0026lsquo;heavy bleeding during periods\u0026rsquo;, \u0026lsquo;pain during urination and bowel movement\u0026rsquo;, \u0026lsquo;constipation and diarrhea\u0026rsquo;, \u0026lsquo;infertility\u0026rsquo;, \u0026lsquo;painful sexual intercourse\u0026rsquo;, etc.\u003c/em\u003e [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Identifying these prominent medical events from patients\u0026rsquo; medical history by the models is the objective of this work. Hence, it is desirable to validate the model performance by analyzing the top features, whether they would help predict endometriosis\u0026rsquo; onset.\u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents the top features identified by the machine learning models, which are directly or indirectly associated with endometriosis. Features such as \u0026lsquo;\u003cem\u003enon inflammatory disorder of uterus (D_N85_8)\u0026rsquo;, \u0026lsquo;pelvic and perineal pain (D_R10_2)\u0026rsquo;\u003c/em\u003e are the diagnosis codes, presenting the association with the risks and symptoms of endometriosis [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]. Procedure codes such \u003cem\u003e\u0026lsquo;anesthesia of lower abdomen for laparoscopy (P_00840)\u0026rsquo;, \u0026lsquo;vaginal hysterectomy including biopsy (P_00944)\u0026rsquo;\u003c/em\u003e are the top procedures often associated with the diagnosis as well treatment of endometriosis [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]. Furthermore, the machine learning models suggest that patients often consult with specialists including \u003cem\u003e\u0026lsquo;emergency medicine (SPCLT_EM)\u0026rsquo;, \u0026lsquo;family medicine (SPCLT_FM)\u0026rsquo;, \u0026lsquo;obstetrics and gynecology (SPCLT_OBG)\u0026rsquo;\u003c/em\u003e when experiencing related symptoms and gynecological issues. Overall, the machine learning models selected top features closely related to the onset of endometriosis, which implies that when tracking any of the features the condition onset could be diagnosed sooner.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.2. Feature selection for market definition\u003c/h2\u003e \u003cp\u003eTop features from all three algorithms, which were specific to target cohort were identified. These features presented to be important in diagnosing the endometriosis condition and were selected for patient scorning criteria. The therapeutics as well as medical and surgical procedure codes specific to endometriosis treatment such as Orilissa, Marilissa, and Lupron Depot were excluded. Around 9.5\u0026nbsp;million female patients age 18 and above were qualified for scoring.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.3. Propensity model training and validation\u003c/h2\u003e \u003cp\u003eUsing the top features selected, Logistic Regression and XGB models were re-trained. As the number of features was reduced, in the beginning we observed a drop in model performance. After several iterations and hyper-parameter tuning, the predictive power of XGB significantly improved compared to the previous iterations; however, we did not see any improvement in the Logistic Regression model results. Interestingly, both models were able to identify additional new features aligned with endometriosis.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eList of top features identified by re-trained models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFeatures\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFeature Description\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_D25_0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSubmucous leiomyoma of uterus\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_F43_0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAcute stress reaction\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N83_291\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOther ovarian cyst, right side\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N85_2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHypertrophy of uterus\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N92_4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eExcessive bleeding in the premenopausal period\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N94_12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeep dyspareunia\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N94_3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePremenstrual tension syndrome\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N94_5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSecondary dysmenorrhea\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_N97_0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale infertility associated with anovulation\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_Z79_890\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHormone replacement therapy\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eD_Z80_41\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFamily history of malignant neoplasm of ovary\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eP_58661\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLaparoscopy w/rmvl adnexal structures\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eR_ACETAMINOPHEN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAcetaminophen\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eR_MEGESTROL_ACETATE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMegestrol acetate\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eR_LIDOCAINE_HCL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLidocaine hcl\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe re-trained machine learning models identified all the top features discussed in Sect.\u0026nbsp;\u003cspan refid=\"Sec9\" class=\"InternalRef\"\u003e4.1\u003c/span\u003e. In Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, we present the additional features recognized by XGB and Logistic Regression models, which are highly significant in predicting the likelihood of endometriosis. The models suggest that features like \u003cem\u003e\u0026lsquo;submucous leiomyoma of uterus (D_D25_0)\u0026rsquo;, \u0026lsquo;ovarian cyst (D_N83_291)\u0026rsquo;, \u0026lsquo;deep dyspareunia (\u0026lsquo;D_N94_5)\u0026rsquo;,\u0026rsquo;female infertility associated with anovulation (D_N97_0)\u0026rsquo;\u003c/em\u003e are important in predicting the likelihood of endometriosis. The models have also flagged \u003cem\u003eAcetaminophen (R_ACETAMINOPHEN), Megestrol acetate (R_MEGESTROL_ACETATE) \u0026amp; Lidocaine hcl (R_LIDOCAINE_HCL)\u003c/em\u003e drugs as the strong predictors of endometriosis.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eshows that the XGB model performed better compared to the Logistic Regression model. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the Receiver Operating Characteristic (ROC) curves on test sets for both re-trained Logistic Regression and XGB models. The Area under the ROC Curve (AUC) values of LR and XGB models on test were 0.87 and 0.96 respectively. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e suggests that the XGB model was able to more accurately differentiate target from control than LR model. Hence, we used XGB model to score the qualified patients.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithms\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eStatistic\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTrain Set\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTest Set\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003eLR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e87%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e87%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSensitivity/TPR/Recall\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e75%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e75%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpecificity/TNR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e98%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e98%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision/PPV\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e98%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e98%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ef1-Score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e-XGB\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e96%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e94%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSensitivity/TPR/Recall\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e93%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e90%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpecificity/TNR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e99%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e98%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision/PPV\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e99%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e97%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ef1-Score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.94\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e. Classification metric of LR and XGB model on train and test set\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.4. Scoring qualified patients\u003c/h2\u003e \u003cp\u003eThe last step of the model evaluation is to score qualified patients to assess the model\u0026rsquo;s predictability of condition onset. A complete medical history of 9.5\u0026nbsp;million qualified patients was extracted for 36 months, which included diagnosis codes, medical and surgical procedure codes, medications and treatments prescribed as well as practitioners\u0026rsquo; therapy expertise and Board-Certified Specialty. After data pre-processing, the likelihood of endometriosis was predicted using the trained XGB model.\u003c/p\u003e \u003cp\u003eA probability distribution of 9.5\u0026nbsp;million scored patients is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. We observed that most of the predicted probability values are concentrated either towards 0 or 1. Considering 0.5 as the threshold, the XGB model suggests that around 36% of the scored patients are likely to get diagnosed with endometriosis sometime in the future. Assuming an ability to leverage the significant variables in diagnosing the condition onset, practitioners can give special medical care and advice in time to these patients, thereby, reducing the risks of endometriosis and its related complications.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e "},{"header":"5. Discussion","content":" \u003cp\u003eOverall, the machine learning models have identified top features that can explain endometriosis onset in advance. As noted, Tables\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e and \u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e in the 4. \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003eResults\u003c/span\u003e Section, these features include diagnosis codes, medical and surgical procedure codes, as well as physician specialties that often support patients through their healthcare journey.\u003c/p\u003e \u003cp\u003eFor the preliminary Logistic Regression, XGB, and Chi-Square runs as noted in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, the following top variables were identified as important in predicting the condition onset: 1) diagnoses codes: \u0026lsquo;\u003cem\u003enon inflammatory disorder of uterus (D_N85_8)\u0026rsquo;, \u0026lsquo;dysmenorrhea (D_N94_6)\u0026rsquo;, \u0026lsquo;pelvic and perineal pain (D_R10_2)\u0026rsquo;, \u0026lsquo;unspecified condition associated with female genital organs and menstrual cycle (D_N94_9)\u003c/em\u003e clearly show association with the risks and symptoms of endometriosis [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]; 2) medical and surgical procedure codes such \u003cem\u003e\u0026lsquo;anesthesia of lower abdomen for laparoscopy (P_00840)\u0026rsquo;, \u0026lsquo;vaginal hysterectomy including biopsy (P_00944)\u0026rsquo;, \u0026lsquo;cystourethroscopy (P_52000)\u0026rsquo;, \u0026lsquo;laparoscopy, surgical with fulguration or excision of lesions of the ovary, peritoneal surface (P_58662)\u0026rsquo;\u003c/em\u003e are associated with the diagnosis as well treatment of endometriosis [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eFrom the patient medical journey and healthcare access side, the machine learning models suggest that patients often consult with specialists, including \u003cem\u003e\u0026lsquo;emergency medicine (SPCLT_EM)\u0026rsquo;, \u0026lsquo;family medicine (SPCLT_FM)\u0026rsquo;, \u0026lsquo;obstetrics and gynecology (SPCLT_OBG)\u0026rsquo;\u003c/em\u003e when experiencing endometriosis related symptoms and gynecological issues. Patients with the history of endometriosis or untreated endometriosis are at a higher risk of developing either an ovarian cancer or \u0026lsquo;\u003cem\u003eendometriosis associated adenocarcinoma\u003c/em\u003e,\u0026rsquo; which can also serve as an indicator of potential occurrence of the condition [\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e, \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e]. The machine learning models selected as one of the top healthcare provider specialties \u0026lsquo;\u003cem\u003ehematology/oncology (SPCLT_HO)\u0026rsquo;.\u003c/em\u003e This finding suggests that if a patient has any signs and symptoms as noted above, a consultation with an oncologist is recommended [\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e, \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e]. Overall, the machine learning models selected top features directly related to the onset of endometriosis, which implies that when tracking any of the features the condition onset could be diagnosed sooner.\u003c/p\u003e \u003cp\u003eAs noted in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e above, Logistic Regression and XGB models identified additional features, which are important in predicting the likelihood of endometriosis. The models suggest that features like \u003cem\u003e\u0026lsquo;submucous leiomyoma of uterus (D_D25_0)\u0026rsquo;, \u0026lsquo;ovarian cyst (D_N83_291)\u0026rsquo;, \u0026lsquo;hypertrophy of uterus (D_N85_2)\u0026rsquo;, \u0026lsquo;excessive bleeding in the premenopausal period (D_N92_4)\u0026rsquo;,\u0026rsquo;deep dyspareunia (\u0026lsquo;D_N94_5)\u0026rsquo;,\u0026rsquo;female infertility associated with anovulation (D_N97_0)\u0026rsquo;, \u0026lsquo;premenstrual tension syndrome (D_94_3)\u0026rsquo;, \u0026rsquo;hormone replacement therapy (D_Z79_890)\u0026rsquo;,\u0026rsquo;family history of malignant neoplasm of ovary\u0026rsquo;\u003c/em\u003e are highly significant in predicting the likelihood of endometriosis. There are also several articles, which support the models\u0026rsquo; claims that \u003cem\u003efibroids, ovarian cysts, infertility, menstrual period complications, family history of neoplasm of ovary, hormone therapy etc\u003c/em\u003e. have strong association with endometriosis [\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e]. Recent clinical research also supports that women of reproductive age with \u0026lsquo;\u003cem\u003echronic stress\u003c/em\u003e\u0026rsquo; are at a higher risk of developing endometriosis [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe machine learning models have also identified \u003cem\u003eAcetaminophen (R_ACETAMINOPHEN), Megestrol acetate (R_MEGESTROL_ACETATE) \u0026amp; Lidocaine hcl (R_LIDOCAINE_HCL)\u003c/em\u003e drugs as the strong predictors of endometriosis, as these drugs are often prescribed as analgesics, birth control \u0026amp; treatment of endometrial cancer and to numb the skin/muscles respectively. Furthermore, features such as \u003cem\u003e\u0026lsquo;submucous leiomyoma of uterus (D_D25_0)\u0026rsquo;\u003c/em\u003e and \u003cem\u003e\u0026lsquo;hypertrophy of uterus (D_N85_2)\u0026rsquo;\u003c/em\u003e are significant predictors [\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e, \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e] in the disease onset; however, more clinical research is needed to support this statement, as these conditions have similar symptoms, but patients are less likely to develop endometriosis [\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOverall, the top data elements present the key features that should be considered when diagnosing endometriosis in adult women in order to decrease the time to diagnosis. As noted in the 4.4 Section of the article, when using these variables in the diagnostic processes, we can with a high accuracy predict the condition onset and differentiate accurately between patients with and without the disease.\u003c/p\u003e "},{"header":"6. Conclusions","content":" \u003cp\u003eIn this article, we validated the crucial role of AI and ML in the disease diagnosis, prediction, and forecasting. We analyzed medical history of patients with endometriosis using machine learning algorithms and re-trained XGB model on selected important features, which were applied to predict the likelihood of endometriosis occurrence in the adult female population. Early prediction of the disease can offer an opportunity for patients to receive needed medical treatment earlier in the patient journey. Creating a typing tool that can be integrated into the Electronic Health Records (EHR) systems and easily accessed by healthcare providers could further aid the objective of improving the diagnosis activities and inform the diagnostic processes that would result in timely and precise diagnosis, ultimately increasing patient care and quality of life. In our future work, we plan to explore advanced deep learning algorithms to further enhance the model performance and increase the accuracy of the machine learning models in predicting the likelihood of the disease onset.\u003c/p\u003e "},{"header":"Abbreviations","content":" \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eArtificial Intelligence (AI)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eMachine Learning (ML)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eLogistic Regression (LR)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eeXtreme Gradient Boosting (XGB)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003ePrincipal Component Analysis (PCA)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eFactor Analysis (FA)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eSingular Value Decomposition (SVD)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eReceiver Operating Characteristic Curve (ROC)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eArea under the ROC Curve (AUC)\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eElectronic Health Records (EHR)\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e "},{"header":"Declarations","content":"\u003cul\u003e\n\u003cli\u003eEthics approval and consent to participate: Symphony Health, PRA Health Sciences Privacy Risk Review Group reviewed the article. The de-identified dataset is used for analytics only. No direct patient identifiers are noted. The analysis presents only a negligible risk of re-identification of an individual, which is consistent with HIPAA Privacy Rules. No additional administrative permissions or ethics approvals were required to access and use the medical claims data described in this study.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cul\u003e\n\u003cli\u003eConsent for publication: Not applicable.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cul\u003e\n\u003cli\u003eAvailability of data and materials: The data that support the findings of this study are available from Symphony Health, PRA Health Sciences, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cul\u003e\n\u003cli\u003eCompeting Interest: The authors declare that they have no competing interests.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cul\u003e\n\u003cli\u003eFunding: Authors work for Symphony Health, PRA Health Sciences. The data used in the article is the property of Symphony Health, PRA Health Sciences. Authors used the data for publication of this article.\u003c/li\u003e\n\u003cli\u003eAuthors' contributions: all authors have read and approved the manuscript.\n\u003cul\u003e\n\u003cli\u003eEJK: Research Principal, Data Scientist \u0026ndash; corresponding author, responsible for the overall study design, analytics plan, and documentation\u003c/li\u003e\n\u003cli\u003eAP: Project Manager - responsible for day to day activities of the research\u003c/li\u003e\n\u003cli\u003eTY: Lead Data Scientist: leading and conducting the analytics for the research project\u003c/li\u003e\n\u003cli\u003eRK, M, VG, S, and MH: Data Scientists on the research project, performing the analytical analysis\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003eAcknowledgements: Authors would like to recognize Heather Valera and Koichi Iwata for their review of document drafts, and their valuable feedback in improving the article content.\u003c/li\u003e\n\u003c/ul\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eDoupe P, Faghmous J, Basu S., Machine Learning for Health Services Researchers. Value Health. 22(7): 808-815, 2019.\u003c/li\u003e\n\u003cli\u003eWilliam H. Crown, PhD. Potential application of machine learning in health outcomes research and some statistical cautions. \u003cem\u003eInternational Society for Pharmacoeconomics and Outcomes Research (ISPOR)\u003c/em\u003e, 2015. 1098-3015$36.00, DOI: https://doi.org/10.1016/j.jval.2014.12.005\u003c/li\u003e\n\u003cli\u003eMarzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, Rajesh Ranganath. A review of challenges and opportunities in machine learning for health. \u003cem\u003earXivLabs\u003c/em\u003e. 2019 v4, https://arxiv.org/abs/1806.00388\u003c/li\u003e\n\u003cli\u003eVarun H Buch, Irfan Ahmed, Mahiben Maruthappu. Artificial intelligence in medicine: current trends and future possibilities. \u003cem\u003eBritish Journal of General Practice\u003c/em\u003e 2018; 68 (668): 143-144. DOI: https://doi.org/10.3399/bjgp18X695213\u003c/li\u003e\n\u003cli\u003eAlvin Rajkomar, Sneha Lingam, Andrew G. Taylor, Michael Blum, John Mongan. High-throughput classification of radiographs using deep convolutional neural networks. \u003cem\u003eJournal of Digital Imaging\u003c/em\u003e 30, 95\u0026ndash;101(2016). DOI: https://doi.org/10.1007/s10278-016-9914-9\u003c/li\u003e\n\u003cli\u003eMin Chen, Yixue Hao, Kai Hwang, Lu Wang, Lin Wang. Disease prediction by machine learning over big data from healthcare communities. \u003cem\u003eIEEE, \u003c/em\u003e2169-3536 (2017), DOI: https://doi.org/10.1109/ACCESS.2017.2694446\u003c/li\u003e\n\u003cli\u003eAdriana Gabriela Alexandru, Irina-Miruna Radu, Madalina - Lavinia Bizon. Big data in healthcare - opportunities and challenges. \u003cem\u003eInformatica Economică vol.22, no. 2/2018\u003c/em\u003e. DOI: https://doi.org/10.12948/issn14531305/22.2.2018.05\u003c/li\u003e\n\u003cli\u003eIroju Olaronke, Ojerinde Oluwaseun. Big data in healthcare: Prospects, challenges and resolutions. \u003cem\u003eIEEE\u003c/em\u003e, 16602629, 2016. DOI: https://doi.org/10.1109/FTC.2016.7821747\u003c/li\u003e\n\u003cli\u003eGetting the Most Out of Longitudinal Patient Data. \u003cem\u003eAnonymous patient-level data (APLD)\u003c/em\u003e [Online] https://www.rxdatascience.com/blog/getting-most-out-of-longitudinal-patient-data\u003c/li\u003e\n\u003cli\u003eIntegrated Dataverse (IDV\u0026reg;). [Online] https://symphonyhealth.prahs.com/what-we-do/view-health-data\u003c/li\u003e\n\u003cli\u003eJerome H. Friedman. Greedy function approximation : A gradient boosting machine. \u003cem\u003eThe Annals of Statistics Volume 29\u003c/em\u003e, (2001), 1189-1232 DOI: https://doi.org/10.1214/aos/1013203451\u003c/li\u003e\n\u003cli\u003eExtreme Gradient Boosting. [Online] https://xgboost.readthedocs.io/en/latest/tutorials/model.html,\u003c/li\u003e\n\u003cli\u003ehttps://info.cambridgespark.com/latest/getting-started-with-xgboost\u003c/li\u003e\n\u003cli\u003eS. Cramer. The origins of logistic regression. \u003cem\u003eTinbergen Institute discussion paper, TI 2002-119/4\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eLogistic Regression. [Online] https://en.wikipedia.org/wiki/Logistic_regression\u003c/li\u003e\n\u003cli\u003eEndometriosis signs and symptoms. [Online] https://www.hopkinsmedicine.org/health/conditions-and-diseases/endometriosis\u003c/li\u003e\n\u003cli\u003eEndometriosis signs and symptoms. [Online] https://www.health.qld.gov.au/news-events/news/signs-symptoms-endometriosis\u003c/li\u003e\n\u003cli\u003eM Sanni Ali, Daniel Prieto-Alhambra, Luciane Cruz Lopes, Dandara Ramos, Nivea Bispo, Maria Y. Ichihara, Julia M. Pescarini, Elizabeth Williamson, Rosemeire L. Fiaccone, Mauricio L. Barreto, and Liam Smeeth. Propensity score methods in health technology assessment: principles, extended applications, and recent advances. \u003cem\u003eFront Pharmacol 10: 973 (2019)\u003c/em\u003e. DOI: https://dx.doi.org/10.3389/fphar.2019.00973\u003c/li\u003e\n\u003cli\u003eRosenbaum P. R., Rubin D. B. The central role of the propensity score in observational studies for causal effects. \u003cem\u003eBiometrika\u003c/em\u003e \u003cem\u003e70, 41\u0026ndash;55\u003c/em\u003e (1983). DOI: https://doi.org/10.1093/biomet/70.1.41\u003c/li\u003e\n\u003cli\u003eRosenbaum P. R., Rubin D. B. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. \u003cem\u003eThe American Statistician\u003c/em\u003e \u003cem\u003e39:1, 33\u0026ndash;38 (1985). \u003c/em\u003eDOI: https://doi.org/10.1080/00031305.1985.10479383\u003c/li\u003e\n\u003cli\u003eYun Xu, Royston Goodacre. On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. \u003cem\u003eJournal of Analysis and Testing 2(3) (2017)\u003c/em\u003e. DOI: https://doi.org/10.1007/s41664-018-0068-2\u003c/li\u003e\n\u003cli\u003eRachel Lea Ballantyne Draelos. Best Use of Train/Val/Test Splits, with Tips for Medical Data. \u003cem\u003eGlass Box Machine Learning and Medicine.\u003c/em\u003e [Online] https://glassboxmedicine.com/2019/09/15/best-use-of-train-val-test-splits-with-tips-for-medical-data/\u003c/li\u003e\n\u003cli\u003eKevin Dobbin, Richard Simon. Optimally splitting cases for training and testing high dimensional classifiers. \u003cem\u003eBMC Medical Genomics, 4:31 (2011)\u003c/em\u003e. DOI: https://doi.org/10.1186/1755-8794-4-31\u003c/li\u003e\n\u003cli\u003eAndrius Vabalas, Emma Gowen, Ellen Poliakoff, Alexander J Casson. Machine learning algorithm validation with a limited sample size. Plos One (2019). DOI: https://doi.org/10.1371/journal.pone.0224365\u003c/li\u003e\n\u003cli\u003eHastie, R. Tibshirani, and J. Friedman, \u0026ldquo;Overview of supervised learning,\u0026rdquo; The elements of statistical learning. \u003cem\u003eSpringer, 2009, pp. 9\u0026ndash;39\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eAlpaydın, E. (2014). Introduction to machine learning. Cambridge, MA: MIT Press.\u003c/li\u003e\n\u003cli\u003eKotsiantis, S. B. (2007). Supervised machine learning: A review of classification techniques. \u003cem\u003eInformatica, 31, 249\u0026ndash;268\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eHastie, R. Tibshirani, and J. Friedman, \u0026ldquo;Unsupervised learning,\u0026rdquo; The elements of statistical learning. \u003cem\u003eSpringer, 2009, pp. 485\u0026ndash;585\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eAgnieszka Wosiak, Agata Zamecznik, Katarzyna Niewiadomska-Jarosik. Supervised and unsupervised machine learning for improved identification of intrauterine growth restriction types. \u003cem\u003eFederated Conference on Computer Science and Information Systems (FedCSIS)\u003c/em\u003e. IEEE (2016)\u003c/li\u003e\n\u003cli\u003eHinton, Geoffrey; Sejnowski, Terrence. Unsupervised Learning: Foundations of Neural Computation. \u003cem\u003eMIT Press (1999)\u003c/em\u003e. ISBN 978-0262581684.\u003c/li\u003e\n\u003cli\u003eMohamed Alloghani, Dhiya Al-Jumeily, Jamila Mustafina, Ahmed J. Aljaaf, Abir Hussain. A systematic review on supervised and unsupervised machine learning algorithms for data science. \u003cem\u003eSupervised and Unsupervised Learning for Data Science (pp.3-21)\u003c/em\u003e. DOI: https://doi.org/10.1007/978-3-030-22475-2_1\u003c/li\u003e\n\u003cli\u003eOsvaldo Simeone. A very brief introduction to machine learning with applications to communication systems. \u003cem\u003earXiv preprint arXiv:1808.02342v4 (2018)\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eHosmer, David W.; Lemeshow, Stanley (2013). Applied Logistic Regression. \u003cem\u003eNew York: Wiley. ISBN 978-0-470-58247-3\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eAlan Agresti (2012). Categorical Data Analysis. Hoboken. \u003cem\u003eJohn Wiley and Sons\u003c/em\u003e. ISBN 978-0-470-46363-5.\u003c/li\u003e\n\u003cli\u003eChen, Tianqi; Guestrin, Carlos, \"XGBoost: A Scalable Tree Boosting System\". In Krishnapuram, Balaji; Shah, Mohak; Smola, Alexander J.; Aggarwal, Charu C.; Shen, Dou; Rastogi, Rajeev (eds.). \u003cem\u003eProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco\u003c/em\u003e, 2016. pp. 785\u0026ndash;794. arXiv:1603.02754. DOI: https://doi.org/10.1145/2939672.2939785\u003c/li\u003e\n\u003cli\u003eHastie, T., Tibshirani, R., Friedman, J. H., \"10. Boosting and Additive Trees\". \u003cem\u003eThe Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 337\u0026ndash;384 (2009)\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eCochran, William G. (1952). The Chi-square test of goodness of fit. \u003cem\u003eThe Annals of Mathematical Statistics\u003c/em\u003e. 23 (3): 315\u0026ndash;345. DOI: https://doi.org/10.1214/aoms/1177729380\u003c/li\u003e\n\u003cli\u003eOn the interpretation of \u0026chi;2 from contingency tables, and the calculation of p. \u003cem\u003eJournal of the Royal Statistical Society. Vol. 85, No. 1 (1922), pp. 87-94\u003c/em\u003e. DOI: https://doi.org/10.2307/2340521\u003c/li\u003e\n\u003cli\u003eChristopher D. Manning, Prabhakar Raghavan and Hinrich Sch\u0026uuml;tze. Introduction to Information Retrieval. Feature selection, Chi-Square feature selection \u003cem\u003eCambridge University Press. 2008\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eChi-Square feature selection. \u003cem\u003e\u0026ldquo;Scikit-learn\u0026rdquo; python library\u003c/em\u003e. [Online] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html\u003c/li\u003e\n\u003cli\u003eMarketing, patient data, and privacy concerns. https://www.reutersevents.com/pharma/commercial/marketing-patient-data-and-privacy-concerns\u003c/li\u003e\n\u003cli\u003eData insights. https://prahs.com/healthcare-intelligence/data-insights\u003c/li\u003e\n\u003cli\u003eSymphony Health Solutions. https://symphonyhealth.prahs.com/\u003c/li\u003e\n\u003cli\u003eSymphony Health Solutions, What we do. https://symphonyhealth.prahs.com/what-we-do\u003c/li\u003e\n\u003cli\u003eOBG Manag. Endometriosis and infertility: Expert answers to 6 questions to help pinpoint the best route to pregnancy. \u003cem\u003eMdedge ObGyn 27(6):30-35 (2015).\u003c/em\u003e https://www.mdedge.com/obgyn/article/99912/surgery/endometriosis-and-infertility-expert-answers-6-questions-help-pinpoint/\u003c/li\u003e\n\u003cli\u003eJon k. Hathaway, MD, PhD, FACS. Decoding Coding. What is the Best Way to Code for Endometriosis? \u003cem\u003eNewsScope, volume 33, issue -2 (2019)\u003c/em\u003e. https://newsscope.aagl.org/volume-33-issue-2/decoding-coding-what-is-the-best-way-to-code-for-endometriosis/\u003c/li\u003e\n\u003cli\u003eFernando M. Reis, Larissa M. Coutinho, Silvia Vannuccini, Stefano Luisi \u0026amp; Felice Petraglia, Is Stress a Cause or a Consequence of Endometriosis? Reproductive Sciences volume 27, pages39\u0026ndash;45(2020). DOI https://doi.org/10.1007/s43032-019-00053-0\u003c/li\u003e\n\u003cli\u003eEndometriosis \u0026ndash; Risks, Signs, Symptoms, Diagnosis and Treatment\n\u003cul\u003e\n\u003cli\u003ehttps://www.mayoclinic.org/diseases-conditions/endometriosis/symptoms-causes/syc-20354656\u003c/li\u003e\n\u003cli\u003ehttps://www.webmd.com/women/endometriosis/endometriosis-causes-symptoms-treatment\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003eBo Liang, Yang-Gui Xie, Xiao Ping Xu, and Chun-Hong Hu1. Diagnosis and treatment of submucous myoma of the uterus with interventional ultrasound. \u003cem\u003eNCBI, PMC Oncol Lett (2018)\u003c/em\u003e. DOI: https://doi.org/10.3892/ol.2018.8122\u003c/li\u003e\n\u003cli\u003eEndometriosis vs. Adenomyosis: Similarities and Differences https://www.healthline.com/health/womens-health/adenomyosis-vs-endometriosis\u003c/li\u003e\n\u003cli\u003eEndometrial Hyperplasia. https://my.clevelandclinic.org/health/diseases/16569-atypical-endometrial-hyperplasia.\u003c/li\u003e\n\u003cli\u003eMarina Kvaskoff, Andrew W Horne, Stacey A Missmer. Informing women with endometriosis about ovarian cancer risk. \u003cem\u003eThe Lancet Journal, volume 390, issue 10111, P2433-2434, (2017)\u003c/em\u003e. DOI: https://doi.org/10.1016/S0140-6736(17)33049-0\u003c/li\u003e\n\u003cli\u003eAline Veras Morais Brilhante, Kathiane Lustosa Augusto, Manuela Cavalcante Portela, Luiz Carlos Gabriele Sucupira, Luiz Adriano Freitas Oliveira, Ana Juariana Magalh\u0026atilde;es Ver\u0026iacute;ssimo Pouchaim, L\u0026iacute;via Rocha Mesquita N\u0026oacute;brega, Tha\u0026iacute;s Fontes de Magalh\u0026atilde;es, and Leonardo Robson Pinheiro Sobreira. Endometriosis and Ovarian Cancer: an Integrative Review (Endometriosis and Ovarian Cancer). \u003cem\u003eAsian Pac J Cancer Prev. (2017) 18(1): 11\u0026ndash;16.\u003c/em\u003e DOI: https://doi.org/10.22034/APJCP.2017.18.1.11\u003c/li\u003e\n\u003cli\u003eJohn P. Cunha, DO, FACOEP. What Will Happen if Endometriosis Is not Treated? \u003cem\u003eemedicinehealth\u0026nbsp; (2019) [online]\u003c/em\u003e https://www.emedicinehealth.com/ask_what_will_happen_if_endometriosis_not_treated/article_em.htm#doctor%E2%80%99s_response\u003c/li\u003e\n\u003cli\u003eA. Michael Coppa, MD. What Happens if Endometriosis is Left Untreated? https://www.drcoppaobgyn.com/blog/what-happens-if-endometriosis-is-left-untreated\u003c/li\u003e\n\u003cli\u003eEndometriosis and ovarian cancer risk. [Online] https://ovarian.org.uk/news-and-blog/blog/endometriosis-and-ovarian-cancer-risk/\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-womens-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmwh","sideBox":"Learn more about [BMC Women's Health](http://bmcwomenshealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmwh/default.aspx","title":"BMC Women's Health","twitterHandle":"","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Endometriosis, infertility, likelihood, Logistic Regression, Machine Learning, eXtreme Gradient Boosting","lastPublishedDoi":"10.21203/rs.3.rs-135736/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-135736/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eEndometriosis is a common progressive female health disorder in which tissues similar to the lining of the uterus grow on other parts of the body like ovaries, fallopian tubes, bowel, and other parts of reproductive organs. In women, it is one of the most common causes of pelvic pain and infertility. In the US, one in every ten women of reproductive age group has endometriosis. The actual cause of endometriosis is still unknown, and it is quite difficult to diagnose. There are several theories regarding the cause; however, not a single theory has been scientifically proven.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eIn this paper, we try to identify the drivers of endometriosis\u0026rsquo; diagnoses via leveraging advanced Machine Learning (ML) algorithms. The primary risks of infertility and other health complications can be minimized to a great extent, if likelihood of endometriosis can be predicted well in advance. As a result, the proper medical care and treatment can be given to the impacted patients. To demonstrate the feasibility, Logistic Regression (LR) and eXtreme Gradient Boosting (XGB) algorithms were trained on 36 months of medical history data.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe machine learning models were used to predict the likelihood of disease on qualified patients from the healthcare claims patient level database. Several directly and indirectly features were identified as important in accurate prediction of the condition onset, including selected diagnosis and procedure codes.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eLeveraging the machine learning approaches can aid early prediction of the disease and offer an opportunity for patients to receive the needed medical treatment earlier in the patient journey. Creating a typing tool that can be integrated into the Electronic Health Records (EHR) systems and easily accessed by healthcare providers could further aid the objective of improving the diagnosis activities and inform the diagnostic processes that would result in timely and precise diagnosis, ultimately increasing patient care and quality of life.\u003c/p\u003e","manuscriptTitle":"Predicting Endometriosis Onset Using Machine Learning Algorithms","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2020-12-31 15:52:12","doi":"10.21203/rs.3.rs-135736/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"editorInvitedReview","content":"","date":"2021-02-13T00:00:00+00:00","index":2,"fulltext":"Recommendation: Accept without revision\nForm responses:\n---\n\nComments to Author:\n---\nThank you for this good work, The idea is innovative and encouraging. Well designed research with promoting outcome* Publons Reviewer Recognition. Springer Nature can send verification of this review directly to Publons (a subsidiary of Clarivate Analytics). If you would like to take advantage of this service, please click on the “Yes” option below. Your name, email address, title of the reviewed manuscript, name of the journal, and date of your review submission (the “Review Data”) will then be transmitted to Publons upon publication of the manuscript. If you have already registered at Publons, they will notify you of the receipt of this review and update your profile as per your settings and their policy. If you are not registered with Publons, you will receive an email from them asking you to register in order for them to be able to recognize your review on your new profile page. Publons may use the Review Data to generate derivative metadata for the benefit of Publons and you as a reviewer, carefully considering the sensitivity of such information. For example, Publons may verify your record as a reviewer by updating your profile published on its webservice if you have registered for such service or help editors to identify candidate reviewers. Please find the details of processing in Publons’ privacy policy https://publons.com/about/terms: **Yes**\n* Declaration of competing interests: **I declare that I have no competing interests'**\n* Reviewer Publication Consent. I agree for my report to be made available under an Open Access Creative Commons CC-BY License (http://creativecommons.org/licenses/by/4.0) if this manuscript is accepted for publication. Any comments that I do not wish to be included in the published report have been included as confidential comments to the editor, which will not be published.: **I agree to the terms of the CC-BY 4.0 license; please publish my name with my report.**\n* Is the study design appropriate to answer the research question (including the use of appropriate controls), and are the conclusions supported by the evidence presented?: **Yes**\n* Are the methods sufficiently described to allow the study to be repeated?: **Yes**\n* Is the use of statistics and treatment of uncertainties appropriate?: **Yes**\n* Is the presentation of the work clear?: **Yes**\n* Are the images in this manuscript (including electrophoretic gels and blots) free from apparent manipulation?: **Yes**\n"},{"type":"decision","content":"Minor revision","date":"2021-02-13T00:00:00+00:00","index":"","fulltext":""},{"type":"reviewerAgreed","content":"","date":"2021-01-21T00:00:00+00:00","index":2,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2021-01-05T00:00:00+00:00","index":1,"fulltext":"Recommendation: Accept after minor essential revisions\nForm responses:\n---\n\nComments to Author:\n---\nThanks for offering this opportunity to review this interesting manuscript, titled as \"Predicting Endometriosis Onset Using Machine Learning Algorithms\". I have following comments:\n1. Could you please provide the cutoff for significance chosen variables into the model?\n2. How to deal with missing data, could you please provide strategies or the cutoff of values as missingness chosen?\n3. Is there a way to show visually, other than a table that could show these hypothetical patient risk scores? It should include a nomogram figure in the manuscript\n* Publons Reviewer Recognition. Springer Nature can send verification of this review directly to Publons (a subsidiary of Clarivate Analytics). If you would like to take advantage of this service, please click on the “Yes” option below. Your name, email address, title of the reviewed manuscript, name of the journal, and date of your review submission (the “Review Data”) will then be transmitted to Publons upon publication of the manuscript. If you have already registered at Publons, they will notify you of the receipt of this review and update your profile as per your settings and their policy. If you are not registered with Publons, you will receive an email from them asking you to register in order for them to be able to recognize your review on your new profile page. Publons may use the Review Data to generate derivative metadata for the benefit of Publons and you as a reviewer, carefully considering the sensitivity of such information. For example, Publons may verify your record as a reviewer by updating your profile published on its webservice if you have registered for such service or help editors to identify candidate reviewers. Please find the details of processing in Publons’ privacy policy https://publons.com/about/terms: **Yes**\n* Declaration of competing interests: **I declare that I have no competing interests**\n* Reviewer Publication Consent. I agree for my report to be made available under an Open Access Creative Commons CC-BY License (http://creativecommons.org/licenses/by/4.0) if this manuscript is accepted for publication. Any comments that I do not wish to be included in the published report have been included as confidential comments to the editor, which will not be published.: **I agree to the terms of the CC-BY 4.0 license; please do not publish my name with my report. (default)**\n* Is the study design appropriate to answer the research question (including the use of appropriate controls), and are the conclusions supported by the evidence presented?: **Yes**\n* Are the methods sufficiently described to allow the study to be repeated?: **Yes**\n* Is the use of statistics and treatment of uncertainties appropriate?: **Yes**\n* Is the presentation of the work clear?: **Yes**\n* Are the images in this manuscript (including electrophoretic gels and blots) free from apparent manipulation?: **Yes**\n"},{"type":"editorAssigned","content":"","date":"2020-12-25T00:00:00+00:00","index":"","fulltext":""},{"type":"reviewersInvited","content":"","date":"2020-12-25T00:00:00+00:00","index":"","fulltext":""},{"type":"reviewerAgreed","content":"","date":"2020-12-25T00:00:00+00:00","index":1,"fulltext":""},{"type":"checksComplete","content":"","date":"2020-12-24T18:14:18+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2020-12-17T00:00:00+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-womens-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmwh","sideBox":"Learn more about [BMC Women's Health](http://bmcwomenshealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmwh/default.aspx","title":"BMC Women's Health","twitterHandle":"","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a89a0ebf-596f-4dab-8183-0cc4679aa185","owner":[],"postedDate":"December 31st, 2020","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":1651022,"name":"Internal Medicine"},{"id":1651023,"name":"Preventive Medicine"}],"tags":[],"updatedAt":"2020-12-31T15:52:12+00:00","versionOfRecord":[],"versionCreatedAt":"2020-12-31 15:52:12","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-135736","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-135736","identity":"rs-135736","version":["v1"]},"buildId":"WvIrzKhiLBfengagbw6Ux","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Condition tags

endometriosisinfertility

Citation neighborhood

Papers in the corpus that this work cites (lower rings, blue) and that cite this one (upper rings, green). Dot size scales with the paper's in-corpus citation count — bigger dot = more influential within the endo/adeno field. Click a dot to open that paper. [ expand to 2 hops ] — adds papers reached through this work's immediate citers/citees. Heavier; up to 60 extra dots.

References (39)

Cited by (9)

Source provenance

europepmc
last seen: 2026-06-14T06:08:20.186862+00:00
openalex
last seen: 2026-06-04T00:00:01.174412+00:00
License: CC0 · commercial use OK