{"paper_id":"c8e7f804-d638-4fbd-aaab-34a03934c1b3","body_text":"655\nExploratory Report on Data Synchronising Methods to \nDevelop Machine Learning-Based Prediction Models \nfor Multimorbidity\nThis work is licensed under Creative Commons Attribution 4.0 License  AJBSR.MS.ID.002999.\nAmerican Journal of\nBiomedical Science & Research\nwww.biomedgrid.com\n---------------------------------------------------------------------------------------------------------------------------------\nGayathri Delanerolle1, Heitor Cavalini 1#, Kingshuk Majumder 9#, Yassine Bouchareb 10, Jian \nQing Shi1,6,7#, Om Kurmi8#, Peter Phiri1,2#*, Ashish Shetty3,4, Dharani Hapangama5#\n1Southern Health NHS Foundation Trust, United Kingdom \n2Psychology Department, University of Southampton, United Kingdom\n3University College London, United Kingdom\n4University College London Foundation Trust, United Kingdom\n5University of Liverpool, United Kingdom\n6Southern University of Science and Technology, Shenzhen, China\n7National Center for Applied Mathematics Shenzhen, China\n8University of Coventry\n9University of Manchester Foundation Hospitals\n10Sultan Qaboos University, Oman\n*Corresponding author: Peter Phiri, Research & Innovation Department, Southern Health NHS Foundation Trust, Clinical Trials Facility, Tom Rudd \nUnit Moorgreen Hospital, University of Southampton, United Kingdom.\nTo Cite This Article: Gayathri Delanerolle, Heitor Cavalini, Kingshuk Majumder, Yassine Bouchareb, Jian Qing Shi, Om Kurmi, Peter Phiri*, \nAshish Shetty, Dharani Hapangama. Exploratory Report on Data Synchronising Methods to Develop Machine Learning-Based Prediction Models \nfor Multimorbidity. Am J Biomed Sci & Res. 2024  22(5) AJBSR.MS.ID.002999, DOI: 10.34297/AJBSR.2024.22.002999\nReceived: \n   May 17, 2024 ;  Published: \n   May 28, 2024\nResearch Article                                                                                            Copyright© Peter Phiri\nISSN: 2642-1747\n \nAbstract\nEndometriosis is a complex chronic condition characteristic of chronic pelvic pain, dysmenorrhea, anxiety and fatigue. This \ncan often lead to multimorbidity which is defined by the presence of two or more long term conditions. Delayed diagnosis of \nendometriosis is a crucial issue that leads to poor quality of life and clinical management. There are a variety of limitations linked \nto conducting endometriosis research including lack of dedicated funding. Additionally, accessing existing electronic healthcare \nrecords can be challenging due to governance and regulatory restrictions. Missing data issues are another concern that has been \ncommonly identified among real-world studies.\nConsidering these challenges, data science technique could provide a solution by way of using synthetic datasets that could be \ngenerated using known characteristics of endometriosis to explore the possibility of predicting multimorbidity. This study aimed \nto develop an exploratory machine learning model that can predict multimorbidity among women with endometriosis using real-\nworld and synthetic data. A sample size of 1012 was used from two endometriosis specialized centres in the UK. In addition, 1000 \nsynthetic data records per centre were generated using the widely used Synthetic Data Vault’s Gaussian Copula model based on \npatients’ records’ characteristics.\nFour standard classification models, Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF), and \nGradient Boosting (GB) were used for classification. The average accuracies for all three models (LR, SVM and RF), given as “model \naccuracy-centre1: accuracy-centre2” were found to be: LR 90.32%:100.00%, SVM 77.87%:100.00%, RF 90.91%:10.00% and GB \n\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n656\n90.15%:100.00% on real-world data, and LR 79.85%:97.41%, SVM 79.21%:97.72%, and RF 78.43%:96.67% and GB 90.68%:99.75% \non synthetic data, respectively.\nThe findings of this report show machine learning models trained on synthetic data performed better than models trained \non real-world data. Our findings suggest synthetic data holds great promise for shows value to conduct clinical epidemiology and \nclinical trials that could devise better precision treatments and possibly reduce the burden of multimorbidity.\nBackground\nData science is a rapidly evolving research field that influences \nanalytics, research methods, clinical practice and policies. Access to \ncomprehensive real-world data and gathering life-course research \ndata are primary challenges observed in many disease areas. Exist-\ning real-world data can be a rich source of information required to \nbetter characterise diseases, generate cohort specifications and un-\nderstand clinical practice gaps to conduct more precision research \nthat is value-based for healthcare systems. A common challenge \nlinked to real-world and research data is a high rate of missingness. \nHistorically, statistical methods were used to address missing data \nwhere possible, but advances in artificial intelligence techniques \nhave provided improved and quicker methods for use. These meth-\nods could also be used for predicting disease outcomes, improving \ndiagnostic accuracy and treatment suitability. \nThese methods can be particularly useful for women’s health \nconditions, where the complex physical and mental health symp-\ntoms can give rise to insufficient understanding of disease patho-\nphysiology and phenotype characteristics that play a vital role in \ndiagnosis, treatment adherence and prevention of secondary or \ntertiary conditions. One such condition is endometriosis. Endome-\ntriosis is complex with an array of physical and psychological symp-\ntomatologies, often leading to multimorbidity [1]. Multimorbidity \nis defined by the presence of two or more conditions in any given \nindividual and therefore could be prevented if the initial conditions \nare managed more effectively. The incidence of multimorbidity has \nincreased with a rising ageing population, burden of non-communi-\ncable diseases in general and mental ill health which, is particularly \nimportant for women [2]. Another important aspect of multimor-\nbidity is disease sequalae, where a physical manifestation could \ncorrelate with a mental health impact, and vice versa. The precise \ncausation is complex to assess due to limitations in the current un-\nderstanding of disease sequalae pathophysiology [3]. As such, mul-\ntimorbidity could be deemed highly heterogeneous. Multimorbidi-\nty impacts people of all ages, although current evidence suggests it \nis more common among women than men, even though previously, \nmultimorbidity was thought to have been more common in older \nadults with a high frailty index score [4]. Hence, multimorbidity is \nchallenging to treat, and there remains a paucity of research avail -\nable to better understand the basic science behind the complex \nmechanisms that could enable better diagnosis and management \nlong-term [4]. \nThis undercurrent of disease complexities linked to endome -\ntriosis that could lead to multimorbidity should be explored to \nsupport clinicians and healthcare organisations in future-proof -\ning patient care [5]. In line with this, exploring machine learning \nas a technique in conjunction with synthetic data methods could \ndemonstrate better predictions and offer a new solution to sample \nsize challenges.\nMethods\nOur primary aim of the study was to develop an exploratory \nmachine learning model that can predict multimorbidity among \nendometriosis women using both real-world and synthetic data. \nIn certain instances, real-world data may present confidentiality \nissues, particularly in medical research where data often contains \npersonal and sensitive information. Sharing such data for analy -\nsis can expose vulnerabilities. To develop these models, existing \nknowledge and symptomatology, comorbidities and demographic \ndata were used. Anonymised data from an ethically approved study \nwas provided from Manchester and Liverpool Endometriosis spe-\ncialist centres in the UK. The data records used included symptoms, \ndiseases, and conditions in women with a confirmed diagnosis of \nendometriosis. Data curation was completed for the entire sample \nsize using the following steps;\nData Pre-Processing: the data was cleaned and prepared to \nmanage missing values, encoding categorical variables, and stan-\ndardizing or normalizing continuous variables.\nSynthetic Data Generation: the synthetic data records were \ngenerated for each centre using a widely used synthetic Data Vault’s \nGaussian Copula model, based on the data characteristics from pa-\ntients’ records.\nModel Development: trained and implemented four standard \nclassification models - Logistic Regression (LR), Support Vector Ma-\nchine (SVM), Random Forest (RF) and Gradient Boosting (GB) - on \nboth real-world and synthetic data. These models were used to pre-\ndict multimorbidity among women with endometriosis.\nModel Evaluation: models were assessed the performance of \nthe models by comparing their average accuracies on real-world \nand synthetic data. Metrics’ of accuracy, and Area Under the Receiv-\ner Operating Characteristic Curve (AUC) were used to evaluate the \nmodels’ performances.\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n657\nComparison and Analysis: the results of the models trained \non real-world data and synthetic data to determine if synthetic data \ncould serve as a viable alternative for real-world data in predicting \nmultimorbidity among women with endometriosis.\nFor all experiments, we train models on both real-world data, \nsynthetic data. Both types of models were tested on the same test \nsets which contained only real-world data because the overall pop-\nulation’s true distribution for endometriosis is verified. The accu-\nracies of these models can then provide better insight into whether \nthe use of synthetic data affects the performance of machine learn-\ning models.\nEthics approval\nAnonymous data used in this study was approved by the North \nof Scotland Research Ethics Committee 2 (LREC: 17/NS/0070) for \nthe RLS study conducted at the University of Liverpool.\nThe model used age, height, symptoms, commodities and \nweight in a mathematical formulation. Let \nix  be the vector contain-\ning these recordings for the thi  person and let ( )1,... nx xx=\n be the \nmatrix containing the data about all n people. As part of developing \nmethodological rigour , we considered a working example was used \nto predict whether each person in the sample develops depression. \nLet \n( )1,... ny yy=  be the vector of response variables where:\n{\n1 if patient i develops a depression\n0 if patient i does not develop depression.iy =\nIn this example, s we collect data for n=3 people and have p\n=3 recordings for each person (i.e., age, height and weight), These \nare represented by 12,iixx  and 3ix  respectively. The data can be sum-\nmarised in Table 1 as follows:\nTable 1: Example Dataset for Predicting Depression.\nPerson # Age Height (m) Weight (Kg) Depression\n1 67 1.9 65 1\n2 43 1.2 75 0\n3 23 1.5 43 0\nWe created a function, fβ  with parameters β , that takes the \nage, height and weight ( )123,,ii ixx x  of the person i, as input and out-\nputs a prediction of whether they will develop depression. Let iy∗\n \nbe the prediction of whether person i develops depression, then we \nsay that\n( )iiy fx β\n∗ =\nThe performance of parameters β  can be tested through a loss \nfunction, defined as ( )L β  which measures the difference between \nthe true values of y and the predictions, ( )1 ,... ny yy∗ ∗∗= . The loss \nfunction imposes a penalty when incorrect predictions are made. \nHence, to find the best β , we solve the optimisation problem:\n( )\n**argmin , , .L yyββ\nβ\nThe function fβ ∗  can then be used to make predictions for pa-\ntients who haven’t been tested for depression.\nAn initial observation was that our prediction function could \nbecome over-fitted to the data. This meant that the function cap-\ntured the specific distribution between x  and y  very well, but if \nthis data was not in a structured format of the true distribution be-\ntween symptoms and comorbidities, the prediction function would \nnot be generalisable to other types of data.\nThe performance of the prediction function on unseen data can \nbe estimated by separating the data into a training set, \n( )\ntrain train,xy\nand test set, ( )\ntest test,xy . The optimal parameters are found using \nthe training set and then the model’s accuracy is tested on the test \nset. This accuracy is measured by the proportion of correctly clas-\nsified data. This is measured by a confusion matrix, which records \nthe frequencies of each possible outcome. Let \nc  be the confusion \nmatrix defined as:\n( )\n01\n10 11\noocc\nccc =\n \n(1)\nwhere ijc  is the number of times \ntestyi =  while testyj\n∗\n= . The \naccuracy of our model is then\n( ) 00 11\n00 01 10 11\nAccuracy % cc\ncc cc\n+= ++ +  \n(2)\nTo summarise, the approach is broken down into the following \nthree steps,\n1. Solve optimisation problem\n( )\n* train train*argmin , ,Lyyββ\nβ\n on the training set, where the set of prediction values, \n*trainy , is \nfound by\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n658\n( )\ntrain trainy fx β\n∗ =\n2. Make predictions on the test set using optimal weights β*\n( )\ntest testy fx β ∗\n∗ =\n3. Construct confusion matrix, C as is defined in (1) and find the \naccuracy of the model on unseen data by equation (2).\nData Preparation-Manchester\nIn the Manchester dataset, for each patient, the presence of \nvarious symptoms and multiple diagnoses among women with En-\ndometriosis. These are summarised, with descriptions in Table 2. \nA total of \n15p =  recordings are made for each person and so we \ndefine ( )1,...i i ipx xx=  \nto be the vector containing the recordings \nfor person i  (Table 2).\nTable 2: Manchester Data Feature Variables.\nFeature Data Type Description\nAge Integer Age of the Patient\nMenorrhagia Binary\nWhether or not the patient has been diagnosed \nwith \nmenorrhagia\nDysmenorrhea Binary\nWhether or not the patient has been diagnosed \nwith \ndysmenorrhea \nNon menstrual Pelvic pain Binary Whether or not the patient experiences \nnon-menstrual pelvic pain\nDysphasia Binary Whether or not the patient experiences dyspha-\nsia\nDyspareunia Binary Whether or not the patient experiences dyspa-\nreunia\nother symptoms Binary Whether or not the patient has any other symp-\ntoms besides the ones recorded in other features\nInfertility Binary Whether or not the patient is infertile\nNo of Endo symptoms Binary Whether or not the patient has more than 1 \nsymptom \nYear of diagnosis Date The year of the patient’s diagnosis of endome-\ntriosis\nOther surgery – Not related to endometriosis Binary Whether or not the patient received any surger-\nies not related to endometriosis\nDischarged Binary Whether or not the patient was discharged\nfollow up Binary Follow up clinical appointments\nHormonal treatment Currently Binary Whether or not the patient is taking any hor-\nmonal treatment\nNo of hormonal treatment tried Integer The number of hormonal treatments the patient \nis taking\nTable 3: Manchester Data Response Variables.\nVariable Name Description\nMy Mental Health The presence of at least one of various mental health conditions\nIy IBS The presence of irritable bowel syndrome (IBS)\nCy Comorbidities (Other)\nThe presence of at least one other disease\n(Perhaps we have a list of these?).\nComby Combined The presence of at least one of the above conditions.\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n659\nAdditionally, for each individual, three response variables are \ndocumented, which are summarised, along with their descriptions, \nin Table 3. These variables are defined as follows:\n{\n1if patient develops a mental health condition\n0 if patient does not develop any mental health condition\nMi\niiy = ,\n{\n1if patient develops irritable bowelsyndrome\n0 if patient does not develop irritable bowelsy ndrome\nIi\niiy = ,\n{\n1 if patient develops at least one of various other comorbidities\n0if patient does not develops at least one of var ious other comorbidities\nci\niiy =\n(Table 3).\nWe examined three models of fit, one for each response vari-\nable. We defined a fourth response variable, “Combined” , as shown \nin the final row of Table 3, which indicates the presence of at least \none of the other three conditions. Formally, Comby  is defined as: \n{\n1\n0 .Comb if patient i develops at least one of any of t he conditions\ni if patient i does not develop at least one of any of the conditionsy =\nWe fitted a fourth model for this response variable.\nWe converted the binary variables, including our response vari-\nables of “Yes” and “No” to 1 and 0, respectively. There was no miss-\ning data in the Manchester dataset and as such we make use of all \n99n =  observations.\nIn Figure 1, we studied the balance of the data for each re-\nsponse variable. We can see that Mental Health and IBS, and Com -\nbined in particular, suffer quite a large imbalance. To address this, \nwe balanced the data through over-sampling before models were \nfit (Figure 1).\nFigure 1\nData Preparation-Liverpool\nThe data from Liverpool had a sample size of 913 patients. The \nraw data defined 68 possible different symptoms which was con-\nsidered as feature variables. A significant rate of missing data was \nidentified. The complete list of features along with their percentage \nmissing values can be found in Table 4.\nTo prepare the data, we first filtered by “Endometriosis = TRUE” , \nto find only those patients who have already been diagnosed with \nEndometriosis, leaving us with 339 patients. Next, we removed all \nfeatures with more than 10% of missing values, leaving us with fea-\ntures. The feature “Endometriosis” is a binary identifier, which, af-\nter filtering, is always true, so we dropped this feature too. The final \nfeatures are summarised, with descriptions, in Table 5. (Table 4,5).\nTable 4: Liverpool Data Percentage Missing Data.\nFeature\nNaN\n(%)\nFeature\nNaN\n(%)\nFeature\nNaN\n(%)\nFeature\nNaN\n(%)\nSample ID 0 Age at diagnosis 98.5\nPain interferes \nwith daily activ-\nities\n0 Hormones 0\nAge 0.1 Endometriosis \nsymptoms 97.8 Dysmenorrhoea \nscore 97.5 Other informa-\ntion 28.6\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n660\nEthnicity 96.7 Endometriosis \nstage 70.2 Non-menstrual \npelvic pain 0 Previous abla-\ntion 0\nPostcode 94.4 VAS 91.5 Analgesia for \npain 0 Medications 85.9\nSample type 2.8 FH ENDO 98.1 Pain prevents \ndaily activities 0 Endometrial \ncancer 0\nHair colour 96.7 Adenomyosis 0 Pelvic pain score 97.4 Metastatic \nlesion 0\nEye colour 96.7 Menorrhagia 0 Miscarriages 44.5 Metastatic \nlesion location 100\nHeight (m) 0.1 Fibroids 0 Polycystic ovary \nsyndrome 0 Type of cancer 99.8\nWeight (kg) 0.4 Reseason for \nsurgery 18.7 Irregular cycles 0 Cancer com-\nments 98.7\nBMI 0 Previous history 84.7 Cu coil 0 Grade 100\nSmoker 0 Gravidity 97.3 Menarche 97.2 Stage 99.8\nPack years 99.1 Parity 8.3 LMP 15.7 Pathology \nfindings 99.8\nExercise 97.4 Deliveries 96.8 Menopause 100 Cancer staging 0\nAlcohol 0 Infertility 0 Post-menopause 0 Dating by his-\ntology 64.3\nDrinks per week 98.5 Dyspareunia 0 Cycle length 17.4 Hormonal \ndating 99.8\nEndometriosis 0 Dysmenorrhoea 0 Days of bleeding 18.4 Agreement of \ndate 0\nAge first symp-\ntoms 98.6 Analgesia 0\nContraceptive/\nhormone treat-\nment\n59.9 Comments 70.1\nTable 5: Liverpool Data Features with Less than 1% Missing Data.\nFeature Data Type Description\nAge Integer Age of patient\nHeight (m) Real Height of patient in meters\nWeight (kg) Real Weight of patient in kilograms\nBMI Real BMI of patient\nSmoker Binary Whether of not the patient smokes\nAlcohol Binary Whether or not the patient consumes alcohol\nAdenomyosis Binary Whether or not the patient has been diagnosed \nwith Adenomyosis\nMenorrhagia Binary Whether or not the patient has been diagnosed \nwith Menorrhagia\nFibroids Binary Whether or not the patient has been diagnosed \nwith Fibroids\nInfertility Binary Whether or not the patient is infertile\nDyspareunia Binary Whether or not the patient has been diagnosed \nwith Dyspareunia\nDysmenorrhoea Binary Whether or not the patient has been diagnosed \nwith Dysmenorrhoea\nAnalgesia Binary Whether or not the patient takes analgesia\nPain interferes with daily activities Binary Whether or not the patient experiences pain with \ndaily activities\nNon-menstrual pelvic pain Binary Whether or not the patient experiences non-men-\nstrual pelvic pain\nAnalgesia for pain Binary Whether or not the patient takes analgesia to re-\nlieve pain\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n661\nPain prevents daily activities Binary Whether or not the patient says that pain pre-\nvents them from performing daily activities\nPCOS Binary Whether or not the patient has polycystic ovary \nsyndrome\nIrregular cycles Binary Whether or not the patient experiences irregular \nmenstrual cycles\nCu coil Binary Whether the patient has ever had a CU coil\nPost-menopausal Binary Whether or not the patient has had menopause\nHormones Binary Whether or not the patient is taking any hormon-\nal replacement treatments\nPrevious ablation Binary Whether the patient has had a previous ablation\nEndometrial cancer Binary Whether or the patient have or had endometrial \ncancer\nMetastatic lesion Binary Whether or not the patient had any cancerous \nlesions\nCancer staging agreement with Pathology Binary Whether or not the patient had an existing in-\nvolvement within the cancer pathway\nAgreement of staging Binary Whether or not the patient had a staging agree-\nment \nSample type Categorical \nParity Categorical\nMissing values in these data can were found in Age, Height, \nWeight, BMI, Sample Type and Parity. Some data with the features \nHeight, Weight and BMI could be calculated from the existing data. \nUsing the formula 2\nWeightBMI Height= , we can compute missing values \nwhere possible. The remaining missing data were imputed using \nscikit learn’s SimpleImputer and IterativeImputer. IterativeImputer \nmodels features with missing values as a function of all other fea-\ntures when imputing. However, this only supports numerical data. \nTherefore, we imputed the missing values of Age, Height, Weight \nand BMI using this. For the categorical features, including Sam-\nple type and Parity, the more simplistic SimpleImputer was used, \nwhich samples when considering only the distribution of the fea-\nture that is to be imputed.\nWe selected two diseases as our response variables for pre-\ndiction (Table 6). Given our ultimate objective of predicting mul -\ntimorbidity in patients, we constructed a final response variable, \n“Combined” , as a binary variable representing the presence of at \nleast one of the other two response variables, akin to the data from \nManchester. Their formal definitions of these response variables \nare as follows:\n{\n1if patient develops Adenomyosis\n0if patient does not develops Adenomyosis ,Ai\niiy =\n{\n1 if patient develops Menorrhagia\n0if patient does not develops Menorrhagia ,Ii\niiy =\n{\n1if patient develops at least one of any of the conditions\n0 f patient does not develop at least one of any of the conditions .Ci\ni iiy =\n(Table 6)\nTable 6: Liverpool Data – Response Variables.\nVariable Name Description\nAy Adenomyosis Whether the patient has been diagnosed with \nAdenomyosis\nMy Menorrhagia Whether the patient has been diagnosed with \nMenorrhagia\nComby Combined The presence of at least one of the above condi-\ntions.\nWe studied the balance of the data for each response variable, \nas shown in figure 2. We can see a large imbalance across all re-\nsponse variables. Over-sampling was used again here to balance the \ndatasets before modelling was applied (Figure 2).\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n662\nFigure 2\nSynthetic Data\nTo address this concern, we employed the Synthetic Data Vault \n(SDV) package in Python to create synthetic data as a substitute and \nassessed its similarity to the real data. By leveraging other sampling \ntechniques, such as random simulation, the synthetic data could \ngenerate a dataset with an expanded sample size that more accu-\nrately represents the entire population.\nDuring our data preparation, we eliminated numerous obser -\nvations due to missing data. The synthetic data generator we use \ncan allow for missing values and will generate missing values in the \nsame proportion as they appear in the real-world data. These miss-\ning values are then imputed later.\nWe utilised SDV’s Gaussian Copula model, which constructs a \ndistribution over the unit cube \n[ ]0.1\nΡ\n from a multivariate normal \ndistribution over RΡ  by using the probability integral transform. \nThe Gaussian Copula characterises the joint distribution of the ran-\ndom variables representing each feature by analysing the depen -\ndencies between their marginal distributions. Once the model is \nfitted to our data, it can be used to sample additional instances of \ndata.\nManchester Data\nWe initiated our analysis with the Manchester data, and after \nfitting the Gaussian Copula to our 99 samples, we generated an ad-\nditional 1000 samples.\nBy employing SDV’s SD Metrics library, we were able to evalu-\nate the similarity between the real and synthetic data. We examined \nhow closely the synthetic data relates to the real data in order to \ndetermine whether we have adequately captured the true distribu-\ntion. This assessment involved comparing the distribution similar-\nities across each feature, and we adopted two approaches for this \nevaluation.\nFigure 3: Age distribution shape comparison.\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n663\nInitially, we measured the similarities across each feature by \ncomparing the shapes of their frequency plots, as illustrated in Fig-\nure 3. This comparison was conducted based on the “age” distribu-\ntion for both the real and synthetic data (Figure 3).\nFor numerical data, SDV calculated the Kolmogorov-Smirnov \n(KS) statistic, which is the maximum difference between the cumu-\nlative distribution functions. The value of this distance is between 0 \nand 1 where SDV converted to a score by:\nScore =1-KS-statistic\nFor Boolean data, SDV calculates the Total Variation Distance \n(TVD) between the real and synthetic data. We determined the fre-\nquency of each category value and represented it as a probability. \nThe TVD statistic compares the differences in probabilities, as given \nby:\n( ) 1, 2RS R S ωω\nω\nδ\n∈Ω\n= − ∑\nwhere Ω  is the set of possible categories and Rω  and Sω  are \nthe frequencies of category ω in the real and synthetic dataset re-\nspectively. The similarity score is then given by:\n( )Score =1 , . RSδ−\nThe score for each feature is summarised in Figure 4, and we \nobtained an average similarity score of 0.92.\nFigure 4: Feature Distribution Shape Comparison.\nFor the second measure of similarity, we constructed a heatmap \nto compare the distribution across all possible combinations of cat-\negorical data. This was accomplished by calculating a score for each \ncombination of categories. To initiate this process, two normalised \ncontingency tables were constructed; one for the real-world data \nand one for the synthetic data. Let α and β be two features, the \ncontingency tables describe the proportion of rows that have each \ncombination of categories in α and β, thereby illustrating the joint \ndistributions of these categories across the two datasets (Figure 4).\nTo compare the distributions, SDV calculated the difference be-\ntween the contingency tables using Total Variation Distance. This \ndistance is subsequently subtracted from 1, implying that a higher \nscore denotes greater similarity. Let A and B be the set of categories \nin features α and β respectively, the score between features α and β \nare calculated as follows:\n,,Score =1- 1 ,2\nab ab\na Ab B\nSR\n∈∈\n−∑∑\n \n(3)\nwhere ,abS  and ,abR  represent the proportions of categories \na and b occurring simultaneously, as derived from the contingency \ntables for the synthetic and real data, respectively. It is important \nto note that we did not employ a measure of association between \nfeatures, such as Cramer’s V , since it does not measure the direction \nof the bias and may consequently yield misleading results.\nA score of 1 indicates that the contingency table was identical \nbetween the two datasets, while a score of 0 indicates that the two \ndatasets were as dissimilar as possible. These scores for all combi-\nnations of features are depicted as a heatmap (Figure 5). It is worth \nnoting that continuous features, such as “ Age” , were discretized in \nutilise Equation (3) in determining a score.\nThe heatmap suggests that most features exhibit a strikingly \nsimilar distribution across the two datasets, with the exception for \n“Year of Diagnosis” . This discrepancy could potentially be attribut-\ned to the feature’s inherent nature as a date, despite being treated \nas an integer in the model. This issue merits further investigation.\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n664\nFigure 5: Distribution Comparison Heatmap.\nBased on these metrics, we confidently concluded, that the new \ndata closely adhered to the distribution of the original data.\nLiverpool Data\nTo generate synthetic data, we adhered to the same procedure \nas with the Manchester data. We produced 1000 additional samples \nfrom a Gaussian copula fitted to the 311 real samples and combined \nthem with the real data to create a new dataset. Using contingency \ntables, we developed a heatmap by applying the formula in Equa-\ntion (3) to generate scores; this heatmap is displayed in Figure 6. A \nscore of 1 implies that the contingency table was identical between \nthe two datasets, whereas a score of 0 indicates that the two data -\nsets were as distinct as possible. Our analysis revealed an average \nsimilarity of 0.94 (Figure 6).\nFigure 6: Real Vs Synthetic Data Distribution Heatmap (Liverpool Data).\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n665\nWe compared the shape of the distributions for each feature; \nfor instance, the distributions for the “Height” feature are illustrat-\ned in Figure 5. We observed that the distributions were dissimilar. \nTo calculate similarity scores, we employed the KS statistic for nu-\nmerical features and Total Variation Distance for Boolean features. \nThese scores are summarised in Figure 8. We found that the dis -\ntributions of “Height” and “Weight” were not similar; however, the \ndistributions of the remaining features exhibited similarity. With an \naverage similarity of 0.75, we concluded that the data distributions \nwere, on average similar. The distributions of all categorical fea-\ntures were accurately captured, but two of the continuous features \nwere not (Figure 7,8).\nFigure 7: Height Distribution Shape Comparison (Liverpool).\nFigure 8: Feature Distribution Shape Comparison Between Real and Synthetic Data (Liverpool).\nModels\nWe evaluated four standard classification models to predict the \nresponse variables; Logistic regression (LR), Support Vector Ma -\nchines (SVM), Random Forest (RF), and Gradient Boosting (GB) as \nthey employ distinct methods data separation and provide unique \ninsights.\nLogistic regression enables us to determine the likelihood of \neach class occurring. It offers straightforward interpretability of \nthe model’s coefficients, allowing us conduct statistical tests on \nthese coefficients to discern which features significantly impact the \nresponse variable’s value. While logistic regression adopts a more \nstatistical approach by maximising the conditional likelihood of the \ntraining data, SVMs take a more geometric approach, maximising \nthe distance between the hyperplanes that separate the data. We \nfitted both logistic regression and SVMs to compare the perfor -\nmance of these approaches.\nIn contrast to SVMs and logistic regression, which attempt to \nseparate the data using a single decision boundary, random forest \nemploy decision trees that partition the decision space into smaller \nregions using multiple decision boundaries.\nThe performance of these varies depending on the nature of \nthe data’s separability. Consequently, we fitted all three models and \ncompared their accuracies to assess the useability of the synthetic \ndata.\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n666\nLogistic Regression\nLet ( )1,..., nyy y=\n to be the general vector of response vari-\nables and let ( )1,...,i i ipxx x=  be the corresponding vector of \nfeatures for patient i. We defined the function:\n( ) ( ) 11 1 iii xx Py e\nβ βσ −= = = +\nas be the probability of patient i developing the condition cor -\nresponding to y, where ( )1,..., pβ ββ=\n \nare some weights. The pre-\ndiction function is then defined to be:\n( ) ( )\n( )\n0 if 0.5\n1 if 0.5\ni\ni\ni\nf\nx\nx\nx\nβ\nβ\nβ\nσ\nσ\n<=  ≥\nWe determined the optimal weights by solving the optimisation \nproblem:\n( )min L β\nβ\nwhere, for logistic regression, the loss function L took the form:\n( ) ( )( ) ( ) ( )( )\n1\nlog 1 log 1 .\nn\ni ii i\ni\nL y xy x βββσ σ\n=\n= − −− −∑\nFinally, we incorporated regularisation terms λ to prevent over-\nfitting, which facilitated capturing the underlying distribution of \nthe data without the proposed model to become overly specific to \nthe training data. This approach helped mitigate any potential bi-\nases.\n( ) ( )( ) ( ) ( )( )\n2\n2\n1\n1log 1 log 1 .\nn\ni ii i\ni\nL y xy x βββ σ σβ λ=\n= +− − +∑\n \n(4)\nSVMs\nNext, we examined Support Vector Machines. We slightly rede-\nfined our response variables from binary {0,1} to binary {-1,1}. For \ninstance, suppose \nM\niy  represents the binary response for a patient \ndeveloping a mental health condition; then M\niy  is defined as:\n1 if patient developed a mental health conditio n\n1 f patient did not develop any mental health condition.\nM\ni\niy ii\n= −\nFor SVMs, the prediction function takes the form:\n( ) ( )\nT\niif x sign x bβ β= −\nWhere Pβ ∈   and b ∈  are some weights. We considered \nthe hinge loss function, defined as:\n( ) ( )( ),\n, : max 0,1 T\nhinge i ib\nb y xb\nβ\nββ = −−\nThe function hinge  is 0 when ( ) 1T\niiy xbβ −≥ , which occurs \nwhen ( )iifx yβ =  or in other words, when we have made a correct \nprediction. Conversely, when ( )iifx yβ ≠ , we would incur some \npenalty. Therefore, for SVMs, the loss function, L takes the form:\n( ) ( )\n2\n,1\n11, max 0,1 ( )\nn\nT\niibi\nL b y xbn β\nββ βλ =\n= + −− ∑\n \n(5)\nwhere λ  is a parameter controlling the impact the of regu-\nlarisation term. Similar to logistic regression, this term manages a \ntrade-off between capturing the distribution of the entire popula -\ntion and overfitting to the training data.\nRandom Forest\nThe next model we fitted is the random forest predictor. These \nrandom forests classify data points through an ensemble of de -\ncision trees. The decision trees operate by separating the pre-\ndictor space by a series of linear boundaries. As before, we let \n( ) }{1,..., , 0,1\nn\nny y yy= ∈  be our set of response variables with \ncorresponding feature vectors ( )1,..., nxx x= where each .p\nix ∈   \nTo build our random forest we followed the procedure:\nFor 1,..., :bB=\na) Sample, with replacement, b mpx ×∈   and }{0,1\nmby ∈\nfrom x  and y  respectively.\nb) Fit k  decision trees, 1 , ...,bb\nkff  to dataset ( ),bbxy\nWhen making predictions on unseen data, the model took the \nmajority vote across all trees.\nGradient Boosting\nFinally, we fit Gradient Boosting models to the data which \nshares some similarities with Random Forest. Similarly, it is an en-\nsemble model, producing a prediction from the ensemble of many \nweaker predictive decision tree models with the difference that \ntrees are trained sequentially. Random Forest, on the other hand, \nconstructs trees independently.\nFor all experiments, we run 5-fold cross-validation to test our \nmodels. The data were split into a training set and test set before \nthe synthetic data were generated. This allowed us to avoid data \nleakage, giving a fair comparison between models trained on re-\nal-world data and those trained on synthetic data. To further ensure \na fair test, the synthetic data were generated before any imputation \nwas done.\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n667\nAll models contain at least one hyper-parameter, and we make \nuse of grid searches to identify the optimal value of these. The re-\nsult of the best performing model is then presented.\nWe make use of two measures of performance, the classifica-\ntion accuracy, recording the percentage of correctly classified in-\nstances in the test set and the AUC score, which gives an indication \nof how well the model can distinguish between classes.\nManchester Data\nAt each fold, the real-world training set contained 80% of the \nobservations (approximately 80 observations), the test set con-\ntained 20% (approximately 20 observations) and the synthetic \ntraining data contained 1000 generated samples.\nLogistic Regression\nWe used scikit-learn to fit logistic regression models of the form \nin equation (4). We performed a grid search to investigate the op-\ntimal value of λ. The accuracies of the best-performing λ for each \nresponse variable can be found in Table 7. We also record the Area \nUnder the Receiver Operating Characteristic Curve (AUC) in table \n8 (Table 7,8).\nTable 7: Logistic Regression Accuracy Comparison Across Real and Synthetic Data.\nReal Synthetic\nλ Accuracy λ Accuracy\nIBS 100 82.12 0.01 75.45\nMental Health 0.0001 79.16 0.1 79.16\nComorbidities (Other) 1 100 1 73.78\nCombined 1 100 0.01 91\nAverage 90.32 79.85\nTable 8: Logistic Regression AUC Comparison Across Real and Synthetic Data.\nReal Synthetic\nλ AUC λ AUC\nIBS 1000 0.97 100000 0.5\nMental Health 1000 0.94 1 0.77\nComorbidities (Other) 10000 1 1 0.55\nCombined 1 1 1 0.82\nAverage 0.98 0.66\nWe can see that for all response variables, in terms of accuracy, \nthe models performed as well as or slightly worse when trained on \nsynthetic data. In terms of AUC, we see the models trained on syn-\nthetic data perform worse. The values indicate some poor perfor -\nmance in distinguishing classes.\nSVM\nWe used Scikit-learn’s svm. SVC to train and test SVMs of the \nform in equation (5) on our data. Scikit-learn is a popular and \nwell-tested choice for SVMs that has shown high performance on a \nvariety of types of datasets.\nSimilarly, a grid search was performed to find the optimal λ. \nTable 9 shows the accuracies of the best-performing value of λ for \neach response. From the accuracy scores, we can see a mixture of \nperformances across both methods. For Mental Health, we see the \nmodel trained on synthetic data perform better, however, for the \nother response variables, we see it perform worse (Table 9).\nTable 9: SVM comparison with synthetic data.\nReal Synthetic\nλ Accuracy λ Accuracy\nIBS 10000 78.13 1000 70.83\nMental Health 10000 58.33 10000 79.17\nComorbidities (other) 100000 75 10000 72.72\nCombined 100000 100 100000 94.12\nAverage 77.87 79.21\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n668\nRandom Forest\nWe fitted random forest models to the data. The CV accuracies \nare summarised in Table 9. Using a grid search, we investigated \n1,5,10,20,30,…,500 trees, the accuracy results of the best-perform -\ning models are summarised in table 10 with best performing AUC \npresented in table 11. From both measures of performance, we \nsee the models trained on synthetic data perform worse. The AUC \nscores in particular suggest poor performance in distinguishing \nclasses (Table 10,11).\nTable 10: Random Forest Accuracy Comparison with Synthetic Data.\nReal Synthetic\nNo. Trees Accuracy No. Trees Accuracy\nIBS 170 87.5 1 84.38\nMental Health 1 80.7 490 70.83\nComorbidities (other) 50 95.45 130 72.73\nCombined 5 100 50 85.71\nAverage 90.91 78.43\nTable 11: Random Forest AUC Comparison with Synthetic Data.\nReal Synthetic\nλ AUC λ AUC\nIBS 10 1 30 0.58\nMental Health 30 1 30 0.65\nComorbidities (Other) 10 1 30 0.73\nCombined 5 1 410 0.5\nAverage 1 0.62\nGradient Boosting\nFinally, we fitted Gradient Boost models to the data. Using a \ngrid search, we investigated the optimal combination of number \nof estimators in the values 100,200,…,500 and learning rate in the \nvalues 10\n-4,…,10 0 The results of the best-performing combinations \nare summarised in table 12. In terms of classification accuracy, we \nsee the synthetic data out-perform the real-world data in the case \nof predicting Mental Health and IBS. However, the corresponding \nAUC, as shown in table 13, scores suggest poor performance in dis-\ntinguishing classes (Table 12,13).\nTable 12: Gradient Boosting Accuracy Comparison.\nGradient Boosting\nReal Synthetic\nNo.\nEstimators\nLearning\nRate\nAccuracy\nNo.\nEstimators\nLearning\nRate\nAccuracy\nIBS 400 0.01 83.33 100 0.0001 91.67\nMental Health 100 0.0001 77.27 100 1 100\nComorbidities \n(other) 100 0.1 100 100 0.0001 76.92\nCombined 100 0.1 100 100 0.01 94.11\nAverage 90.15 90.68\nTable 13: Gradient Boosting AUC Comparison.\nGradient Boosting\nReal Synthetic\nNo.\nEstimators\nLearning\nRate\nAUC\nNo.\nEstimators\nLearning\nRate\nAUC\nIBS 100 1 1 500 1 0.41\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n669\nMental Health 100 1 1 500 1 0.58\nComorbidities \n(other) 100 1 1 500 1 0.64\nCombined 100 0.1 1 500 1 0.58\nAverage 1 0.55\nUpon examining the average accuracies of all our models in Ta-\nbles 14 and 15, we can draw some conclusions about the perfor -\nmance of the models trained on synthetic data compared to those \ntrained on real data. It is evident that models trained on real-world \ndata performed better than those trained on synthetic data in most \ncases. However, the performance of the models trained on synthetic \ndata are not significantly worse, suggesting that we don’t compro-\nmise a large amount of accuracy. The AUC scores, in some places, \nsuggest a significant compromise in the model’s ability to distin-\nguish classes.\nTable 14: Random Forest Model Comparison.\nData Logistic Regression SVM Random Forest Gradient Boosting\nReal 90.32 77.87 90.91 90.15\nSynthetic 79.85 79.21 78.43 90.68\nTable 15: Solver AUC Comparison on Manchester Data.\nData Logistic Regression Random Forest Gradient Boosting\nReal 0.98 1.0 1.0\nSynthetic 0.66 0.62 0.55\nSolver Comparison\nIn conclusion, the use of synthetic data proves to be a promis -\ning approach to training machine learning models when real data \nis limited or unavailable. The models trained on synthetic data in \nthis study were not always able to out-perform those trained on \nreal data, but they show the ability to retain high levels of accuracy. \nMany experiments show a classification accuracy of 100%. This is \nunlikely to happen in reality and suggests that the sample size is too \nsmall to make concrete conclusions in some cases. However, some \nof the findings support the adoption of synthetic data generation \nmethods as a viable alternative to real data in machine learning ap-\nplications since the loss in accuracy is minimal, and in some cases \nslightly improves (Tables 14,15).\nSensitivity Analysis\nTo assess our model’s sensitivity, we introduced random noise \nto the data and measured the impact on model accuracy. We ran-\ndomly selected 1% of points in each dataset and replaced their val-\nues. Table 16 summarises the accuracy of the new models and the \nrelative percentage change in accuracy (Table 16).\nTable 16: Sensitivity Analysis for Models on Manchester Data.\nData Logistic Regression SVM Random Forest Gradient Boosting\nAccuracy Change Accuracy Change Accuracy Change Accuracy Change\nReal 90.15% -0.19% 78.43 0.72% 90.91 0.00% 90.15 0.00%\nSynthetic 78.43 -1.78% 79.21% 0.00% 79.41 1.25% 90.68 0.58%\nTable 11 reveals that the accuracy of the model was impacted in \nsome instances. The logistic regression model trained on synthetic \ndata was affected by more than 1.7% while the accuracy of its re-\nal-world trained counterpart was only changed by 0.19%. Neither \ndataset shows a consistency to how the models were affected.\nLiverpool Results\nA similar 5-fold approach was taken to train models on the Liv-\nerpool dataset. At each fold, the real-world training set contained \n80% of the observations (approximately 271 observations), the \ntest set contained 20% (approximately 67 observations) and the \nsynthetic training data contained 1000 generated samples.\nLogistic Regression\nWe used scikit-learn to fit logistic regression models of the form \nin equation (4). We performed a grid search to investigate the op-\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n670\ntimal value of λ. The accuracies of the best-performing λ for each \nresponse variable can be found in Table 17. We also record the Area \nUnder the Receiver Operating Characteristic Curve (AUC) as shown \nin table 18 (Table 17,18).\nTable 17: Logistic Regression Accuracy Comparison.\nReal Synthetic\nλ Accuracy λ Accuracy\nAdenomyosis 100000 100 0.1 94.7\nMenorrhagia 1 100 0.001 99.07\nCombined 0.1 100 1 98.46\nAverage 100 97.41\nTable 18: Logistic Regression AUC Comparison.\nReal Synthetic\nλ AUC λ AUC\nAdenomyosis 10000 1 100000 0.67\nMenorrhagia 100 1 1000 0.71\nCombined 1 1 10 0.98\nAverage 1 0.79\nWe see that in all cases of real-world data, the accuracy is re-\ncorded at 100%. This is perhaps a consequence of a small sample \nsize. Across all response variables, we see the models trained on \nsynthetic data perform slightly worse. However, the accuracy is not \nlargely compromised.\nSVM\nIn the same method as in the Manchester data, we train SVMs \nand compare the accuracy for various values of λ. The best perform-\ning models are summarised in table 19.\nTable 19: Logistic Regression Accuracy Comparison.\nReal Synthetic\nλ Accuracy λ Accuracy\nAdenomyosis 100 100 10000 93.75\nMenorrhagia 100 100 100 100\nCombined 100 100 100 100\nAverage 100 97.92\nWe can see from table 19, that the model trained on synthetic \ndata performed the same or slightly worse than their real-world \ncounter parts. Again supporting the idea that synthetic data may \nbe used as a substitute for real-world data without compromising \nmuch accuracy.\nRandom Forest\nSimilarly to the Manchester data, we fitted random forest mod-\nels, using a grid search to investigate 1,5,10,20,30,…,500 trees. The \nresults of the best-performing models are summarised in table 20 \nwith accuracy scores and table 21 with AUC scores. From both mea-\nsures of performance, we see the models trained on synthetic data \nperform worse. The AUC scores in particular suggest some poor \nperformance in distinguishing classes such as for predicting Ade -\nnomyosis. However, the results for predicting Menorrhagia support \nthe use of synthetic data, with minimal loss in accuracy and AUC \n(Table 20,21).\nTable 20: Random Forest Accuracy Comparison.\nRandom Forest Accuracy\nReal Synthetic\nNo. Trees Accuracy No. Trees Accuracy\nAdenomyosis 1 100 1 96.43\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n671\nMenorrhagia 5 100 10 98.46\nCombined 5 100 30 95.38\nAverage 100 96.76\nTable 21: Random Forest AUC Comparison.\nReal Synthetic\nλ AUC λ AUC\nAdenomyosis 5 1 5 0.49\nMenorrhagia 30 1 50 0.98\nCombined 5 1 50 0.95\nAverage 1 0.81\nGradient Boosting\nFinally, we investigated using Gradient Boost models, again us-\ning a grid search to investigate the optimal combination of number \nof estimators in the values 100, 200,…,500 and learning rate in the \nvalues 10\n-4,…,10 0 \nThe results of the best-performing combinations are sum-\nmarised in table 22 for accuracy and table 23 for AUC. The accura-\ncy of the synthetically trained models remain consistent or slightly \nworse than their real-world counterpart, supporting the use syn-\nthetic data without a large loss in accuracy. The AUC scores, how -\never, suggest a larger compromise in distinguishing classes (Tables \n22,23).\nTable 22: Gradient Boosting Accuracy Comparison.\nRandom Forest Accuracy\nReal Synthetic\nNo. Estimators Learning Accuracy No. Estimators Learning Accuracy\nRate Rate\nAdenomyosis 100 0.1 100 100 0.0001 99.24\nMenorrhagia 100 0.0001 100 100 0.0001 100\nCombined 100 0.0001 100 100 0.0001 100\nAverage 100 99.75\nTable 23: Gradient Boosting AUC Comparison.\nReal Synthetic\nNo. Estimators\nLearning\nRate\nAUC No. Estimators Learning Rate AUC\nAdenomyosis 100 1 1 500 1 0.47\nMenorrhagia 100 0.1 1 500 1 0.76\nCombined 100 0.1 1 500 1 0.66\nAverage 1 0.63\nSolver Comparison\nTo summarise, the average accuracies of all models are present-\ned in Table 24, along with their AUC scores in table 25. Overall, the \nmodels trained on real-world data performed better. However, the \naccuracy measures suggest that the use of synthetic data does not \nsignificantly impact accuracy performance, while the AUC scores \nsuggest a more significant impact to the ability to distinguish class-\nes (Tables 24,25).\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n672\nTable 24: Solver Accuracy Comparison on Liverpool Data.\nData Logistic Regression SVM Random Forest Gradient Boosting\nReal 100.00 100.00 100.00 100.00\nSynthetic 97.41 97.72 96.76 99.75\nTable 25: Solver AUC Comparison on Liverpool Data.\nData Logistic Regression Random Forest Gradient Boosting\nReal 1.0 1.0 1.0\nSynthetic 0.79 0.81 0.63\nSensitivity Analysis\nTo test the sensitivity of our models we added random noise to \nthe data and measured its impact on model accuracy. By sampling \nfrom a unform distribution, we randomly selected 1% of points in \neach dataset to introduce noise. The values at these points were \nreplaced by random samples from a uniform distribution over the \nfeature’s possible values. Table 26 displays the accuracy of the new \nmodels and their relative percentage change in accuracy (Table 26).\nTable 26: Sensitivity Analysis on Liverpool Data.\nData Logistic Regression SVM Random Forest Gradient Boosting\nAccuracy Change Accuracy Change Accuracy Change Accuracy Change\nReal 99.75% -0.25% 100% 0.00% 100% 0% 100 0.00%\nSynthetic 99.75% 2.40% 97.72% 0.00% 97.41% -0.67% 99.75 0.00%\nFrom Table 26, we can observe that the performance of the SVM \nand Random Forest models experienced minimal change. However, \nthe logistic regression model trained on synthetic data showed a \nsomewhat significant change in accuracy, indicating some sensitiv-\nity to perturbations in the data. This suggests that for logistic re-\ngression, it is crucial for the synthetic data’s distribution to closely \nresemble the real data, as the models are sensitive to small varia-\ntions (Table 27).\nTable 27: Comparison of all Models.\nLogistic Regression SVM Random Forest Gradient Boosting\nData Manchester Liverpool Manchester Liverpool Manchester Liverpool Manchester Liverpool\nReal 90.32 100% 77.87 100% 90.91 100% 90.15 100\nSynthetic 79.85 97.41% 79.21 97.72% 78.43 96.76% 90.68 99.75\nTable 27 compares the model accuracies across both datasets. \nWe observed that the models trained on the Liverpool dataset con-\nsistently out-perform those trained on the Manchester dataset, for \nboth real and synthetic data.\nThe two datasets documented different attributes of individu-\nals and contained varying numbers of features and observations. \nThe Liverpool dataset had a larger number of both features and \nobservations, and our method performed well in both datasets. \nThese results support the idea that our method can be applied to a \ndiverse range of datasets. The experiments have also demonstrated \nthe effectiveness our method is with both continuous and categor -\nical data. From the distribution analysis of the Liverpool synthetic \ndata, we observed that our method’s performance was weakest on \ntwo continuous features.\nThroughout the experiments, we showed that synthetic data \nperformed similarly or slightly worse than those trained on real \ndata. Since all models were tested on real data, this evidence sup-\nports the argument that synthetic data can be used as a replace-\nment for real data with minimal compromise on accuracy. However, \nin some cases, we see a significant compromise in AUC score.\nDiscussion\nMultimorbidity is a growing concern within the global popu-\nlation, particularly for those with chronic conditions like endo -\nmetriosis, where treatment options are limited. Predicting multi -\nmorbidity is challenging among endometriosis patients due to late \ndiagnoses. Therefore, employing machine learning methods to use \nkey features to predict the possibility of multimorbidity is valuable \nfor healthcare services, patients and clinicians. Our findings sug-\n\nAm J Biomed Sci & Res\nAmerican Journal of Biomedical Science & Research\nCopyright© Peter Phiri\n673\ngest that the method could be replicated for other complex wom-\nen’s health conditions such as polycystic ovary syndrome, gesta-\ntional diabetes or fibroids.\nOur findings indicate that the real-world dataset contained one \nvariable as a significant indicator for developing multimorbidity \nand highlighted the usefulness of synthetic data for future research, \nespecially in cases with higher rates of missing data. Synthetic \ndata can also provide more detailed information regarding the re-\nlationships between these variables, as they could be considered \nsignificant indicators. These indicators can be used to differentiate \nbetween samples with symptoms and those with disease sequalae \nthat would influence the clinical decision-making process, particu-\nlarly for patients requiring excision surgery. With a larger sample \nsize and better representation of the overall population, synthetic \ndata has the potential to provide more detailed information about \nthe significance of each feature. \nPrevious research used methods such as pairwise comparisons \nto assess diseases in pairs and combined results where appropri -\nate with similar diseases. This technique may have a higher error \nrate, as complex chronic diseases do not follow a one-size fits-all \napproach. Whilst the pairwise class of techniques could demon -\nstrate co-occurrence of frequencies and predicted frequencies dis -\nsimilar, they can still show a correlation, as indicated by Hidalgo \nand colleagues’ disease network that represented nodes and edges \n[6]. This is akin to a network meta-analysis approach. A limitation \nwith this approach in disease prediction could be the lack of tempo-\nral data in the resulting network nodes, necessitating an additional \nanalysis such as a correlation evaluation [6]. This also means that \ndata with missing data points may be entirely deleted, impacting \nthe final analysis and any subsequent conclusions. Correlation \nanalyses would enable researchers and clinicians to understand \nthe spread of the diseases based on the links shown within the \nnetwork that can be modelled over time [6]. Jensen and colleagues \ndemonstrated a similar temporal network approach, showing that \na pairwise method can be combined with a correlation analysis \nover time [7]. Giannoula and colleagues used this approach to re-\nveal disease clusters using a time warping along with a pairwise \nmethod to mine multimorbidity patterns and phenotyping with \nextensive data points [8]. In comparison, our combined approach \nof machine learning on a synchronised dataset can provide better \nmultimorbidity prediction.\n Another class of models used to predict multimorbidity is \nprobabilistic methods, which focus on the relationships among dis-\neases rather than a pairwise approach. Strauss and colleagues em-\nployed this method to model a small real-world dataset from the UK \nevaluating multimorbidity cluster trajectories. Individual patients \nwere grouped in clusters based on the number of chronic condi -\ntions detected within their healthcare record over a specific period. \nThese clusters were divided into four main categories, including \nthe presence or absence of chronic problems in the number of co-\nmorbidities. However, this approach did not consider patients with \nundiagnosed symptoms aligned with chronic conditions, which is a \ncommon observation in real-world data.\nThe distribution of the synthetic data captures the true distri -\nbution of the real-world data but can have an arbitrary larger sam-\nple size, indicating that synthetic data has the potential to provide \nvaluable insight for healthcare services To address the increasing \nand complex healthcare demands of a growing population, effective \nclinical service design is crucial for healthcare sustainability., More-\nover, our results show that synthetic data accurately represents the \nreal data and so can be used in place of the real data in cases where \nthe real data contains sensitive or private information that cannot \nbe shared. The accuracy measures of our models support the hy -\npothesis that the use of synthetic data does not affect the perfor -\nmance of the prediction models used in this analysis.\nLimitations\nThe model performance will need to be tested on more complex \nand larger datasets to ensure that a digital clinical trial can be con-\nducted to optimise the model performance.\nConclusion\nOur study created an exploratory machine learning model that \ncan predict multimorbidity among endometriosis women using re-\nal-world and synthetic data. Before experimenting with the models \ndeveloped using the real-world dataset, a quality assessment test \nwas conducted by comparing the synthetic and real-world data -\nsets. Distribution and similarity plots suggested that the synthetic \ndata did indeed follow the same distribution as the real-world data. \nTherefore, synthetic data generation shows great promise, espe -\ncially for conducting high- quality clinical epidemiology and clinical \ntrials that could devise better precision treatments for endometrio-\nsis and, possibly prevent multimorbidity.\nDeclarations\nConflicts of Interest\nPP has received a research grant from Novo Nordisk, Janssen \nCilag, and other, educational from the Queen Mary University of \nLondon, other from John Wiley & Sons, outside the submitted work.\n All other authors report no conflict of interest. The views ex -\npressed are those of the authors and not necessarily those of the \nNHS, the National Institute for Health Research, the Department of \nHealth and Social Care or the Academic institutions.\nAvailability of Data and Material\nThe authors will consider sharing the dataset gathered upon \nreceipt of reasonable requests.\n\nAmerican Journal of Biomedical Science & Research\nAm J Biomed Sci & Res                                     Copyright© Peter Phiri\n674\nCode Availability\nThe authors will consider sharing the dataset gathered upon \nreceipt of reasonable requests.\nAuthor Contributions\nFEINMAN is part of the ELEMI program developed and con-\nceptualised by GD. GD and PP conceptualised and developed work \npackage 1 of the FEINMAN project. GD devised the use of synthetic \ndata to better asses’ chronic diseases. GD devised the hypothesis \nfor using synthetic data modelled on clinical symptoms to devel-\nop optimal prediction models. GD, AZ and PP furthered the study \nprotocol. GD developed the method and furthered this with PP , AZ, \nDB, JQS, HC, DKP and AS. GD, DB, PP and AZ designed and executed \nthe analysis plan. All authors critically appraised, commented and \nagreed on the final manuscript. All authors approved the final man-\nuscript.\nReferences\n1. Delanerolle G, Ramakrishnan R, Hapangama D, Zeng Y, Shetty A, et al. \n(2021) A systematic review and meta-analysis of the Endometriosis and \nMental-Health Sequelae; The ELEMI Project. Womens Health (Lond).\n2. Alimohammadian M, Majidi A, Yaseri M, Ahmadi B, Islami F, et al. (2017) \nMultimorbidity as an important issue among women: results of a gender \ndifference investigation in a large population-based cross-sectional \nstudy in West Asia. BMJ open 7(5): e013548. \n3. Tripp Reimer T , Williams JK, Gardner SE, Rakel B, Herr K, et al. (2020) \nAn integrated model of multimorbidity and symptom science. Nursing \noutlook 68(4): 430-439.\n4. Oni T , McGrath N, BeLue R, Roderick P , Colagiuri S, et al. (2014) Chronic \ndiseases and multi-morbidity-a conceptual modification to the WHO \nICCC model for countries in health transition. BMC public health 14(1): \n1-7.\n5. Delanerolle GK, Shetty S, Raymont V (2021) A perspective: use of \nmachine learning models to predict the risk of multimorbidity. LOJ \nMedical Sciences 5(5).\n6. Hassaine A, Salimi Khorshidi G, Canoy D, Rahimi K (2020) Untangling \nthe complexity of multimorbidity with machine learning. Mechanisms of \nageing and development 190: 111325.\n7. Jensen AB, Moseley PL, Oprea TI, Ellesøe SG, Eriksson R, et al. (2014) \nTemporal disease trajectories condensed from population-wide registry \ndata covering 6.2 million patients. Nature communications 5(1): 4022. \n8. Giannoula A, Gutierrez Sacristán A, Bravo Á, Sanz F, Furlong LI (2018) \nIdentifying temporal patterns in patient disease trajectories using \ndynamic time warping: A population-based study. Scientific reports \n8(1): 1-4.","source_license":"CC0","license_restricted":false}