Exploratory Report on Data Synchronising Methods to Develop Machine Learning-Based Prediction Models to for Multimorbidity
preprint
OA: green
CC0
Abstract
Endometriosis is a complex chronic condition characteristic of chronic pelvic pain, dysmenorrhea, anxiety and fatigue. This can often lead to multimorbidity which is defined by the presence of two or more long term conditions. Delayed diagnosis of endometriosis is a crucial issue that leads to poor quality of life and clinical management. There are a variety of limitations linked to conducting endometriosis research including lack of dedicated funding. Additionally, accessing existing electronic healthcare records can be challenging due to governance and regulatory restrictions. Missing data issues are another concern that has been commonly identified among real-world studies. Considering these challenges, data science technique could provide a solution by way of using synthetic datasets that could be generated using known characteristics of endometriosis to explore the possibility of predicting multimorbidity. This study aimed to develop an exploratory machine learning model that can predict multimorbidity among women with endometriosis using real-world and synthetic data. A sample size of 1012 was used from two endometriosis specialized centres in the UK. In addition, 1000 synthetic data records per centre were generated using the widely used Synthetic Data Vault’s Gaussian Copula model based on patients’ records’ characteristics. Three standard classification models, Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF), were used for classification. The average accuracies for all three models (LR, SVM and RF), given as “model accuracy-centre1: accuracy-centre2” were found to be: LR 64.26%:69.04%, SVM 67.35%:68.61%, and RF 58.67%:73.76% on real-world data, and LR 69.9%:72.29%, SVM 69.39%:70.13, and RF 68.88%:74.62 on synthetic data, respectively. The findings of this report show machine learning models trained on synthetic data performed better than models trained on real-world data. Our findings suggest synthetic data holds great promise for shows value to conduct clinical epidemiology and clinical trials that could devise better precision treatments and possibly reduce the burden of multimorbidity.
My notes (saved in your browser only)
Condition tags
Citation neighborhood (sparse)
Too few in-corpus citations on either side for a chart; here are the lists.
Cites (1)
References (8)
- A systematic review and meta-analysis of the Endometriosis and Mental-Health Sequelae; The ELEMI Project via openalex
- W2152861242 via openalex
- W2610947763 via openalex
- W2792014895 via openalex
- W3031435379 via openalex
- W3047279261 via openalex
- W2068234084 via openalex
- W3202057953 via openalex
Source provenance
- europepmc
- last seen: 2026-06-14T06:08:20.186862+00:00
- openalex
- last seen: 2026-06-10T17:14:06.276822+00:00
License: CC0
· commercial use OK