{"paper_id":"41967213-ffbf-4923-88dd-2769bfff1e34","body_text":"Female reproductive disorders (FRDs) such as endometriosis, uterine fibroids,\nand ovarian cysts significantly affect physical and emotional health, disability,\nand fertility for women and those assigned female at birth [ 1 ]. FRDs fall into a category of conditions that are\noften misdiagnosed and have prolonged diagnostic timeframes and limited therapeutic\noptions [ 2 , 3 ]. Prevalence of common FRDs such as endometriosis is often\nunderestimated given the clinical difficulty of identifying the condition without\ninvasive laparoscopic surgery and the often years-long lag between symptom onset and\ndiagnosis [ 2 , 4 ]. Due to their widespread prevalence and substantial impact on daily\nlife, ways to more easily identify FRDs as well as viable therapeutic approaches for\nFRDs are highly sought after [ 5 – 7 ]. Diet and environment have been proposed as\npotential intervention opportunities for FRDs [ 8 , 9 ], but standard clinical\nrecommendations on diet and exposures are limited. Focusing on modifiable features\nsuch as diet, lifestyle factors, and environmental exposures may offer new options\nfor individuals and care providers to manage these common conditions and improve\noutcomes. We present an innovative approach for assessing survey-based data to\npredict links between nutrition, environmental exposures, comorbidity, and\nmedication and three common FRDs, namely endometriosis, uterine fibroids, and\novarian cysts.\nEndometriosis is the extrauterine growth of endometrial tissue (also\ncalled lesions) with hallmark symptoms that include pelvic pain, dysuria,\ndysmenorrhea, and sub- or infertility [ 10 ]. This FRD is estimated to occur in 10 % of women [ 11 ]. Delays in diagnosis are common with\nendometriosis, and many individuals wait years for a conclusive diagnosis [ 2 , 4 ].\nAccordingly, estimates of prevalence vary widely and are likely inaccurate. An\nestimated 35–50 % of individuals diagnosed with endometriosis experience\npain and/or infertility [ 5 ], but\napproximately 20–25 % of individuals with endometriosis do not experience\npelvic pain [ 5 , 12 , 13 ].\nBecause symptoms can be inconsistent, clinical diagnosis is difficult.\nEndometriosis is often diagnosed during treatment for fertility issues [ 14 , 15 ]. Endometriosis can present similarly to other gynecological\ndisorders including primary dysmenorrhea, pelvic inflammatory disease, and\npelvic adhesions presenting as chronic pelvic pain, painful menses, tubal\npregnancies, and infertility [ 2 , 3 ]. Due to its inconsistent presentation,\nsurgical visualization is needed to definitively diagnose endometriosis, which\nis a barrier to diagnosis and treatment [ 2 ].\nUterine fibroids, also called leiomyomas, are common benign tumors\nestimated to be present in 70–80 % of women by the age of menopause,\n[ 16 ] and approximately 20–25 %\nof those individuals present with clinical symptoms [ 17 ]. The fibroids are composed of smooth muscle cells\nand fibrous extracellular matrix that is overproduced and creates tumors within\nthe myometrium [ 18 ]. Many women with\nfibroids are not clinically diagnosed. Some have no symptoms, and some live with\nsignificantly burdensome symptoms without a clinical diagnosis. The high\nprevalence of undiagnosed fibroids means that prevalence may be underestimated\nwhen determined using clinical records. Common fibroid symptoms include heavy\nmenses, pelvic pain, anemia, urinary incontinence, and infertility [ 18 – 20 ]. With symptomatic fibroids, pregnancy complications (placenta\nprevia, intrauterine growth restriction, increased need for cesarean section)\ncan be more common [ 21 ]. Diagnosis of\nfibroids is usually accomplished with a variety of imaging techniques, including\ntransvaginal ultrasound, hysterosalpingography, saline infusion sonography,\nhysteroscopy, and magnetic resonance imaging (MRI) [ 21 – 23 ].\nOvarian cysts affect approximately one in 25 women [ 7 ]. There are multiple types of ovarian cysts, but\nfunctional cysts are the most prevalent. Functional cysts occur when a follicle\nforms in the ovary, but no ovulation ensues and the follicle does not rupture,\ncreating a cyst [ 24 ]. The most frequently\nreported symptoms of ovarian cysts are pelvic pain, abdominal pressure,\nbloating, and infertility although asymptomatic ovarian cysts can occur [ 25 , 26 ]. Asymptomatic ovarian cysts can be left untreated and may not\nrequire intervention, with some cysts disappearing naturally. However, cysts\naffecting fertility, pelvic anatomy, or quality of life in a significant way can\nbe surgically removed [ 27 ]. While\npolycystic ovary syndrome (PCOS) is a condition that includes the presence of\novarian cysts, this investigation does not include PCOS as a primary outcome of\ninterest.\nOntologies are a methodology for standardizing terminology in a\ncomputable fashion to support the creation of logical axioms between related\nterms. Prominent ontologies in the biomedical sciences include the Gene Ontology\n[ 28 ] and the Human Phenotype Ontology\n[ 29 ], with many others related to\nfoods, chemicals, and diseases [ 30 – 32 ]. Knowledge\ngraphs (KGs) are a method for representing knowledge such as ontology content\nand instance level data in a graph structure in which nodes and edges are\nexplicitly connected via semantic relationships [ 33 ]. Because of their innate high dimensionality, data inquiries can\nbe conducted using KGs. However, the dimensionality of KGs can be reduced\nthrough embedding so they can support other analytic methodologies [ 34 ]. In our investigation, we aligned\nheterogeneous data regarding health, environment, and internal exposures to\nontology content for ingestion into a KG, which was subsequently embedded and\nanalyzed using machine learning techniques.\n\nThe primary data for this project came from the Personalized Environment\nand Genes Study (PEGS, formerly known as the Environmental Polymorphisms\nRegistry) conducted by the National Institute of Environmental Health Sciences\n(NIEHS) [ 35 , 36 ], which includes data from three respondent\nsurveys, the Health and Exposure (self-reported diseases and phenotypes),\nInternal Exposome (foods, medications, supplements, and ingested exposures), and\nExternal Exposome (environmental exposures) surveys. Survey respondents are\nadult (aged 18 years or more) residents of North Carolina recruited for\nvoluntary participation through health providers or events such as health fairs.\nThe data included in this investigation were collected between 2012 and 2020.\nPEGS data is available by request only from NIEHS. This investigation was\napproved and deemed research with no human subjects (Category 4 exemption) by\nthe Oregon State University (IRB-2021–1207).\nAdditional publicly available data were included in this investigation.\nAgricultural Chemical Usage Program (ACUP) data from the United States\nDepartment of Agriculture (USDA) on fungicides, pesticides, and other chemicals\napplied to agricultural crops during 2016–2020 was included for all\nrelevant questions in the PEGS data sets (for instance, data on chemicals\napplied to carrots was included as PEGS inquires about consumption of carrots).\nACUP data were not included if there was no related PEGS question, and not all\nPEGS questions about diet had related ACUP data (for example, consumption of\ncombination foods such as hamburgers or foods without crop components, such as\nmeat). Nutrient data for Foundation Foods from the USDA Food Data Central (FDC)\nwas included when available with references to the FoodOn ontology [ 32 ]. This allowed for direct mapping to the\nselected ontology alignment (for instance, a survey question on intake of\ncottage cheese mapped to FOODON:03303720; and ‘cottage cheese\n(lowfat)’ mapped to FDC ID: 328,841 and FDC nutritional content for\n‘Cheese, cottage, lowfat, 2 % milkfat’).\nCombined, the PEGS surveys comprise 1842 questions. We assessed the\nsurvey questions for ontology alignment based on existing ontology content and\ncomplexity of the survey question as well as the primary topic area. We focused\non questions related to diseases, phenotypes, dietary exposures, and\nenvironmental exposures. We then aligned feasible survey questions of interest\n(n = 341, with 135 from the External Exposure Survey, 131 from the Internal\nExposure Survey, and 75 from the Health and Exposure Survey) to ontology\nterminology. An ontology curator (author LC) manually reviewed the data to map\nthe PEGS survey questions to the coordinating ontology content. Free-response\ncomponents of the PEGS surveys and other data sets, including USDA ACUP data,\nwere mapped to ontology terms using semi-automated curation with OntoRunNER\n[ 37 ], followed by supplemental manual\nreview by the curator. The ‘survey question label’ selected for\nfree response questions was assigned the mapped ontology term value of the\nresponse due to the list aggregation used to process data via OntoRunNER. When\nnecessary, we requested new ontology terms in efforts to support the mappings\nneeded for this data alignment. Primary requests were made to the Food Ontology\n(FoodOn) [ 32 ] and the Environmental\nConditions, Treatments, and Exposures Ontology (ECTO) [ 38 ].\nWe created the KG for this project with an extract, transform, load\n(ETL) pipeline constructed using the Knowledge Graph Hub project KG-template\n[ 39 ]. The KG-template offers a\nskeleton structure of data download, transformation, and merge scripts that we\ncustomized for this project. This pipeline was developed using Python (Version\n3.90.10) and Koza [ 40 ], a data\ntransformation framework constructed by the Monarch Initiative. Transformations\nincluded the alignment of self-reported data for questions of interest with the\nontology mappings generated manually or semi-automatically as described in  Fig. 1 . Code used for KG development is\navailable at our GitHub repository(41).\nWe conducted each data transformation (for instance, disease, phenotype,\nmedication, food) with a unique script that asserted the correct\n“predicate” (for example, the phenotype transform created\nassertions such as ‘Person:1234’ ‘has phenotype’\n‘uterine leiomyoma’). We followed this process for all PEGS data\nand all supplemental data on food, chemical usage, and nutrient content.  Fig. 2  provides an example of the full\nmapping and transformation process, in which reusable nodes were generated for a\nrespondent’s unique ID and their survey responses. In turn, all questions\nanswered by a respondent were mapped to the same respondent node using their ID.\nSimilarly, all respondents who answered the same question were mapped to the\nsame question response node. In addition to the transformed respondent data, the\nfull contents of relevant ontologies (Human Phenotype Ontology (HPO), Mondo\nDisease Ontology (Mondo), Medical Actions Ontology (MAxO), Gene Ontology (GO),\nEnvironment Ontology (ENVO), Chemical Entities of Biological Interest (ChEBI),\nECTO, and FoodOn) were merged to create the KG. Within the KG structure, each\nontology term or survey participant was considered a “node”, with\nall relationships between each node considered an “edge”.\nAs with many KGs, the KG for this project was a high-dimensional object\nwith a large number of nodes and edges, making it less amenable to machine\nlearning. Lower-dimension forms of a KG allow for improved generalization of\nknowledge, as the latent representation places dissimilar nodes farther away\nfrom one another and nodes with greater similarity closer to each other. To\nreduce the dimensionality of the KG in preparation for machine-learning\ntechniques, we embedded the KG using Graph Representation leArning, Predictions\nand Evaluation (GRAPE) [ 41 ] and its\nembedding library. We used only the largest component of the KG, which\neliminated data from 691 (7.1 %) survey respondents due to insufficient data.\nThe generated embedded representations included ontology terms, exposures,\nclinical variables, FRDs, and respondents. As such, the resulting\nrepresentations embedded the topological relationships between the different\ntypes of entities populating the KG in a vectorial space. Additional details can\nbe found in the  Supplemental\nMethods .\nFor the following machine learning methods, we generated two\nedge-embedding versions, a training embedding and a full data embedding. The\ntraining embedding included a ‘Training’ portion comprising 70 %\nof the graph and a ‘Test’ portion comprising the remaining 30 %.\nWe created the test portion by selecting and holding out edges that, when\nremoved from the full embedding, did not create a new component and thus kept\nthe primary component of the graph intact. This avoided a biased estimation of\nthe edge prediction results for the test set (see the GRAPE github repository\nfor a full description of the method [ 42 ]). Edges in the training set were not specifically selected as\n“positive” responses (for example, edges documenting an\nFRD-variable relationship), in efforts to train the model for edge prediction\nbased on the entire topology of the graph. The full embedding included all\navailable data.  Fig. 3  summarizes the\nanalytical methods.\nRandom forests (RF) [ 43 ] are\nmachine-learning classifiers used for computing medical predictions due to their\ninherent explainability and interpretability and the availability of methods\n(although preliminary) to convert them into a checklist of rules [ 44 , 45 ].\nOur primary machine-learning task was applied to the KG we created,\ngenerating link predictions between variables (for example, food, nutrient,\nenvironmental exposure, disease, phenotype) and the FDRs of interest. We then\ntrained an RF model (501 trees, 15 maximum depth) using the embeddings of the\ntraining data (with holdouts). The standard machine-learning performance metrics\nindicated the model was trained successfully and suitable for our analysis (area\nunder the receiver operating characteristic (AUROC) = 0.915 for the\n‘Test’ portion of the training data). To produce actionable\nresults, we then retrained the model on the full dataset to obtain a set of\npredicted links between the FRDs and other variables. In the output, predicted\nlinks were represented by two node values—the “source”\n(independent variable) and “destination” (dependent variable)\nnodes of the link—and a “prediction” score indicating the\nstrength of the predicted link between the two nodes. Utilizing the full graph\nembedding, we selected prediction outcomes from the model that included an FRD\n(for example, endometriosis, ovarian cysts, uterine fibroids) as the\n“source” and the resulting “destination”. We\nretained pairs with a prediction score > 0.8, resulting in a list of\npredicted variables for each FRD of interest.\nFor additional comparison of our KG findings, we conducted a secondary\nanalysis using elastic nets, RFs, and logistic regression models to provide\nfeature explanations (in terms of feature importance in prediction) and\ninterpretations (in terms of the directionality of risk scores associated with\neach feature). We conducted this analysis in R, version 4.20.2. We cleaned the\nprimary PEGS data on health conditions and internal and external exposures to\ninclude female participants only. We then excluded participants who did not\ncomplete all three surveys to improve data quality, given the lower response\nrates to the Internal and External Exposure Surveys versus the Health and\nExposure Survey. For the regression analysis, we utilized only survey questions\nthat aligned with the KG analysis (see KG Data Preparation) to maintain\nconsistency and enable comparison. We imputed missing data using the missForest\nalgorithm, which has exhibited superior performance in previous work [ 44 , 46 ].\nTo select the features with the strongest relationships with the FRDs of\ninterest, we leveraged an explainable machine-learning technique [ 47 ], to account for the class imbalance\naffecting the FRD datasets and to produce both importance scores and their\ndirectionality concerning the risk of disease. We developed a model that applied\na first step of supervised feature selection on the training set and then\nselected features used to train an RF classifier. The model then computed\npermutation-based feature importance scores based on the RF classifier that were\nused to select the most important variables for FRD prediction. Features\nregarded as important by an RF are not characterized by directionality and\nmagnitude, which is important for a medical context [ 48 ]. To assess these characteristics, we then trained\nlogistic regression classifiers, whose learned odds ratios and\n P  values indicate the significance and directionality of\nrisk scores. We ran the model three times, each time utilizing a different FRD\nas the primary outcome. We adjusted the  P  values obtained in\nthe logistic regression analyses for endometriosis, ovarian cysts, and uterine\nfibroids using Bonferroni correction to account for the family-wise false\ndiscovery rate (FDR).\nBased on the KG and logistic regression model results, we identified the\nmost influential features for each FRD. We compared both the KG and logistic\nregression outputs for exact matches for each FRD. Details of additional methods\ncan be found in the  Supplemental Methods  and our code can be found on GitHub [ 49 ].\n\nA total of 16,039 surveys were completed (External Exposome = 3579, Internal\nExposome = 3034, Health and Exposure = 9426) by 9765 unique individuals, including\n2773 individuals who completed all three surveys. In the study population, there was\nreported prevalence of 7 % for endometriosis, 15 % for uterine fibroids, and 13 %\nfor ovarian cysts. Translation keys for all survey questions of interest and their\ncoordinating ontology content, including OntoRunNER generated mappings, can be found\nin  Supplementary Table\n1A – D \n( Supp Table 1D  is also\navailable on github [ 50 ]). The majority of\nsurvey respondents were female, with an average age between 49.9 and 54 years\ndepending on the survey ( Table 1 ). Further\ninformation such as race/ethnicity, pregnancy history, age at menarche, and health\ncare access level were not available in this dataset.\nThe KG created for this project has 308.60 K heterogeneous nodes and 696.68\nK edges in total. The graph contains 28.44 K connected components (of which 28.41 K\nare disconnected nodes), with the largest one containing 280.03 K nodes and the\nsmallest one containing a single node.  Fig. 4 \nshows the resulting full graph embedding after selecting for the largest connected\ncomponent in the graph.\nWe identified a list of significant (P < 0.005) and suggestive (P\n< 0.05) variable features from the logistic regression analyses and predicted\nsignificant findings from the KG (prediction score > 0.8). All survey labels\nwere coded for a “Yes” response to the question, indicating the\npresence of an exposure or condition.  Table\n2A – C  shows the significant\n(P < 0.005) and suggestive features (P < 0.05) identified from\nlogistic regression. Significant or suggestive features from both analyses are\nindicated in bold in  Tables 2A – C .  Supplemental Tables 2A – C  provide a full list of\nvariables identified from logistic regression. A full list of variables identified\nas part of the KG link prediction methodology can be found in  Supplemental Table 3  ( Supp Table 3  is also available on\nGithub [ 51 ]).\nTable 2A – C . Significant and suggestive features identified via\nlogistic regression. Variables that are direct matches in the KG results are\ndisplayed in bold. Unreported Mean Variance Inflation Factor (VIF) scores indicate\ninadequate information available to calculate the score.\n\nOur work developing a KG with survey-based data and conducting machine\nlearning to predict variables associated with FRDs is the first of its kind. The\nlogistic regression model we developed for comparison supports our findings using\nthis novel approach. Comparing the logistic regression and KG models resulted in\nnumerous exact matches for medical conditions and procedures, environmental\nexposures, medications, and dietary exposures for the considered FRDs. Endometriosis\nand ovarian cysts had suggestive associations with other gynecological conditions\nand procedures. Positive responses to questions regarding hysterectomy, ovary\nremoval, and ovarian cysts were all suggestively associated with endometriosis. A\npossible explanation for the procedure associations is that ovary removal and\nhysterectomy are offered as endometriosis treatment options when other therapies\nhave been unsuccessful [ 52 , 53 ]. However, the timing of disease onset and medical\nprocedures in this dataset was unavailable. Endometriosis can present as an ovarian\nendometrioma, an endometriotic cyst in the ovary [ 54 ], which may be related to the suggestive endometriosis and ovarian\ncyst association identified. It is important to note that screening for any of these\ngynecological conditions may contribute to the identification of another\ngynecological comorbidity due to increased potential for detection.\nUse of duloxetine had a suggestive association with uterine fibroids in this\nstudy. Duloxetine is a medication primarily used for treatment of major depressive\ndisorder, generalized anxiety disorder, chronic musculoskeletal pain, and\nfibromyalgia [ 55 ]. While duloxetine does not\nhave a documented relationship with FRDs in current literature, there is a strong\nassociation between depression and mental health concerns in individuals with FRDs.\nIndividuals with uterine fibroids have been documented to experience higher rates of\ndepression and anxiety compared to controls, particularly amongst individuals who\nexperience pain symptoms or who have undergone a hysterectomy [ 56 ]. Given the increased prevalence of mental health\nconditions amongst individuals with FRDs, individuals with these conditions may be\nmore likely to take antidepressants or similar medications which may be related to\nthis finding.\nOmeprazole use was significantly associated with increased odds of uterine\nfibroids. Omeprazole is a proton pump inhibitor, used to treat gastroesophageal\nreflux disease (GERD), ulcers, and other conditions characterized by excessive\nstomach acid [ 57 ]. Omeprazole has no reported\nside effects related to uterine fibroid development, but bulk-related symptoms may\npresent due to uterine fibroids as the enlarged fibroids can distort the abdominal\nanatomy and cause abdominal bloating and pressure [ 58 ]. Uterine fibroids have been denoted as an associated disorder for\nindividuals with Barrett’s esophagus, a gastrointestinal complication of GERD\n[ 59 , 60 ].\nWe identified multiple potential associations between diet and FRDs. Tofu\nconsumption was suggestively associated with decreased odds of endometriosis. Tofu,\na processed soybean curd, is often studied for its health benefits related to its\nhigh isoflavone content [ 61 ]. Isoflavones are\nof interest given their known antioxidant properties [ 62 ]. It is hypothesized that excessive inflammation\nobserved with endometriosis may be mitigated through isoflavone exposure [ 62 , 63 ].\nSupporting the suggestive association of the present study, prior work has reported\nan inverse relationship between urinary isoflavone concentration and severe\nendometriosis [ 64 ]. However, a set of case\nstudies investigating excessive soy consumption found high soy intake to be related\nto dysmenorrhea, endometriosis, and uterine fibroids [ 65 ]. Because of the higher rates of soy consumption among\nAsian individuals compared to other groups [ 66 ], it is notable that prevalence of endometriosis is higher in Asian\npopulations than in other racial groups [ 67 , 68 ]. However, data on race\nwere unavailable for analysis. Notably, soy isoflavones are also phytoestrogens,\ngiven their ability to bind to estrogen receptors and contribute to estrogenic\nactivity in humans [ 62 ]. Isoflavones have\nbeen denoted as potential endocrine disruptors, however these long-term mechanistic\neffects are not fully elucidated [ 61 ]. While\nour results are inconclusive, further research evaluating soy consumption and\nendometriosis may be helpful for guidance on prevention and management.\nA suggestive association was also identified for carrot consumption and\ndecreased odds of endometriosis. Consumption of fruits and vegetables has been\nidentified as protective against endometriosis, potentially due to the\nanti-inflammatory properties of dietary components, including vitamins C and E\n[ 69 , 70 ]. Carrots contain high levels of antioxidant carotenoids, which may\nreduce the inflammatory responses that occur in individuals with endometriosis\n[ 71 ]. The effects of carrot consumption\nare inconsistent in the literature, with multiple investigations reporting no\nsignificant associations between carrots and endometriosis [ 72 , 73 ]. Further\nexploratory work is needed for all potential dietary relationships with FRDs,\nincluding study designs which can include food quantities, as that was a limitation\nof this study design.\nBy utilizing a novel KG methodology and comparing the results with those\nfrom a traditional logistic regression model, we generated and corroborated multiple\nhypotheses of the effects of modifiable lifestyle factors on FRDs. The KG method\npresented here is an effective hypothesis-generation strategy, but the results\nshould not be construed as causal as in other survey-based methodologies. Due to a\nlack of temporality information regarding exposures and condition onset, hypotheses\ngenerated from these associations should be investigated bidirectionally to best\ninterpret how the variables interact.\nThe logistic regression approach indicated positive or negative associations\nfor survey variables, which cannot be calculated using existing KG methods. The KG\nmodel identified an unranked list of predicted significant factors that require\nfurther assessment to identify variables of interest. Given the novelty of applying\nthe KG method in survey-based data, its successful application in the present work\nshowcases the potential of computational survey investigations using biomedical\nontologies. Collecting data with ontology alignment in mind or retroactively\nperforming ontology alignment for secondary data analysis will provide opportunities\nto apply KG study designs for hypothesis generation.\n\nThis work has limitations due to the nature of the PEGS dataset, namely the\nNorth Carolina-specific population and the lower percentage of individuals with FRDs\ncompared to national prevalence estimates. While this investigation was a secondary\ndata analysis and did not involve design or collection of PEGS survey data, future\ninvestigations should include a more geographically diverse sample population for\ngreater generalizability of study findings. Additionally, the dataset lacks\ninformation on temporality. PEGS participants are asked to describe their current\neating habits, past and current exposures, and whether they have been diagnosed with\nan FRD. Given the lack of context for when onset of a condition occurred, it is\ndifficult to identify the true impact of diet or environmental exposures, as they\nmay have occurred before or after symptom presentation and disease diagnosis. Use of\na survey design that includes temporality questions and collects information on\ngynecological history, demographics, and other potential confounders may improve the\ninterpretation of findings.\nOf note, our investigation used a binary variable of food consumption for\nindividuals to indicate that they either do or do not consume a particular food.\nThis approach was consistent for all food exposures, with no distinctions made\nbetween low and high consumption. Given the potentially wide range of consumption\nlevels, this binary approach reduces the ability to decipher the impacts of dietary\nfactors using the KG model. Binning data into “low”,\n“medium”, or “high” consumption levels (for example,\n“low” consumers eat apples 0–1 times per week) should be\nconsidered for future KG based investigations, to improve data output granularity.\nFurther, our named entity recognition approach to mapping string responses to survey\nquestions can be improved by grouping similar medications (for instance, regular\nversus extended-release formats). Additionally, machine learning approaches that\nconsider specific values for dietary intake (for example, the number of apples\nconsumed per week) when creating link predictions in a KG model would greatly\nbenefit future nutrition investigations for hypothesis generation and potential\nfuture causally predictive works.\nThe performance of our KG model resulted in a substantial list of findings,\nmany with similarly high prediction scores. While edge prediction provides\nprediction values between 0 and 1, equally ranked results make prioritization for\nhypothesis generation challenging. As such, efforts should be made to improve the\nprioritization of KG findings to enable hypothesis development.\nWhile areas for improvement exist in this study design, we identified\nmultiple predicted variables, including modifiable lifestyle factors, for FRD.\nAdditional results, including those resulting exclusively from KG analysis, may\nresult in meaningful hypotheses in future investigations of FRDs.\n\nFRDs are highly impactful conditions for women globally, and there is a need\nto identify modifiable factors associated with these disorders. Limited\ninvestigations using ontologies or KG structures for investigations of FRDs have\nbeen conducted, and most existing studies have not accounted for modifiable\nlifestyle factors such as diet and environmental exposures. Using KG and logistic\nregression approaches, we identified a variety of potential intervention points for\nFRDs that can be pursued in future work. Because they are based on open-source,\nbiomedical ontologies and computational resources, the novel methodologies used in\nthis study can be repurposed for additional investigations.\nComputational analysis methods for nutrition and exposure survey\ndata are limited, reducing their impact on treatments for conditions\nlike FRDs.\nAlthough previous investigations evaluate FRD mechanisms and\ninterventions, there are significant gaps in knowledge regarding\nmodifiable lifestyle risk factors.\nThis investigation harmonizes nutrition and exposure data with\nbiomedical ontologies for FRD knowledge graph (KG) creation.\nKG analysis via a graph-representation-learning (GRL) model\nidentifies variables which may significantly impact FRDs; these results\nare compared with a classic explainable AI technique, where the\nsignificance and risk of crucial variables identified via random\nforest-based, permutation-importance analysis are assessed by logistic\nregression.","source_license":"CC-BY-4.0","license_restricted":false}