Predicting nutrition and environmental factors associated with female reproductive disorders using a knowledge graph and random forests

Lauren E Chan; Elena Casiraghi; Justin Reese; Quaker E Harmon; Kevin Schaper; Harshad Hegde; Giorgio Valentini; Charles Schmitt; Alison Motsinger-Reif; Janet E Hall; Christopher J Mungall; Peter N Robinson; Melissa A Haendel

doi:10.1016/j.ijmedinf.2024.105461

Predicting nutrition and environmental factors associated with female reproductive disorders using a knowledge graph and random forests

Lauren E Chan, Elena Casiraghi, Justin Reese, Quaker E Harmon, Kevin Schaper, Harshad Hegde, Giorgio Valentini, Charles Schmitt, Alison Motsinger-Reif, Janet E Hall, Christopher J Mungall, Peter N Robinson, Melissa A Haendel

International journal of medical informatics · 2024 · vol. 187 , pp. 105461 · doi:10.1016/j.ijmedinf.2024.105461 · PMID:38643701 · PMC11188727

other OA: gold CC-BY-4.0

📄 Open PDF Full text JSON View on PubMed View at publisher

⚙ AI-generated summary by claude@2026-06, 2026-06-08 ⓘ

This study used a knowledge graph and random forest analysis on survey data to identify potential associations between female reproductive disorders and environmental and nutritional factors.

One-sentence paraphrase of the abstract; not a substitute for reading it. No clinical advice. How this works

⚙ AI-generated deep summary by claude@2026-06, 2026-06-11 · read from full text ⓘ

The study built a knowledge graph from the Personalized Environment and Genes Study (PEGS) surveys and linked external agricultural chemical usage and USDA nutrient data, aligning diet, environmental exposures, comorbidities, and medications to ontology terms, then embedding the graph for machine-learning. Using a GRAPE-based embedding approach and random forests, the authors aimed to predict associations between nutrition/environment factors and three female reproductive disorders (endometriosis, uterine fibroids, and ovarian cysts) in adult North Carolina survey respondents, while noting that only the largest connected component of the KG was used, excluding 691 respondents (7.1%) due to insufficient data. The paper describes ontology curation, KG construction via an ETL pipeline, and graph embedding to reduce high dimensionality for modeling. Relevance to endometriosis: the paper explicitly targets endometriosis (along with uterine fibroids and ovarian cysts) as a primary predicted female reproductive disorder using the PEGS knowledge-graph and random-forest framework.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Abstract

OBJECTIVE: Female reproductive disorders (FRDs) are common health conditions that may present with significant symptoms. Diet and environment are potential areas for FRD interventions. We utilized a knowledge graph (KG) method to predict factors associated with common FRDs (for example, endometriosis, ovarian cyst, and uterine fibroids). MATERIALS AND METHODS: We harmonized survey data from the Personalized Environment and Genes Study (PEGS) on internal and external environmental exposures and health conditions with biomedical ontology content. We merged the harmonized data and ontologies with supplemental nutrient and agricultural chemical data to create a KG. We analyzed the KG by embedding edges and applying a random forest for edge prediction to identify variables potentially associated with FRDs. We also conducted logistic regression analysis for comparison. RESULTS: Across 9765 PEGS respondents, the KG analysis resulted in 8535 significant or suggestive predicted links between FRDs and chemicals, phenotypes, and diseases. Amongst these links, 32 were exact matches when compared with the logistic regression results, including comorbidities, medications, foods, and occupational exposures. DISCUSSION: Mechanistic underpinnings of predicted links documented in the literature may support some of our findings. Our KG methods are useful for predicting possible associations in large, survey-based datasets with added information on directionality and magnitude of effect from logistic regression. These results should not be construed as causal but can support hypothesis generation. CONCLUSION: This investigation enabled the generation of hypotheses on a variety of potential links between FRDs and exposures. Future investigations should prospectively evaluate the variables hypothesized to impact FRDs.

Full text 30,679 characters · extracted from pmc-nxml · 6 sections · click to expand

Methods

The primary data for this project came from the Personalized Environment and Genes Study (PEGS, formerly known as the Environmental Polymorphisms Registry) conducted by the National Institute of Environmental Health Sciences (NIEHS) [ 35 , 36 ], which includes data from three respondent surveys, the Health and Exposure (self-reported diseases and phenotypes), Internal Exposome (foods, medications, supplements, and ingested exposures), and External Exposome (environmental exposures) surveys. Survey respondents are adult (aged 18 years or more) residents of North Carolina recruited for voluntary participation through health providers or events such as health fairs. The data included in this investigation were collected between 2012 and 2020. PEGS data is available by request only from NIEHS. This investigation was approved and deemed research with no human subjects (Category 4 exemption) by the Oregon State University (IRB-2021–1207). Additional publicly available data were included in this investigation. Agricultural Chemical Usage Program (ACUP) data from the United States Department of Agriculture (USDA) on fungicides, pesticides, and other chemicals applied to agricultural crops during 2016–2020 was included for all relevant questions in the PEGS data sets (for instance, data on chemicals applied to carrots was included as PEGS inquires about consumption of carrots). ACUP data were not included if there was no related PEGS question, and not all PEGS questions about diet had related ACUP data (for example, consumption of combination foods such as hamburgers or foods without crop components, such as meat). Nutrient data for Foundation Foods from the USDA Food Data Central (FDC) was included when available with references to the FoodOn ontology [ 32 ]. This allowed for direct mapping to the selected ontology alignment (for instance, a survey question on intake of cottage cheese mapped to FOODON:03303720; and ‘cottage cheese (lowfat)’ mapped to FDC ID: 328,841 and FDC nutritional content for ‘Cheese, cottage, lowfat, 2 % milkfat’). Combined, the PEGS surveys comprise 1842 questions. We assessed the survey questions for ontology alignment based on existing ontology content and complexity of the survey question as well as the primary topic area. We focused on questions related to diseases, phenotypes, dietary exposures, and environmental exposures. We then aligned feasible survey questions of interest (n = 341, with 135 from the External Exposure Survey, 131 from the Internal Exposure Survey, and 75 from the Health and Exposure Survey) to ontology terminology. An ontology curator (author LC) manually reviewed the data to map the PEGS survey questions to the coordinating ontology content. Free-response components of the PEGS surveys and other data sets, including USDA ACUP data, were mapped to ontology terms using semi-automated curation with OntoRunNER [ 37 ], followed by supplemental manual review by the curator. The ‘survey question label’ selected for free response questions was assigned the mapped ontology term value of the response due to the list aggregation used to process data via OntoRunNER. When necessary, we requested new ontology terms in efforts to support the mappings needed for this data alignment. Primary requests were made to the Food Ontology (FoodOn) [ 32 ] and the Environmental Conditions, Treatments, and Exposures Ontology (ECTO) [ 38 ]. We created the KG for this project with an extract, transform, load (ETL) pipeline constructed using the Knowledge Graph Hub project KG-template [ 39 ]. The KG-template offers a skeleton structure of data download, transformation, and merge scripts that we customized for this project. This pipeline was developed using Python (Version 3.90.10) and Koza [ 40 ], a data transformation framework constructed by the Monarch Initiative. Transformations included the alignment of self-reported data for questions of interest with the ontology mappings generated manually or semi-automatically as described in Fig. 1 . Code used for KG development is available at our GitHub repository(41). We conducted each data transformation (for instance, disease, phenotype, medication, food) with a unique script that asserted the correct “predicate” (for example, the phenotype transform created assertions such as ‘Person:1234’ ‘has phenotype’ ‘uterine leiomyoma’). We followed this process for all PEGS data and all supplemental data on food, chemical usage, and nutrient content. Fig. 2 provides an example of the full mapping and transformation process, in which reusable nodes were generated for a respondent’s unique ID and their survey responses. In turn, all questions answered by a respondent were mapped to the same respondent node using their ID. Similarly, all respondents who answered the same question were mapped to the same question response node. In addition to the transformed respondent data, the full contents of relevant ontologies (Human Phenotype Ontology (HPO), Mondo Disease Ontology (Mondo), Medical Actions Ontology (MAxO), Gene Ontology (GO), Environment Ontology (ENVO), Chemical Entities of Biological Interest (ChEBI), ECTO, and FoodOn) were merged to create the KG. Within the KG structure, each ontology term or survey participant was considered a “node”, with all relationships between each node considered an “edge”. As with many KGs, the KG for this project was a high-dimensional object with a large number of nodes and edges, making it less amenable to machine learning. Lower-dimension forms of a KG allow for improved generalization of knowledge, as the latent representation places dissimilar nodes farther away from one another and nodes with greater similarity closer to each other. To reduce the dimensionality of the KG in preparation for machine-learning techniques, we embedded the KG using Graph Representation leArning, Predictions and Evaluation (GRAPE) [ 41 ] and its embedding library. We used only the largest component of the KG, which eliminated data from 691 (7.1 %) survey respondents due to insufficient data. The generated embedded representations included ontology terms, exposures, clinical variables, FRDs, and respondents. As such, the resulting representations embedded the topological relationships between the different types of entities populating the KG in a vectorial space. Additional details can be found in the Supplemental Methods . For the following machine learning methods, we generated two edge-embedding versions, a training embedding and a full data embedding. The training embedding included a ‘Training’ portion comprising 70 % of the graph and a ‘Test’ portion comprising the remaining 30 %. We created the test portion by selecting and holding out edges that, when removed from the full embedding, did not create a new component and thus kept the primary component of the graph intact. This avoided a biased estimation of the edge prediction results for the test set (see the GRAPE github repository for a full description of the method [ 42 ]). Edges in the training set were not specifically selected as “positive” responses (for example, edges documenting an FRD-variable relationship), in efforts to train the model for edge prediction based on the entire topology of the graph. The full embedding included all available data. Fig. 3 summarizes the analytical methods. Random forests (RF) [ 43 ] are machine-learning classifiers used for computing medical predictions due to their inherent explainability and interpretability and the availability of methods (although preliminary) to convert them into a checklist of rules [ 44 , 45 ]. Our primary machine-learning task was applied to the KG we created, generating link predictions between variables (for example, food, nutrient, environmental exposure, disease, phenotype) and the FDRs of interest. We then trained an RF model (501 trees, 15 maximum depth) using the embeddings of the training data (with holdouts). The standard machine-learning performance metrics indicated the model was trained successfully and suitable for our analysis (area under the receiver operating characteristic (AUROC) = 0.915 for the ‘Test’ portion of the training data). To produce actionable results, we then retrained the model on the full dataset to obtain a set of predicted links between the FRDs and other variables. In the output, predicted links were represented by two node values—the “source” (independent variable) and “destination” (dependent variable) nodes of the link—and a “prediction” score indicating the strength of the predicted link between the two nodes. Utilizing the full graph embedding, we selected prediction outcomes from the model that included an FRD (for example, endometriosis, ovarian cysts, uterine fibroids) as the “source” and the resulting “destination”. We retained pairs with a prediction score > 0.8, resulting in a list of predicted variables for each FRD of interest. For additional comparison of our KG findings, we conducted a secondary analysis using elastic nets, RFs, and logistic regression models to provide feature explanations (in terms of feature importance in prediction) and interpretations (in terms of the directionality of risk scores associated with each feature). We conducted this analysis in R, version 4.20.2. We cleaned the primary PEGS data on health conditions and internal and external exposures to include female participants only. We then excluded participants who did not complete all three surveys to improve data quality, given the lower response rates to the Internal and External Exposure Surveys versus the Health and Exposure Survey. For the regression analysis, we utilized only survey questions that aligned with the KG analysis (see KG Data Preparation) to maintain consistency and enable comparison. We imputed missing data using the missForest algorithm, which has exhibited superior performance in previous work [ 44 , 46 ]. To select the features with the strongest relationships with the FRDs of interest, we leveraged an explainable machine-learning technique [ 47 ], to account for the class imbalance affecting the FRD datasets and to produce both importance scores and their directionality concerning the risk of disease. We developed a model that applied a first step of supervised feature selection on the training set and then selected features used to train an RF classifier. The model then computed permutation-based feature importance scores based on the RF classifier that were used to select the most important variables for FRD prediction. Features regarded as important by an RF are not characterized by directionality and magnitude, which is important for a medical context [ 48 ]. To assess these characteristics, we then trained logistic regression classifiers, whose learned odds ratios and P values indicate the significance and directionality of risk scores. We ran the model three times, each time utilizing a different FRD as the primary outcome. We adjusted the P values obtained in the logistic regression analyses for endometriosis, ovarian cysts, and uterine fibroids using Bonferroni correction to account for the family-wise false discovery rate (FDR). Based on the KG and logistic regression model results, we identified the most influential features for each FRD. We compared both the KG and logistic regression outputs for exact matches for each FRD. Details of additional methods can be found in the Supplemental Methods and our code can be found on GitHub [ 49 ].

Results

A total of 16,039 surveys were completed (External Exposome = 3579, Internal Exposome = 3034, Health and Exposure = 9426) by 9765 unique individuals, including 2773 individuals who completed all three surveys. In the study population, there was reported prevalence of 7 % for endometriosis, 15 % for uterine fibroids, and 13 % for ovarian cysts. Translation keys for all survey questions of interest and their coordinating ontology content, including OntoRunNER generated mappings, can be found in Supplementary Table 1A – D ( Supp Table 1D is also available on github [ 50 ]). The majority of survey respondents were female, with an average age between 49.9 and 54 years depending on the survey ( Table 1 ). Further information such as race/ethnicity, pregnancy history, age at menarche, and health care access level were not available in this dataset. The KG created for this project has 308.60 K heterogeneous nodes and 696.68 K edges in total. The graph contains 28.44 K connected components (of which 28.41 K are disconnected nodes), with the largest one containing 280.03 K nodes and the smallest one containing a single node. Fig. 4 shows the resulting full graph embedding after selecting for the largest connected component in the graph. We identified a list of significant (P < 0.005) and suggestive (P 0.8). All survey labels were coded for a “Yes” response to the question, indicating the presence of an exposure or condition. Table 2A – C shows the significant (P < 0.005) and suggestive features (P < 0.05) identified from logistic regression. Significant or suggestive features from both analyses are indicated in bold in Tables 2A – C . Supplemental Tables 2A – C provide a full list of variables identified from logistic regression. A full list of variables identified as part of the KG link prediction methodology can be found in Supplemental Table 3 ( Supp Table 3 is also available on Github [ 51 ]). Table 2A – C . Significant and suggestive features identified via logistic regression. Variables that are direct matches in the KG results are displayed in bold. Unreported Mean Variance Inflation Factor (VIF) scores indicate inadequate information available to calculate the score.

Conclusion

FRDs are highly impactful conditions for women globally, and there is a need to identify modifiable factors associated with these disorders. Limited investigations using ontologies or KG structures for investigations of FRDs have been conducted, and most existing studies have not accounted for modifiable lifestyle factors such as diet and environmental exposures. Using KG and logistic regression approaches, we identified a variety of potential intervention points for FRDs that can be pursued in future work. Because they are based on open-source, biomedical ontologies and computational resources, the novel methodologies used in this study can be repurposed for additional investigations. Computational analysis methods for nutrition and exposure survey data are limited, reducing their impact on treatments for conditions like FRDs. Although previous investigations evaluate FRD mechanisms and interventions, there are significant gaps in knowledge regarding modifiable lifestyle risk factors. This investigation harmonizes nutrition and exposure data with biomedical ontologies for FRD knowledge graph (KG) creation. KG analysis via a graph-representation-learning (GRL) model identifies variables which may significantly impact FRDs; these results are compared with a classic explainable AI technique, where the significance and risk of crucial variables identified via random forest-based, permutation-importance analysis are assessed by logistic regression.

Discussion

Our work developing a KG with survey-based data and conducting machine learning to predict variables associated with FRDs is the first of its kind. The logistic regression model we developed for comparison supports our findings using this novel approach. Comparing the logistic regression and KG models resulted in numerous exact matches for medical conditions and procedures, environmental exposures, medications, and dietary exposures for the considered FRDs. Endometriosis and ovarian cysts had suggestive associations with other gynecological conditions and procedures. Positive responses to questions regarding hysterectomy, ovary removal, and ovarian cysts were all suggestively associated with endometriosis. A possible explanation for the procedure associations is that ovary removal and hysterectomy are offered as endometriosis treatment options when other therapies have been unsuccessful [ 52 , 53 ]. However, the timing of disease onset and medical procedures in this dataset was unavailable. Endometriosis can present as an ovarian endometrioma, an endometriotic cyst in the ovary [ 54 ], which may be related to the suggestive endometriosis and ovarian cyst association identified. It is important to note that screening for any of these gynecological conditions may contribute to the identification of another gynecological comorbidity due to increased potential for detection. Use of duloxetine had a suggestive association with uterine fibroids in this study. Duloxetine is a medication primarily used for treatment of major depressive disorder, generalized anxiety disorder, chronic musculoskeletal pain, and fibromyalgia [ 55 ]. While duloxetine does not have a documented relationship with FRDs in current literature, there is a strong association between depression and mental health concerns in individuals with FRDs. Individuals with uterine fibroids have been documented to experience higher rates of depression and anxiety compared to controls, particularly amongst individuals who experience pain symptoms or who have undergone a hysterectomy [ 56 ]. Given the increased prevalence of mental health conditions amongst individuals with FRDs, individuals with these conditions may be more likely to take antidepressants or similar medications which may be related to this finding. Omeprazole use was significantly associated with increased odds of uterine fibroids. Omeprazole is a proton pump inhibitor, used to treat gastroesophageal reflux disease (GERD), ulcers, and other conditions characterized by excessive stomach acid [ 57 ]. Omeprazole has no reported side effects related to uterine fibroid development, but bulk-related symptoms may present due to uterine fibroids as the enlarged fibroids can distort the abdominal anatomy and cause abdominal bloating and pressure [ 58 ]. Uterine fibroids have been denoted as an associated disorder for individuals with Barrett’s esophagus, a gastrointestinal complication of GERD [ 59 , 60 ]. We identified multiple potential associations between diet and FRDs. Tofu consumption was suggestively associated with decreased odds of endometriosis. Tofu, a processed soybean curd, is often studied for its health benefits related to its high isoflavone content [ 61 ]. Isoflavones are of interest given their known antioxidant properties [ 62 ]. It is hypothesized that excessive inflammation observed with endometriosis may be mitigated through isoflavone exposure [ 62 , 63 ]. Supporting the suggestive association of the present study, prior work has reported an inverse relationship between urinary isoflavone concentration and severe endometriosis [ 64 ]. However, a set of case studies investigating excessive soy consumption found high soy intake to be related to dysmenorrhea, endometriosis, and uterine fibroids [ 65 ]. Because of the higher rates of soy consumption among Asian individuals compared to other groups [ 66 ], it is notable that prevalence of endometriosis is higher in Asian populations than in other racial groups [ 67 , 68 ]. However, data on race were unavailable for analysis. Notably, soy isoflavones are also phytoestrogens, given their ability to bind to estrogen receptors and contribute to estrogenic activity in humans [ 62 ]. Isoflavones have been denoted as potential endocrine disruptors, however these long-term mechanistic effects are not fully elucidated [ 61 ]. While our results are inconclusive, further research evaluating soy consumption and endometriosis may be helpful for guidance on prevention and management. A suggestive association was also identified for carrot consumption and decreased odds of endometriosis. Consumption of fruits and vegetables has been identified as protective against endometriosis, potentially due to the anti-inflammatory properties of dietary components, including vitamins C and E [ 69 , 70 ]. Carrots contain high levels of antioxidant carotenoids, which may reduce the inflammatory responses that occur in individuals with endometriosis [ 71 ]. The effects of carrot consumption are inconsistent in the literature, with multiple investigations reporting no significant associations between carrots and endometriosis [ 72 , 73 ]. Further exploratory work is needed for all potential dietary relationships with FRDs, including study designs which can include food quantities, as that was a limitation of this study design. By utilizing a novel KG methodology and comparing the results with those from a traditional logistic regression model, we generated and corroborated multiple hypotheses of the effects of modifiable lifestyle factors on FRDs. The KG method presented here is an effective hypothesis-generation strategy, but the results should not be construed as causal as in other survey-based methodologies. Due to a lack of temporality information regarding exposures and condition onset, hypotheses generated from these associations should be investigated bidirectionally to best interpret how the variables interact. The logistic regression approach indicated positive or negative associations for survey variables, which cannot be calculated using existing KG methods. The KG model identified an unranked list of predicted significant factors that require further assessment to identify variables of interest. Given the novelty of applying the KG method in survey-based data, its successful application in the present work showcases the potential of computational survey investigations using biomedical ontologies. Collecting data with ontology alignment in mind or retroactively performing ontology alignment for secondary data analysis will provide opportunities to apply KG study designs for hypothesis generation.

Limitations

This work has limitations due to the nature of the PEGS dataset, namely the North Carolina-specific population and the lower percentage of individuals with FRDs compared to national prevalence estimates. While this investigation was a secondary data analysis and did not involve design or collection of PEGS survey data, future investigations should include a more geographically diverse sample population for greater generalizability of study findings. Additionally, the dataset lacks information on temporality. PEGS participants are asked to describe their current eating habits, past and current exposures, and whether they have been diagnosed with an FRD. Given the lack of context for when onset of a condition occurred, it is difficult to identify the true impact of diet or environmental exposures, as they may have occurred before or after symptom presentation and disease diagnosis. Use of a survey design that includes temporality questions and collects information on gynecological history, demographics, and other potential confounders may improve the interpretation of findings. Of note, our investigation used a binary variable of food consumption for individuals to indicate that they either do or do not consume a particular food. This approach was consistent for all food exposures, with no distinctions made between low and high consumption. Given the potentially wide range of consumption levels, this binary approach reduces the ability to decipher the impacts of dietary factors using the KG model. Binning data into “low”, “medium”, or “high” consumption levels (for example, “low” consumers eat apples 0–1 times per week) should be considered for future KG based investigations, to improve data output granularity. Further, our named entity recognition approach to mapping string responses to survey questions can be improved by grouping similar medications (for instance, regular versus extended-release formats). Additionally, machine learning approaches that consider specific values for dietary intake (for example, the number of apples consumed per week) when creating link predictions in a KG model would greatly benefit future nutrition investigations for hypothesis generation and potential future causally predictive works. The performance of our KG model resulted in a substantial list of findings, many with similarly high prediction scores. While edge prediction provides prediction values between 0 and 1, equally ranked results make prioritization for hypothesis generation challenging. As such, efforts should be made to improve the prioritization of KG findings to enable hypothesis development. While areas for improvement exist in this study design, we identified multiple predicted variables, including modifiable lifestyle factors, for FRD. Additional results, including those resulting exclusively from KG analysis, may result in meaningful hypotheses in future investigations of FRDs.

Introduction

Female reproductive disorders (FRDs) such as endometriosis, uterine fibroids, and ovarian cysts significantly affect physical and emotional health, disability, and fertility for women and those assigned female at birth [ 1 ]. FRDs fall into a category of conditions that are often misdiagnosed and have prolonged diagnostic timeframes and limited therapeutic options [ 2 , 3 ]. Prevalence of common FRDs such as endometriosis is often underestimated given the clinical difficulty of identifying the condition without invasive laparoscopic surgery and the often years-long lag between symptom onset and diagnosis [ 2 , 4 ]. Due to their widespread prevalence and substantial impact on daily life, ways to more easily identify FRDs as well as viable therapeutic approaches for FRDs are highly sought after [ 5 – 7 ]. Diet and environment have been proposed as potential intervention opportunities for FRDs [ 8 , 9 ], but standard clinical recommendations on diet and exposures are limited. Focusing on modifiable features such as diet, lifestyle factors, and environmental exposures may offer new options for individuals and care providers to manage these common conditions and improve outcomes. We present an innovative approach for assessing survey-based data to predict links between nutrition, environmental exposures, comorbidity, and medication and three common FRDs, namely endometriosis, uterine fibroids, and ovarian cysts. Endometriosis is the extrauterine growth of endometrial tissue (also called lesions) with hallmark symptoms that include pelvic pain, dysuria, dysmenorrhea, and sub- or infertility [ 10 ]. This FRD is estimated to occur in 10 % of women [ 11 ]. Delays in diagnosis are common with endometriosis, and many individuals wait years for a conclusive diagnosis [ 2 , 4 ]. Accordingly, estimates of prevalence vary widely and are likely inaccurate. An estimated 35–50 % of individuals diagnosed with endometriosis experience pain and/or infertility [ 5 ], but approximately 20–25 % of individuals with endometriosis do not experience pelvic pain [ 5 , 12 , 13 ]. Because symptoms can be inconsistent, clinical diagnosis is difficult. Endometriosis is often diagnosed during treatment for fertility issues [ 14 , 15 ]. Endometriosis can present similarly to other gynecological disorders including primary dysmenorrhea, pelvic inflammatory disease, and pelvic adhesions presenting as chronic pelvic pain, painful menses, tubal pregnancies, and infertility [ 2 , 3 ]. Due to its inconsistent presentation, surgical visualization is needed to definitively diagnose endometriosis, which is a barrier to diagnosis and treatment [ 2 ]. Uterine fibroids, also called leiomyomas, are common benign tumors estimated to be present in 70–80 % of women by the age of menopause, [ 16 ] and approximately 20–25 % of those individuals present with clinical symptoms [ 17 ]. The fibroids are composed of smooth muscle cells and fibrous extracellular matrix that is overproduced and creates tumors within the myometrium [ 18 ]. Many women with fibroids are not clinically diagnosed. Some have no symptoms, and some live with significantly burdensome symptoms without a clinical diagnosis. The high prevalence of undiagnosed fibroids means that prevalence may be underestimated when determined using clinical records. Common fibroid symptoms include heavy menses, pelvic pain, anemia, urinary incontinence, and infertility [ 18 – 20 ]. With symptomatic fibroids, pregnancy complications (placenta previa, intrauterine growth restriction, increased need for cesarean section) can be more common [ 21 ]. Diagnosis of fibroids is usually accomplished with a variety of imaging techniques, including transvaginal ultrasound, hysterosalpingography, saline infusion sonography, hysteroscopy, and magnetic resonance imaging (MRI) [ 21 – 23 ]. Ovarian cysts affect approximately one in 25 women [ 7 ]. There are multiple types of ovarian cysts, but functional cysts are the most prevalent. Functional cysts occur when a follicle forms in the ovary, but no ovulation ensues and the follicle does not rupture, creating a cyst [ 24 ]. The most frequently reported symptoms of ovarian cysts are pelvic pain, abdominal pressure, bloating, and infertility although asymptomatic ovarian cysts can occur [ 25 , 26 ]. Asymptomatic ovarian cysts can be left untreated and may not require intervention, with some cysts disappearing naturally. However, cysts affecting fertility, pelvic anatomy, or quality of life in a significant way can be surgically removed [ 27 ]. While polycystic ovary syndrome (PCOS) is a condition that includes the presence of ovarian cysts, this investigation does not include PCOS as a primary outcome of interest. Ontologies are a methodology for standardizing terminology in a computable fashion to support the creation of logical axioms between related terms. Prominent ontologies in the biomedical sciences include the Gene Ontology [ 28 ] and the Human Phenotype Ontology [ 29 ], with many others related to foods, chemicals, and diseases [ 30 – 32 ]. Knowledge graphs (KGs) are a method for representing knowledge such as ontology content and instance level data in a graph structure in which nodes and edges are explicitly connected via semantic relationships [ 33 ]. Because of their innate high dimensionality, data inquiries can be conducted using KGs. However, the dimensionality of KGs can be reduced through embedding so they can support other analytic methodologies [ 34 ]. In our investigation, we aligned heterogeneous data regarding health, environment, and internal exposures to ontology content for ingestion into a KG, which was subsequently embedded and analyzed using machine learning techniques.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Condition tags

endometriosis

MeSH descriptors

Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure Environmental Exposure

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-06-14T06:08:20.186862+00:00
pubmed: last seen: 2026-06-14T06:06:22.001263+00:00
unpaywall: last seen: 2026-05-14T19:30:52.867331+00:00

License: CC-BY-4.0 · commercial use OK · attribution required
Courtesy of the U.S. National Library of Medicine