Methods
The primary data for this project came from the Personalized Environment
and Genes Study (PEGS, formerly known as the Environmental Polymorphisms
Registry) conducted by the National Institute of Environmental Health Sciences
(NIEHS) [ 35 , 36 ], which includes data from three respondent
surveys, the Health and Exposure (self-reported diseases and phenotypes),
Internal Exposome (foods, medications, supplements, and ingested exposures), and
External Exposome (environmental exposures) surveys. Survey respondents are
adult (aged 18 years or more) residents of North Carolina recruited for
voluntary participation through health providers or events such as health fairs.
The data included in this investigation were collected between 2012 and 2020.
PEGS data is available by request only from NIEHS. This investigation was
approved and deemed research with no human subjects (Category 4 exemption) by
the Oregon State University (IRB-2021–1207).
Additional publicly available data were included in this investigation.
Agricultural Chemical Usage Program (ACUP) data from the United States
Department of Agriculture (USDA) on fungicides, pesticides, and other chemicals
applied to agricultural crops during 2016–2020 was included for all
relevant questions in the PEGS data sets (for instance, data on chemicals
applied to carrots was included as PEGS inquires about consumption of carrots).
ACUP data were not included if there was no related PEGS question, and not all
PEGS questions about diet had related ACUP data (for example, consumption of
combination foods such as hamburgers or foods without crop components, such as
meat). Nutrient data for Foundation Foods from the USDA Food Data Central (FDC)
was included when available with references to the FoodOn ontology [ 32 ]. This allowed for direct mapping to the
selected ontology alignment (for instance, a survey question on intake of
cottage cheese mapped to FOODON:03303720; and ‘cottage cheese
(lowfat)’ mapped to FDC ID: 328,841 and FDC nutritional content for
‘Cheese, cottage, lowfat, 2 % milkfat’).
Combined, the PEGS surveys comprise 1842 questions. We assessed the
survey questions for ontology alignment based on existing ontology content and
complexity of the survey question as well as the primary topic area. We focused
on questions related to diseases, phenotypes, dietary exposures, and
environmental exposures. We then aligned feasible survey questions of interest
(n = 341, with 135 from the External Exposure Survey, 131 from the Internal
Exposure Survey, and 75 from the Health and Exposure Survey) to ontology
terminology. An ontology curator (author LC) manually reviewed the data to map
the PEGS survey questions to the coordinating ontology content. Free-response
components of the PEGS surveys and other data sets, including USDA ACUP data,
were mapped to ontology terms using semi-automated curation with OntoRunNER
[ 37 ], followed by supplemental manual
review by the curator. The ‘survey question label’ selected for
free response questions was assigned the mapped ontology term value of the
response due to the list aggregation used to process data via OntoRunNER. When
necessary, we requested new ontology terms in efforts to support the mappings
needed for this data alignment. Primary requests were made to the Food Ontology
(FoodOn) [ 32 ] and the Environmental
Conditions, Treatments, and Exposures Ontology (ECTO) [ 38 ].
We created the KG for this project with an extract, transform, load
(ETL) pipeline constructed using the Knowledge Graph Hub project KG-template
[ 39 ]. The KG-template offers a
skeleton structure of data download, transformation, and merge scripts that we
customized for this project. This pipeline was developed using Python (Version
3.90.10) and Koza [ 40 ], a data
transformation framework constructed by the Monarch Initiative. Transformations
included the alignment of self-reported data for questions of interest with the
ontology mappings generated manually or semi-automatically as described in Fig. 1 . Code used for KG development is
available at our GitHub repository(41).
We conducted each data transformation (for instance, disease, phenotype,
medication, food) with a unique script that asserted the correct
“predicate” (for example, the phenotype transform created
assertions such as ‘Person:1234’ ‘has phenotype’
‘uterine leiomyoma’). We followed this process for all PEGS data
and all supplemental data on food, chemical usage, and nutrient content. Fig. 2 provides an example of the full
mapping and transformation process, in which reusable nodes were generated for a
respondent’s unique ID and their survey responses. In turn, all questions
answered by a respondent were mapped to the same respondent node using their ID.
Similarly, all respondents who answered the same question were mapped to the
same question response node. In addition to the transformed respondent data, the
full contents of relevant ontologies (Human Phenotype Ontology (HPO), Mondo
Disease Ontology (Mondo), Medical Actions Ontology (MAxO), Gene Ontology (GO),
Environment Ontology (ENVO), Chemical Entities of Biological Interest (ChEBI),
ECTO, and FoodOn) were merged to create the KG. Within the KG structure, each
ontology term or survey participant was considered a “node”, with
all relationships between each node considered an “edge”.
As with many KGs, the KG for this project was a high-dimensional object
with a large number of nodes and edges, making it less amenable to machine
learning. Lower-dimension forms of a KG allow for improved generalization of
knowledge, as the latent representation places dissimilar nodes farther away
from one another and nodes with greater similarity closer to each other. To
reduce the dimensionality of the KG in preparation for machine-learning
techniques, we embedded the KG using Graph Representation leArning, Predictions
and Evaluation (GRAPE) [ 41 ] and its
embedding library. We used only the largest component of the KG, which
eliminated data from 691 (7.1 %) survey respondents due to insufficient data.
The generated embedded representations included ontology terms, exposures,
clinical variables, FRDs, and respondents. As such, the resulting
representations embedded the topological relationships between the different
types of entities populating the KG in a vectorial space. Additional details can
be found in the Supplemental
Methods .
For the following machine learning methods, we generated two
edge-embedding versions, a training embedding and a full data embedding. The
training embedding included a ‘Training’ portion comprising 70 %
of the graph and a ‘Test’ portion comprising the remaining 30 %.
We created the test portion by selecting and holding out edges that, when
removed from the full embedding, did not create a new component and thus kept
the primary component of the graph intact. This avoided a biased estimation of
the edge prediction results for the test set (see the GRAPE github repository
for a full description of the method [ 42 ]). Edges in the training set were not specifically selected as
“positive” responses (for example, edges documenting an
FRD-variable relationship), in efforts to train the model for edge prediction
based on the entire topology of the graph. The full embedding included all
available data. Fig. 3 summarizes the
analytical methods.
Random forests (RF) [ 43 ] are
machine-learning classifiers used for computing medical predictions due to their
inherent explainability and interpretability and the availability of methods
(although preliminary) to convert them into a checklist of rules [ 44 , 45 ].
Our primary machine-learning task was applied to the KG we created,
generating link predictions between variables (for example, food, nutrient,
environmental exposure, disease, phenotype) and the FDRs of interest. We then
trained an RF model (501 trees, 15 maximum depth) using the embeddings of the
training data (with holdouts). The standard machine-learning performance metrics
indicated the model was trained successfully and suitable for our analysis (area
under the receiver operating characteristic (AUROC) = 0.915 for the
‘Test’ portion of the training data). To produce actionable
results, we then retrained the model on the full dataset to obtain a set of
predicted links between the FRDs and other variables. In the output, predicted
links were represented by two node values—the “source”
(independent variable) and “destination” (dependent variable)
nodes of the link—and a “prediction” score indicating the
strength of the predicted link between the two nodes. Utilizing the full graph
embedding, we selected prediction outcomes from the model that included an FRD
(for example, endometriosis, ovarian cysts, uterine fibroids) as the
“source” and the resulting “destination”. We
retained pairs with a prediction score > 0.8, resulting in a list of
predicted variables for each FRD of interest.
For additional comparison of our KG findings, we conducted a secondary
analysis using elastic nets, RFs, and logistic regression models to provide
feature explanations (in terms of feature importance in prediction) and
interpretations (in terms of the directionality of risk scores associated with
each feature). We conducted this analysis in R, version 4.20.2. We cleaned the
primary PEGS data on health conditions and internal and external exposures to
include female participants only. We then excluded participants who did not
complete all three surveys to improve data quality, given the lower response
rates to the Internal and External Exposure Surveys versus the Health and
Exposure Survey. For the regression analysis, we utilized only survey questions
that aligned with the KG analysis (see KG Data Preparation) to maintain
consistency and enable comparison. We imputed missing data using the missForest
algorithm, which has exhibited superior performance in previous work [ 44 , 46 ].
To select the features with the strongest relationships with the FRDs of
interest, we leveraged an explainable machine-learning technique [ 47 ], to account for the class imbalance
affecting the FRD datasets and to produce both importance scores and their
directionality concerning the risk of disease. We developed a model that applied
a first step of supervised feature selection on the training set and then
selected features used to train an RF classifier. The model then computed
permutation-based feature importance scores based on the RF classifier that were
used to select the most important variables for FRD prediction. Features
regarded as important by an RF are not characterized by directionality and
magnitude, which is important for a medical context [ 48 ]. To assess these characteristics, we then trained
logistic regression classifiers, whose learned odds ratios and
P values indicate the significance and directionality of
risk scores. We ran the model three times, each time utilizing a different FRD
as the primary outcome. We adjusted the P values obtained in
the logistic regression analyses for endometriosis, ovarian cysts, and uterine
fibroids using Bonferroni correction to account for the family-wise false
discovery rate (FDR).
Based on the KG and logistic regression model results, we identified the
most influential features for each FRD. We compared both the KG and logistic
regression outputs for exact matches for each FRD. Details of additional methods
can be found in the Supplemental Methods and our code can be found on GitHub [ 49 ].
Results
A total of 16,039 surveys were completed (External Exposome = 3579, Internal
Exposome = 3034, Health and Exposure = 9426) by 9765 unique individuals, including
2773 individuals who completed all three surveys. In the study population, there was
reported prevalence of 7 % for endometriosis, 15 % for uterine fibroids, and 13 %
for ovarian cysts. Translation keys for all survey questions of interest and their
coordinating ontology content, including OntoRunNER generated mappings, can be found
in Supplementary Table
1A – D
( Supp Table 1D is also
available on github [ 50 ]). The majority of
survey respondents were female, with an average age between 49.9 and 54 years
depending on the survey ( Table 1 ). Further
information such as race/ethnicity, pregnancy history, age at menarche, and health
care access level were not available in this dataset.
The KG created for this project has 308.60 K heterogeneous nodes and 696.68
K edges in total. The graph contains 28.44 K connected components (of which 28.41 K
are disconnected nodes), with the largest one containing 280.03 K nodes and the
smallest one containing a single node. Fig. 4
shows the resulting full graph embedding after selecting for the largest connected
component in the graph.
We identified a list of significant (P < 0.005) and suggestive (P
0.8). All survey labels
were coded for a “Yes” response to the question, indicating the
presence of an exposure or condition. Table
2A – C shows the significant
(P < 0.005) and suggestive features (P < 0.05) identified from
logistic regression. Significant or suggestive features from both analyses are
indicated in bold in Tables 2A – C . Supplemental Tables 2A – C provide a full list of
variables identified from logistic regression. A full list of variables identified
as part of the KG link prediction methodology can be found in Supplemental Table 3 ( Supp Table 3 is also available on
Github [ 51 ]).
Table 2A – C . Significant and suggestive features identified via
logistic regression. Variables that are direct matches in the KG results are
displayed in bold. Unreported Mean Variance Inflation Factor (VIF) scores indicate
inadequate information available to calculate the score.
Conclusion
FRDs are highly impactful conditions for women globally, and there is a need
to identify modifiable factors associated with these disorders. Limited
investigations using ontologies or KG structures for investigations of FRDs have
been conducted, and most existing studies have not accounted for modifiable
lifestyle factors such as diet and environmental exposures. Using KG and logistic
regression approaches, we identified a variety of potential intervention points for
FRDs that can be pursued in future work. Because they are based on open-source,
biomedical ontologies and computational resources, the novel methodologies used in
this study can be repurposed for additional investigations.
Computational analysis methods for nutrition and exposure survey
data are limited, reducing their impact on treatments for conditions
like FRDs.
Although previous investigations evaluate FRD mechanisms and
interventions, there are significant gaps in knowledge regarding
modifiable lifestyle risk factors.
This investigation harmonizes nutrition and exposure data with
biomedical ontologies for FRD knowledge graph (KG) creation.
KG analysis via a graph-representation-learning (GRL) model
identifies variables which may significantly impact FRDs; these results
are compared with a classic explainable AI technique, where the
significance and risk of crucial variables identified via random
forest-based, permutation-importance analysis are assessed by logistic
regression.
Discussion
Our work developing a KG with survey-based data and conducting machine
learning to predict variables associated with FRDs is the first of its kind. The
logistic regression model we developed for comparison supports our findings using
this novel approach. Comparing the logistic regression and KG models resulted in
numerous exact matches for medical conditions and procedures, environmental
exposures, medications, and dietary exposures for the considered FRDs. Endometriosis
and ovarian cysts had suggestive associations with other gynecological conditions
and procedures. Positive responses to questions regarding hysterectomy, ovary
removal, and ovarian cysts were all suggestively associated with endometriosis. A
possible explanation for the procedure associations is that ovary removal and
hysterectomy are offered as endometriosis treatment options when other therapies
have been unsuccessful [ 52 , 53 ]. However, the timing of disease onset and medical
procedures in this dataset was unavailable. Endometriosis can present as an ovarian
endometrioma, an endometriotic cyst in the ovary [ 54 ], which may be related to the suggestive endometriosis and ovarian
cyst association identified. It is important to note that screening for any of these
gynecological conditions may contribute to the identification of another
gynecological comorbidity due to increased potential for detection.
Use of duloxetine had a suggestive association with uterine fibroids in this
study. Duloxetine is a medication primarily used for treatment of major depressive
disorder, generalized anxiety disorder, chronic musculoskeletal pain, and
fibromyalgia [ 55 ]. While duloxetine does not
have a documented relationship with FRDs in current literature, there is a strong
association between depression and mental health concerns in individuals with FRDs.
Individuals with uterine fibroids have been documented to experience higher rates of
depression and anxiety compared to controls, particularly amongst individuals who
experience pain symptoms or who have undergone a hysterectomy [ 56 ]. Given the increased prevalence of mental health
conditions amongst individuals with FRDs, individuals with these conditions may be
more likely to take antidepressants or similar medications which may be related to
this finding.
Omeprazole use was significantly associated with increased odds of uterine
fibroids. Omeprazole is a proton pump inhibitor, used to treat gastroesophageal
reflux disease (GERD), ulcers, and other conditions characterized by excessive
stomach acid [ 57 ]. Omeprazole has no reported
side effects related to uterine fibroid development, but bulk-related symptoms may
present due to uterine fibroids as the enlarged fibroids can distort the abdominal
anatomy and cause abdominal bloating and pressure [ 58 ]. Uterine fibroids have been denoted as an associated disorder for
individuals with Barrett’s esophagus, a gastrointestinal complication of GERD
[ 59 , 60 ].
We identified multiple potential associations between diet and FRDs. Tofu
consumption was suggestively associated with decreased odds of endometriosis. Tofu,
a processed soybean curd, is often studied for its health benefits related to its
high isoflavone content [ 61 ]. Isoflavones are
of interest given their known antioxidant properties [ 62 ]. It is hypothesized that excessive inflammation
observed with endometriosis may be mitigated through isoflavone exposure [ 62 , 63 ].
Supporting the suggestive association of the present study, prior work has reported
an inverse relationship between urinary isoflavone concentration and severe
endometriosis [ 64 ]. However, a set of case
studies investigating excessive soy consumption found high soy intake to be related
to dysmenorrhea, endometriosis, and uterine fibroids [ 65 ]. Because of the higher rates of soy consumption among
Asian individuals compared to other groups [ 66 ], it is notable that prevalence of endometriosis is higher in Asian
populations than in other racial groups [ 67 , 68 ]. However, data on race
were unavailable for analysis. Notably, soy isoflavones are also phytoestrogens,
given their ability to bind to estrogen receptors and contribute to estrogenic
activity in humans [ 62 ]. Isoflavones have
been denoted as potential endocrine disruptors, however these long-term mechanistic
effects are not fully elucidated [ 61 ]. While
our results are inconclusive, further research evaluating soy consumption and
endometriosis may be helpful for guidance on prevention and management.
A suggestive association was also identified for carrot consumption and
decreased odds of endometriosis. Consumption of fruits and vegetables has been
identified as protective against endometriosis, potentially due to the
anti-inflammatory properties of dietary components, including vitamins C and E
[ 69 , 70 ]. Carrots contain high levels of antioxidant carotenoids, which may
reduce the inflammatory responses that occur in individuals with endometriosis
[ 71 ]. The effects of carrot consumption
are inconsistent in the literature, with multiple investigations reporting no
significant associations between carrots and endometriosis [ 72 , 73 ]. Further
exploratory work is needed for all potential dietary relationships with FRDs,
including study designs which can include food quantities, as that was a limitation
of this study design.
By utilizing a novel KG methodology and comparing the results with those
from a traditional logistic regression model, we generated and corroborated multiple
hypotheses of the effects of modifiable lifestyle factors on FRDs. The KG method
presented here is an effective hypothesis-generation strategy, but the results
should not be construed as causal as in other survey-based methodologies. Due to a
lack of temporality information regarding exposures and condition onset, hypotheses
generated from these associations should be investigated bidirectionally to best
interpret how the variables interact.
The logistic regression approach indicated positive or negative associations
for survey variables, which cannot be calculated using existing KG methods. The KG
model identified an unranked list of predicted significant factors that require
further assessment to identify variables of interest. Given the novelty of applying
the KG method in survey-based data, its successful application in the present work
showcases the potential of computational survey investigations using biomedical
ontologies. Collecting data with ontology alignment in mind or retroactively
performing ontology alignment for secondary data analysis will provide opportunities
to apply KG study designs for hypothesis generation.
Limitations
This work has limitations due to the nature of the PEGS dataset, namely the
North Carolina-specific population and the lower percentage of individuals with FRDs
compared to national prevalence estimates. While this investigation was a secondary
data analysis and did not involve design or collection of PEGS survey data, future
investigations should include a more geographically diverse sample population for
greater generalizability of study findings. Additionally, the dataset lacks
information on temporality. PEGS participants are asked to describe their current
eating habits, past and current exposures, and whether they have been diagnosed with
an FRD. Given the lack of context for when onset of a condition occurred, it is
difficult to identify the true impact of diet or environmental exposures, as they
may have occurred before or after symptom presentation and disease diagnosis. Use of
a survey design that includes temporality questions and collects information on
gynecological history, demographics, and other potential confounders may improve the
interpretation of findings.
Of note, our investigation used a binary variable of food consumption for
individuals to indicate that they either do or do not consume a particular food.
This approach was consistent for all food exposures, with no distinctions made
between low and high consumption. Given the potentially wide range of consumption
levels, this binary approach reduces the ability to decipher the impacts of dietary
factors using the KG model. Binning data into “low”,
“medium”, or “high” consumption levels (for example,
“low” consumers eat apples 0–1 times per week) should be
considered for future KG based investigations, to improve data output granularity.
Further, our named entity recognition approach to mapping string responses to survey
questions can be improved by grouping similar medications (for instance, regular
versus extended-release formats). Additionally, machine learning approaches that
consider specific values for dietary intake (for example, the number of apples
consumed per week) when creating link predictions in a KG model would greatly
benefit future nutrition investigations for hypothesis generation and potential
future causally predictive works.
The performance of our KG model resulted in a substantial list of findings,
many with similarly high prediction scores. While edge prediction provides
prediction values between 0 and 1, equally ranked results make prioritization for
hypothesis generation challenging. As such, efforts should be made to improve the
prioritization of KG findings to enable hypothesis development.
While areas for improvement exist in this study design, we identified
multiple predicted variables, including modifiable lifestyle factors, for FRD.
Additional results, including those resulting exclusively from KG analysis, may
result in meaningful hypotheses in future investigations of FRDs.
Introduction
Female reproductive disorders (FRDs) such as endometriosis, uterine fibroids,
and ovarian cysts significantly affect physical and emotional health, disability,
and fertility for women and those assigned female at birth [ 1 ]. FRDs fall into a category of conditions that are
often misdiagnosed and have prolonged diagnostic timeframes and limited therapeutic
options [ 2 , 3 ]. Prevalence of common FRDs such as endometriosis is often
underestimated given the clinical difficulty of identifying the condition without
invasive laparoscopic surgery and the often years-long lag between symptom onset and
diagnosis [ 2 , 4 ]. Due to their widespread prevalence and substantial impact on daily
life, ways to more easily identify FRDs as well as viable therapeutic approaches for
FRDs are highly sought after [ 5 – 7 ]. Diet and environment have been proposed as
potential intervention opportunities for FRDs [ 8 , 9 ], but standard clinical
recommendations on diet and exposures are limited. Focusing on modifiable features
such as diet, lifestyle factors, and environmental exposures may offer new options
for individuals and care providers to manage these common conditions and improve
outcomes. We present an innovative approach for assessing survey-based data to
predict links between nutrition, environmental exposures, comorbidity, and
medication and three common FRDs, namely endometriosis, uterine fibroids, and
ovarian cysts.
Endometriosis is the extrauterine growth of endometrial tissue (also
called lesions) with hallmark symptoms that include pelvic pain, dysuria,
dysmenorrhea, and sub- or infertility [ 10 ]. This FRD is estimated to occur in 10 % of women [ 11 ]. Delays in diagnosis are common with
endometriosis, and many individuals wait years for a conclusive diagnosis [ 2 , 4 ].
Accordingly, estimates of prevalence vary widely and are likely inaccurate. An
estimated 35–50 % of individuals diagnosed with endometriosis experience
pain and/or infertility [ 5 ], but
approximately 20–25 % of individuals with endometriosis do not experience
pelvic pain [ 5 , 12 , 13 ].
Because symptoms can be inconsistent, clinical diagnosis is difficult.
Endometriosis is often diagnosed during treatment for fertility issues [ 14 , 15 ]. Endometriosis can present similarly to other gynecological
disorders including primary dysmenorrhea, pelvic inflammatory disease, and
pelvic adhesions presenting as chronic pelvic pain, painful menses, tubal
pregnancies, and infertility [ 2 , 3 ]. Due to its inconsistent presentation,
surgical visualization is needed to definitively diagnose endometriosis, which
is a barrier to diagnosis and treatment [ 2 ].
Uterine fibroids, also called leiomyomas, are common benign tumors
estimated to be present in 70–80 % of women by the age of menopause,
[ 16 ] and approximately 20–25 %
of those individuals present with clinical symptoms [ 17 ]. The fibroids are composed of smooth muscle cells
and fibrous extracellular matrix that is overproduced and creates tumors within
the myometrium [ 18 ]. Many women with
fibroids are not clinically diagnosed. Some have no symptoms, and some live with
significantly burdensome symptoms without a clinical diagnosis. The high
prevalence of undiagnosed fibroids means that prevalence may be underestimated
when determined using clinical records. Common fibroid symptoms include heavy
menses, pelvic pain, anemia, urinary incontinence, and infertility [ 18 – 20 ]. With symptomatic fibroids, pregnancy complications (placenta
previa, intrauterine growth restriction, increased need for cesarean section)
can be more common [ 21 ]. Diagnosis of
fibroids is usually accomplished with a variety of imaging techniques, including
transvaginal ultrasound, hysterosalpingography, saline infusion sonography,
hysteroscopy, and magnetic resonance imaging (MRI) [ 21 – 23 ].
Ovarian cysts affect approximately one in 25 women [ 7 ]. There are multiple types of ovarian cysts, but
functional cysts are the most prevalent. Functional cysts occur when a follicle
forms in the ovary, but no ovulation ensues and the follicle does not rupture,
creating a cyst [ 24 ]. The most frequently
reported symptoms of ovarian cysts are pelvic pain, abdominal pressure,
bloating, and infertility although asymptomatic ovarian cysts can occur [ 25 , 26 ]. Asymptomatic ovarian cysts can be left untreated and may not
require intervention, with some cysts disappearing naturally. However, cysts
affecting fertility, pelvic anatomy, or quality of life in a significant way can
be surgically removed [ 27 ]. While
polycystic ovary syndrome (PCOS) is a condition that includes the presence of
ovarian cysts, this investigation does not include PCOS as a primary outcome of
interest.
Ontologies are a methodology for standardizing terminology in a
computable fashion to support the creation of logical axioms between related
terms. Prominent ontologies in the biomedical sciences include the Gene Ontology
[ 28 ] and the Human Phenotype Ontology
[ 29 ], with many others related to
foods, chemicals, and diseases [ 30 – 32 ]. Knowledge
graphs (KGs) are a method for representing knowledge such as ontology content
and instance level data in a graph structure in which nodes and edges are
explicitly connected via semantic relationships [ 33 ]. Because of their innate high dimensionality, data inquiries can
be conducted using KGs. However, the dimensionality of KGs can be reduced
through embedding so they can support other analytic methodologies [ 34 ]. In our investigation, we aligned
heterogeneous data regarding health, environment, and internal exposures to
ontology content for ingestion into a KG, which was subsequently embedded and
analyzed using machine learning techniques.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.