Intrinsic tumor factors and extrinsic environmental and social exposures contribute to endometrial cancer recurrence patterns | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Intrinsic tumor factors and extrinsic environmental and social exposures contribute to endometrial cancer recurrence patterns Jesus Gonzalez Bosquet, Oyomoare Osazuwa-Peters, Vincent M. Wagner, and 19 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8682460/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Purpose In a previous study, we trained, validated and tested models of endometrial cancer (EC) recurrence integrating clinical, genomic and pathological data from the Oncology Research Information Exchange Network (ORIEN). Preliminary studies also have demonstrated that bacterial communities may influence the risk of EC recurrence by altering the local environment within the upper female genital tract. The objective of this study was to evaluate whether extrinsic and environmental factors, including tumor-associated bacterial communities, tumor immune contexture and air pollution alongside clinical, pathologic and genomic features are associated with EC recurrence across clinically relevant risk groups. Patients and Methods: We performed a retrospective, multi-institution, case–control study with data from the ORIEN network EC dataset. Data was stratified into low-risk, FIGO grade 1 and 2, stage I (N = 329), high-risk, or FIGO grade 3 or stages II-IV (N = 324), and non-endometrioid histology (N = 239) groups. RNA and DNA were extracted from tumor specimens and processed to obtain the necessary genomic/metagenomic data. Genus level microbiome data were extracted and curated) from RNA sequencing using Kraken2 , Bracken and exotic software packages. Risk of EC recurrence was evaluated by integrating microbiome and environmental data alongside existing clinical, pathological and genomic data using topic modelling with latent dirichlet allocation (LDA). Prediction models of EC recurrence were created using machine and deep learning analytics (ML and DL) with MATLAB apps and TensorFlow . Finally, performance of both topic and prediction models were externally validated in an independent EC dataset from TCGA. Results The resulting models, analyzed with topic modelling, demonstrated the complexity of factors involved in recurrence of disease for EC. The components of the resulting topic models, and specifically the microbiome, changed when environmental factors, like air pollutants, were introduced in the model. In the low-risk EC group, microbes that were quite abundant in models before introducing environmental factors, were scarcely seen afterwards, like genera Thermothielavioides , Theileria , Rhizoctonia . Bacillus was the genus with higher per-topic probability within all risk groups, especially for low-risk EC (28%). Ozone (O 3 ) was a resulting component of all risk groups’ models. BMI was the sole informative clinical variable after data integration, and only present in the low-risk group. Resulting models from the high-risk and non-endometrioid groups included differential gene expressions: MMP13, S100A7, SMOC1, ACACA and ADD2, DLX5, SLCO2B1, NWD1 respectively. CNVs also were present in both low-risk and non-endometrioid groups, but their per-topic probabilities were low. The same was true for the immune contexture data. The components of the resulting topic models were used to train, validate and test prediction models of EC recurrence by risk groups. Performances of these models were excellent (@ 0.9). Despite some missing microbiome data in TCGA from resulting topic models, prediction models trained in the ORIEN set, had similar performances in TCGA testing set, with overlapping AUC 95% CIs. Conclusion Both extrinsic factors (tumor-associated bacterial communities, tumor immune contexture and air pollution) and intrinsic factors predict EC recurrence. The complexity of tumor and host factors influencing cancer relapses underscore the need for more individualized prediction models of disease outcomes. Biological sciences/Cancer Biological sciences/Computational biology and bioinformatics Biological sciences/Microbiology Health sciences/Oncology Health sciences/Risk factors Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 BACKGROUND The incidence and mortality for endometrial cancer (EC) continues to rise 1 with a projected mortality increase of 55% by 2030. 2 These discouraging outcomes are in part to the persistence of treatment failures, despite the recent introduction of immunotherapy and targeting therapies for this disease with notable successes (RUBY, GY-018, DUO-E). 3–5 Though non-endometrioid EC types account for a disproportionately high number of EC recurrences and cancer-related deaths, 6 the majority of treatment failures and recurrences occur in endometrioid EC, with approximately 10–15% of disease recurrence in patients with early-stage EC. 6,7 In a previous study, we trained, validated and tested models of EC recurrence integrating clinical, genomic and pathological data from the Oncology Research Information Exchange Network (ORIEN). 8 The models were stratified into low risk, FIGO grade 1 and 2, stage I (N = 329), high risk, or FIGO grade 3 or stages II, III, IV (N = 324), and non-endometrioid histology (N = 239) groups. This study resulted in validated high-performing prediction models, with area under the curve (AUC) performance over 0.9–0.95 for all 3 risk groups. While these models demonstrated excellent discrimination, they may not fully capture the biological complexity and environmental heterogeneity that influence EC recurrence across diverse patient populations. To further improve discriminatory accuracy and generality of these models, we hypothesized that inclusion of intrinsic tumor microenvironmental (TME) variables and extrinsic environmental variables alongside clinical, pathologic and genomic features may modify geographically the risk for EC recurrence. In preliminary data, we observed that the microbiome is associated with female genital tract cancers, specifically with EC, and may interact differently with tumors with different mutation signatures. 9 , 10 The human microbiome is a symbiotic community of bacteria, fungi, and viruses that live on or within the human body with specific functions, properties, and interactions within its environment. 11 , 12 Bacterial communities may influence the risk of EC recurrence by altering the local immune response modulation, by epigenetic changes, or TME modulation. 13 – 15 Additionally, other environmental factors, like air pollution also have been associated to incidence and recurrence in hormonal-related cancers, like breast cancer, 16 or acting as xenoestrogens or anti-androgens, inducing oxidative stress, DNA damage, epigenetic changes, and chronic inflammation in hormone-sensitive tissues. 17 , 18 The objective of this study was to assess the differences in EC recurrence risk when accounting for TME factors, like tumor-associated microbiome and immune cell infiltration, and extrinsic environmental factors, like air pollution and environmental determinants of health. Then, we assessed the predictive accuracy of these intrinsic and extrinsic variables for EC recurrence. Performance of the prediction models were externally validated using the Cancer Genome Atlas (TCGA) EC datasets. METHODS Study design: We performed a retrospective, multi-institution, case–control study with data originated from the ORIEN network EC dataset. ORIEN is comprised of multiple cancer centers that have agreed to use the same Institutional Review Board (IRB)-approved protocol and consent (Total Cancer Care Protocol, TCC) to follow patients throughout their lifetime. 19 A copy of the protocol is included in Supplementary Material . Patients consent to donate medical records and tissue specimens for molecular profiling, as an approach to improve design and performance of personalized cancer care. RNA and DNA were extracted from tumor specimens and processed to obtain the necessary genomic/metagenomic data, as specified previously. 8 The study analysis was carried out in several steps: 1) Step 1 : selection of models and variables included in the preliminary study of EC prediction of recurrence that included clinical, genomic and pathologic data; 8 2) Step 2 : extraction and curating of microbiome data (at the taxa level of genus) from RNA sequencing (RNAseq) experiments using Kraken2 , Bracken and exotic software packages; 3) Step 3 : using topic modelling, as described previously, 20 to determine microbiomes (genus taxa) associated with EC recurrence by risk groups; 4) Step 4 : determine social and environmental determinants of health associated with EC recurrence; 5) Step 5 : integration of significant genomic, microbiome and environmental factors (resulting from previous steps) with topic modelling, to identify those factors associated with EC recurrence by risk groups; 6) Step 6 : assessment of how these variables from significant topics associated with EC recurrence performed as prediction models of recurrence using machine and deep learning analytics (ML and DL) with MATLAB apps and TensorFlow . Finally, these steps with integration of elements of EC recurrence and EC prediction modelling were externally tested (validated) in an independent EC dataset, TCGA. Patients’ inclusion, clinical, pathological and genomic data: Details of patients inclusion in risk groups, clinical and pathological data included and genomic data extraction, processing and analysis (Step 1 of the study design) are detailed in a previous publication. 8 Briefly, we included all patients in ORIEN database with EC, including all histologies, that had information about recurrent disease. Patients with EC recurrence (or cases ) were those that after completion of treatment with no evidence of disease (NED), EC reappeared, either locally (vaginal), regionally (pelvis) or distally. Index cases included women with a new event of EC cancer after treatment, those who had cancer at the last surveillance or died from cancer. Controls were patients with NED during the whole follow-up. There was a total 892 women with EC included in this analysis: 186 with EC recurrence (cases) and 706 without (controls), that had RNA and DNA sequenced and had recurrence information ( Supplementary Table 1 , also in Gonzalez Bosquet J., et al. 8 ). Included patients were part of ORIEN database since 2004 and up to 2021. Patients with 2009 FIGO stage I and histological grade 1 or 2 endometrioid EC had an overall recurrence rate of 11.6% (38/329) and were considered low risk for recurrence. Patients with a histological grade 3 endometrioid EC or with FIGO stage II-IV had an overall recurrence rate of 21.3% (69/324) and were considered high-risk for recurrence. Patients with non-endometrioid type EC (serous, carcinosarcoma, clear cell, undifferentiated, mixed) had an overall recurrence rate of 33.1% (79/239) and have even higher risk for recurrence. Baseline variables were collected after surgery, when histologic type, FIGO stage and other clinical and demographic characteristics were known. Resulting models and variables included in the preliminary study of EC recurrence prediction, which included clinical, genomic and pathologic data, 8 were selected and incorporated into the integrated dataset to be analyzed with topic modelling ( Supplementary Table 2 , also in Gonzalez Bosquet J., et al. 8 ). Tumor microenvironment (TME) data: Microbiome data. Data preprocessing : CRAM files were downloaded from the Orien server and then converted into BAM files with samtools for further analysis. Analyses were performed as outlined by the NCI Genomic Data Commons (GDC - https://docs.gdc.cancer.gov/Data/Introduction/ ). The STAR suite (including STAR-Fusion ) were used to align the transcriptome to the genome assembly version CHM13 T2T. 21 , 22 We used the exotic pipeline to broadly but conservatively identify microbes present in the tumors while removing technical artifacts and contaminants from the dataset (Step 2 of the study design). 23 First, exotic maps raw reads with quality scores (FASTQs) to the human reference genome, with a second alignment pass following the standard workflow of TCGA and other large-scale sequencing efforts. Next, exotic aligns the unmapped reads to a wide range of non-human genomes, including bacteria, archaea, viruses, fungi, and a subset of other eukaryotes using the KrakenUniq option from the Kraken2 pipeline. 24 Then, it uses Bracken for estimation of abundance at a the genus taxa level using the resulting classification from KrakenUniq . 25 Next, exotic filters contaminants in two phases: statistical filtering and literature matching. 23 Finally, the outputs are normalized to remove technical artifacts. In summary, exotic discards a small fraction of the reads in the statistical filtering step, though these reads represent a large fraction of the total microbes; and removes a large fraction of the reads but relatively few taxa with the literature-based filtering. Data analysis : Topic modeling with Latent Dirichlet Allocation (LDA) 26 , 27 was used to assess changes in microbial communities between samples (Step 3 of the study design): 1) first, Idatuning method determined the optimal number of latent topics for the analysis; 2) then, we used Topicmodels to evaluate differences in microbial communities by examining topic distributions (both R packages). 28 Statistical differences between the two groups (controls vs cases) were considered for false discovery rate (FDR) adjusted p-values < 0.05. The use of topic modeling (as natural language processing – NLP) allows to assess how microbiome communities differ quantifiably and in their composition. By treating microbial communities as “topics”, like how words cluster in textual data, we were able to model the high-dimensional interactions between different genus and identify potentially meaningful patterns and associations EC recurrence. Again, variables (genus) included in the resulting models were selected and incorporated into the integrated dataset to be analyzed with topic modelling. Tumor immune environment. To evaluate the tumor micro-environment and the immunity response induced by the tumor, we assessed the immune contexture (or the type of tumor-infiltrating immune cells) 29 and the cancer associated fibroblasts (or CAF). This evaluation could be very informative of the types of inflammatory, angiogenic, and desmoplastic reactions occurring in a tumor. We measured the immune contexture with quanTIseq , a computational pipeline that uses bulk RNAseq data using a novel deconvolution approach. 29 We used RNAseq resulting from previous steps. We used the Microenvironment Cell Populations (MCP)-counter, a transcriptome- based computational method that quantifies the abundance of tissue-infiltrating immune and non-immune stromal cell populations in non-hematopoietic human tumors. 30 This method also uses the gene expression matrix resulting from RNAseq to determine the abundance score for CD3 + T cells, CD8 + T cells, cytotoxic lymphocytes, NK cells, B lymphocytes, cells originating from monocytes (monocytic lineage), myeloid dendritic cells, neutrophils, as well as endothelial cells and fibroblasts. Social and Environmental data: Environmental variables. Air pollution data for year 2010 was obtained at the county level for four gases (O 3 , CO, SO 2 , NO 2 ), and two aerosols (PM 10 , PM 2.5 ), from the Center for Air, Climate and Energy Solutions (CACES; https://www.caces.us/data ). This air pollution data was linked with ORIEN data for EC study cohort by Aster Insights collaborators who handle data pull by cross-referencing unique county identifiers in the air pollution data with five-digit zip codes for each patient in the EC study cohort. Air pollution data for each eligible patient was provided with the county code only, to prevent identification of individual patients. Not all patients had information from the county code. Social and environmental determinants of health. Social and environmental determinants of health were derived from the Centers for Disease Control and Prevention’s Environmental Justice Index (EJI) dataset. The EJI is a nationwide, place-based index designed to capture cumulative health impacts from environmental and social burdens at the census tract level. It comprises 36 indicators organized into 10 domains—Racial/Ethnic Minority Status, Socioeconomic Status, Household Characteristics, Housing Type, Air Pollution, Potentially Hazardous and Toxic Sites, Built Environment, Transportation Infrastructure, Water Pollution, and Preexisting Chronic Disease Burden—and grouped into three overarching modules: Social Vulnerability, Environmental Burden, and Health Vulnerability. For this study, we extracted the percentile rank scores for each of the 10 domains from the EJI dataset, which was downloaded from the Agency for Toxic Substances and Disease Registry website. In addition, food access data were obtained from the United State Department of Agriculture’s Food Access Research Atlas, specifically the variable low access tract at 1 mile for urban areas or 10 miles for rural areas. This variable is defined as “a low-income tract with at least 500 people or 33% of the population living more than 1 mile (urban) or more than 10 miles (rural) from the nearest supermarket, supercenter, or large grocery store.” These census tract–level social and environmental determinants were linked to patient-level data using the county code corresponding to each census tract as the unique identifier. For counties containing multiple census tracts, data were summarized using a weighted mean, with weights based on census tract population size. Additional details on all variables used to capture social and environmental determinants of health are provided in Supplementary Table 3 . Data analysis. We used bipartite network analysis to identify clusters (subtypes) of both patients and social and environmental determinants of health. Bipartite network takes input data at the county-code level and outputs a quantitative summary (number, size, and statistical significance) along with a network visualization of the identified clusters. 31 Statistical significance was assessed by comparing the observed value to a null distribution generated from 1,000 random permutations of the network. 32 Compared to traditional clustering methods such as hierarchical clustering or principal component analysis, bipartite networks offer two key advantages: (1) they operate autonomously without requiring user-defined parameters, and (2) they define clusters that include both patients and variables. 32 We used bipartite networks to detect clusters and associations between cluster membership and recurrence of disease between air pollution and social determinants of health. Bipartite network separated air pollution data and social determinants of health, so we performed a multivariate lasso regression of EC recurrence for both domains, selected those variables that were most informative for EC recurrence prediction for both, and then, selected variables from both domains, were incorporated into the integrated dataset to be analyzed separately with topic modelling. Integration of resulting models: All elements significant in the clinical, pathological, genomic, microbiological, and environmental analyses were added to integrated databases and analyzed with topic modelling to assess patterns and associations between different data types and EC recurrence (Step 4 of the study design). Because environmental and social variables were less available in the dataset, and separated by bipartite networks, models were performed with and without them. Training, validating and testing EC recurrence models : Finally, we trained, validated and tested models with the integrated datasets that included all selected variables resulting from topic modelling (Step 5 of the study design). For prediction modelling we used lasso regression, other machine learning (ML) included in MATLAB apps, and deep learning (DL) with TensorFlow analytics. Briefly, for MATLAB analysis, we used 10-fold cross-validation for training, and left 20% of EC samples for testing with, using 35 ML different methods on ORIEN dataset. The best models were selected for reporting. Model explanation was performed on training and testing models using Shapley values. 33 In the context of machine learning prediction, the Shapley value of a feature for a query point explains the contribution of the feature to a prediction (score of each class for classification) at the specified query point. We use the Shapley values of predictors to interpret which predictors have the largest (or smallest) average impact on model output magnitude. Additionally, we used TensorFlow 34 in a Jupyter notebook with a Keras application programming interface (API) 35 as the DL method. This is a modification of the TensorFlow core tutorial ‘Classification of imbalanced data’ ( www.tensorflow.org/tutorials/structured_data/imbalanced_data ). Normalization of the data was performed using the sklearn StandardScaler . Models had 16 layers, with a dropout layer to reduce overfitting, and an output sigmoid layer that returns the probability of a transaction being fraudulent. The input layer of each model contained as many nodes as features to analyze. Training was performed to account for weights of the outcomes as well as for unbalanced data using oversampling methods. Validation was done using 15% of samples and 25% of samples were kept for testing the models. Validation in TCGA EC data: Data preprocessing. TCGA BAM files initially were converted to FASTQ files with the samtools pipeline. Then, the rest of the genomic and microbiome extraction was performed as detailed in the ORIEN database. Validation analysis. Validation was performed using TCGA EC dataset, that included endometrioid and serous EC (TCGA-UCEC) 8 and endometrial carcinosarcoma (TCGA-UCS). Briefly, after permission was granted to access controlled data by the Genomic Data Commons (GDC) Data Portal (dbGaP#29868), TCGA-UCEC RNAseq (406 endometroid and 136 serous EC) and TCGA-UCS RNAseq (56 endometrial carcinosarcomas) files in BAM format were downloaded from women with EC. Main clinical characteristics are described in Supplementary Table 4 . Of note is that non-endometrioid cases in TCGA did not include any clear cell, undifferentiated, or dedifferentiated carcinomas. For validation we used only those significant variables resulting from topic modelling that were selected and included in the integrated dataset (Step 4 of the study design). We used TCGA dataset first to externally validate the models associated with EC recurrence that included clinical, genomic and microbiome data. County codes were not available for TCGA patients, so we were not able to link all metagenomic/genomic data with environmental and socials determinants of health. Additionally, we used TCGA datasets for external testing of the best prediction models for EC recurrence trained in the ORIEN set. The best prediction models of EC recurrence were tested with ML learning ( MATLAB ) and with DL ( TensorFlow ) and including TCGA data as the testing set. Survival analysis prediction with Cox proportional hazard ratios and Kaplan-Meir survival curves were performed in R with survival and ggsurvfit packages. RESULTS Tumor-associated microbiome communities associated with EC recurrence: First we identified the optimal number of latent topics for each EC recurrence risks groups: low risk (85 latent topics), high risk (70 latent topics) and non-endometrioid group (55 latent topics) ( Figure 1, left panels). Then, we used latent Dirichlet allocation (LDA) to identify differentially abundant topics by comparing topic distributions profiles between recurrence groups ( Figure 1, middle panels). Topics were considered statistically significant topics if they met an FDR corrected p-values threshold of < 0.05 and demonstrated negative log2 fold changes. Tumor micro-environment features associated with EC recurrence: We assessed the tumor immune microenvironment and CAF using gene expression patterns derived from RNAseq ( Figure 2 ). Topic modelling was applied to determine which of these cellular components were most informative for EC recurrence. Immune cell populations identified in this initial analysis were subsequently introduced into the integrated topic modeling framework alongside significant genomic, metagenomic and clinical features, stratified by risk group. Environmental data associated with EC recurrence: For low-risk EC, five out of six air pollutants were informative for EC recurrence, including CO, NO 2 , O 3 , PM 10 , PM 2.5 ; while for high-risk four out of six, and for non-endometrioid three out of six ( Figure 3 ). Aerosols, PM 10 , PM 2.5 consistently showed increased risk (OR > 1), like O 3 , while CO, NO 2 , and SO 2 showed inverse associations (OR < 1). Variables selected by this model were then incorporated into the integrated datasets together with significant genomic, metagenomic and clinical data variables for the final analysis. Social and environmental determinants of health initial lasso regression are presented in Supplementary Figure 1 . Integration of resulting models: All features identified as significant across clinical, pathological, genomic, microbiological, and environmental analyses were incorporated to an integrated dataset and analyzed using topic modelling. Three integrated models were evaluated: a) a model including clinical, genomic, and immune features (Clin+Gen+Imm); b) a model additionally incorporating air pollution variables (Clin+Gen+Imm+Pol); and c) a model further including social and environmental determinants of health data (Clin+Gen+Imm+Env). This stepwise modeling strategy was employed because county identifiers linking environmental data to patients were available for air pollution in 74% of cases and for social/environmental determinants of health in only 64% of patients. The composition of significant topics across all three EC risk groups (low-risk, high-risk, and non-endometrioid) with and without environmental variables is summarized in Figure 4 . Topic model optimization and computational performance for each integrated model (Clin+Gen+Imm, Clin+Gen+Imm+Pol, and Clin+Gen+Imm+Env) are presented in Supplementary Figures 2, 3 , and 4, respectively. Per-topic variable probabilities, detailing the expected average probability for each component within a given topic, indicated that Bacillus was the most probable microbial genus across all risk groups, especially for low-risk EC (28%) but also for non-endometrioid type (10%)( Supplementary Table 5 ). In addition, Stenotrophomonas (10%) and Thermothielavioides (27%) were frequently observed in significant recurrence-associated topics in low-risk EC. Among clinical variables, BMI was the only feature retained after data integration; however, it was observed exclusively in the low-risk group and at a low probability (0.6%). Variables with higher per-topic probabilities (>10%) were predominantly gene expression features. In the high-risk group, these included ENSG00000137745.12 ( MMP13 ), ENSG00000143556.9 ( S100A7 ), ENSG00000198732.10 ( SMOC1 ), and ENSG00000278540.5 ( ACACA ). In the non-endometrioid group, high probable genes included ENSG00000075340.23 ( ADD2 ), ENSG00000105880.7 ( DLX5 ), ENSG00000137491.15 ( SLCO2B1 ), ENSG00000188039.14 ( NWD1 ), along with pseudogenes ENSG00000128262.8 ( POM121L9P ), ENSG00000234975.6 ( FTH1P2 ). CNVs were detected in both low-risk and non-endometrioid groups; however, their probabilities within significant topics were consistently low. Similarly, immune contexture features, CAF, and gene isoforms expression contributed at low frequency. SNVs were infrequent and were not prominent in any risk group. Although air pollutants and social/environmental determinants of health were present across all models, their per-topic probabilities were uniformly low (<1%). Notably, the inclusion of environmental variables altered the composition of microbiome-associated features within the resulting topic models, suggesting interactions between environmental exposures and tumor-associated microbial communities. Training, validation and testing models for EC recurrence: We developed, validated and tested predictive models for EC recurrence using features from significant topics identified in the integrated dataset. This analysis evaluated whether topic-selected features were also informative predictors of recurrence. Models were trained and cross-validated using two analytical platforms: MATLAB -based machine learning (ML) and TensorFlow -based deep learning (DL). For MATLAB, only the best performing models were retained from 35 candidate configurations. For TensorFlow , training accounted for class imbalance through oversampling strategies, as recurrence events comprised approximately 10-30% of samples. Separate recurrence predictions models were trained for each risk group, low-risk EC ( Figure 5 ), high-risk ( Supplementary Figure 5 ) and non-endometrioid EC ( Supplementary Figure 6 ). For each group we trained models including different combinations of data. Figure 5 summarizes model performance for the low-risk group, as measured by the area under the receiver operator characteristics curve (AUC), for models incorporating: clinical and metagenomic data (Clin+Gen; Figure 5a and 5b ); clinical, microbiome, genomic and immune contexture (Clin+Gen+Imm; Figure 5c and 5d ); clinical, microbiome, genomic, immune contexture, and air pollution data (Clin+Gen+Imm+Pol; Figure 5e and 5f ); and clinical, microbiome, genomic, immune contexture, and social/environmental data (Clin+Gen+Imm+Env, Figure 5g and 5h ). Equivalent modeling strategies were applied to the high-risk ( Supplementary Figure 5 ) and non-endometrioid groups ( Supplementary Figure 6 ). Across all three risk groups and both analytical platforms, models containing clinical, microbiome, genomic and immune contexture features (Clin+Gen+Imm) demonstrated the strongest performance in the testing set. For low-risk EC, AUCs reached 0.93 using MATLAB ( Figure 5c ) and 0.88 using Tensorflow ( Figure 5d ). In high-risk EC, corresponding AUCs were 0.9 (MATLAB; Supplementary Figure 5c ) and 0.85 (Tensorflow; Supplementary Figure 5d ). In non-endometrioid EC, AUCs were 0.79 (MATLAB; Supplementary Figure 6c ) and 0.76 (Tensorflow; Supplementary Figure 6d ). Although inclusion of environmental variable, air pollution (Clin+Gen+Imm+Pol; Figure 5e and 5f ) and social/environmental determinants of health (Clin+Gen+Imm+Env, Figure 5g and 5h ), reduced sample size due to missing county-level data (see confusion matrix in low-risk and non-endometrioid groups - Supplementary Figure 6e-h ), model performance in testing sets remained acceptable. This was particularly evident in the high-risk group: where AUC reached 0.89 for Clin+Gen+Imm+Pol and 0.8 for Clin+Gen+Imm+Env models ( Supplementary Figure 5e-h ). To assess the relative contribution of individual predictors within the best-performing models, we applied Shapley value analysis ( Supplementary Figure 7 ). Incorporation of air pollution measures and social/environmental determinants of health consistently altered the ranking and composition of the most influential predictors across all three EC risk groups, with particularly pronounced effects on microbiome-associated features ( Supplementary Figure 7d-i ). These effects were most evident in the non-endometrioid group ( Supplementary Figure 7i ), where multiple social/environmental determinants, proximity to high volume roadways and airports, proximity to impaired water bodies, and limited food access, emerged as influential contributors to recurrence prediction. External testing of models for EC recurrence: After downloading and pre-processing TCGA EC dataset using the same pipeline applied to the ORIEN cohort, we extracted variables corresponding to those retained in the integrated topic models encompassing clinical, genomic/metagenomic and immune context features. To first assess whether the EC risk group stratification derived from ORIEN was comparable in TCGA, we evaluated progression-free survival (PFS) across low-risk, high-risk, and non-endometrioid groups in both datasets ( Supplementary Figure 8 ). Although differences in PFS were observed, the 95% CIs for all three risk groups overlapped substantially, particularly during the first 2-3 years of follow-up. TCGA represents a valuable external resource but had known limitations that may affect validation performance, 8 including missing variables, limited follow-up and case status reporting, incompletely staged cases, and differences in timing of biospecimen collection. These factors are likely to contribute to the divergence of PFS curves observed later in follow-up. County-level identifiers are not available in TCGA because they constitute personal identifying data, therefore, environmental exposures could not be linked to TCGA EC cases. In addition, several features present in the integrated ORIEN topic models were not available in TCGA: a) in low-risk EC 29% of significant components missing: CNVs (mainly in long non-coding RNAs) and some microbiomes, like the genus Thermothielavioides with a probability of 27% of being a component of the resulting topic, genus Theileria and Rhizoctonia with probabilities below 8%, and the rest with probabilities below 4%; b) in high-risk EC 18% of significant components missing: like genus Malassezia with a probability of 4% and Candida with a probability of 5%; the rest missing components had probabilities of 2% or below; c) the non-endometrioid group had only 11% of missing components all of them with probabilities below 0.5%. Notably, Thermothielavioides was absent from all resulting topic models after air pollution variables were introduced, while Theileria , Rhizoctonia , Malassezia , Candida appeared in only one of four topics when air pollution was included ( Figure 4 ). We next performed topic modeling in TCGA using all features overlapping with the ORIEN-derived models ( Figure 6 ). Despite missing variables, microbiome-related components in TCGA topic models demonstrated similar probability distributions to those observed in ORIEN, with overlapping 95% confidence intervals ( Supplementary Figure 9 ). Two exceptions were noted: Bacillus exhibited a higher probability in TCGA compared with ORIEN (86% vs. 4%; Supplementary Figure 9b ), whereas Escherichia also showed increased probability in TCGA (0.3% vs. 8%; Supplementary Figure 9c ). Given the incomplete overlap of variables between datasets, we retrained recurrence prediction models in the ORIEN cohort using only features available in TCGA to enable external validation. As in prior analyses, models were trained and validated using MATLAB (ML) and TensorFlow (DL) approaches, with oversampling applied to address class imbalance. The TCGA cohort was then used as an independent external test set ( Figure 7 ). Although overall model performance was reduced relative to internal testing ( Figure 5 ), reflecting the loss of informative variables, the AUCs obtained in TCGA testing fell within the 95% confidence intervals of the newly trained ORIEN models, indicating no statistically meaningful performance degradation. Finally, Shapley value analysis was applied to both the ORIEN-trained models and TCGA-tested models to assess predictor importance ( Supplementary Figure 10 ). In both low-risk and high-risk EC groups, the most influential contributors were concordant between training and testing models: Bacillus and Escherichia in low-risk EC ( Supplementary Figures 10a and 10d ), and SMOC1 ( ENSG00000198732 ), ENSG00000214776 pseudogene expression and Acinetobacter in high-risk EC ( Supplementary Figures 10b and 10e ). In non-endometrioid EC, multiple predictors contribute consistently across training and testing models, including T Cells, CD8+ T Cells, CNVs involving ADA and KRT9 genes, and microbial features such as Bacillus and Escherichia ( Supplementary Figure 10c and 10f ). DISCUSSION Endometrial cancer (EC) recurrence is a complex, multifactorial process that cannot be fully explained by tumor-intrinsic features alone. In this study, we applied an integrative, systems-level framework to model EC recurrence as an emergent property of interactions among clinical factors, tumor genomics, immune contexture, tumor-associated microbial communities, and environmental exposures. Using topic modeling to capture coordinated, cross-domain patterns and machine learning approaches to evaluate predictive performance, we identified reproducible, risk group-specific recurrence signatures that generalized across analytical platforms and independent datasets. Importantly, features selected through topic modeling retained strong predictive value in recurrence models, supporting the biological and clinical relevance of these integrated patterns. Together, these findings underscore the multifactorial nature of EC recurrence and demonstrate the feasibility of integrated, multi-domain modeling to interrogate recurrence biology at scale. What This Study Adds A central finding of this study is that incorporation of environmental and neighborhood-based exposures reshaped recurrence-associated topic composition across all risk groups. Microbial communities that were prominent in models incorporating only tumor-intrinsic features were attenuated or absent after inclusion of air pollution and social–environmental variables, indicating that recurrence-associated bacterial signatures are strongly context-dependent. These findings suggest that tumor-associated microbial signals reflect broader tumor–host–environment interactions rather than static or isolated microbial effects. Tumor-Microbiome Interactions in EC Recurrence Across all risk groups, Bacillus emerged as the bacterial genus with the highest per-topic probability, although its directionality differed by risk category. Decreased representation of Bacillus was associated with recurrence in low-risk EC, whereas increased representation was linked to recurrence in high-risk and non-endometrioid disease. This bidirectional association suggests that tumor-associated bacterial signals may reflect underlying tumor biology, host factors, or treatment context rather than uniform oncogenic or protective effects. Similar context-dependent microbial associations have been reported in other hormonally influenced malignancies, 42 supporting the interpretation of these signals as ecological markers of tumor state. In the low-risk EC group, microbes that were quite abundant in models before introducing environmental factors were scarcely seen afterwards, like genera Thermothielavioides , Theileria , Rhizoctonia , Malassezia , and Candida . It is difficult to know exactly the reason for this change in microbiome composition, because our study design cannot infer causality only association, but this is an intriguing observation that needs further follow up. Importantly, bacterial communities identified in this study were inferred from tumor-derived bulk RNA sequencing data rather than from dedicated microbiome sampling. As such, these findings should be interpreted as relative, comparative signals reflecting tumor-associated microbial nucleic acids rather than direct measures of viable or mucosal microbiota. Nevertheless, consistent identification of specific genera across modeling approaches, risk strata, and external validation supports their relevance as ecological markers of tumor–host–environment interactions rather than isolated microbial drivers. These results should therefore be viewed as hypothesis-generating. Social and Environmental Determinants Associated with EC Recurrence Environmental exposures and social determinants of health emerged as consistent modifiers of recurrence-associated patterns across EC risk groups. Ozone (O₃) was repeatedly identified as a component of recurrence-associated topics in all three risk strata, supporting a biologically plausible link between ambient oxidative stress and EC recurrence. O₃ exposure has been implicated in oxidative DNA damage, inflammatory signaling, immune modulation, and estrogen dysregulation, pathways central to EC pathogenesis and progression, particularly in hormonally responsive tissues. 36-45 Although individual-level exposure assessment was not feasible, the reproducible association of O₃ with recurrence-associated topics suggests that environmental oxidative stress may act as a contextual modifier of tumor biology rather than an isolated risk factor. In parallel, social and environmental determinants of health contributed to the composition of recurrence-associated topics, most prominently in high-risk and non-endometrioid EC. Features such as proximity to high-volume roadways and airports, impaired water bodies, and limited food access, proxies for structural and environmental disadvantage, were among the variables influencing these patterns. These findings are consistent with growing evidence linking neighborhood-level exposures to cancer outcomes and support a model in which place-based factors shape tumor biology through indirect, cumulative mechanisms. 46,47 Notably, these associations persisted despite individual-level race or ethnicity not emerging as dominant predictors, underscoring the potential importance of structural context beyond individual demographic characteristics. Risk Group–Specific Biological Programs Underlying EC recurrence. Recurrence-associated patterns differed substantially by clinical risk group, reinforcing the biological heterogeneity of EC recurrence pathways and arguing against a single, unified mechanism of relapse. Low-risk endometrioid EC recurrence was driven predominantly by metabolic and microbiome-associated features, with minimal persistence of clinical variables beyond body mass index (BMI). In contrast, high-risk and non-endometrioid tumors were characterized by greater contributions from gene expression programs, immune contexture, stromal activation, and environmental domains. This stratified behavior supports the concept that recurrence mechanisms, and therefore opportunities for refined risk stratification or intervention, may differ fundamentally across EC subtypes. Obesity, and its proxy BMI, are intrinsically linked to estrogen metabolism, EC pathogenesis, metabolic syndrome, and microbiome dysbiosis. 36 Accordingly, the persistence of BMI as a component of low-risk recurrence models is biologically plausible, particularly given the estrogen-responsive nature of low-risk endometrioid tumors. In this group, recurrence-associated patterns reflected a coordinated imbalance involving reduced Bacillus , elevated ozone exposure, copy number alterations in genes primarily related to nucleocytoplasmic transport, and increased CAF representation, suggesting a convergence of hormonal, metabolic, microbial, and stromal influences that may favor tumor re-emergence. In contrast, clinical variables previously associated with recurrence risk in earlier analyses, including ethnicity, chemotherapy exposure, albumin, and red blood cell distribution width, did not persist within the integrated topic models for high-risk or non-endometrioid EC. Instead, recurrence in these groups was characterized by dysregulation of gene, pseudogene, and isoform expression involving T-cell signaling pathways, lipid and carbohydrate metabolism, folate transport and metabolism, and basal transcriptional machinery. These molecular programs co-occurred with pronounced immune and stromal features, including increased CAF abundance and macrophages M1 infiltration, as well as consistent microbiome dysbiosis marked by increased Bacillus and Candida and decreased Escherichia . Elevated ozone exposure was again observed, suggesting a recurring environmental backdrop across higher-risk disease. Notably, increased CAF representation emerged as a shared feature across all EC subtypes associated with recurrence, consistent with prior evidence implicating stromal remodeling in disease progression. 48 However, heightened macrophages M1 infiltration was restricted to high-risk and non-endometrioid tumors, underscoring risk group–specific immune dynamics. Together, these findings highlight that EC recurrence arises from distinct, subtype-dependent biological programs shaped by interacting tumor-intrinsic, microenvironmental, microbial, and environmental factors, rather than from a uniform recurrence pathway. Clinical and Translational Implications of Integrated Recurrence Modeling This study was not designed to produce an immediately deployable clinical prediction tool. Rather, it establishes a scalable, modular analytic framework for integrating heterogeneous biological and environmental data to interrogate EC recurrence biology at a systems level. Although the recurrence prediction models developed here performed comparably to previously published models, they consistently demonstrated that features emerging from integrated topic models encompassing clinical variables, tumor genomics, immune contexture, microbiome composition, and environmental exposures capture biologically meaningful recurrence-associated patterns. Notably, incorporation of air pollution variables altered microbiome feature composition without degrading model performance, underscoring the tightly interconnected nature of tumor, host, microbial, and environmental factors influencing EC relapse. To minimize overfitting and assess generalizability, both topic models and recurrence prediction models were evaluated using the TCGA EC cohort as an independent external dataset. Because TCGA lacks several key variables present in ORIEN, including environmental exposures such as air pollutants, models were retrained in ORIEN using TCGA-compatible features prior to external testing. Despite these constraints, model performance in TCGA remained within the 95% confidence intervals of the retrained ORIEN models, indicating preserved predictive stability. Differences in cohort composition and data structure likely influenced external validation performance, including earlier-era sample collection in TCGA, more limited follow-up and case status reporting, and reduced histologic diversity within non-endometrioid tumors compared with ORIEN. These limitations highlight the challenges of external validation for integrated, multi-domain models and emphasize the importance of contemporary, deeply annotated cohorts for translational modeling. From an NIH translational research perspective, this work primarily occupies the T0–T1 space, generating integrated biological insights and analytically validated recurrence signatures rather than clinical decision tools. Importantly, however, it provides a foundation for progression toward T2 translation. Specifically, this framework enables prospective cohort studies incorporating longitudinal biospecimen collection, spatially resolved tumor and microenvironment profiling, and microbiome-specific assays to validate and refine recurrence-associated programs. Such studies can inform risk-adapted surveillance strategies, identify biologically defined subgroups most likely to benefit from targeted interventions, and guide the rational design of prevention or interception trials. By establishing a reproducible, extensible modeling architecture, this study advances the field toward clinically actionable integration of tumor biology, host context, and environmental exposures in EC recurrence. Strengths A major strength of this study is the integration of diverse data modalities within a unified analytical framework. By jointly modeling clinical, pathological, genomic, immune, microbiome, and environmental features, we move beyond traditional single-domain analyses and provide a more holistic view of EC recurrence biology. Topic modeling enabled identification of coordinated feature sets that reflect biologically meaningful processes rather than isolated variables, while subsequent machine learning models demonstrated that these topic-derived features are also robust predictors of recurrence across multiple risk groups. Another key strength is the use of complementary analytical platforms. Consistent performance across MATLAB-based machine learning and TensorFlow-based deep learning approaches supports the robustness of our findings and reduces the likelihood that results are driven by platform-specific modeling assumptions. The application of Shapley value analysis further strengthens interpretability by clarifying the relative contribution of individual predictors within the best-performing models, an important consideration for translational relevance. External validation using the TCGA endometrial cancer cohort represents an additional strength. Despite incomplete overlap of features and known limitations of TCGA data, recurrence models trained in ORIEN and tested in TCGA demonstrated performance that remained within the confidence bounds of internally validated models. Concordance of key predictors—particularly microbiome-associated features, immune cell populations, and select genomic alterations—between ORIEN training models and TCGA testing models provides evidence of generalizability and biological consistency across independent datasets. Limitations Several limitations should be acknowledged. First, tumor-associated microbial signals were inferred from bulk RNA sequencing rather than from dedicated microbiome sequencing platforms. As such, these findings should be interpreted as relative, comparative signals reflecting microbial nucleic acids present in tumor-derived data rather than direct measures of viable or mucosal microbiota. While consistent identification of specific genera across risk groups, modeling strategies, and external validation supports their relevance as ecological markers, functional and spatial validation will be required to clarify causal relationships. Second, integration of environmental exposures was constrained by data availability. County-level identifiers were required to link air pollution and social or environmental determinants of health to individual patients, resulting in reduced sample sizes for models incorporating these variables. This limitation likely attenuated model performance in some settings and may have reduced power to detect stronger environmental effects. Moreover, TCGA lacks county-level identifiers entirely, precluding external validation of environmental features and limiting assessment of their generalizability. Third, external validation using TCGA was affected by incomplete overlap of features between datasets, differences in follow-up duration, case status reporting, and timing of biospecimen collection. These factors necessitated retraining of recurrence models in ORIEN using TCGA-compatible features and likely contributed to reduced absolute performance in external testing. Nevertheless, the observation that TCGA testing performance remained within the confidence intervals of retrained ORIEN models supports the stability of the underlying predictive framework despite these constraints. CONCLUSION In summary, endometrial cancer recurrence emerges from this analysis as an emergent property of coordinated interactions among tumor-intrinsic programs and extrinsic contextual factors, rather than as the consequence of any single biological domain. By integrating clinical, pathological, genomic, immune, microbiome, and environmental features within a systems-level modeling framework, we demonstrate that these complex interactions can be quantified, interpreted, and externally validated across independent cohorts. Importantly, both intrinsic tumor and host characteristics and extrinsic environmental and social exposures contributed to recurrence-associated patterns, with their relative influence varying by clinical risk group. These findings underscore the biological heterogeneity underlying EC relapse and highlight the limitations of one-size-fits-all prediction approaches. Collectively, this work supports the need for more individualized, context-aware models of disease outcomes and establishes an extensible analytic foundation for future translational efforts aimed at improving EC recurrence risk stratification, prevention, and intervention. Declarations Ethical Approval and Consent to participate: Patients consent to donate medical records and tissue specimens for molecular profiling to the ORIEN network as an approach to improve design and performance of personalized cancer care. The ORIEN network is comprised of multiple cancer centers that have agreed to use the same Institutional Review Board (IRB)-approved protocol and consent (Total Cancer Care Protocol, TCC). Availability of data and materials: The data used in this study was generated through private funding by Aster Insights (www.asterinsights.com) in collaboration with the Oncology Research Information Exchange Network (ORIEN, www.oriencancer.org). Inquiries regarding access to the data or collaboration within ORIEN should be submitted here at https://researchdatarequest.orienavatar.com/. For non-ORIEN academic researchers, only processed data outputs from clinical, whole exome and whole transcriptome data may be available, where applicable. Disclosure of potential conflict of interest: All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript. Consent for publication: All authors have reviewed and approved the manuscript for submission. Funding: This work was supported in part by the NIH grant 5R01CA99908-18 to Kimberly K. Leslie, where Gonzalez Bosquet was a co-investigator. Additionally, Dr. Gonzalez Bosquet received support from the basic research fund from the Department of Obstetrics & Gynecology (2014) at the University of Iowa. Also, was supported in part by the American Association of Obstetricians and Gynecologists Foundation (AAOGF, 2014) Bridge Funding Award and the Holden Comprehensive Cancer Center (HCCC) Support Grant (5P30CA086862-23). Authors’ Contributions: J.G.B., O.O.P., K.D., and D.S. wrote the main manuscript text; V.M.W., A.P., A.A.T, C.M.C., M.S.H., B.R.C., A.L.L., B.S., R.L.D., L.E.D., M.J.C., L.L., and L.C. participated in the review and edition of the manuscript; J.G.B, D.S., A.C.T and N.J. helped in analysis design and interpretation; J.G.B., R.H., and D.S. participated in data formatting, curating and analysis; R.J.R. and M.L.C. help with data procurement and resources administration. References Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer statistics, 2022. CA Cancer J Clin 72, 7–33 (2022). https://doi.org:10.3322/caac.21708 Sheikh, M. A. et al. USA endometrial cancer projections to 2030: should we be concerned? Future Oncol 10, 2561–2568 (2014). https://doi.org:10.2217/fon.14.192 Westin, S. N. et al. Durvalumab Plus Carboplatin/Paclitaxel Followed by Maintenance Durvalumab With or Without Olaparib as First-Line Treatment for Advanced Endometrial Cancer: The Phase III DUO-E Trial. Journal of clinical oncology: official journal of the American Society of Clinical Oncology 42, 283–299 (2024). https://doi.org:10.1200/JCO.23.02132 Eskander, R. N. et al. Pembrolizumab plus Chemotherapy in Advanced Endometrial Cancer. The New England journal of medicine 388, 2159–2170 (2023). https://doi.org:10.1056/NEJMoa2302312 Mirza, M. R. et al. Dostarlimab for Primary Advanced or Recurrent Endometrial Cancer. The New England journal of medicine 388, 2145–2158 (2023). https://doi.org:10.1056/NEJMoa2216334 Del Carmen, M. G., Boruta, D. M., 2nd & Schorge, J. O. Recurrent endometrial cancer. Clin Obstet Gynecol 54, 266–277 (2011). https://doi.org:10.1097/GRF.0b013e318218c6d1 Restaino, S. et al. Recurrent Endometrial Cancer: Which Is the Best Treatment? Systematic Review of the Literature. Cancers (Basel) 14 (2022). https://doi.org:10.3390/cancers14174176 Gonzalez Bosquet, J. et al. Training, Validating, and Testing Machine Learning Prediction Models for Endometrial Cancer Recurrence. JCO Precis Oncol 9, e2400859 (2025). https://doi.org:10.1200/PO-24-00859 Gonzalez-Bosquet, J. et al. Bacterial, Archaea, and Viral Transcripts (BAVT) Expression in Gynecological Cancers and Correlation with Regulatory Regions of the Genome. Cancers (Basel) 13 (2021). https://doi.org:10.3390/cancers13051109 Gonzalez-Bosquet, J. et al. Microbial Communities in Gynecological Cancers and Their Association with Tumor Somatic Variation. Cancers (Basel) 15 (2023). https://doi.org:10.3390/cancers15133316 Madhogaria, B., Bhowmik, P. & Kundu, A. Correlation between human gut microbiome and diseases. Infect Med (Beijing) 1, 180–191 (2022). https://doi.org:10.1016/j.imj.2022.08.004 Aggarwal, N. et al. Microbiome and Human Health: Current Understanding, Engineering, and Enabling Technologies. Chem Rev 123, 31–72 (2023). https://doi.org:10.1021/acs.chemrev.2c00431 Chambers, L. M. et al. The Microbiome and Gynecologic Cancer: Current Evidence and Future Opportunities. Curr Oncol Rep 23, 92 (2021). https://doi.org:10.1007/s11912-021-01079-x Laniewski, P., Ilhan, Z. E. & Herbst-Kralovetz, M. M. The microbiome and gynaecological cancer development, prevention and therapy. Nat Rev Urol 17, 232–250 (2020). https://doi.org:10.1038/s41585-020-0286-z Li, C. et al. Association between vaginal microbiota and the progression of ovarian cancer. J Med Virol 95, e28898 (2023). https://doi.org:10.1002/jmv.28898 Srikummoon, P. et al. The recurrence and mortality risk in Luminal A breast cancer patients who lived in high pollution area. PloS one 20, e0335140 (2025). https://doi.org:10.1371/journal.pone.0335140 Smotherman, C. et al. Association of air pollution with postmenopausal breast cancer risk in UK Biobank. Breast Cancer Res 25, 83 (2023). https://doi.org:10.1186/s13058-023-01681-w Calaf, G. M., Ponce-Cusi, R., Aguayo, F., Munoz, J. P. & Bleak, T. C. Endocrine disruptors from the environment affecting breast cancer. Oncol Lett 20, 19–32 (2020). https://doi.org:10.3892/ol.2020.11566 Dalton, W. S., Sullivan, D., Ecsedy, J. & Caligiuri, M. A. Patient Enrichment for Precision-Based Cancer Clinical Trials: Using Prospective Cohort Surveillance as an Approach to Improve Clinical Trials. Clin Pharmacol Ther 104, 23–26 (2018). https://doi.org:10.1002/cpt.1051 Polio, A., Wagner, V., Bender, D. P., Goodheart, M. J. & Gonzalez Bosquet, J. A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer. Int J Mol Sci 26 (2025). https://doi.org:10.3390/ijms26157432 Haas, B. J. et al. STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq. bioRxiv (2017). Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). https://doi.org:10.1093/bioinformatics/bts635 Hoyd, R. et al. Exogenous Sequences in Tumors and Immune Cells (Exotic): A Tool for Estimating the Microbe Abundances in Tumor RNA-seq Data. Cancer Res Commun 3, 2375–2385 (2023). https://doi.org:10.1158/2767-9764.CRC-22-0435 Breitwieser, F. P., Baker, D. N. & Salzberg, S. L. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19, 198 (2018). https://doi.org:10.1186/s13059-018-1568-0 Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci 3 (2017). https://doi.org:10.7717/peerj-cs.104 Cao, J., Xia, T., Li, J., Zhang, Y. & Tang, S. A density-based method for adaptive LDA model selection. Neurocomputing 72, 1775–1781 (2009). https://doi.org: https://doi.org/10.1016/j.neucom.2008.06.011 Griffiths, T. L. & Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences 101, 5228–5235 (2004). https://doi.org:doi:10.1073/pnas.0307752101 Ponweiser, M. Latent Dirichlet Allocation in R (WU Vienna University of Economics and Business, Vienna, 2012). Finotello, F. et al. Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome Med 11, 34 (2019). https://doi.org:10.1186/s13073-019-0638-6 Becht, E. et al. Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome Biol 17, 218 (2016). https://doi.org:10.1186/s13059-016-1070-5 Bhavnani, S. K. et al. Enabling Comprehension of Patient Subgroups and Characteristics in Large Bipartite Networks: Implications for Precision Medicine. AMIA Jt Summits Transl Sci Proc 2017, 21–29 (2017). Bhavnani, S. K. et al. Subtyping Social Determinants of Health in All of Us: Network Analysis and Visualization Approach. medRxiv (2023). https://doi.org:10.1101/2023.01.27.23285125 Ladbury, C. et al. Utilization of model-agnostic explainable artificial intelligence frameworks in oncology: a narrative review. Transl Cancer Res 11, 3853–3868 (2022). https://doi.org:10.21037/tcr-22-1626 Developers., T. TensorFlow, %3Chttps://doi.org/10.5281/zenodo.5949169%3E (2022). Mohammad, N., Muad, A. M., Ahmad, R. & Yusof, M. Accuracy of advanced deep learning with tensorflow and keras for classifying teeth developmental stages in digital panoramic imaging. BMC Med Imaging 22, 66 (2022). https://doi.org:10.1186/s12880-022-00794-6 Zheng, W. et al. Gut microbiota and endometrial cancer: research progress on the pathogenesis and application. Ann Med 57, 2451766 (2025). https://doi.org:10.1080/07853890.2025.2451766 Arnone, A. A. & Cook, K. L. Gut and Breast Microbiota as Endocrine Regulators of Hormone Receptor-positive Breast Cancer Risk and Therapy Response. Endocrinology 164 (2022). https://doi.org:10.1210/endocr/bqac177 Bukato, K., Kostrzewa, T., Gammazza, A. M., Gorska-Ponikowska, M. & Sawicki, S. Endogenous estrogen metabolites as oxidative stress mediators and endometrial cancer biomarkers. Cell Commun Signal 22, 205 (2024). https://doi.org:10.1186/s12964-024-01583-0 Bolton, J. L. Quinoids, quinoid radicals, and phenoxyl radicals formed from estrogens and antiestrogens. Toxicology 177, 55–65 (2002). https://doi.org:10.1016/s0300-483x(02)00195-6 Liang, X. et al. Ozone exposure at environmental level induces female reproductive impairment via transcriptomic and alternative analysis. Ecotoxicol Environ Saf 306, 119276 (2025). https://doi.org:10.1016/j.ecoenv.2025.119276 Rousselle, D. & Silveyra, P. Acute Exposure to Ozone Affects Circulating Estradiol Levels and Gonadotropin Gene Expression in Female Mice. Int J Environ Res Public Health 22 (2025). https://doi.org:10.3390/ijerph22020222 Urbaniak, C. et al. The Microbiota of Breast Tissue and Its Association with Breast Cancer. Appl Environ Microbiol 82, 5039–5048 (2016). https://doi.org:10.1128/AEM.01235-16 Reuter, S., Gupta, S. C., Chaturvedi, M. M. & Aggarwal, B. B. Oxidative stress, inflammation, and cancer: how are they linked? Free Radic Biol Med 49, 1603–1616 (2010). https://doi.org:10.1016/j.freeradbiomed.2010.09.006 Baeza-Noci, J. & Pinto-Bonilla, R. Systemic Review: Ozone: A Potential New Chemotherapy. Int J Mol Sci 22 (2021). https://doi.org:10.3390/ijms222111796 Lunov, O. et al. Cell death induced by ozone and various non-thermal plasmas: therapeutic perspectives and limitations. Sci Rep 4, 7129 (2014). https://doi.org:10.1038/srep07129 Madison, T., Schottenfeld, D., James, S. A., Schwartz, A. G. & Gruber, S. B. Endometrial cancer: socioeconomic status and racial/ethnic differences in stage at diagnosis, treatment, and survival. Am J Public Health 94, 2104–2111 (2004). https://doi.org:10.2105/ajph.94.12.2104 Helpman, L., Pond, G. R., Elit, L., Anderson, L. N. & Seow, H. Endometrial cancer presentation is associated with social determinants of health in a public healthcare system: A population-based cohort study. Gynecologic oncology 158, 130–136 (2020). https://doi.org:10.1016/j.ygyno.2020.04.693 Wei, S., Conner, M. G., Zhang, K., Siegal, G. P. & Novak, L. Juxtatumoral stromal reactions in uterine endometrioid adenocarcinoma and their prognostic significance. Int J Gynecol Pathol 29, 562–567 (2010). https://doi.org:10.1097/PGP.0b013e3181e36321 Additional Declarations No competing interests reported. Supplementary Files Supplementarymaterial12726NPJPO.docx SupplementaryTable5.xlsx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8682460","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":581516769,"identity":"b1a5f0c1-dee2-4e76-bc37-b8a5f543bcea","order_by":0,"name":"Jesus Gonzalez Bosquet","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABOklEQVRIie2QMWvCQBTHXwjY5YprgkW/wkkgUjL0qyQIdbmhUBAHK5nMkg9wQ2m+Qlw633EQl2DXlnawFOziELcMtjSnLalGO3e43/L+PO7HvfcAFIp/iOb/ygyAfecBoG2o/alAqbD0uLJDqfDxT6+q6EGwMHO4aXbqYs6y9UuzFZD22+pOnNWN7hyyvqgMFqZ2A8HUOvcTzClaWDhdWpjfC2TSS6zRWVWhpNbQPhMv5j4WyBBebBDbkAp+drF+Oq4q0bscrFDESSbWWHgRJZ2c30qll+kfBxQKtoFg6MUJwgJc4fmPxAbuS4VgXTughMRyELBiBXTFQyZ3WVwbadJDZrQsOrPevtIOpq9POYya+GE6meebi3Un2WDoXNSR7PSdiuJvys7vbhnZ/vuC1raMjigKhUKhKPgCcd9+2o3qLMUAAAAASUVORK5CYII=","orcid":"","institution":"University of Iowa","correspondingAuthor":true,"prefix":"","firstName":"Jesus","middleName":"Gonzalez","lastName":"Bosquet","suffix":""},{"id":581516770,"identity":"e2cc5088-fb12-4946-9016-4c024fed5044","order_by":1,"name":"Oyomoare Osazuwa-Peters","email":"","orcid":"","institution":"Duke University","correspondingAuthor":false,"prefix":"","firstName":"Oyomoare","middleName":"","lastName":"Osazuwa-Peters","suffix":""},{"id":581516774,"identity":"15ecde25-d343-4c32-88c5-19987d4c9655","order_by":2,"name":"Vincent M. Wagner","email":"","orcid":"","institution":"University of Iowa","correspondingAuthor":false,"prefix":"","firstName":"Vincent","middleName":"M.","lastName":"Wagner","suffix":""},{"id":581516776,"identity":"f9b6ef6f-0557-43ac-8d0e-c99caed19d5a","order_by":3,"name":"Andrew Polio","email":"","orcid":"","institution":"University of Iowa","correspondingAuthor":false,"prefix":"","firstName":"Andrew","middleName":"","lastName":"Polio","suffix":""},{"id":581516778,"identity":"e5edabb0-6d60-44eb-a89b-20d4a665a680","order_by":4,"name":"Rebecca Hoyd","email":"","orcid":"","institution":"The Ohio State University","correspondingAuthor":false,"prefix":"","firstName":"Rebecca","middleName":"","lastName":"Hoyd","suffix":""},{"id":581516779,"identity":"99f29140-d68f-46cd-9ccb-bf9d729268c8","order_by":5,"name":"Ahmad A. Tarhini","email":"","orcid":"","institution":"Moffitt Cancer Center","correspondingAuthor":false,"prefix":"","firstName":"Ahmad","middleName":"A.","lastName":"Tarhini","suffix":""},{"id":581516781,"identity":"5828476f-69a7-444c-8570-b513a86f663f","order_by":6,"name":"Casey M. Cosgrove","email":"","orcid":"","institution":"The Ohio State University","correspondingAuthor":false,"prefix":"","firstName":"Casey","middleName":"M.","lastName":"Cosgrove","suffix":""},{"id":581516783,"identity":"67c07a6e-b517-4c92-b250-a003ab5a01f7","order_by":7,"name":"Marilyn S. Huang","email":"","orcid":"","institution":"University of Virginia","correspondingAuthor":false,"prefix":"","firstName":"Marilyn","middleName":"S.","lastName":"Huang","suffix":""},{"id":581516784,"identity":"8c87ade0-0681-4974-8fb9-114695dddd54","order_by":8,"name":"Bradley R. Corr","email":"","orcid":"","institution":"University of Colorado","correspondingAuthor":false,"prefix":"","firstName":"Bradley","middleName":"R.","lastName":"Corr","suffix":""},{"id":581516785,"identity":"3a48be6f-38e3-4547-816e-9b5dbf0b0254","order_by":9,"name":"Aliza L. Leiser","email":"","orcid":"","institution":"Rutgers, The State University of New Jersey","correspondingAuthor":false,"prefix":"","firstName":"Aliza","middleName":"L.","lastName":"Leiser","suffix":""},{"id":581516787,"identity":"2fcec9f6-caf4-428d-bd71-bd62d2e5991d","order_by":10,"name":"Bodour Salhia","email":"","orcid":"","institution":"University of Southern California","correspondingAuthor":false,"prefix":"","firstName":"Bodour","middleName":"","lastName":"Salhia","suffix":""},{"id":581516788,"identity":"f94acc8d-1650-48da-a64c-5cd534cc1b1e","order_by":11,"name":"Kathleen Darcy","email":"","orcid":"","institution":"Walter Reed National Military Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Kathleen","middleName":"","lastName":"Darcy","suffix":""},{"id":581516789,"identity":"c073b716-d4c7-4833-8e32-8a3ee5bb7e1c","order_by":12,"name":"Rob L. Dood","email":"","orcid":"","institution":"Huntsman Cancer Institute","correspondingAuthor":false,"prefix":"","firstName":"Rob","middleName":"L.","lastName":"Dood","suffix":""},{"id":581516790,"identity":"5e256157-3681-4dc5-8936-b1424f0a1550","order_by":13,"name":"Lauren E. Dockery","email":"","orcid":"","institution":"University of Oklahoma","correspondingAuthor":false,"prefix":"","firstName":"Lauren","middleName":"E.","lastName":"Dockery","suffix":""},{"id":581516791,"identity":"f107146b-9a57-4dc4-acc7-2fc0a0260d8b","order_by":14,"name":"Michael J. Cavnar","email":"","orcid":"","institution":"University of Kentucky","correspondingAuthor":false,"prefix":"","firstName":"Michael","middleName":"J.","lastName":"Cavnar","suffix":""},{"id":581516792,"identity":"6557a0aa-da20-44f3-ab3b-bb960a8a1591","order_by":15,"name":"Lisa Landrum","email":"","orcid":"","institution":"Indiana University","correspondingAuthor":false,"prefix":"","firstName":"Lisa","middleName":"","lastName":"Landrum","suffix":""},{"id":581516793,"identity":"e72d0228-50fb-46e5-9a40-178f297daeda","order_by":16,"name":"Laura Chambers","email":"","orcid":"","institution":"The Ohio State University","correspondingAuthor":false,"prefix":"","firstName":"Laura","middleName":"","lastName":"Chambers","suffix":""},{"id":581516794,"identity":"df93f926-00f4-46a4-a012-02822fc0bfdd","order_by":17,"name":"Aik Choon Tan","email":"","orcid":"","institution":"Huntsman Cancer Institute","correspondingAuthor":false,"prefix":"","firstName":"Aik","middleName":"Choon","lastName":"Tan","suffix":""},{"id":581516795,"identity":"7ceb9001-f77b-4a4f-959c-05f42851d36c","order_by":18,"name":"Ning Jin","email":"","orcid":"","institution":"The Ohio State University","correspondingAuthor":false,"prefix":"","firstName":"Ning","middleName":"","lastName":"Jin","suffix":""},{"id":581516796,"identity":"a3053dfd-4e07-4e88-98cc-d315eb92af72","order_by":19,"name":"Robert J. Rounbehler","email":"","orcid":"","institution":"Aster Insights","correspondingAuthor":false,"prefix":"","firstName":"Robert","middleName":"J.","lastName":"Rounbehler","suffix":""},{"id":581516797,"identity":"c133520f-6ec3-4ca4-b721-5fdbd3ac38eb","order_by":20,"name":"Michelle L. Churchman","email":"","orcid":"","institution":"Aster Insights","correspondingAuthor":false,"prefix":"","firstName":"Michelle","middleName":"L.","lastName":"Churchman","suffix":""},{"id":581516798,"identity":"3ccbac02-4999-403e-8999-c7f0aad847be","order_by":21,"name":"Dan Spakowicz","email":"","orcid":"","institution":"The Ohio State University","correspondingAuthor":false,"prefix":"","firstName":"Dan","middleName":"","lastName":"Spakowicz","suffix":""}],"badges":[],"createdAt":"2026-01-23 21:08:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8682460/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8682460/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":101499832,"identity":"093a19f9-2eb4-400a-bbde-e038e072b9eb","added_by":"auto","created_at":"2026-01-30 13:13:05","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":545876,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTopic modelling for microbiome communities: Analysis to identify an optimal and significant topics in EC (left panels): \u003c/strong\u003eWe utilized the \u003cem\u003eFindTopicNumber\u003c/em\u003e function from the \u003cem\u003e\u003cstrong\u003eldatuning\u003c/strong\u003e\u003c/em\u003epackage to identify an optimal latent topic number for our model based on 4 different metrics: minimization for Arun2010 and CaoJuan2009, and maximization for Deveaud2014 and Griffiths2004. For minimization metrics a lower value suggests an optimal topic structure; and for maximization metrics a higher value suggests an optimal topic structure. Optimal topic numbers are represented in left panels: \u003cstrong\u003ea)\u003c/strong\u003e Low risk EC; \u003cstrong\u003eb)\u003c/strong\u003e High risk EC; \u003cstrong\u003ec)\u003c/strong\u003eNon-endometrioid EC.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSelecting topics via Latent Dirichlet Allocation (LDA – Middle and Right panels): \u003c/strong\u003eLDA is popular for natural language processing for topic modeling. LDA computes differential abundance analysis, to identify differentially abundant topics between recurrent and non-recurrent cohorts (middle panels). Selected in blue are those topics with log2 fold changes and false discovery rate (FDR) corrected p-values \u0026lt; 0.05:\u003cstrong\u003e a)\u003c/strong\u003e Low risk EC, topics 4,25,27,29,35,38,50,66; \u003cstrong\u003eb)\u003c/strong\u003e High risk EC, topics 9,40,43,56; \u003cstrong\u003ec)\u003c/strong\u003eNon-endometrioid EC, topics 23,30,46.\u003c/p\u003e\n\u003cp\u003eIn the \u003cstrong\u003eright panel \u003c/strong\u003ewe depict per-topic-species probabilities matrix to examine which genus have the highest probabilities of assignment to this topic/community: \u003cstrong\u003ea)\u003c/strong\u003e Low risk EC, topic #4; \u003cstrong\u003eb)\u003c/strong\u003e High risk EC, topic 56; \u003cstrong\u003ec)\u003c/strong\u003e Non-endometrioid EC, topic 46.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/44d622c815921a2676e1748c.png"},{"id":101499799,"identity":"54661a32-e466-40ef-bf77-7fc4f1a1de4a","added_by":"auto","created_at":"2026-01-30 13:12:52","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":385193,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDeconvolution of bulk RNA to assess immune contexture and cancer associated fibroblasts (CAF).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eA. \u003c/strong\u003eDeconvolution of bulk RNAseq from ORIEN EC sample set with \u003cem\u003equanTIseq,\u003c/em\u003e a computational pipeline for the quantification of the Tumor Immune contexture: type and density of tumor-infiltrating immune cells.\u0026nbsp; from human RNA-seq data. The proportion of immune cells infiltrating the tumor was 26% in average (range 2-100%). Each type of resulting immune cell is color coded in the annotation side panel: B cells, M1 and M2 macrophages, monocytes, neutrophils, natural killer (NK) cells, non-regulatory CD4+ T cells, CD8+ T cells, T\u003csub\u003ereg\u003c/sub\u003e cells, and myeloid dendritic cells (DC).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eB. \u003c/strong\u003eHeatmap of Microenvironment Cell Populations (MCP) counter score, using the log2 geometric mean of this set of markers for each immunologic cell. Samples (columns) are order by tissue type: HGSC or normal tube; and the cell type (rows) are: CD3+ T cells, CD8+ T cells, cytotoxic lymphocytes, NK cells, B lymphocytes, cells originating from monocytes (monocytic lineage), and myeloid dendritic cells, and neutrophils, endothelial cells and fibroblasts.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/70cf6e00a04dc93842060081.png"},{"id":101499711,"identity":"307ef3d5-5b5c-44cc-ba36-17e5b5dbc47a","added_by":"auto","created_at":"2026-01-30 13:12:40","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":277520,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAir pollutants by county code. \u003c/strong\u003eWe performed a multivariate lasso regression analysis to identify which air pollutants were associated with EC recurrence. Those selected were integrated in the final analysis including all significant variables.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUpper panels\u003c/strong\u003e show tables with odd ratios (OR) of selected air pollutants resulting from the lasso regression analysis for EC recurrence: a) Low risk EC: NO\u003csub\u003e2\u003c/sub\u003e, O\u003csub\u003e3\u003c/sub\u003e, PM\u003csub\u003e10\u003c/sub\u003e, PM\u003csub\u003e2.5\u003c/sub\u003e; b) High risk EC: O\u003csub\u003e3\u003c/sub\u003e, PM\u003csub\u003e10\u003c/sub\u003e, SO\u003csub\u003e2\u003c/sub\u003e; c) Non-endometrioid EC: O\u003csub\u003e3\u003c/sub\u003e, PM\u003csub\u003e10\u003c/sub\u003e, PM\u003csub\u003e2.5\u003c/sub\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLower panels\u003c/strong\u003e represent lasso multivariate regression results (glmnet R package): the upper axis represent the number of variables included in the model; the y axis is the performance of the model measured by AUC, with\u0026nbsp; the 95% confidence interval (CI): a) Low risk EC: the best performance of the model is with 5 variables; b) High risk EC: best performance with 4 variables; c) Non-endometrioid EC: best performance with 3 variables. The lower axis is the log of the λ, value used to optimize model construction. We performed 1,000 bootstrap replicates to find the most adequate λ for the model. Underneath each graphic is the performance of the model by the AUC with the 95% CI.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/c5b10851869945a186ca106f.png"},{"id":101499786,"identity":"c6440b08-71ed-40b8-868c-a01c1549c93a","added_by":"auto","created_at":"2026-01-30 13:12:49","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":810047,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eResulting components of the significant topic models from integrated databases. \u003c/strong\u003eFor the 3 risk groups, low EC (a), high EC (b), and non-endometrioid EC (c), we represented the components resulting from the topic models analyses with/without environmental factors. In all 3 tables, at the left is the variable type, the first column are the components of the model without environmental variables and the next two with them:\u003c/p\u003e\n\u003cp\u003eClin+Gen+Imm: including clinical, genomic/metagenomic, immune contexture; Clin+Gen+Imm+Pol: including clinical, genomic/metagenomic, immune contexture, and air pollution. Clin+Gen+Imm+Env: including clinical, genomic/metagenomic, immune contexture, and all environmental factors.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/e953d1b2a26cf8765403eb39.png"},{"id":101499753,"identity":"c47663ea-a8a2-4099-8562-265e62b3d396","added_by":"auto","created_at":"2026-01-30 13:12:41","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":404997,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTraining, validation and testing of prediction models of low-risk EC recurrence using components of significant topics from the LDA analyses done with machine learning (ML - \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eMatLab\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e) and with deep learning (DL - \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eTensorFlow\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea)\u003c/strong\u003e Training, validation and testing of a low-risk EC recurrence model using MatLab (Ensemble, subspace discriminant) and components from the significant topics integrating clinic and microbiome data: the left panel shows the testing ROC curves with AUC, micro-average AUC and macro-average AUC. Micro-average takes imbalance into account in the sense that the resulting performance is based on the proportion of every class; the right panel shows the model testing confusion matrix.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eb)\u003c/strong\u003e Training, validation and testing of a low-risk EC recurrence model using \u003cem\u003eTensorFlow\u003c/em\u003e and components from the significant topics integrating clinic and microbiome data: the left panel shows the model testing confusion matrix; the right panel shows training and testing ROC curves with AUC.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ec) \u003c/strong\u003eTraining, validation and testing of a low-risk EC recurrence model using MatLab (Ensemble, subspace KNN) and components from the significant topics integrating clinic, microbiome, genomic and cell immunocompetent infiltration data: the left panel shows the testing ROC curves with AUC, micro-average AUC and macro-average AUC; the right panel shows the model testing confusion matrix.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ed)\u003c/strong\u003e Training, validation and testing of a low-risk EC recurrence model using \u003cem\u003eTensorFlow\u003c/em\u003e and components from the significant topics integrating clinic, microbiome, genomic and cell immunocompetent infiltration data: the left panel shows the model testing confusion matrix; the right panel shows training and testing ROC curves with AUC.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ee) \u003c/strong\u003eTraining, validation and testing of a low-risk EC recurrence model using MatLab (Fine tree) and components from the significant topics integrating clinic, microbiome, genomic, cell immunocompetent infiltration, and air pollution data: the left panel shows the testing ROC curves with AUC, micro-average AUC and macro-average AUC; the right panel shows the model testing confusion matrix.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ef)\u003c/strong\u003e Training, validation and testing of a low-risk EC recurrence model using \u003cem\u003eTensorFlow\u003c/em\u003e and components from the significant topics integrating clinic, microbiome, genomic, cell immunocompetent infiltration, and air pollution data: the left panel shows the model testing confusion matrix; the right panel shows training and testing ROC curves with AUC.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eg) \u003c/strong\u003eTraining, validation and testing of a low-risk EC recurrence model using MatLab (Binary GLM logistic regression) and components from the significant topics integrating clinic, microbiome, genomic, cell immunocompetent infiltration, and all social and environmental determinants of health data: the left panel shows the testing ROC curves with AUC, micro-average AUC and macro-average AUC; the right panel shows the model testing confusion matrix.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eh)\u003c/strong\u003e Training, validation and testing of a low-risk EC recurrence model using \u003cem\u003eTensorFlow\u003c/em\u003e and components from the significant topics integrating clinic, microbiome, genomic, cell immunocompetent infiltration, and all environmental data: the left panel shows the model testing confusion matrix; the right panel shows training and testing ROC curves with AUC.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/ba39fe6e34fdba0726112ff1.png"},{"id":101499839,"identity":"7853506a-ddb2-4f11-b107-0c8b047907c1","added_by":"auto","created_at":"2026-01-30 13:13:07","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":739909,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTCGA validation of integration of clinical, genome, bacteriome, immunocompetent cells proportions with topic modelling.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAnalysis to identify optimal and significant topics in EC (\u003cstrong\u003eleft panels\u003c/strong\u003e): To identify an optimal latent topic number for our model based on 4 different metrics: minimization for Arun2010 and CaoJuan2009, and maximization for Deveaud2014 and Griffiths2004. Optimal topic numbers are represented in left panels: \u003cstrong\u003ea)\u003c/strong\u003e Low risk EC; \u003cstrong\u003eb)\u003c/strong\u003e High risk EC; \u003cstrong\u003ec)\u003c/strong\u003e Non-endometrioid EC.\u003c/p\u003e\n\u003cp\u003eSelecting topics via LDA (\u003cstrong\u003eMiddle and Right panels\u003c/strong\u003e): To identify differentially abundant topics between recurrent and non-recurrent cohorts (\u003cstrong\u003emiddle panels\u003c/strong\u003e) we used LDA analysis. Selected in blue are those topics with negative log2 fold changes and FDR corrected p-values \u0026lt; 0.05: \u003cstrong\u003ea)\u003c/strong\u003e Low risk EC, topic 13; \u003cstrong\u003eb)\u003c/strong\u003e High risk EC, topics 10,16; \u003cstrong\u003ec)\u003c/strong\u003e Non-endometrioid EC, topic 1,5.\u003c/p\u003e\n\u003cp\u003eIn the \u003cstrong\u003eright panel\u003c/strong\u003e we depict per-topic-species probabilities matrix to examine which genus have the highest probabilities of assignment to this topic/community: \u003cstrong\u003ea)\u003c/strong\u003e Low risk EC, topic #13; \u003cstrong\u003eb)\u003c/strong\u003e High risk EC, topic #10 (with highest and lowest log2 fold changes); \u003cstrong\u003ec)\u003c/strong\u003e Non-endometrioid EC, topic #1.\u003c/p\u003e\n\u003cp\u003eGiven the incomplete overlap of variables between datasets, we retrained recurrence prediction models in the ORIEN cohort using only features available in TCGA to enable external validation. As in prior analyses, models were trained and validated using \u003cem\u003eMATLAB\u003c/em\u003e (ML) and \u003cem\u003eTensorFlow\u003c/em\u003e (DL) approaches, with oversampling applied to address class imbalance. The TCGA cohort was then used as an independent external test set (\u003cstrong\u003eFigure 7\u003c/strong\u003e). Although overall model performance was reduced relative to internal testing (\u003cstrong\u003eFigure 5\u003c/strong\u003e), reflecting the loss of informative variables, the AUCs obtained in TCGA testing fell within the 95% confidence intervals of the newly trained ORIEN models, indicating no statistically meaningful performance degradation.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/dc6dcac8f899f09589a62e9d.png"},{"id":101499795,"identity":"021bb014-31a7-4308-aca2-b7927119e805","added_by":"auto","created_at":"2026-01-30 13:12:50","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":358324,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTraining and validation of prediction models of EC recurrence using components of significant topics from the ORIEN analyses tested in TCGA dataset.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea)\u003c/strong\u003e Training and validation of a low-risk EC recurrence model in ORIEN database using \u003cem\u003eMatLab\u003c/em\u003e (fine tree) and components from the significant topics microbiome, genome and immunocompetent cell invasion data (\u003cstrong\u003eleft panel\u003c/strong\u003e): ROC curve AUC 0.68 (95% CI 0.55-0.81) and testing the same model in TCGA low-risk database (\u003cstrong\u003eright panel\u003c/strong\u003e): ROC curve micro AUC 0.73 (95% CI 0.65-0.81).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eb)\u003c/strong\u003e Training and validation of a low-risk EC recurrence model in ORIEN database using \u003cem\u003eTensorFlow\u003c/em\u003e (modified tutorial: Classification on imbalanced data) components from the significant topics microbiome, genome and immunocompetent cell invasion data (\u003cstrong\u003eleft panel\u003c/strong\u003e): model testing confusion matrix; and testing the same model in TCGA low-risk database (\u003cstrong\u003eright panel\u003c/strong\u003e): training and testing ROC curves (0.81 [0.70-91] and 0.52 [0.30-0.76]).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ec)\u003c/strong\u003e Training and validation of a high-risk EC recurrence model in ORIEN database using \u003cem\u003eMatLab\u003c/em\u003e (efficient logistic regression) and components from the significant topics microbiome, genome and immunocompetent cell invasion data (\u003cstrong\u003eleft panel\u003c/strong\u003e): ROC curve micro-AUC 0.79 (95% CI 0.69-0.88) and testing the same model in TCGA high-risk database (\u003cstrong\u003eright panel\u003c/strong\u003e): ROC curve micro-AUC 0.65 (95% CI 0.57-0.72).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ed)\u003c/strong\u003e Training and validation of a high-risk EC recurrence model in ORIEN database using \u003cem\u003eTensorFlow\u003c/em\u003e components from the significant topics microbiome, genome and immunocompetent cell invasion data \u003cstrong\u003e(left panel\u003c/strong\u003e): model testing confusion matrix; and testing the same model in TCGA high-risk database (right panel): training and testing ROC curves (0.79 [0.69-0.86] and 0.53 [0.36-0.70]).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ee)\u003c/strong\u003e Training and validation of a non-endometrioid EC recurrence model in ORIEN database using \u003cem\u003eMatLab\u003c/em\u003e (fine KNN) and components from the significant topics microbiome, genome and immunocompetent cell invasion data (\u003cstrong\u003eleft panel\u003c/strong\u003e): ROC curve micro-AUC 0.62 (95% CI 0.49-0.75) and testing the same model in TCGA non-endometrioid database (\u003cstrong\u003eright panel\u003c/strong\u003e): ROC curve micro-AUC 0.58 (95% CI 0.45-0.70).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ef)\u003c/strong\u003e Training and validation of a non-endometrioid EC recurrence model in ORIEN database using \u003cem\u003eTensorFlow\u003c/em\u003e components from the significant topics microbiome, genome and immunocompetent cell invasion data \u003cstrong\u003e(left panel\u003c/strong\u003e): model testing confusion matrix; and testing the same model in TCGA non-endometrioid database (\u003cstrong\u003eright panel\u003c/strong\u003e): training and testing ROC curves (0.78 [0.61-0.94] and 0.54 [0.37-0.71]).\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/b91213121dda4c708f0022e9.png"},{"id":102747889,"identity":"4104bb68-519c-4883-8870-4f8652a71f51","added_by":"auto","created_at":"2026-02-16 09:05:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":5045318,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/c84b3c44-c701-4404-8a91-dce9f0c715d8.pdf"},{"id":101499797,"identity":"79a66739-480d-41d7-a2be-28909787e615","added_by":"auto","created_at":"2026-01-30 13:12:51","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":5294742,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterial12726NPJPO.docx","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/c5844030242f6e597a5e51a7.docx"},{"id":101499791,"identity":"772781c4-b423-4878-a9ed-610027819287","added_by":"auto","created_at":"2026-01-30 13:12:50","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":19280,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTable5.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8682460/v1/03beea09a39784cc2b70666f.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Intrinsic tumor factors and extrinsic environmental and social exposures contribute to endometrial cancer recurrence patterns ","fulltext":[{"header":"BACKGROUND","content":"\u003cp\u003eThe incidence and mortality for endometrial cancer (EC) continues to rise\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e with a projected mortality increase of 55% by 2030.\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e These discouraging outcomes are in part to the persistence of treatment failures, despite the recent introduction of immunotherapy and targeting therapies for this disease with notable successes (RUBY, GY-018, DUO-E).\u003csup\u003e3\u0026ndash;5\u003c/sup\u003e Though non-endometrioid EC types account for a disproportionately high number of EC recurrences and cancer-related deaths,\u003csup\u003e6\u003c/sup\u003e the majority of treatment failures and recurrences occur in endometrioid EC, with approximately 10\u0026ndash;15% of disease recurrence in patients with early-stage EC.\u003csup\u003e6,7\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eIn a previous study, we trained, validated and tested models of EC recurrence integrating clinical, genomic and pathological data from the Oncology Research Information Exchange Network (ORIEN).\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e The models were stratified into low risk, FIGO grade 1 and 2, stage I (N\u0026thinsp;=\u0026thinsp;329), high risk, or FIGO grade 3 or stages II, III, IV (N\u0026thinsp;=\u0026thinsp;324), and non-endometrioid histology (N\u0026thinsp;=\u0026thinsp;239) groups. This study resulted in validated high-performing prediction models, with area under the curve (AUC) performance over 0.9\u0026ndash;0.95 for all 3 risk groups. While these models demonstrated excellent discrimination, they may not fully capture the biological complexity and environmental heterogeneity that influence EC recurrence across diverse patient populations. To further improve discriminatory accuracy and generality of these models, we hypothesized that inclusion of intrinsic tumor microenvironmental (TME) variables and extrinsic environmental variables alongside clinical, pathologic and genomic features may modify geographically the risk for EC recurrence.\u003c/p\u003e \u003cp\u003eIn preliminary data, we observed that the microbiome is associated with female genital tract cancers, specifically with EC, and may interact differently with tumors with different mutation signatures.\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e The human microbiome is a symbiotic community of bacteria, fungi, and viruses that live on or within the human body with specific functions, properties, and interactions within its environment.\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e Bacterial communities may influence the risk of EC recurrence by altering the local immune response modulation, by epigenetic changes, or TME modulation.\u003csup\u003e\u003cspan additionalcitationids=\"CR14\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e Additionally, other environmental factors, like air pollution also have been associated to incidence and recurrence in hormonal-related cancers, like breast cancer,\u003csup\u003e16\u003c/sup\u003e or acting as xenoestrogens or anti-androgens, inducing oxidative stress, DNA damage, epigenetic changes, and chronic inflammation in hormone-sensitive tissues.\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eThe objective of this study was to assess the differences in EC recurrence risk when accounting for TME factors, like tumor-associated microbiome and immune cell infiltration, and extrinsic environmental factors, like air pollution and environmental determinants of health. Then, we assessed the predictive accuracy of these intrinsic and extrinsic variables for EC recurrence. Performance of the prediction models were externally validated using the Cancer Genome Atlas (TCGA) EC datasets.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cdiv id=\"Sec3\"\u003e\n \u003ch2\u003eStudy design:\u003c/h2\u003e\n \u003cp\u003eWe performed a retrospective, multi-institution, case\u0026ndash;control study with data originated from the ORIEN network EC dataset. ORIEN is comprised of multiple cancer centers that have agreed to use the same Institutional Review Board (IRB)-approved protocol and consent (Total Cancer Care Protocol, TCC) to follow patients throughout their lifetime.\u003csup\u003e\u003cspan\u003e19\u003c/span\u003e\u003c/sup\u003e A copy of the protocol is included in \u003cstrong\u003eSupplementary Material\u003c/strong\u003e. Patients consent to donate medical records and tissue specimens for molecular profiling, as an approach to improve design and performance of personalized cancer care. RNA and DNA were extracted from tumor specimens and processed to obtain the necessary genomic/metagenomic data, as specified previously.\u003csup\u003e\u003cspan\u003e8\u003c/span\u003e\u003c/sup\u003e The study analysis was carried out in several steps: 1) \u003cem\u003eStep 1\u003c/em\u003e: selection of models and variables included in the preliminary study of EC prediction of recurrence that included clinical, genomic and pathologic data;\u003csup\u003e8\u003c/sup\u003e 2) \u003cem\u003eStep 2\u003c/em\u003e: extraction and curating of microbiome data (at the taxa level of genus) from RNA sequencing (RNAseq) experiments using \u003cem\u003eKraken2\u003c/em\u003e, \u003cem\u003eBracken\u003c/em\u003e and \u003cem\u003eexotic\u003c/em\u003e software packages; 3) \u003cem\u003eStep 3\u003c/em\u003e: using topic modelling, as described previously,\u003csup\u003e20\u003c/sup\u003e to determine microbiomes (genus taxa) associated with EC recurrence by risk groups; 4) \u003cem\u003eStep 4\u003c/em\u003e: determine social and environmental determinants of health associated with EC recurrence; 5) \u003cem\u003eStep 5\u003c/em\u003e: integration of significant genomic, microbiome and environmental factors (resulting from previous steps) with topic modelling, to identify those factors associated with EC recurrence by risk groups; 6) \u003cem\u003eStep 6\u003c/em\u003e: assessment of how these variables from significant topics associated with EC recurrence performed as prediction models of recurrence using machine and deep learning analytics (ML and DL) with \u003cem\u003eMATLAB\u003c/em\u003e apps and \u003cem\u003eTensorFlow\u003c/em\u003e. Finally, these steps with integration of elements of EC recurrence and EC prediction modelling were externally tested (validated) in an independent EC dataset, TCGA.\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003ePatients\u0026rsquo; inclusion, clinical, pathological and genomic data:\u003c/h3\u003e\n\u003cp\u003eDetails of patients inclusion in risk groups, clinical and pathological data included and genomic data extraction, processing and analysis (Step 1 of the study design) are detailed in a previous publication.\u003csup\u003e\u003cspan\u003e8\u003c/span\u003e\u003c/sup\u003e Briefly, we included all patients in ORIEN database with EC, including all histologies, that had information about recurrent disease. Patients with EC recurrence (or \u003cstrong\u003ecases\u003c/strong\u003e) were those that after completion of treatment with no evidence of disease (NED), EC reappeared, either locally (vaginal), regionally (pelvis) or distally. Index cases included women with a new event of EC cancer after treatment, those who had cancer at the last surveillance or died from cancer. \u003cstrong\u003eControls\u003c/strong\u003e were patients with NED during the whole follow-up. There was a total 892 women with EC included in this analysis: 186 with EC recurrence (cases) and 706 without (controls), that had RNA and DNA sequenced and had recurrence information (\u003cstrong\u003eSupplementary Table\u0026nbsp;1\u003c/strong\u003e, also in Gonzalez Bosquet J., et al.\u003csup\u003e8\u003c/sup\u003e). Included patients were part of ORIEN database since 2004 and up to 2021. Patients with 2009 FIGO stage I and histological grade 1 or 2 endometrioid EC had an overall recurrence rate of 11.6% (38/329) and were considered low risk for recurrence. Patients with a histological grade 3 endometrioid EC or with FIGO stage II-IV had an overall recurrence rate of 21.3% (69/324) and were considered high-risk for recurrence. Patients with non-endometrioid type EC (serous, carcinosarcoma, clear cell, undifferentiated, mixed) had an overall recurrence rate of 33.1% (79/239) and have even higher risk for recurrence.\u003c/p\u003e\n\u003cp\u003eBaseline variables were collected after surgery, when histologic type, FIGO stage and other clinical and demographic characteristics were known. Resulting models and variables included in the preliminary study of EC recurrence prediction, which included clinical, genomic and pathologic data,\u003csup\u003e8\u003c/sup\u003e were selected and incorporated into the integrated dataset to be analyzed with topic modelling (\u003cstrong\u003eSupplementary Table\u0026nbsp;2\u003c/strong\u003e, also in Gonzalez Bosquet J., et al.\u003csup\u003e8\u003c/sup\u003e).\u003c/p\u003e\n\u003ch3\u003eTumor microenvironment (TME) data:\u003c/h3\u003e\n\u003cp\u003e\u003cspan type=\"ItalicUnderline\" name=\"Emphasis\"\u003eMicrobiome data.\u003c/span\u003e \u003cem\u003eData preprocessing\u003c/em\u003e: CRAM files were downloaded from the Orien server and then converted into BAM files with \u003cem\u003esamtools\u003c/em\u003e for further analysis. Analyses were performed as outlined by the NCI Genomic Data Commons (GDC - \u003cspan\u003e\u003cspan\u003ehttps://docs.gdc.cancer.gov/Data/Introduction/\u003c/span\u003e\u003c/span\u003e). The \u003cem\u003eSTAR\u003c/em\u003e suite (including \u003cem\u003eSTAR-Fusion\u003c/em\u003e) were used to align the transcriptome to the genome assembly version CHM13 T2T.\u003csup\u003e\u003cspan\u003e21\u003c/span\u003e,\u003cspan\u003e22\u003c/span\u003e\u003c/sup\u003e We used the \u003cem\u003eexotic\u003c/em\u003e pipeline to broadly but conservatively identify microbes present in the tumors while removing technical artifacts and contaminants from the dataset (Step 2 of the study design).\u003csup\u003e\u003cspan\u003e23\u003c/span\u003e\u003c/sup\u003e First, exotic maps raw reads with quality scores (FASTQs) to the human reference genome, with a second alignment pass following the standard workflow of TCGA and other large-scale sequencing efforts. Next, \u003cem\u003eexotic\u003c/em\u003e aligns the unmapped reads to a wide range of non-human genomes, including bacteria, archaea, viruses, fungi, and a subset of other eukaryotes using the \u003cem\u003eKrakenUniq\u003c/em\u003e option from the \u003cem\u003eKraken2\u003c/em\u003e pipeline.\u003csup\u003e\u003cspan\u003e24\u003c/span\u003e\u003c/sup\u003e Then, it uses \u003cem\u003eBracken\u003c/em\u003e for estimation of abundance at a the genus taxa level using the resulting classification from \u003cem\u003eKrakenUniq\u003c/em\u003e.\u003csup\u003e\u003cspan\u003e25\u003c/span\u003e\u003c/sup\u003e Next, \u003cem\u003eexotic\u003c/em\u003e filters contaminants in two phases: statistical filtering and literature matching.\u003csup\u003e\u003cspan\u003e23\u003c/span\u003e\u003c/sup\u003e Finally, the outputs are normalized to remove technical artifacts. In summary, exotic discards a small fraction of the reads in the statistical filtering step, though these reads represent a large fraction of the total microbes; and removes a large fraction of the reads but relatively few taxa with the literature-based filtering.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eData analysis\u003c/em\u003e: Topic modeling with Latent Dirichlet Allocation (LDA)\u003csup\u003e\u003cspan\u003e26\u003c/span\u003e,\u003cspan\u003e27\u003c/span\u003e\u003c/sup\u003e was used to assess changes in microbial communities between samples (Step 3 of the study design): 1) first, \u003cem\u003eIdatuning\u003c/em\u003e method determined the optimal number of latent topics for the analysis; 2) then, we used \u003cem\u003eTopicmodels\u003c/em\u003e to evaluate differences in microbial communities by examining topic distributions (both R packages).\u003csup\u003e\u003cspan\u003e28\u003c/span\u003e\u003c/sup\u003e Statistical differences between the two groups (controls vs cases) were considered for false discovery rate (FDR) adjusted p-values\u0026thinsp;\u0026lt;\u0026thinsp;0.05. The use of topic modeling (as natural language processing \u0026ndash; NLP) allows to assess how microbiome communities differ quantifiably and in their composition. By treating microbial communities as \u0026ldquo;topics\u0026rdquo;, like how words cluster in textual data, we were able to model the high-dimensional interactions between different genus and identify potentially meaningful patterns and associations EC recurrence.\u003c/p\u003e\n\u003cp\u003eAgain, variables (genus) included in the resulting models were selected and incorporated into the integrated dataset to be analyzed with topic modelling.\u003c/p\u003e\n\u003cp\u003e\u003cspan type=\"ItalicUnderline\" name=\"Emphasis\"\u003eTumor immune environment.\u003c/span\u003e To evaluate the tumor micro-environment and the immunity response induced by the tumor, we assessed the immune contexture (or the type of tumor-infiltrating immune cells)\u003csup\u003e\u003cspan\u003e29\u003c/span\u003e\u003c/sup\u003e and the cancer associated fibroblasts (or CAF). This evaluation could be very informative of the types of inflammatory, angiogenic, and desmoplastic reactions occurring in a tumor.\u003c/p\u003e\n\u003cp\u003eWe measured the immune contexture with \u003cem\u003equanTIseq\u003c/em\u003e, a computational pipeline that uses bulk RNAseq data using a novel deconvolution approach.\u003csup\u003e\u003cspan\u003e29\u003c/span\u003e\u003c/sup\u003e We used RNAseq resulting from previous steps.\u003c/p\u003e\n\u003cp\u003eWe used the Microenvironment Cell Populations (MCP)-counter, a transcriptome-\u003c/p\u003e\n\u003cp\u003ebased computational method that quantifies the abundance of tissue-infiltrating immune and non-immune stromal cell populations in non-hematopoietic human tumors.\u003csup\u003e\u003cspan\u003e30\u003c/span\u003e\u003c/sup\u003e This method also uses the gene expression matrix resulting from RNAseq to determine the abundance score for CD3\u0026thinsp;+\u0026thinsp;T cells, CD8\u0026thinsp;+\u0026thinsp;T cells, cytotoxic lymphocytes, NK cells, B lymphocytes, cells originating from monocytes (monocytic lineage), myeloid dendritic cells, neutrophils, as well as endothelial cells and fibroblasts.\u003c/p\u003e\n\u003ch3\u003eSocial and Environmental data:\u003c/h3\u003e\n\u003cp\u003e\u003cspan type=\"ItalicUnderline\" name=\"Emphasis\"\u003eEnvironmental variables.\u003c/span\u003e Air pollution data for year 2010 was obtained at the county level for four gases (O\u003csub\u003e3\u003c/sub\u003e, CO, SO\u003csub\u003e2\u003c/sub\u003e, NO\u003csub\u003e2\u003c/sub\u003e), and two aerosols (PM\u003csub\u003e10\u003c/sub\u003e, PM\u003csub\u003e2.5\u003c/sub\u003e), from the Center for Air, Climate and Energy Solutions (CACES; \u003cspan\u003e\u003cspan\u003ehttps://www.caces.us/data\u003c/span\u003e\u003c/span\u003e). This air pollution data was linked with ORIEN data for EC study cohort by Aster Insights collaborators who handle data pull by cross-referencing unique county identifiers in the air pollution data with five-digit zip codes for each patient in the EC study cohort. Air pollution data for each eligible patient was provided with the county code only, to prevent identification of individual patients. Not all patients had information from the county code.\u003c/p\u003e\n\u003cp\u003e\u003cspan type=\"ItalicUnderline\" name=\"Emphasis\"\u003eSocial and environmental determinants of health.\u003c/span\u003e Social and environmental determinants of health were derived from the Centers for Disease Control and Prevention\u0026rsquo;s Environmental Justice Index (EJI) dataset. The EJI is a nationwide, place-based index designed to capture cumulative health impacts from environmental and social burdens at the census tract level. It comprises 36 indicators organized into 10 domains\u0026mdash;Racial/Ethnic Minority Status, Socioeconomic Status, Household Characteristics, Housing Type, Air Pollution, Potentially Hazardous and Toxic Sites, Built Environment, Transportation Infrastructure, Water Pollution, and Preexisting Chronic Disease Burden\u0026mdash;and grouped into three overarching modules: Social Vulnerability, Environmental Burden, and Health Vulnerability.\u003c/p\u003e\n\u003cp\u003eFor this study, we extracted the percentile rank scores for each of the 10 domains from the EJI dataset, which was downloaded from the Agency for Toxic Substances and Disease Registry website. In addition, food access data were obtained from the United State Department of Agriculture\u0026rsquo;s Food Access Research Atlas, specifically the variable low access tract at 1 mile for urban areas or 10 miles for rural areas. This variable is defined as \u0026ldquo;a low-income tract with at least 500 people or 33% of the population living more than 1 mile (urban) or more than 10 miles (rural) from the nearest supermarket, supercenter, or large grocery store.\u0026rdquo; These census tract\u0026ndash;level social and environmental determinants were linked to patient-level data using the county code corresponding to each census tract as the unique identifier. For counties containing multiple census tracts, data were summarized using a weighted mean, with weights based on census tract population size. Additional details on all variables used to capture social and environmental determinants of health are provided in \u003cstrong\u003eSupplementary Table\u0026nbsp;3\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cspan type=\"ItalicUnderline\" name=\"Emphasis\"\u003eData analysis.\u003c/span\u003e We used bipartite network analysis to identify clusters (subtypes) of both patients and social and environmental determinants of health. Bipartite network takes input data at the county-code level and outputs a quantitative summary (number, size, and statistical significance) along with a network visualization of the identified clusters.\u003csup\u003e\u003cspan\u003e31\u003c/span\u003e\u003c/sup\u003e Statistical significance was assessed by comparing the observed value to a null distribution generated from 1,000 random permutations of the network.\u003csup\u003e\u003cspan\u003e32\u003c/span\u003e\u003c/sup\u003e Compared to traditional clustering methods such as hierarchical clustering or principal component analysis, bipartite networks offer two key advantages: (1) they operate autonomously without requiring user-defined parameters, and (2) they define clusters that include both patients and variables.\u003csup\u003e\u003cspan\u003e32\u003c/span\u003e\u003c/sup\u003e We used bipartite networks to detect clusters and associations between cluster membership and recurrence of disease between air pollution and social determinants of health. Bipartite network separated air pollution data and social determinants of health, so we performed a multivariate \u003cem\u003elasso\u003c/em\u003e regression of EC recurrence for both domains, selected those variables that were most informative for EC recurrence prediction for both, and then, selected variables from both domains, were incorporated into the integrated dataset to be analyzed separately with topic modelling.\u003c/p\u003e\n\u003ch3\u003eIntegration of resulting models:\u003c/h3\u003e\n\u003cp\u003eAll elements significant in the clinical, pathological, genomic, microbiological, and environmental analyses were added to integrated databases and analyzed with topic modelling to assess patterns and associations between different data types and EC recurrence (Step 4 of the study design). Because environmental and social variables were less available in the dataset, and separated by bipartite networks, models were performed with and without them.\u003c/p\u003e\n\u003cdiv id=\"Sec8\"\u003e\n \u003ch2\u003e\u003cem\u003eTraining, validating and testing EC recurrence models\u003c/em\u003e:\u003c/h2\u003e\n \u003cp\u003eFinally, we trained, validated and tested models with the integrated datasets that included all selected variables resulting from topic modelling (Step 5 of the study design). For prediction modelling we used \u003cem\u003elasso\u003c/em\u003e regression, other machine learning (ML) included in \u003cem\u003eMATLAB\u003c/em\u003e apps, and deep learning (DL) with \u003cem\u003eTensorFlow\u003c/em\u003e analytics. Briefly, for \u003cem\u003eMATLAB\u003c/em\u003e analysis, we used 10-fold cross-validation for training, and left 20% of EC samples for testing with, using 35 ML different methods on ORIEN dataset. The best models were selected for reporting. Model explanation was performed on training and testing models using Shapley values.\u003csup\u003e\u003cspan\u003e33\u003c/span\u003e\u003c/sup\u003e In the context of machine learning prediction, the Shapley value of a feature for a query point explains the contribution of the feature to a prediction (score of each class for classification) at the specified query point. We use the Shapley values of predictors to interpret which predictors have the largest (or smallest) average impact on model output magnitude.\u003c/p\u003e\n \u003cp\u003eAdditionally, we used \u003cem\u003eTensorFlow\u003c/em\u003e\u003csup\u003e\u003cspan\u003e34\u003c/span\u003e\u003c/sup\u003e in a \u003cem\u003eJupyter\u003c/em\u003e notebook with a \u003cem\u003eKeras\u003c/em\u003e application programming interface (API)\u003csup\u003e\u003cspan\u003e35\u003c/span\u003e\u003c/sup\u003e as the DL method. This is a modification of the \u003cem\u003eTensorFlow\u003c/em\u003e core tutorial \u0026lsquo;Classification of imbalanced data\u0026rsquo; (\u003cspan\u003e\u003cspan\u003ewww.tensorflow.org/tutorials/structured_data/imbalanced_data\u003c/span\u003e\u003c/span\u003e). Normalization of the data was performed using the \u003cem\u003esklearn StandardScaler\u003c/em\u003e. Models had 16 layers, with a dropout layer to reduce overfitting, and an output sigmoid layer that returns the probability of a transaction being fraudulent. The input layer of each model contained as many nodes as features to analyze. Training was performed to account for weights of the outcomes as well as for unbalanced data using oversampling methods. Validation was done using 15% of samples and 25% of samples were kept for testing the models.\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003eValidation in TCGA EC data:\u003c/h3\u003e\n\u003cp\u003e\u003cspan type=\"ItalicUnderline\" name=\"Emphasis\"\u003eData preprocessing.\u003c/span\u003e TCGA BAM files initially were converted to FASTQ files with the \u003cem\u003esamtools\u003c/em\u003e pipeline. Then, the rest of the genomic and microbiome extraction was performed as detailed in the ORIEN database.\u003c/p\u003e\n\u003cp\u003e\u003cspan type=\"ItalicUnderline\" name=\"Emphasis\"\u003eValidation analysis.\u003c/span\u003e Validation was performed using TCGA EC dataset, that included endometrioid and serous EC (TCGA-UCEC)\u003csup\u003e\u003cspan\u003e8\u003c/span\u003e\u003c/sup\u003e and endometrial carcinosarcoma (TCGA-UCS). Briefly, after permission was granted to access controlled data by the Genomic Data Commons (GDC) Data Portal (dbGaP#29868), TCGA-UCEC RNAseq (406 endometroid and 136 serous EC) and TCGA-UCS RNAseq (56 endometrial carcinosarcomas) files in BAM format were downloaded from women with EC. Main clinical characteristics are described in \u003cstrong\u003eSupplementary Table\u0026nbsp;4\u003c/strong\u003e. Of note is that non-endometrioid cases in TCGA did not include any clear cell, undifferentiated, or dedifferentiated carcinomas. For validation we used only those significant variables resulting from topic modelling that were selected and included in the integrated dataset (Step 4 of the study design). We used TCGA dataset first to externally validate the models associated with EC recurrence that included clinical, genomic and microbiome data. County codes were not available for TCGA patients, so we were not able to link all metagenomic/genomic data with environmental and socials determinants of health. Additionally, we used TCGA datasets for external testing of the best prediction models for EC recurrence trained in the ORIEN set. The best prediction models of EC recurrence were tested with ML learning (\u003cem\u003eMATLAB\u003c/em\u003e) and with DL (\u003cem\u003eTensorFlow\u003c/em\u003e) and including TCGA data as the testing set. Survival analysis prediction with Cox proportional hazard ratios and Kaplan-Meir survival curves were performed in R with \u003cem\u003esurvival\u003c/em\u003e and \u003cem\u003eggsurvfit\u003c/em\u003e packages.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003e\u003cstrong\u003eTumor-associated microbiome communities associated with EC recurrence:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFirst we identified the optimal number of latent topics \u0026nbsp;for each EC recurrence risks groups: low risk (85 latent topics), high risk (70 latent topics) and non-endometrioid group (55 latent topics) (\u003cstrong\u003eFigure 1,\u0026nbsp;\u003c/strong\u003eleft panels). Then, we used latent Dirichlet allocation (LDA) to identify differentially abundant topics by comparing topic distributions profiles between recurrence groups (\u003cstrong\u003eFigure 1,\u0026nbsp;\u003c/strong\u003emiddle panels). Topics were considered statistically significant topics if they met an \u0026nbsp;FDR corrected p-values threshold of \u0026lt; 0.05 and demonstrated negative log2 fold changes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTumor micro-environment\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;features associated with EC recurrence:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe assessed the tumor immune microenvironment and CAF using gene expression patterns derived from RNAseq (\u003cstrong\u003eFigure 2\u003c/strong\u003e). Topic modelling was applied to determine which of these cellular components were most informative for EC recurrence. Immune cell populations identified in this initial analysis were subsequently introduced into the integrated topic modeling framework alongside significant genomic, metagenomic and clinical features, stratified by risk group. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEnvironmental data associated with EC recurrence:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFor low-risk EC, five out of six air pollutants were informative for EC recurrence, including CO, NO\u003csub\u003e2\u003c/sub\u003e, O\u003csub\u003e3\u003c/sub\u003e, PM\u003csub\u003e10\u003c/sub\u003e, PM\u003csub\u003e2.5\u003c/sub\u003e; while for high-risk four out of six, and for non-endometrioid three out of six (\u003cstrong\u003eFigure 3\u003c/strong\u003e). Aerosols, PM\u003csub\u003e10\u003c/sub\u003e, PM\u003csub\u003e2.5\u003c/sub\u003e consistently showed increased risk (OR \u0026gt; 1), like O\u003csub\u003e3\u003c/sub\u003e, while CO, NO\u003csub\u003e2\u003c/sub\u003e, and SO\u003csub\u003e2\u003c/sub\u003e showed inverse associations (OR \u0026lt; 1). Variables selected by this model were then incorporated into the integrated datasets together with significant genomic, metagenomic and clinical data variables for the final analysis. Social and environmental determinants of health initial lasso regression are presented in \u003cstrong\u003eSupplementary Figure 1\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eIntegration of resulting models:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll features identified as significant across clinical, pathological, genomic, microbiological, and environmental analyses were incorporated to an integrated dataset and analyzed using topic modelling. Three integrated models were evaluated: a) a model including clinical, genomic, and immune features (Clin+Gen+Imm); b) a model additionally incorporating air pollution variables (Clin+Gen+Imm+Pol); and c) a model further including social and environmental determinants of health data (Clin+Gen+Imm+Env). This stepwise modeling strategy was employed because county identifiers linking environmental data to patients were available for air pollution in 74% of cases and for social/environmental determinants of health in only 64% of patients.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe composition of significant topics across all three EC risk groups (low-risk, high-risk, and non-endometrioid) with and without environmental variables is summarized in \u003cstrong\u003eFigure 4\u003c/strong\u003e. Topic model optimization and computational performance for each integrated model (Clin+Gen+Imm, Clin+Gen+Imm+Pol, and Clin+Gen+Imm+Env) are presented in \u003cstrong\u003eSupplementary Figures 2, 3\u003c/strong\u003e, and 4, respectively.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003ePer-topic variable probabilities, detailing the expected average probability for each component within a given topic, indicated that \u003cem\u003eBacillus\u003c/em\u003e was the most probable microbial genus across all risk groups, especially for low-risk EC (28%) but also for non-endometrioid type (10%)(\u003cstrong\u003eSupplementary Table 5\u003c/strong\u003e). In addition, \u003cem\u003eStenotrophomonas\u003c/em\u003e (10%) and \u003cem\u003eThermothielavioides\u003c/em\u003e (27%) were frequently observed in significant recurrence-associated topics in low-risk EC.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAmong clinical variables, BMI was the only feature retained after data integration; however, it was observed exclusively in the low-risk group and at a low probability (0.6%). Variables with higher per-topic probabilities (\u0026gt;10%) were predominantly gene expression features. \u0026nbsp; In the high-risk group, these included ENSG00000137745.12 (\u003cem\u003eMMP13\u003c/em\u003e), ENSG00000143556.9 (\u003cem\u003eS100A7\u003c/em\u003e), ENSG00000198732.10 (\u003cem\u003eSMOC1\u003c/em\u003e), and ENSG00000278540.5 (\u003cem\u003eACACA\u003c/em\u003e). In the non-endometrioid group, high probable genes included ENSG00000075340.23 (\u003cem\u003eADD2\u003c/em\u003e), ENSG00000105880.7 (\u003cem\u003eDLX5\u003c/em\u003e), ENSG00000137491.15 (\u003cem\u003eSLCO2B1\u003c/em\u003e), ENSG00000188039.14 (\u003cem\u003eNWD1\u003c/em\u003e), along with pseudogenes ENSG00000128262.8 (\u003cem\u003ePOM121L9P\u003c/em\u003e), ENSG00000234975.6 (\u003cem\u003eFTH1P2\u003c/em\u003e).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eCNVs were detected in both low-risk and non-endometrioid groups; however, their probabilities within significant topics were consistently low. Similarly, immune contexture features, CAF, and gene isoforms expression contributed at low frequency. SNVs were infrequent and were not prominent in any risk group. Although air pollutants and social/environmental determinants of health were present across all models, their per-topic probabilities were uniformly low (\u0026lt;1%).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNotably, the inclusion of environmental variables altered the composition of microbiome-associated features within the resulting topic models, suggesting interactions between environmental exposures and tumor-associated microbial communities.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTraining, validation and testing models for EC recurrence:\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe developed, validated and tested predictive models for EC recurrence using features from significant topics identified in the integrated dataset. This analysis evaluated whether topic-selected features were also informative predictors of recurrence. Models were trained and cross-validated using two analytical platforms: \u003cem\u003eMATLAB\u003c/em\u003e-based machine learning (ML) and \u003cem\u003eTensorFlow\u003c/em\u003e-based deep learning (DL). For MATLAB, only the best performing models were retained from 35 candidate configurations. For \u003cem\u003eTensorFlow\u003c/em\u003e, training accounted for class imbalance through oversampling strategies, as recurrence events comprised approximately 10-30% of samples. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSeparate recurrence predictions models were trained for each risk group, low-risk EC (\u003cstrong\u003eFigure 5\u003c/strong\u003e), high-risk (\u003cstrong\u003eSupplementary Figure 5\u003c/strong\u003e) and non-endometrioid EC (\u003cstrong\u003eSupplementary Figure 6\u003c/strong\u003e). For each group we trained models including different combinations of data. \u003cstrong\u003eFigure 5\u003c/strong\u003e summarizes model performance for the low-risk group, as measured by the area under the receiver operator characteristics curve (AUC), for models incorporating: clinical and metagenomic data (Clin+Gen;\u003cstrong\u003e\u0026nbsp;Figure 5a\u003c/strong\u003e and \u003cstrong\u003e5b\u003c/strong\u003e); clinical, microbiome, genomic and\u0026nbsp;immune contexture\u0026nbsp;(Clin+Gen+Imm;\u0026nbsp;\u003cstrong\u003eFigure 5c\u003c/strong\u003e and \u003cstrong\u003e5d\u003c/strong\u003e);\u0026nbsp;clinical, microbiome, genomic, immune contexture, and air pollution data\u0026nbsp;(Clin+Gen+Imm+Pol;\u0026nbsp;\u003cstrong\u003eFigure 5e\u003c/strong\u003e and \u003cstrong\u003e5f\u003c/strong\u003e); and\u0026nbsp;clinical, microbiome, genomic, immune contexture, and social/environmental data\u0026nbsp;(Clin+Gen+Imm+Env,\u0026nbsp;\u003cstrong\u003eFigure 5g\u003c/strong\u003e and \u003cstrong\u003e5h\u003c/strong\u003e). Equivalent modeling strategies were applied to the high-risk (\u003cstrong\u003eSupplementary Figure 5\u003c/strong\u003e) and non-endometrioid groups (\u003cstrong\u003eSupplementary Figure 6\u003c/strong\u003e).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAcross all three risk groups and both analytical platforms, models containing clinical, microbiome, genomic and immune contexture features (Clin+Gen+Imm) demonstrated the strongest performance in the testing set. For low-risk EC, AUCs reached 0.93 using MATLAB (\u003cstrong\u003eFigure 5c\u003c/strong\u003e) and 0.88 using Tensorflow (\u003cstrong\u003eFigure 5d\u003c/strong\u003e). In high-risk EC, corresponding AUCs were 0.9 (MATLAB; \u003cstrong\u003eSupplementary Figure 5c\u003c/strong\u003e) and 0.85 (Tensorflow;\u0026nbsp;\u003cstrong\u003eSupplementary Figure 5d\u003c/strong\u003e). In non-endometrioid EC, AUCs were\u0026nbsp;0.79 (MATLAB;\u0026nbsp;\u003cstrong\u003eSupplementary Figure 6c\u003c/strong\u003e)\u0026nbsp;and 0.76 (Tensorflow;\u0026nbsp;\u003cstrong\u003eSupplementary Figure 6d\u003c/strong\u003e).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAlthough inclusion of environmental variable, \u0026nbsp;air pollution (Clin+Gen+Imm+Pol; \u003cstrong\u003eFigure 5e\u003c/strong\u003e and \u003cstrong\u003e5f\u003c/strong\u003e) and social/environmental determinants of health (Clin+Gen+Imm+Env, \u003cstrong\u003eFigure\u003c/strong\u003e \u003cstrong\u003e5g\u003c/strong\u003e and \u003cstrong\u003e5h\u003c/strong\u003e), reduced sample size due to missing county-level data (see confusion matrix in low-risk and non-endometrioid groups - \u003cstrong\u003eSupplementary Figure 6e-h\u003c/strong\u003e), model performance in testing sets remained acceptable. This was particularly evident in the high-risk group: where AUC reached 0.89 for\u0026nbsp;Clin+Gen+Imm+Pol\u0026nbsp;and 0.8 for\u0026nbsp;Clin+Gen+Imm+Env models\u0026nbsp;(\u003cstrong\u003eSupplementary Figure 5e-h\u003c/strong\u003e).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo assess the relative contribution of individual predictors within the best-performing models, we applied Shapley value analysis (\u003cstrong\u003eSupplementary Figure 7\u003c/strong\u003e). Incorporation of air pollution measures and social/environmental determinants of health consistently altered the ranking and composition of the most influential predictors across all three EC risk groups, with particularly pronounced effects on microbiome-associated features (\u003cstrong\u003eSupplementary Figure 7d-i\u003c/strong\u003e). These effects were most evident in the non-endometrioid group (\u003cstrong\u003eSupplementary Figure 7i\u003c/strong\u003e), where multiple social/environmental determinants, proximity to high volume roadways and airports, proximity to impaired water bodies, and limited food access, emerged as influential contributors to recurrence prediction.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExternal testing of models for EC recurrence:\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAfter downloading and pre-processing TCGA EC dataset using the same pipeline applied to the ORIEN cohort, we extracted variables corresponding to those retained in the integrated topic models encompassing clinical, genomic/metagenomic and immune context features. To first assess whether the EC risk group stratification derived from ORIEN was comparable in TCGA, we evaluated progression-free survival (PFS) across low-risk, high-risk, and non-endometrioid groups in both datasets (\u003cstrong\u003eSupplementary Figure 8\u003c/strong\u003e). Although differences in PFS were observed, the 95% CIs for all three risk groups overlapped substantially, particularly during the first 2-3 years of follow-up. TCGA represents a valuable external resource but had known limitations that may affect validation performance,\u003csup\u003e8\u003c/sup\u003e including missing variables, limited follow-up and case status reporting, incompletely staged cases, and differences in timing of biospecimen collection. These factors are likely to contribute to the divergence of PFS curves observed later in follow-up.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eCounty-level identifiers are not available in TCGA because they constitute personal identifying data, therefore, environmental exposures could not be linked to TCGA EC cases. In addition, several features present in the integrated ORIEN topic models were not available in TCGA: a) in low-risk EC 29% of significant components missing: CNVs (mainly in long non-coding RNAs) and some microbiomes, like the genus \u003cem\u003eThermothielavioides\u003c/em\u003e with a probability of 27% of being a component of the resulting topic, genus \u003cem\u003eTheileria\u003c/em\u003e and \u003cem\u003eRhizoctonia\u003c/em\u003e with probabilities below 8%, and the rest with probabilities below 4%; b) in high-risk EC 18% of significant components missing: like genus \u003cem\u003eMalassezia\u0026nbsp;\u003c/em\u003ewith a probability of 4% and \u003cem\u003eCandida\u003c/em\u003e with a probability of 5%; the rest missing components had probabilities of 2% or below; c) the non-endometrioid group had only 11% of missing components all of them with probabilities below 0.5%. Notably, \u003cem\u003eThermothielavioides\u003c/em\u003e was absent from all resulting topic models after air pollution variables were introduced, while \u003cem\u003eTheileria\u003c/em\u003e, \u003cem\u003eRhizoctonia\u003c/em\u003e, \u003cem\u003eMalassezia\u003c/em\u003e, \u003cem\u003eCandida\u003c/em\u003e appeared in only one of four topics when air pollution was included (\u003cstrong\u003eFigure 4\u003c/strong\u003e).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe next performed topic modeling in TCGA using all features overlapping with the ORIEN-derived models (\u003cstrong\u003eFigure 6\u003c/strong\u003e). Despite missing variables, microbiome-related components in TCGA topic models demonstrated similar probability distributions to those observed in ORIEN, with overlapping 95% confidence intervals (\u003cstrong\u003eSupplementary Figure 9\u003c/strong\u003e). Two exceptions were noted: \u003cem\u003eBacillus\u003c/em\u003e exhibited a higher probability in TCGA compared with ORIEN (86% vs. 4%; \u003cstrong\u003eSupplementary Figure 9b\u003c/strong\u003e), whereas \u003cem\u003eEscherichia\u003c/em\u003e also showed increased probability in TCGA (0.3% vs. 8%; \u003cstrong\u003eSupplementary Figure 9c\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eGiven the incomplete overlap of variables between datasets, we retrained recurrence prediction models in the ORIEN cohort using only features available in TCGA to enable external validation. As in prior analyses, models were trained and validated using \u003cem\u003eMATLAB\u003c/em\u003e (ML) and \u003cem\u003eTensorFlow\u003c/em\u003e (DL) approaches, with oversampling applied to address class imbalance. The TCGA cohort was then used as an independent external test set (\u003cstrong\u003eFigure 7\u003c/strong\u003e). Although overall model performance was reduced relative to internal testing (\u003cstrong\u003eFigure 5\u003c/strong\u003e), reflecting the loss of informative variables, the AUCs obtained in TCGA testing fell within the 95% confidence intervals of the newly trained ORIEN models, indicating no statistically meaningful performance degradation.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFinally, Shapley value analysis was applied to both the ORIEN-trained models and TCGA-tested models to assess predictor importance (\u003cstrong\u003eSupplementary Figure 10\u003c/strong\u003e). In both low-risk and high-risk EC groups, the most influential contributors were concordant between training and testing models: \u003cem\u003eBacillus\u003c/em\u003e and \u003cem\u003eEscherichia\u003c/em\u003e in low-risk EC (\u003cstrong\u003eSupplementary\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eFigures\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e10a\u0026nbsp;\u003c/strong\u003eand\u003cstrong\u003e\u0026nbsp;10d\u003c/strong\u003e), and \u003cem\u003eSMOC1\u003c/em\u003e (\u003cem\u003eENSG00000198732\u003c/em\u003e), \u003cem\u003eENSG00000214776\u003c/em\u003e pseudogene expression and \u003cem\u003eAcinetobacter\u003c/em\u003e in high-risk EC\u0026nbsp;(\u003cstrong\u003eSupplementary\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eFigures\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e10b\u0026nbsp;\u003c/strong\u003eand\u003cstrong\u003e\u0026nbsp;10e\u003c/strong\u003e). In non-endometrioid EC, multiple predictors contribute consistently across training and testing models, including T Cells, CD8+ T Cells, CNVs involving \u003cem\u003eADA\u003c/em\u003e and \u003cem\u003eKRT9\u003c/em\u003e genes, and microbial features such as \u003cem\u003eBacillus\u003c/em\u003e and \u003cem\u003eEscherichia\u003c/em\u003e (\u003cstrong\u003eSupplementary\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eFigure\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e10c\u0026nbsp;\u003c/strong\u003eand\u003cstrong\u003e\u0026nbsp;10f\u003c/strong\u003e).\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eEndometrial cancer (EC) recurrence is a complex, multifactorial process that cannot be fully explained by tumor-intrinsic features alone. In this study, we applied an integrative, systems-level framework to model EC recurrence as an emergent property of interactions among clinical factors, tumor genomics, immune contexture, tumor-associated microbial communities, and environmental exposures. Using topic modeling to capture coordinated, cross-domain patterns and machine learning approaches to evaluate predictive performance, we identified reproducible, risk group-specific recurrence signatures that generalized across analytical platforms and independent datasets. Importantly, features selected through topic modeling retained strong predictive value in recurrence models, supporting the biological and clinical relevance of these integrated patterns. Together, these findings underscore the multifactorial nature of EC recurrence and demonstrate the feasibility of integrated, multi-domain modeling to interrogate recurrence biology at scale.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWhat This Study Adds\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA central finding of this study is that incorporation of environmental and neighborhood-based exposures reshaped recurrence-associated topic composition across all risk groups. Microbial communities that were prominent in models incorporating only tumor-intrinsic features were attenuated or absent after inclusion of air pollution and social\u0026ndash;environmental variables, indicating that recurrence-associated bacterial signatures are strongly context-dependent. These findings suggest that tumor-associated microbial signals reflect broader tumor\u0026ndash;host\u0026ndash;environment interactions rather than static or isolated microbial effects.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTumor-Microbiome Interactions in EC Recurrence\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAcross all risk groups, \u003cem\u003eBacillus\u003c/em\u003e emerged as the bacterial genus with the highest per-topic probability, although its directionality differed by risk category. Decreased representation of Bacillus was associated with recurrence in low-risk EC, whereas increased representation was linked to recurrence in high-risk and non-endometrioid disease. This bidirectional association suggests that tumor-associated bacterial signals may reflect underlying tumor biology, host factors, or treatment context rather than uniform oncogenic or protective effects. Similar context-dependent microbial associations have been reported in other hormonally influenced malignancies,\u003csup\u003e42\u003c/sup\u003e supporting the interpretation of these signals as ecological markers of tumor state.\u003c/p\u003e\n\u003cp\u003eIn the low-risk EC group, microbes that were quite abundant in models before introducing environmental factors were scarcely seen afterwards, like genera \u003cem\u003eThermothielavioides\u003c/em\u003e, \u003cem\u003eTheileria\u003c/em\u003e, \u003cem\u003eRhizoctonia\u003c/em\u003e, \u003cem\u003eMalassezia\u003c/em\u003e, and \u003cem\u003eCandida\u003c/em\u003e. It is difficult to know exactly the reason for this change in microbiome composition, because our study design cannot infer causality only association, but this is an intriguing observation that needs further follow up.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eImportantly, bacterial communities identified in this study were inferred from tumor-derived bulk RNA sequencing data rather than from dedicated microbiome sampling. As such, these findings should be interpreted as relative, comparative signals reflecting tumor-associated microbial nucleic acids rather than direct measures of viable or mucosal microbiota. Nevertheless, consistent identification of specific genera across modeling approaches, risk strata, and external validation supports their relevance as ecological markers of tumor\u0026ndash;host\u0026ndash;environment interactions rather than isolated microbial drivers. These results should therefore be viewed as hypothesis-generating.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSocial and Environmental Determinants Associated with EC Recurrence\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEnvironmental exposures and social determinants of health emerged as consistent modifiers of recurrence-associated patterns across EC risk groups. Ozone (O₃) was repeatedly identified as a component of recurrence-associated topics in all three risk strata, supporting a biologically plausible link between ambient oxidative stress and EC recurrence. O₃ exposure has been implicated in oxidative DNA damage, inflammatory signaling, immune modulation, and estrogen dysregulation, pathways central to EC pathogenesis and progression, particularly in hormonally responsive tissues.\u003csup\u003e36-45\u003c/sup\u003e Although individual-level exposure assessment was not feasible, the reproducible association of O₃ with recurrence-associated topics suggests that environmental oxidative stress may act as a contextual modifier of tumor biology rather than an isolated risk factor.\u003c/p\u003e\n\u003cp\u003eIn parallel, social and environmental determinants of health contributed to the composition of recurrence-associated topics, most prominently in high-risk and non-endometrioid EC. Features such as proximity to high-volume roadways and airports, impaired water bodies, and limited food access, proxies for structural and environmental disadvantage, were among the variables influencing these patterns. These findings are consistent with growing evidence linking neighborhood-level exposures to cancer outcomes and support a model in which place-based factors shape tumor biology through indirect, cumulative mechanisms.\u003csup\u003e46,47\u003c/sup\u003e Notably, these associations persisted despite individual-level race or ethnicity not emerging as dominant predictors, underscoring the potential importance of structural context beyond individual demographic characteristics.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRisk Group\u0026ndash;Specific Biological Programs Underlying EC recurrence.\u003c/strong\u003e\u003cbr\u003e\u0026nbsp;Recurrence-associated patterns differed substantially by clinical risk group, reinforcing the biological heterogeneity of EC recurrence pathways and arguing against a single, unified mechanism of relapse. Low-risk endometrioid EC recurrence was driven predominantly by metabolic and microbiome-associated features, with minimal persistence of clinical variables beyond body mass index (BMI). In contrast, high-risk and non-endometrioid tumors were characterized by greater contributions from gene expression programs, immune contexture, stromal activation, and environmental domains. This stratified behavior supports the concept that recurrence mechanisms, and therefore opportunities for refined risk stratification or intervention, may differ fundamentally across EC subtypes.\u003c/p\u003e\n\u003cp\u003eObesity, and its proxy BMI, are intrinsically linked to estrogen metabolism, EC pathogenesis, metabolic syndrome, and microbiome dysbiosis.\u003csup\u003e36\u003c/sup\u003e Accordingly, the persistence of BMI as a component of low-risk recurrence models is biologically plausible, particularly given the estrogen-responsive nature of low-risk endometrioid tumors. In this group, recurrence-associated patterns reflected a coordinated imbalance involving reduced \u003cem\u003eBacillus\u003c/em\u003e, elevated ozone exposure, copy number alterations in genes primarily related to nucleocytoplasmic transport, and increased CAF representation, suggesting a convergence of hormonal, metabolic, microbial, and stromal influences that may favor tumor re-emergence.\u003c/p\u003e\n\u003cp\u003eIn contrast, clinical variables previously associated with recurrence risk in earlier analyses, including ethnicity, chemotherapy exposure, albumin, and red blood cell distribution width, did not persist within the integrated topic models for high-risk or non-endometrioid EC. Instead, recurrence in these groups was characterized by dysregulation of gene, pseudogene, and isoform expression involving T-cell signaling pathways, lipid and carbohydrate metabolism, folate transport and metabolism, and basal transcriptional machinery. These molecular programs co-occurred with pronounced immune and stromal features, including increased CAF abundance and macrophages M1 infiltration, as well as consistent microbiome dysbiosis marked by increased \u003cem\u003eBacillus\u003c/em\u003e and \u003cem\u003eCandida\u003c/em\u003e and decreased \u003cem\u003eEscherichia\u003c/em\u003e. Elevated ozone exposure was again observed, suggesting a recurring environmental backdrop across higher-risk disease.\u003c/p\u003e\n\u003cp\u003eNotably, increased CAF representation emerged as a shared feature across all EC subtypes associated with recurrence, consistent with prior evidence implicating stromal remodeling in disease progression.\u003csup\u003e48\u003c/sup\u003e However, heightened macrophages M1 infiltration was restricted to high-risk and non-endometrioid tumors, underscoring risk group\u0026ndash;specific immune dynamics. Together, these findings highlight that EC recurrence arises from distinct, subtype-dependent biological programs shaped by interacting tumor-intrinsic, microenvironmental, microbial, and environmental factors, rather than from a uniform recurrence pathway.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical and Translational Implications of Integrated Recurrence Modeling\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was not designed to produce an immediately deployable clinical prediction tool. Rather, it establishes a scalable, modular analytic framework for integrating heterogeneous biological and environmental data to interrogate EC recurrence biology at a systems level. Although the recurrence prediction models developed here performed comparably to previously published models, they consistently demonstrated that features emerging from integrated topic models encompassing clinical variables, tumor genomics, immune contexture, microbiome composition, and environmental exposures capture biologically meaningful recurrence-associated patterns. Notably, incorporation of air pollution variables altered microbiome feature composition without degrading model performance, underscoring the tightly interconnected nature of tumor, host, microbial, and environmental factors influencing EC relapse.\u003c/p\u003e\n\u003cp\u003eTo minimize overfitting and assess generalizability, both topic models and recurrence prediction models were evaluated using the TCGA EC cohort as an independent external dataset. Because TCGA lacks several key variables present in ORIEN, including environmental exposures such as air pollutants, models were retrained in ORIEN using TCGA-compatible features prior to external testing. Despite these constraints, model performance in TCGA remained within the 95% confidence intervals of the retrained ORIEN models, indicating preserved predictive stability. Differences in cohort composition and data structure likely influenced external validation performance, including earlier-era sample collection in TCGA, more limited follow-up and case status reporting, and reduced histologic diversity within non-endometrioid tumors compared with ORIEN. These limitations highlight the challenges of external validation for integrated, multi-domain models and emphasize the importance of contemporary, deeply annotated cohorts for translational modeling.\u003c/p\u003e\n\u003cp\u003eFrom an NIH translational research perspective, this work primarily occupies the T0\u0026ndash;T1 space, generating integrated biological insights and analytically validated recurrence signatures rather than clinical decision tools. Importantly, however, it provides a foundation for progression toward T2 translation. Specifically, this framework enables prospective cohort studies incorporating longitudinal biospecimen collection, spatially resolved tumor and microenvironment profiling, and microbiome-specific assays to validate and refine recurrence-associated programs. Such studies can inform risk-adapted surveillance strategies, identify biologically defined subgroups most likely to benefit from targeted interventions, and guide the rational design of prevention or interception trials. By establishing a reproducible, extensible modeling architecture, this study advances the field toward clinically actionable integration of tumor biology, host context, and environmental exposures in EC recurrence.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStrengths\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA major strength of this study is the integration of diverse data modalities within a unified analytical framework. By jointly modeling clinical, pathological, genomic, immune, microbiome, and environmental features, we move beyond traditional single-domain analyses and provide a more holistic view of EC recurrence biology. Topic modeling enabled identification of coordinated feature sets that reflect biologically meaningful processes rather than isolated variables, while subsequent machine learning models demonstrated that these topic-derived features are also robust predictors of recurrence across multiple risk groups.\u003c/p\u003e\n\u003cp\u003eAnother key strength is the use of complementary analytical platforms. Consistent performance across MATLAB-based machine learning and TensorFlow-based deep learning approaches supports the robustness of our findings and reduces the likelihood that results are driven by platform-specific modeling assumptions. The application of Shapley value analysis further strengthens interpretability by clarifying the relative contribution of individual predictors within the best-performing models, an important consideration for translational relevance.\u003c/p\u003e\n\u003cp\u003eExternal validation using the TCGA endometrial cancer cohort represents an additional strength. Despite incomplete overlap of features and known limitations of TCGA data, recurrence models trained in ORIEN and tested in TCGA demonstrated performance that remained within the confidence bounds of internally validated models. Concordance of key predictors\u0026mdash;particularly microbiome-associated features, immune cell populations, and select genomic alterations\u0026mdash;between ORIEN training models and TCGA testing models provides evidence of generalizability and biological consistency across independent datasets.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLimitations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSeveral limitations should be acknowledged. First, tumor-associated microbial signals were inferred from bulk RNA sequencing rather than from dedicated microbiome sequencing platforms. As such, these findings should be interpreted as relative, comparative signals reflecting microbial nucleic acids present in tumor-derived data rather than direct measures of viable or mucosal microbiota. While consistent identification of specific genera across risk groups, modeling strategies, and external validation supports their relevance as ecological markers, functional and spatial validation will be required to clarify causal relationships.\u003c/p\u003e\n\u003cp\u003eSecond, integration of environmental exposures was constrained by data availability. County-level identifiers were required to link air pollution and social or environmental determinants of health to individual patients, resulting in reduced sample sizes for models incorporating these variables. This limitation likely attenuated model performance in some settings and may have reduced power to detect stronger environmental effects. Moreover, TCGA lacks county-level identifiers entirely, precluding external validation of environmental features and limiting assessment of their generalizability.\u003c/p\u003e\n\u003cp\u003eThird, external validation using TCGA was affected by incomplete overlap of features between datasets, differences in follow-up duration, case status reporting, and timing of biospecimen collection. These factors necessitated retraining of recurrence models in ORIEN using TCGA-compatible features and likely contributed to reduced absolute performance in external testing. Nevertheless, the observation that TCGA testing performance remained within the confidence intervals of retrained ORIEN models supports the stability of the underlying predictive framework despite these constraints.\u003c/p\u003e"},{"header":"CONCLUSION","content":"\u003cp\u003eIn summary, endometrial cancer recurrence emerges from this analysis as an emergent property of coordinated interactions among tumor-intrinsic programs and extrinsic contextual factors, rather than as the consequence of any single biological domain. By integrating clinical, pathological, genomic, immune, microbiome, and environmental features within a systems-level modeling framework, we demonstrate that these complex interactions can be quantified, interpreted, and externally validated across independent cohorts.\u003c/p\u003e\n\u003cp\u003eImportantly, both intrinsic tumor and host characteristics and extrinsic environmental and social exposures contributed to recurrence-associated patterns, with their relative influence varying by clinical risk group. These findings underscore the biological heterogeneity underlying EC relapse and highlight the limitations of one-size-fits-all prediction approaches. Collectively, this work supports the need for more individualized, context-aware models of disease outcomes and establishes an extensible analytic foundation for future translational efforts aimed at improving EC recurrence risk stratification, prevention, and intervention.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthical Approval and Consent to participate:\u0026nbsp;\u003c/strong\u003ePatients consent to donate medical records and tissue specimens for molecular profiling to the ORIEN network as an approach to improve design and performance of personalized cancer care. The ORIEN network is comprised of multiple cancer centers that have agreed to use the same Institutional Review Board (IRB)-approved protocol and consent (Total Cancer Care Protocol, TCC).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials:\u0026nbsp;\u003c/strong\u003eThe data used in this study was generated through private funding by Aster Insights (www.asterinsights.com) in collaboration with the Oncology Research Information Exchange Network (ORIEN, www.oriencancer.org). Inquiries regarding access to the data or collaboration within ORIEN should be submitted here at https://researchdatarequest.orienavatar.com/. For non-ORIEN academic researchers, only processed data outputs from clinical, whole exome and whole transcriptome data may be available, where applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDisclosure of potential conflict of interest:\u003c/strong\u003e All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers\u0026rsquo; bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication:\u0026nbsp;\u003c/strong\u003eAll authors have reviewed and approved the manuscript for submission.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u0026nbsp;\u003c/strong\u003eThis work was supported in part by the NIH grant 5R01CA99908-18 to Kimberly K. Leslie, where Gonzalez Bosquet was a co-investigator. Additionally, Dr. Gonzalez Bosquet received support from the basic research fund from the Department of Obstetrics \u0026amp; Gynecology (2014) at the University of Iowa. Also, was supported in part by the American Association of Obstetricians and Gynecologists Foundation (AAOGF, 2014) Bridge Funding Award and the Holden Comprehensive Cancer Center (HCCC) Support Grant (5P30CA086862-23).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; Contributions:\u003c/strong\u003e J.G.B., O.O.P., K.D., and D.S. \u0026nbsp;wrote the main manuscript text; \u0026nbsp;V.M.W., A.P., A.A.T, C.M.C., M.S.H., B.R.C., A.L.L., B.S., R.L.D., L.E.D., M.J.C., L.L., and L.C. participated in the review and edition of the manuscript; J.G.B, D.S., A.C.T and N.J. helped in analysis design and interpretation; J.G.B., R.H., and D.S. participated in data formatting, curating and analysis; R.J.R. and M.L.C. help with data procurement and resources administration.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eSiegel, R. L., Miller, K. D., Fuchs, H. E. \u0026amp; Jemal, A. Cancer statistics, 2022. \u003cem\u003eCA Cancer J Clin\u003c/em\u003e 72, 7\u0026ndash;33 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3322/caac.21708\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3322/caac.21708\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSheikh, M. A. \u003cem\u003eet al.\u003c/em\u003e USA endometrial cancer projections to 2030: should we be concerned? \u003cem\u003eFuture Oncol\u003c/em\u003e 10, 2561\u0026ndash;2568 (2014). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.2217/fon.14.192\u003c/span\u003e\u003cspan address=\"https://doi.org:10.2217/fon.14.192\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWestin, S. N. \u003cem\u003eet al.\u003c/em\u003e Durvalumab Plus Carboplatin/Paclitaxel Followed by Maintenance Durvalumab With or Without Olaparib as First-Line Treatment for Advanced Endometrial Cancer: The Phase III DUO-E Trial. \u003cem\u003eJournal of clinical oncology: official journal of the American Society of Clinical Oncology\u003c/em\u003e 42, 283\u0026ndash;299 (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1200/JCO.23.02132\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1200/JCO.23.02132\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEskander, R. N. \u003cem\u003eet al.\u003c/em\u003e Pembrolizumab plus Chemotherapy in Advanced Endometrial Cancer. \u003cem\u003eThe New England journal of medicine\u003c/em\u003e 388, 2159\u0026ndash;2170 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1056/NEJMoa2302312\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1056/NEJMoa2302312\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMirza, M. R. \u003cem\u003eet al.\u003c/em\u003e Dostarlimab for Primary Advanced or Recurrent Endometrial Cancer. \u003cem\u003eThe New England journal of medicine\u003c/em\u003e 388, 2145\u0026ndash;2158 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1056/NEJMoa2216334\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1056/NEJMoa2216334\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDel Carmen, M. G., Boruta, D. M., 2nd \u0026amp; Schorge, J. O. Recurrent endometrial cancer. \u003cem\u003eClin Obstet Gynecol\u003c/em\u003e 54, 266\u0026ndash;277 (2011). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1097/GRF.0b013e318218c6d1\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1097/GRF.0b013e318218c6d1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRestaino, S. \u003cem\u003eet al.\u003c/em\u003e Recurrent Endometrial Cancer: Which Is the Best Treatment? Systematic Review of the Literature. \u003cem\u003eCancers (Basel)\u003c/em\u003e 14 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3390/cancers14174176\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3390/cancers14174176\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGonzalez Bosquet, J. \u003cem\u003eet al.\u003c/em\u003e Training, Validating, and Testing Machine Learning Prediction Models for Endometrial Cancer Recurrence. \u003cem\u003eJCO Precis Oncol\u003c/em\u003e 9, e2400859 (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1200/PO-24-00859\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1200/PO-24-00859\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGonzalez-Bosquet, J. \u003cem\u003eet al.\u003c/em\u003e Bacterial, Archaea, and Viral Transcripts (BAVT) Expression in Gynecological Cancers and Correlation with Regulatory Regions of the Genome. \u003cem\u003eCancers (Basel)\u003c/em\u003e 13 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3390/cancers13051109\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3390/cancers13051109\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGonzalez-Bosquet, J. \u003cem\u003eet al.\u003c/em\u003e Microbial Communities in Gynecological Cancers and Their Association with Tumor Somatic Variation. \u003cem\u003eCancers (Basel)\u003c/em\u003e 15 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3390/cancers15133316\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3390/cancers15133316\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMadhogaria, B., Bhowmik, P. \u0026amp; Kundu, A. Correlation between human gut microbiome and diseases. \u003cem\u003eInfect Med (Beijing)\u003c/em\u003e 1, 180\u0026ndash;191 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1016/j.imj.2022.08.004\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1016/j.imj.2022.08.004\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAggarwal, N. \u003cem\u003eet al.\u003c/em\u003e Microbiome and Human Health: Current Understanding, Engineering, and Enabling Technologies. \u003cem\u003eChem Rev\u003c/em\u003e 123, 31\u0026ndash;72 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1021/acs.chemrev.2c00431\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1021/acs.chemrev.2c00431\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChambers, L. M. \u003cem\u003eet al.\u003c/em\u003e The Microbiome and Gynecologic Cancer: Current Evidence and Future Opportunities. \u003cem\u003eCurr Oncol Rep\u003c/em\u003e 23, 92 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1007/s11912-021-01079-x\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1007/s11912-021-01079-x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaniewski, P., Ilhan, Z. E. \u0026amp; Herbst-Kralovetz, M. M. The microbiome and gynaecological cancer development, prevention and therapy. \u003cem\u003eNat Rev Urol\u003c/em\u003e 17, 232\u0026ndash;250 (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41585-020-0286-z\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41585-020-0286-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, C. \u003cem\u003eet al.\u003c/em\u003e Association between vaginal microbiota and the progression of ovarian cancer. \u003cem\u003eJ Med Virol\u003c/em\u003e 95, e28898 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1002/jmv.28898\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1002/jmv.28898\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSrikummoon, P. \u003cem\u003eet al.\u003c/em\u003e The recurrence and mortality risk in Luminal A breast cancer patients who lived in high pollution area. \u003cem\u003ePloS one\u003c/em\u003e 20, e0335140 (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1371/journal.pone.0335140\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1371/journal.pone.0335140\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmotherman, C. \u003cem\u003eet al.\u003c/em\u003e Association of air pollution with postmenopausal breast cancer risk in UK Biobank. \u003cem\u003eBreast Cancer Res\u003c/em\u003e 25, 83 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1186/s13058-023-01681-w\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1186/s13058-023-01681-w\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCalaf, G. M., Ponce-Cusi, R., Aguayo, F., Munoz, J. P. \u0026amp; Bleak, T. C. Endocrine disruptors from the environment affecting breast cancer. \u003cem\u003eOncol Lett\u003c/em\u003e 20, 19\u0026ndash;32 (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3892/ol.2020.11566\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3892/ol.2020.11566\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDalton, W. S., Sullivan, D., Ecsedy, J. \u0026amp; Caligiuri, M. A. Patient Enrichment for Precision-Based Cancer Clinical Trials: Using Prospective Cohort Surveillance as an Approach to Improve Clinical Trials. \u003cem\u003eClin Pharmacol Ther\u003c/em\u003e 104, 23\u0026ndash;26 (2018). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1002/cpt.1051\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1002/cpt.1051\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePolio, A., Wagner, V., Bender, D. P., Goodheart, M. J. \u0026amp; Gonzalez Bosquet, J. A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer. \u003cem\u003eInt J Mol Sci\u003c/em\u003e 26 (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3390/ijms26157432\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3390/ijms26157432\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHaas, B. J. \u003cem\u003eet al.\u003c/em\u003e STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq. \u003cem\u003ebioRxiv\u003c/em\u003e (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDobin, A. \u003cem\u003eet al.\u003c/em\u003e STAR: ultrafast universal RNA-seq aligner. \u003cem\u003eBioinformatics\u003c/em\u003e 29, 15\u0026ndash;21 (2013). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1093/bioinformatics/bts635\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1093/bioinformatics/bts635\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoyd, R. \u003cem\u003eet al.\u003c/em\u003e Exogenous Sequences in Tumors and Immune Cells (Exotic): A Tool for Estimating the Microbe Abundances in Tumor RNA-seq Data. \u003cem\u003eCancer Res Commun\u003c/em\u003e 3, 2375\u0026ndash;2385 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1158/2767-9764.CRC-22-0435\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1158/2767-9764.CRC-22-0435\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBreitwieser, F. P., Baker, D. N. \u0026amp; Salzberg, S. L. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. \u003cem\u003eGenome Biol\u003c/em\u003e 19, 198 (2018). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1186/s13059-018-1568-0\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1186/s13059-018-1568-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLu, J., Breitwieser, F. P., Thielen, P. \u0026amp; Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. \u003cem\u003ePeerJ Comput Sci\u003c/em\u003e 3 (2017). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.7717/peerj-cs.104\u003c/span\u003e\u003cspan address=\"https://doi.org:10.7717/peerj-cs.104\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCao, J., Xia, T., Li, J., Zhang, Y. \u0026amp; Tang, S. A density-based method for adaptive LDA model selection. \u003cem\u003eNeurocomputing\u003c/em\u003e 72, 1775\u0026ndash;1781 (2009). https://doi.org:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neucom.2008.06.011\u003c/span\u003e\u003cspan address=\"10.1016/j.neucom.2008.06.011\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGriffiths, T. L. \u0026amp; Steyvers, M. Finding scientific topics. \u003cem\u003eProceedings of the National Academy of Sciences\u003c/em\u003e 101, 5228\u0026ndash;5235 (2004). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:doi:10.1073/pnas.0307752101\u003c/span\u003e\u003cspan address=\"https://doi.org:doi:10.1073/pnas.0307752101\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePonweiser, M. \u003cem\u003eLatent Dirichlet Allocation in R\u003c/em\u003e (WU Vienna University of Economics and Business, Vienna, 2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFinotello, F. \u003cem\u003eet al.\u003c/em\u003e Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. \u003cem\u003eGenome Med\u003c/em\u003e 11, 34 (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1186/s13073-019-0638-6\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1186/s13073-019-0638-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBecht, E. \u003cem\u003eet al.\u003c/em\u003e Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. \u003cem\u003eGenome Biol\u003c/em\u003e 17, 218 (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1186/s13059-016-1070-5\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1186/s13059-016-1070-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBhavnani, S. K. \u003cem\u003eet al.\u003c/em\u003e Enabling Comprehension of Patient Subgroups and Characteristics in Large Bipartite Networks: Implications for Precision Medicine. \u003cem\u003eAMIA Jt Summits Transl Sci Proc\u003c/em\u003e 2017, 21\u0026ndash;29 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBhavnani, S. K. \u003cem\u003eet al.\u003c/em\u003e Subtyping Social Determinants of Health in All of Us: Network Analysis and Visualization Approach. \u003cem\u003emedRxiv\u003c/em\u003e (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1101/2023.01.27.23285125\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1101/2023.01.27.23285125\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLadbury, C. \u003cem\u003eet al.\u003c/em\u003e Utilization of model-agnostic explainable artificial intelligence frameworks in oncology: a narrative review. \u003cem\u003eTransl Cancer Res\u003c/em\u003e 11, 3853\u0026ndash;3868 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.21037/tcr-22-1626\u003c/span\u003e\u003cspan address=\"https://doi.org:10.21037/tcr-22-1626\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDevelopers., T. TensorFlow, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e%3Chttps://doi.org/10.5281/zenodo.5949169%3E\u003c/span\u003e\u003cspan address=\"%3C10.5281/zenodo.5949169%3E\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMohammad, N., Muad, A. M., Ahmad, R. \u0026amp; Yusof, M. Accuracy of advanced deep learning with tensorflow and keras for classifying teeth developmental stages in digital panoramic imaging. \u003cem\u003eBMC Med Imaging\u003c/em\u003e 22, 66 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1186/s12880-022-00794-6\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1186/s12880-022-00794-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZheng, W. \u003cem\u003eet al.\u003c/em\u003e Gut microbiota and endometrial cancer: research progress on the pathogenesis and application. \u003cem\u003eAnn Med\u003c/em\u003e 57, 2451766 (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1080/07853890.2025.2451766\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1080/07853890.2025.2451766\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eArnone, A. A. \u0026amp; Cook, K. L. Gut and Breast Microbiota as Endocrine Regulators of Hormone Receptor-positive Breast Cancer Risk and Therapy Response. \u003cem\u003eEndocrinology\u003c/em\u003e 164 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1210/endocr/bqac177\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1210/endocr/bqac177\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBukato, K., Kostrzewa, T., Gammazza, A. M., Gorska-Ponikowska, M. \u0026amp; Sawicki, S. Endogenous estrogen metabolites as oxidative stress mediators and endometrial cancer biomarkers. \u003cem\u003eCell Commun Signal\u003c/em\u003e 22, 205 (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1186/s12964-024-01583-0\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1186/s12964-024-01583-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBolton, J. L. Quinoids, quinoid radicals, and phenoxyl radicals formed from estrogens and antiestrogens. \u003cem\u003eToxicology\u003c/em\u003e 177, 55\u0026ndash;65 (2002). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1016/s0300-483x(02)00195-6\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1016/s0300-483x(02)00195-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiang, X. \u003cem\u003eet al.\u003c/em\u003e Ozone exposure at environmental level induces female reproductive impairment via transcriptomic and alternative analysis. \u003cem\u003eEcotoxicol Environ Saf\u003c/em\u003e 306, 119276 (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1016/j.ecoenv.2025.119276\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1016/j.ecoenv.2025.119276\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRousselle, D. \u0026amp; Silveyra, P. Acute Exposure to Ozone Affects Circulating Estradiol Levels and Gonadotropin Gene Expression in Female Mice. \u003cem\u003eInt J Environ Res Public Health\u003c/em\u003e 22 (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3390/ijerph22020222\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3390/ijerph22020222\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUrbaniak, C. \u003cem\u003eet al.\u003c/em\u003e The Microbiota of Breast Tissue and Its Association with Breast Cancer. \u003cem\u003eAppl Environ Microbiol\u003c/em\u003e 82, 5039\u0026ndash;5048 (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1128/AEM.01235-16\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1128/AEM.01235-16\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReuter, S., Gupta, S. C., Chaturvedi, M. M. \u0026amp; Aggarwal, B. B. Oxidative stress, inflammation, and cancer: how are they linked? \u003cem\u003eFree Radic Biol Med\u003c/em\u003e 49, 1603\u0026ndash;1616 (2010). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1016/j.freeradbiomed.2010.09.006\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1016/j.freeradbiomed.2010.09.006\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBaeza-Noci, J. \u0026amp; Pinto-Bonilla, R. Systemic Review: Ozone: A Potential New Chemotherapy. \u003cem\u003eInt J Mol Sci\u003c/em\u003e 22 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.3390/ijms222111796\u003c/span\u003e\u003cspan address=\"https://doi.org:10.3390/ijms222111796\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLunov, O. \u003cem\u003eet al.\u003c/em\u003e Cell death induced by ozone and various non-thermal plasmas: therapeutic perspectives and limitations. \u003cem\u003eSci Rep\u003c/em\u003e 4, 7129 (2014). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/srep07129\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/srep07129\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMadison, T., Schottenfeld, D., James, S. A., Schwartz, A. G. \u0026amp; Gruber, S. B. Endometrial cancer: socioeconomic status and racial/ethnic differences in stage at diagnosis, treatment, and survival. \u003cem\u003eAm J Public Health\u003c/em\u003e 94, 2104\u0026ndash;2111 (2004). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.2105/ajph.94.12.2104\u003c/span\u003e\u003cspan address=\"https://doi.org:10.2105/ajph.94.12.2104\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHelpman, L., Pond, G. R., Elit, L., Anderson, L. N. \u0026amp; Seow, H. Endometrial cancer presentation is associated with social determinants of health in a public healthcare system: A population-based cohort study. \u003cem\u003eGynecologic oncology\u003c/em\u003e 158, 130\u0026ndash;136 (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1016/j.ygyno.2020.04.693\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1016/j.ygyno.2020.04.693\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWei, S., Conner, M. G., Zhang, K., Siegal, G. P. \u0026amp; Novak, L. Juxtatumoral stromal reactions in uterine endometrioid adenocarcinoma and their prognostic significance. \u003cem\u003eInt J Gynecol Pathol\u003c/em\u003e 29, 562\u0026ndash;567 (2010). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1097/PGP.0b013e3181e36321\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1097/PGP.0b013e3181e36321\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8682460/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8682460/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003ePurpose\u003c/h2\u003e \u003cp\u003eIn a previous study, we trained, validated and tested models of endometrial cancer (EC) recurrence integrating clinical, genomic and pathological data from the Oncology Research Information Exchange Network (ORIEN). Preliminary studies also have demonstrated that bacterial communities may influence the risk of EC recurrence by altering the local environment within the upper female genital tract. The objective of this study was to evaluate whether extrinsic and environmental factors, including tumor-associated bacterial communities, tumor immune contexture and air pollution alongside clinical, pathologic and genomic features are associated with EC recurrence across clinically relevant risk groups.\u003c/p\u003e\u003ch2\u003ePatients and Methods:\u003c/h2\u003e \u003cp\u003eWe performed a retrospective, multi-institution, case\u0026ndash;control study with data from the ORIEN network EC dataset. Data was stratified into low-risk, FIGO grade 1 and 2, stage I (N\u0026thinsp;=\u0026thinsp;329), high-risk, or FIGO grade 3 or stages II-IV (N\u0026thinsp;=\u0026thinsp;324), and non-endometrioid histology (N\u0026thinsp;=\u0026thinsp;239) groups. RNA and DNA were extracted from tumor specimens and processed to obtain the necessary genomic/metagenomic data. Genus level microbiome data were extracted and curated) from RNA sequencing using \u003cem\u003eKraken2\u003c/em\u003e, \u003cem\u003eBracken\u003c/em\u003e and \u003cem\u003eexotic\u003c/em\u003e software packages. Risk of EC recurrence was evaluated by integrating microbiome and environmental data alongside existing clinical, pathological and genomic data using topic modelling with latent dirichlet allocation (LDA). Prediction models of EC recurrence were created using machine and deep learning analytics (ML and DL) with \u003cem\u003eMATLAB\u003c/em\u003e apps and \u003cem\u003eTensorFlow\u003c/em\u003e. Finally, performance of both topic and prediction models were externally validated in an independent EC dataset from TCGA.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe resulting models, analyzed with topic modelling, demonstrated the complexity of factors involved in recurrence of disease for EC. The components of the resulting topic models, and specifically the microbiome, changed when environmental factors, like air pollutants, were introduced in the model. In the low-risk EC group, microbes that were quite abundant in models before introducing environmental factors, were scarcely seen afterwards, like genera \u003cem\u003eThermothielavioides\u003c/em\u003e, \u003cem\u003eTheileria\u003c/em\u003e, \u003cem\u003eRhizoctonia\u003c/em\u003e. \u003cem\u003eBacillus\u003c/em\u003e was the genus with higher per-topic probability within all risk groups, especially for low-risk EC (28%). Ozone (O\u003csub\u003e3\u003c/sub\u003e) was a resulting component of all risk groups\u0026rsquo; models. BMI was the sole informative clinical variable after data integration, and only present in the low-risk group. Resulting models from the high-risk and non-endometrioid groups included differential gene expressions: \u003cem\u003eMMP13, S100A7, SMOC1, ACACA\u003c/em\u003e and \u003cem\u003eADD2, DLX5, SLCO2B1, NWD1\u003c/em\u003e respectively. CNVs also were present in both low-risk and non-endometrioid groups, but their per-topic probabilities were low. The same was true for the immune contexture data. The components of the resulting topic models were used to train, validate and test prediction models of EC recurrence by risk groups. Performances of these models were excellent (@ 0.9). Despite some missing microbiome data in TCGA from resulting topic models, prediction models trained in the ORIEN set, had similar performances in TCGA testing set, with overlapping AUC 95% CIs.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eBoth extrinsic factors (tumor-associated bacterial communities, tumor immune contexture and air pollution) and intrinsic factors predict EC recurrence. The complexity of tumor and host factors influencing cancer relapses underscore the need for more individualized prediction models of disease outcomes.\u003c/p\u003e","manuscriptTitle":"Intrinsic tumor factors and extrinsic environmental and social exposures contribute to endometrial cancer recurrence patterns","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-30 13:10:10","doi":"10.21203/rs.3.rs-8682460/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"e7345d9b-0144-4f2d-9e2a-86d393c18f36","owner":[],"postedDate":"January 30th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":61863849,"name":"Biological sciences/Cancer"},{"id":61863850,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":61863851,"name":"Biological sciences/Microbiology"},{"id":61863852,"name":"Health sciences/Oncology"},{"id":61863853,"name":"Health sciences/Risk factors"}],"tags":[],"updatedAt":"2026-02-14T21:53:40+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-30 13:10:10","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8682460","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8682460","identity":"rs-8682460","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.