Methods
The UKB [ 11 ] is a large-scale study that recruited participants across the United Kingdom and was approved by the North West Multi-Centre Research Ethics Committee as a Research Tissue Bank, and all participants provided informed consent. Whole exome sequencing (WES) data, array genotypes data and phenotype data used in our study were accessed under application number 29900. Genome-wide association statistics of FinnGen [ 12 ] and Biobank Japan (BBJ) [ 13 ] are previously published.
For this study, we included ~ 200,000 participants with available array genotype, WES, and phenotype data. We identified 3,736 individuals with endometriosis based on the ICD-10 codes N80 (Field 41,270, 41,280) and self-reported diagnoses of 1402 (Field 20,002). 196,907 individuals with no N80 diagnosis and no self-reported endometriosis were treated as controls. Only females of European ancestry were included (Field 22,006, 21,000). Subjects were excluded if there was a mismatch between genetic sex (Field 22,001) and self-reported sex (Field 31), with extreme heterozygosity (Field 22,027) or close relatives (kinship coefficient > 0.0442). This retained 87,100 individuals for subsequent analyses.
Genetic association analyses were performed using REGENIE [ 14 ], a fast two-step whole-genome regression method that effectively handles unbalanced case–control phenotypes, population structure and relatedness. We included in step1 of REGENIE array variants with minor allele frequency (MAF) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\ge $$\end{document} 1%, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\ge $$\end{document} 90% genotyping rate, Hardy–Weinberg equilibrium test \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${P\ge 10}^{-15}$$\end{document} , minor allele count (MAC) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\ge 100$$\end{document} , not in long-range linkage disequilibrium (LD) regions, and LD pruning (1,000 variant windows, 100 variant sliding windows and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${r}^{2}<0.9$$\end{document} ). These thresholds follow the UKB analysis recommendations in the REGENIE documentation and the exome sequencing analysis by Backman et al. [ 10 ]. We adjusted for age at recruitment and the first four genetic principal components (PCs) to account for population structure. These PCs were calculated using PLINK v2.0 [ 15 ] from the quality-controlled array data. In step 2, approximate Firth logistic regression was applied to test rare exome variants (MAF < 1%), and we included additionally four PCs derived from rare variants as previous research showed that PCs from common variants inadequately adjust for fine-scale population structure [ 16 , 17 ]. The Genetic variants with P values \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${<5\times 10}^{-8}$$\end{document} were considered significant.
Gene-based association analysis was performed using the burden test in step 2 of REGENIE to aggregate the effects of rare variants within genes, thereby increasing statistical power for detecting associations driven by rare functional variation. We constructed a mask consisting of rare (MAF < 1%) loss-of-function (LoF) variants and deleterious missense (Dmis) variants, defined following Xie et al. [ 18 ], based on concordant predictions from nine in silico tools, including SIFT [ 19 ], PolyPhen-2 (HVAR and HDIV), M-CAP [ 20 , 21 ]˒ MetaLR, MetaSVM [ 22 ], LRT [ 23 ], PROVEAN, and MutationTaster [ 24 , 25 ]. The covariates were the same as those used in step 2 of the exome-wide single-variant tests. The significance level was set as 0.05 divided by the number of genes (16,773) examined.
To evaluate the robustness and generalizability of the gene-based findings across different populations, we downloaded two independent publicly available GWAS summary statistics from FinnGen and BBJ of endometriosis, with cases defined by ICD codes in FinnGen and by clinical diagnosis in the BBJ. As individual-level genotype data were not accessible, we performed gene-based association testing using GATES [ 26 ], which combines SNP-level association evidence across genes while accounting for LD. The 1000 Genomes Project phase 3 FIN and JPT population datasets were used as LD reference panels to ensure appropriate adjustment for population-specific genetic variation. Other parameters adhered to the default settings of GATES.
Pathway analyses were performed using the clusterProfiler R package [ 27 ] to identify enriched biological processes of relevant genes. Gene Ontology (GO) [ 28 ] terms were analyzed. For these analyses, we selected the top genes ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${P<5\times 10}^{-5}$$\end{document} ) based on gene-based associations, as well as genes retrieved from BioGRID [ 29 ] that interact with the identified candidate genes.
To explore the broader impact of the candidate risk genes on health outcomes, we conducted a phenome-wide association study (PheWAS). The ICD-10 data from the UK Biobank were downloaded using data field ID 41720 and preprocessed as a phenotype table using the PheWAS package [ 30 ]. ICD codes were mapped to phecodes using mapping tables [ 31 ], available at https://wei-lab.app.vumc.org/phecode . In total, 1532 phecodes were mapped. Carrier status for each identified candidate gene was determined as True if the participant had at least one LoF or Dmis variant in the exonic region of the gene, and as False otherwise. Age, sex, and the top four PCs were adjusted in the PheWAS analysis to account for demographic and population structure.
To identify EMs relevant cell types and evaluate expression patterns of candidate genes, we performed cell type specific [ 32 ] and differential expression analyses using publicly available single-cell RNA sequencing (scRNA-seq) datasets from Fonseca et al. [ 33 ] and Tan Y et al. [ 34 ]. For the cell type-specific analysis, we examined nine annotated cell populations (endothelial, epithelial, smooth muscle, mesenchymal, myeloid, mast, B/plasma, erythrocytes and T/NK cells) across five lesions in Fonseca et al. dataset, and five major populations (stromal, endothelial, lymphocytes, epithelial and myeloid) from four lesions in Tan Y et al. dataset, respectively. To further assess the consistency between the two datasets, the nine cell types from Fonseca et al. were grouped into six broader categories (stromal, endothelial, lymphocytes, epithelial, myeloid and erythrocytes), corresponding to those reported by Tan et al. Differential expression analyses were subsequently performed using the Wilcoxon test. Details of the single-cell transcriptome acquisition and processing are reported in the original studies. A workflow for the data analyses is presented in Fig. 1 . Fig. 1 A workflow for the data analyses in this study
A workflow for the data analyses in this study
Results
The basic characteristics of participants in this study are shown in Table 1 . Table 1 Participant characteristics Sample size Age, y; mean (SD) Sex Ancestry (Cases/controls) Cases Controls Discovery cohort UK Biobank 3,118/83,982 53.5(8.0) 56.7(8.0) Females European Replication cohort FinnGen 10,029/81,593 – – Females Finnish Biobank Japan 734/102,372 42.5 (8.1) 61.4 (14.3) Females Japanese
Participant characteristics
We performed single variants analyses for the 1,423,086 rare variants with MAC above 8. Genomic inflation was 1.05. There was no statistically significant association between tested variants and EMs at a genome-wide significance level ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p<5\times 10}^{-8}$$\end{document} ). Only one variant in LMO1 was marginally significant ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p=4.84\times 10}^{-7}$$\end{document} , Fig. S1, Table S1).
We aggregated 16,778,137 LoF and Dmis variants into gene-based tests using REGENIE, corresponding to 16,773 genes. One gene ( SOGA1 ) was significantly associated with EMs after Bonferroni correction ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p=1.32\times 10}^{-6}$$\end{document} , \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$<0.05/\mathrm{16,773}$$\end{document} , Table S2). There were 15 LoF rare variants in total distributed in SOGA1 in this test (Table S3). Two additional genes, MYBPC2 ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p=1.53\times 10}^{-5}$$\end{document} ) and RAB19 ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p=4.51\times 10}^{-5}$$\end{document} ) reached suggestive significance (Table S2). These results were presented using a quantile–quantile (Q-Q) plot and a Manhattan plot (Fig. 2 A and B). Fig. 2 QQ plot and Manhattan plot for exome-wide gene-based analysis. A Q-Q plot shows observed vs. expected p-values in exome-wide association analysis. B Manhattan plots show the genome along the x-axis and the gene-based association significance levels on the y-axis. Red line represents the Bonferroni corrected threshold with \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${P=2.98\times 10}^{-6}$$\end{document} , blue line represents the suggestive significance level at \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${P=5\times 10}^{-5}$$\end{document}
QQ plot and Manhattan plot for exome-wide gene-based analysis. A Q-Q plot shows observed vs. expected p-values in exome-wide association analysis. B Manhattan plots show the genome along the x-axis and the gene-based association significance levels on the y-axis. Red line represents the Bonferroni corrected threshold with \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${P=2.98\times 10}^{-6}$$\end{document} , blue line represents the suggestive significance level at \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${P=5\times 10}^{-5}$$\end{document}
We replicated the association between SOGA1 and EMs risk using gene-based testing in FinnGen and BBJ, respectively. The SOGA1 was associated with EMs ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.015$$\end{document} and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.011$$\end{document} , respectively, in Finnish and Japanese populations) at a nominal significance level of 0.05 (Table S4).
We searched BioGRID 4.4 for interactions involving SOGA1 . A total of 130 interactors with 165 interactions were constructed (Fig. S2). All interactions were curated from published literature-based evidence. After mapping these interactors to the FinnGen and BBJ GWAS summary statistics, 116 genes were included in the gene-based analyses. Using GATES, 11 of the 116 genes showed significant associations with EMs at the threshold of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p<0.05$$\end{document} . This corresponds to a two-fold enrichment (11/5.8) with a statistical significance of 0.031. Several of these genes, including GSK3B [ 35 ], PPP2CA [ 36 ], ESR2 [ 37 ], SAMHD1 [ 38 ], CYLD [ 39 ], and AURKA [ 40 ], have also been reported to be associated with endometriosis in previous studies. Of the 11 genes, six were significant in the FinnGen analysis, and five were significant in the BBJ analysis (Table S4), showing very high enrichments (6/0.55 with p-value of 6 × 10 –7 for FinnGen and 5/0.55 with p-value of 1.1 × 10 –4 for BBJ).
Three genes in the gene-based analyses that were significantly or suggestively associated with EMs were examined in GO terms. In GO term enrichment analysis, the identified significant terms included gluconeogenesis, relevant biological processes, muscle filament sliding, and autophagosome assembly ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p}_{adjust}={3.29\times 10}^{-3}\sim 4.4\times {10}^{-2}$$\end{document} . We also looked up SOGA1 in previously reported literature. As described [ 41 , 42 ], SOGA1 was identified as a microtubule-associated factor, related to chromosome segregation. It is also involved in autophagy and glucose.
We further examined the GO terms of SOGA1 related gene sets identified in the interaction networks. Ten genes related to significant GO terms and 21 significant biological processes/cellular components/molecular functions that might be involved in the migration and cell proliferation of EMs. Most of the enriched GO terms were related to Regulation of Protein Binding ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p}_{adjust}={6.21\times 10}^{-3}$$\end{document} ), Wnt Signalosome/Beta-Catenin Destruction ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p}_{adjust}={4.42\times 10}^{-2}$$\end{document} ), Cortical Actin Cytoskeleton ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p}_{adjust}={3.03\times 10}^{-2}$$\end{document} ), Spindle/Spindle Pole/Mitotic Spindle Midzone/Spindle Pole Centrosome ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p}_{adjust}={4.42\times 10}^{-2}$$\end{document} ).
We tested SOGA1 identified in gene-based analysis for association with 1,532 traits. SOGA1 was statistically significantly associated with Endometriosis ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p={2.47\times 10}^{-6}$$\end{document} , < 0.05/1532), Dermatophytosis of the body ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p={3.31\times 10}^{-6}$$\end{document} ) and Noninflammatory disorders of cervix ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p={2.50\times 10}^{-5}$$\end{document} ) (Fig. 3 ). Fig. 3 Manhattan Plot of PheWAS analysis. This plot showed the relationship between SOGA1 and tested phenotypes. The x-axis represents the phenotypes tested, color-coded by ICD-10 categories, while the y-axis shows the − log10( p -values) of the association significance. The red line represents the significance threshold (0.05/1,532), and the blue line represents the suggestive significance threshold (0.05)
Manhattan Plot of PheWAS analysis. This plot showed the relationship between SOGA1 and tested phenotypes. The x-axis represents the phenotypes tested, color-coded by ICD-10 categories, while the y-axis shows the − log10( p -values) of the association significance. The red line represents the significance threshold (0.05/1,532), and the blue line represents the suggestive significance threshold (0.05)
We identified EMs relevant cell types by analyzing scRNA-seq expression datasets together with summary statistics of EMs. For scRNA-seq from Fonseca et al., EMs has been categorized into five major classes of tissue types, and the corresponding cell types in these datasets have been classified into nine categories. Of the five endometrial types, endometrioma exhibited epithelial and mesenchymal cell-type enrichments ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.01$$\end{document} , \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.02$$\end{document} ), no endometriosis detected exhibited epithelial-specific enrichments ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.03$$\end{document} ), and the unaffected ovary exhibited strong enrichments in endothelial cells ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.03$$\end{document} ). When the five endometrial types were considered together, no statistically significant enrichments were observed for any specific cell type ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.06\sim 0.99$$\end{document} , Fig. 4 A). The nine identified cell types were then grouped into six broader categories, including stromal, endothelial, lymphocytes, epithelial, myeloid and erythrocytes. We found significant enrichments in stromal, epithelial and endothelial relative to immune and erythrocytes for most endometrial types ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.01\sim 0.03$$\end{document} , Fig. 4 A). Additionally, there were three marginally significant enrichments in stromal cells ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.05$$\end{document} ). For the scRNA-seq dataset from Tan Y et al., stromal enrichment was also observed (Fig. 4 B). Fig. 4 Heritability enrichment of endometriosis and differential expression of SOGA1 across cell types. A Heritability enrichment of endometriosis in nine cell types and six broader cell groups evaluated by LDSC. The color of dots represents different cell groups. B Heritability enrichment of endometriosis in five cell groups based on data from Tan Y et al. C – E Dot plots showing SOGA1 expression levels and the proportion of expressing cells across various cell types. Dot size indicates the proportion of SOGA1 -expressing cells, and color intensity reflects expression levels. The left side displays lesion-specific groups, and the right side shows the combined EMs group
Heritability enrichment of endometriosis and differential expression of SOGA1 across cell types. A Heritability enrichment of endometriosis in nine cell types and six broader cell groups evaluated by LDSC. The color of dots represents different cell groups. B Heritability enrichment of endometriosis in five cell groups based on data from Tan Y et al. C – E Dot plots showing SOGA1 expression levels and the proportion of expressing cells across various cell types. Dot size indicates the proportion of SOGA1 -expressing cells, and color intensity reflects expression levels. The left side displays lesion-specific groups, and the right side shows the combined EMs group
We further evaluated SOGA1 gene expression using these datasets. For nine cell types, SOGA1 was mostly highly expressed in endothelial cells and showed relatively medium expressions in smooth muscle, mesenchymal and epithelial cells in the context of five endometrial-types and combined EMs group (Fig. 4 C). While their log 2 FC values differed by less than 0.29. For six cell types in Fonseca et al. as well as five cell types in Tan Y et al., there was a higher expression of SOGA1 in stromal, endothelial and epithelial cells (Fig. 4 D and E).
Background
Endometriosis (EMs) is a chronic, inflammatory condition characterized by the presence of endometrial-like tissue outside the uterus, affecting approximately 10% of reproductive-age women globally. This disease often results in severe pelvic pain, infertility, and reduced quality of life, with diagnostic delays of 5–12 years further exacerbating patient burden [ 1 ]. The etiology of EMs remains unclear. While several hypotheses have been proposed, none fully explains the disease's development across all observed sites.
Endometriosis is estimated to have a heritability of 47–51% based on twin studies in Swedish cohorts [ 2 ]. Danish population studies show daughters of women with endometriosis are more than twice as likely to develop the condition [ 3 ], indicating a strong genetic component in its pathogenesis. Genome-wide association studies (GWAS) have identified 42 significant risk loci (49 independent signals) through meta-analyses of 60,674 European and East Asian cases and 701,926 controls, with the strongest effects in the American Society of Reproductive Medicine (ASRM) stage III/IV disease, yet these explain only 5.01% of disease variance [ 4 ]. Koller et al. [ 5 ] identified one locus in the UK Biobank data from female participants of European descent. Kloeve-Mogensen et al. [ 6 ] calculated a polygenic risk score (PRS) using 14 single-nucleotide polymorphisms (SNPs) from a GWAS meta-analysis of 17,000 EM European cases, achieving modest prediction (AUC = 0.64). The focus of the genetic architecture of endometriosis has mainly been on common variants. While the common variants collectively account for ~ 26% of heritability [ 7 ], a substantial proportion of genetic risk remains unexplained.
The substantial missing heritability in complex diseases has motivated increasing interest in rare, functional variants [ 8 ]. Some reports indicate that rare variants (especially large-effect functional loss or destructive missense mutations) can directly identify key pathogenic pathways and core genes that are often obscured in common variant studies due to extreme polygenicity. Whereas others suggest their overall contribution may be modest [ 9 , 10 ]. Despite these differences, rare loss-of-function and damaging missense variants remain powerful tools for identifying core pathogenic mechanisms, and convergence between rare and common variant signals has been observed across multiple complex traits. These considerations highlight the value of investigating rare coding variation in endometriosis.
In this study, we investigated the contribution of rare variants to EMs and uncovered potential pathogenic mechanisms. Utilizing exome sequence data from the UK Biobank (UKB), we identified genetic associations with EMs through single-variant and gene-based analyses. Our findings were replicated in individuals of Finnish and Japanese ancestry. Additionally, we explored the functional characterization of the identified gene and examined its expression using single-cell RNA expression datasets of EMs.
Discussion
Our WES study identified a gene-wide significant association with SOGA1 in a gene-based test including 15 LoF rare variants ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${p=1.32\times 10}^{-6}$$\end{document} ). Consistent results were also obtained using SKAT and SKAT-O methods implemented in REGENIE, which allow for heterogeneous effect directions (data not shown). The association was supported at nominal significance in the Finnish and East Asian cohorts ( \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.015$$\end{document} and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p=0.011$$\end{document} ). As expected, we did not observe associations with single variants in this study due to the sample size limitation.
We applied GATES to FinnGen and BBJ GWAS summary statistics to test for gene-based associations between SOGA1 interacting genes and endometriosis. Eleven of the 116 evaluable genes showed significant associations, representing a two-fold enrichment. Several of these genes, including GSK3B , PPP2CA , ESR2 , SAMHD1 , CYLD , and AURKA , have also been linked to endometriosis in previous studies, providing independent support for their relevance. Functional enrichment analyses of the SOGA1 -interacting gene set suggested potential involvement in pathways related to glucose metabolism and autophagy, although these signals remain exploratory and should not be interpreted as direct mechanistic evidence. Previous experimental studies have shown that SOGA1 encodes a microtubule-associated protein that integrates centrosomal and Golgi microtubules on the Golgi membrane, aiding in directional cell migration [ 41 , 43 ]. SOGA1 is essential for proper chromosome segregation during mitosis and has been demonstrated to regulate autophagy and glucose homeostasis by suppressing glucose production in an adiponectin- and insulin-dependent manner [ 44 , 45 ]. These biological functions overlap with several well-recognized pathophysiological processes in endometriosis, including autophagy, fibrosis, angiogenesis, migration, and glucose metabolism [ 46 , 47 ]. A recent GWAS meta-analysis identified 42 significant risk loci associated with EMs [ 4 ]. Although our identified gene does not overlap with those of large GWAS studies focused on common variants, it can be partly attributed to our focus on rare variants (MAF < 0.01), which may not have been well covered by GWAS arrays or imputation. These studies also emphasized the diverse biological pathways implicated in endometriosis.
Endometriosis is characterized by abnormal migration and invasion of endometrial-like tissue outside the uterus, which involves complex cell migration mechanisms. To explore potential cellular contexts in which SOGA1 may act, we examined its expression across cell types using publicly available scRNA-seq datasets. SOGA1 was observed highly expressed in endothelial and stromal cells in EMs lesions, and LDSC-SEG analyses identified these cell types were relevant to EMs. Endothelial and stromal cells are recognized contributors to the migratory, invasive, and tissue-remodeling behavior observed in EMs [ 48 – 51 ]. Our data do not provide direct evidence that SOGA1 mediates these processes. Nevertheless, the enrichment of SOGA1 expression in these cells raises the possibility that it may contribute to behaviors involved in lesion development, such as migration, invasion, or tissue remodeling. Determining whether SOGA1 has a mechanistic role in these pathways will require further functional investigation.
Our findings suggest that rare variants in SOGA1 may contribute to EMs pathogenesis via altered glucose metabolism, autophagy, and cell migration. Although PheWAS results showed pleiotropic effects of SOGA1 on noninflammatory disorders of cervix, the consistently low SOGA1 expression in immune-related cells across most lesions of EMs, suggest that SOGA1 may not exert its effects on endometriosis through direct immune pathways. Clinically, identifying SOGA1 -related pathways could guide therapeutic strategies targeting metabolic and migratory processes in EMs. However, additional biological validation is required before translation to practice.
The main strength of this study lies in its large WES dataset (87,100 participants), enabling rare variant analysis at an unprecedented scale in EMs. Replication in independent cohorts of different ancestries adds robustness. Integration with scRNA-seq data further strengthens the biological plausibility of SOGA1 .
Nevertheless, several limitations of this study should be acknowledged. First, while our findings strongly implicate SOGA1 , definitive functional validation remains essential to establish causal links to disease processes. Our team is actively collecting intraoperative tissues (ectopic/eutopic endometrium), follicular fluid, granulosa cells, and peripheral blood, to enable future functional and molecular investigations.
Second, endometriosis is a clinically heterogeneous disorder comprising multiple subtypes with both shared and distinct biological mechanisms. In population-based resources such as the UKB, phenotype misclassification represents a major challenge. Asymptomatic or minimally symptomatic endometriosis is not uncommon, leading to contamination of controls [ 52 ]. In addition, phenotype definitions in population-based cohorts often rely on clinical assessment or self-report rather than surgical and histological confirmation. Even when laparoscopy is performed, visual inspection alone has limited sensitivity and specificity, with well-documented false-positive and false-negative rates [ 53 , 54 ], further complicating reliable phenotype assignment. Consistent with these limitations, the prevalence of endometriosis observed in this study (3.6%) is substantially lower than epidemiological estimates of approximately 10%, suggesting considerable underdiagnosis, particularly among controls. Such misclassification is expected to bias effect estimates toward the null and reduce statistical power. These challenges are especially consequential for rare-variant association analyses. Individual rare variants within SOGA1 were observed at extremely low frequencies, typically one or two heterozygous carriers across the entire cohort. Under this allelic architecture, even modest phenotype misclassification or sample restriction can lead to a disproportionate loss of statistical power. Narrowing the phenotype may increase heritability and strengthen genetic signals, but formal subtype-specific analyses were not feasible in the present study due to limited phenotypic annotation and sample size. Instead, we performed a series of restriction-based sensitivity analyses, including exclusion of overlapping diagnoses (N70, N73, K58, and N30.1), restriction of cases to those diagnosed before 35 years of age, and more conservative control definitions excluding individuals with any ICD-10 diagnosis records. While these restrictions improved phenotypic specificity, they also resulted in pronounced reductions in effective sample size, particularly among cases. Under such conditions, statistical power for rare-variant burden tests collapses nonlinearly, and association results become highly sensitive to stochastic variation in carrier distribution. Consequently, increasingly stringent phenotype definitions did not lead to systematic strengthening of association signals but instead produced instability across single-variant and gene-based results. This pattern reflects a fundamental trade-off between phenotype refinement and statistical power in the context of allelic rarity.
Finally, although replication analyses were performed in independent cohorts, the results should be interpreted with caution. Due to the lack of independent rare-variant summary statistics for rare-variant association replication, gene-based testing relied on common-variant summary statistics aggregated using GATES. As such, this analysis does not represent a direct replication of rare-variant effects but rather provides supportive gene-level evidence driven by common variants. Future replication efforts using independent sequencing-based cohorts will be critical for confirming rare-variant associations.
Conclusions
In our study, we used WES data from 87,100 participants in the UK Biobank, identifying a significant association between SOGA1 and the risk of EMs in European ancestry populations. The association was also repeated in both Finnish and Japanese cohorts. Our findings will provide insight into disease mechanisms and targets for biological experiments to gain further understanding about the role of SOGA1 in EMs pathogenesis.
Supplementary Material
Supplementary material 1.
Supplementary material 2.
Supplementary material 1.
Supplementary material 2.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.