Results
As shown in Figure 1 , we have adapted one of the most sensitive assays currently available (TruSeq Methyl Capture EPIC Library Prep Kit, TMC-EPIC kit) for the EOC methylation site marker discovery. With a coverage of over 3.3 million CpG sites, the TMC-EPIC kit largely exceeded the commonly utilized 450K or 850K assays, which surveyed 45,000 and 85,000 CpG sites, respectively. Therefore, the TMC-EPIC kit would reveal the methylation status of many more sites that were not previously studied. A limitation of this sensitive kit is the requirement of at least 500 ng input DNA for library construction, which is very difficult to obtain from a single plasma sample. To overcome this challenge, we constructed several libraries by using pooled cfDNA samples; 5 early-stage pooled samples and 6 advanced-stage pooled samples were gained from a total of >220 invasive EOC patients, and the libraries of healthy subjects were derived from 10 healthy female pooled samples (total healthy subjects involved >300). These pooled samples are referred to as pool cohort in the following content, and the average sequencing depth of each site was more than 20X in this cohort ( Table S1 ; Figure S1 A). Figure 1 Flowchart of this study cfDNA methylation markers were primarily screened in the pool cohort by the TMC-EPIC kit for over 3.3 million CpG sites, and 500 sites were selected based on their methylation differences and p values for the subsequent examination in the individual cohort. As a result, 493 methylation markers with acceptable sequencing quality were retained for the EOC diagnostic and prognostic prediction models construction. The diagnostic models were constructed by two strategies for comparison: the first was conventional LASSO-logistic regression approach, and the second was based on a pretrained methylation transformed called MethylBERT. Meanwhile, the best marker OV1 was further assayed on ddPCR platform in the ddPCR cohort. Lastly, OV1 was examined on the ddPCR platform in the prospective cohort, and the result was confirmed by imaging.
Flowchart of this study
cfDNA methylation markers were primarily screened in the pool cohort by the TMC-EPIC kit for over 3.3 million CpG sites, and 500 sites were selected based on their methylation differences and p values for the subsequent examination in the individual cohort. As a result, 493 methylation markers with acceptable sequencing quality were retained for the EOC diagnostic and prognostic prediction models construction. The diagnostic models were constructed by two strategies for comparison: the first was conventional LASSO-logistic regression approach, and the second was based on a pretrained methylation transformed called MethylBERT. Meanwhile, the best marker OV1 was further assayed on ddPCR platform in the ddPCR cohort. Lastly, OV1 was examined on the ddPCR platform in the prospective cohort, and the result was confirmed by imaging.
As CpG sites with significant methylation difference between EOC and healthy female were revealed from the pool cohort ( Figures S2 and S3 A), it is critical to verify them in individual samples to ensure that the representation in pooled samples was related to individuals. Since plasma sample of a single individual is limited, we utilized a targeted capture approach for methylation site evaluation in the individual validation. A sample with >10 ng cfDNA would give qualified methylation sequencing data by this approach; however, fewer CpG sites could be chosen for examination. For cost consideration, we selected 500 CpG sites as candidate methylation markers and designed and synthesized their corresponding probes for targeted capture. The sites selection was mainly based on the difference and p values between EOC and healthy pools; its detailed criteria are provided in the STAR Methods. These probes were applied to examine 1,909 individuals’ cfDNA samples; all these samples are completely distinct from the samples constituting the pool cohort. The probes successfully captured 493 of the 500 candidate markers in these individual samples; after combination of the sequencing reads of the same unique molecular identifier (UMI), 1,872 samples (754 EOC patients and 1,118 healthy females) gave in average more than 10 reads (with different UMI) per CpG positions ( Figure S1 B), which were retained for the further analysis ( Data S1 ). These individual samples are referred to as individual cohort ( Table S1 ) in the following text. The results exhibited a good consistency in methylation change between the individual and pool cohorts ( Figure S3 B).
We employed the bidirectional encoder representations from transformers (BERT), a transformer-based language model that was able to learn broad clinical and biological knowledge and feature representations. We applied the BERT paradigm to analyze all available cancer DNA methylation datasets to exploit massive knowledge and interactions among chromosome, position, methylation level, and gene function in over 110,000 cancer samples from GEO database: https://www.ncbi.nlm.nih.gov/gds and The Cancer Genome Atlas (TCGA) database: https://portal.gdc.cancer.gov . We then built a model that enabled us to learn individual methylation CpG site representations and multiple CpG-CpG site relationships which were named as MethylBERT ( Figure 2 A). Thereafter, the MethylBERT was applied to analyze a training dataset that was randomly selected from 2/3 of the individual cohort; binary classification using fully connected layer and sigmoid activation function conferred an EOC probability to each sample. A probability cutoff value of 0.837 was determined to distinguish EOC from normal samples; hence a MethylBERT-EOC diagnostic model was built ( STAR Methods ). Figure 2 cfDNA methylation analysis of the MethylBERT-EOC diagnosis model in the individual cohort (A) Overview of the MethylBERT model. Over 110,000 whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) data were collected from GEO and TCGA datasets (upper left); their chromosome embedding, position embedding, methylation level embedding, and gene embedding were combined as CpG site embedding scheme (upper right), which was fed into a matrix decomposition-based transformer model with a certain percentage of CpG sites randomly masked, and pretrained it to predict the methylation level of the masked CpG sites, with the given context of their surrounding CpG sites (lower left). Lastly the fine-tuning pretrained model (MethylBERT) was employed to process the methylation data of input samples (lower right). (B and C) Confusion tables of binary results of the MethylBERT-EOC diagnostic model in the training (B) and validation datasets (C). (D) Receiver operating characteristic (ROC) curves of the MethylBERT-EOC diagnostic model in EOC diagnostic prediction of the training and validation datasets. (E and F) ROC curves of the MethylBERT-EOC diagnostic model in early and advanced EOC diagnostic prediction of the training (E) and validation (F) datasets. (G) MethylBERT-EOC diagnosis-based EOC prediction score of healthy female samples and EOC samples of different stages. ∗∗∗ p < 0.001.
cfDNA methylation analysis of the MethylBERT-EOC diagnosis model in the individual cohort
(A) Overview of the MethylBERT model. Over 110,000 whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) data were collected from GEO and TCGA datasets (upper left); their chromosome embedding, position embedding, methylation level embedding, and gene embedding were combined as CpG site embedding scheme (upper right), which was fed into a matrix decomposition-based transformer model with a certain percentage of CpG sites randomly masked, and pretrained it to predict the methylation level of the masked CpG sites, with the given context of their surrounding CpG sites (lower left). Lastly the fine-tuning pretrained model (MethylBERT) was employed to process the methylation data of input samples (lower right).
(B and C) Confusion tables of binary results of the MethylBERT-EOC diagnostic model in the training (B) and validation datasets (C).
(D) Receiver operating characteristic (ROC) curves of the MethylBERT-EOC diagnostic model in EOC diagnostic prediction of the training and validation datasets.
(E and F) ROC curves of the MethylBERT-EOC diagnostic model in early and advanced EOC diagnostic prediction of the training (E) and validation (F) datasets.
(G) MethylBERT-EOC diagnosis-based EOC prediction score of healthy female samples and EOC samples of different stages. ∗∗∗ p < 0.001.
MethylBERT-EOC diagnostic model gave 93.24% sensitivity on 95.3% specificity (area under curve, AUC, = 0.98) in the training dataset that consisted of 503 EOC and 744 healthy female samples and 89.24% sensitivity on 94.39% specificity (AUC = 0.97) in the validation dataset of 251 EOC and 374 healthy female samples ( Figures 2 B–2D). The positive predictive value (PPV), negative predictive value, and false-positive rate (FPR) were 93.06%, 95.3%, and 4.71%, respectively, in the training dataset and were 91.43%, 94.39%, and 5.53%, respectively, in the validation dataset. Furthermore, in the training dataset, there were 132 early EOC samples; the MethylBERT-EOC diagnostic model could successfully diagnose 111 of them (84.09% sensitivity), while in the validation dataset of 73 early EOC samples, the model diagnosed 58 of them (79.45% sensitivity) ( Figures 2 C–2E and 2F). In aspect of diagnosing different EOC subtypes, in the validation dataset, the MethylBERT-EOC diagnostic model gained 76.74% and 96.69% sensitivities in early and advanced serous carcinoma, respectively, and gained 72.73% and 90.91% sensitivities in early and advanced endometrioid carcinoma, respectively ( Table S2 ). For other EOC subtypes, MethylBERT-EOC also exhibited efficient diagnostic sensitivities except for the mixed subtype ( Data S1 ); however, since the sample size was too small for each of these subtypes, their sensitivities should be further validated in future studies.
Because EOC and healthy subjects in the individual cohort were not strictly age matched, to assess if our MethylBERT-EOC diagnostic model could be impacted by the age difference, we looked at the EOC probability (a MethylBERT-EOC diagnostic model prediction outcome value) in different age groups of healthy female subjects and EOC subjects, and no significant variation among different age groups was observed in either subject group ( Figures S4 A–S4C).
In addition, we estimated the corresponding sensitivities of the MethylBERT-EOC diagnostic model when it was on different specificities (85%, 90%, 95%, and 99%; cutoff = 0.6396193, 0.7194187, 0.8636678, and 0.9731606, respectively) and found that if we adjust the specificity to >99%, its sensitivity would be still as high as over 70% in the validation dataset ( Table 1 ). Table 1 Corresponding sensitivities of the MethylBERT-EOC diagnostic model when the specificities were over 85%, 90%, 95%, and 99% in distinguishing EOC from healthy females Sensitivity Specificity 85.04% (0.64) 90.10% (0.72) 95.19% (0.86) 99.20% (0.97) Early EOC 90.41% 87.67% 79.45% 63.01% Advanced EOC 96.07% 94.38% 93.26% 74.16% Overall EOC 94.42% 92.43% 89.24% 70.92% Cutoff values were indicated below each specificity. Healthy female: 374 subjects; early EOC: 73 subjects; advanced EOC: 178 subjects.
Corresponding sensitivities of the MethylBERT-EOC diagnostic model when the specificities were over 85%, 90%, 95%, and 99% in distinguishing EOC from healthy females
Cutoff values were indicated below each specificity. Healthy female: 374 subjects; early EOC: 73 subjects; advanced EOC: 178 subjects.
To test if the MethylBERT-EOC diagnostic model was able to discriminate EOC from other gynecological disease, we examined the 493 markers in 96 endometriosis cfDNA samples and applied their methylation data to the model. As a result, not only the EOC probability was significantly lower in these endometriosis samples compared to EOC samples of the validation dataset but also an 89.24% sensitivity on 91.66% specificity was observed to discriminate the EOC samples from endometriosis samples ( Figures S5 A–S5C, Data S2 ).
Since CA125 and HE4 assays are still the most commonly used EOC screening tests in clinical practice despite their unsatisfactory sensitivities, we compared diagnostic sensitivities of CA125 and HE4 with MethylBERT-EOC diagnostic model by using 715 EOC samples of the individual cohort that possessed complete CA125 and HE4 information. As a result, CA125 or HE4 alone provided a sensitivity of 48.81% and 46.85%, respectively, for these samples; when CA125 and HE4 were used in combination, the sensitivity went up to 60.6%. In contrast, our MethylBERT-EOC diagnostic model demonstrated 92.73% sensitivity in these samples. Importantly, in the 282 EOC samples that were missed by CA125 and HE4 assay, 255 (90.43%) were correctly diagnosed by our MethylBERT-EOC diagnostic model ( Figure S6 A).
Similar to previous studies, though CA125 was significantly different between EOC and female subjects, we did not observe its difference between early- and advanced-stage EOC in the individual cohort ( n = 1,058 that possessed CA125 information) ( Figure S6 B); similar finding was reported by a previous study. 21 However, with our MethylBERT-EOC diagnostic model, the same EOC samples showed higher EOC probability in advanced EOC ( Figure 2 G), suggesting a potential correlation with the tumor load. This will need to be further evaluated in a bigger study but does highlight the potential of our MethylBERT-EOC diagnostic model in disease progression and treatment efficacy monitoring, as well as a role in monitoring disease recurrence.
Next, we evaluated whether the combination of our MethylBERT-EOC diagnostic model with CA125 will add additional sensitivity to the performance. In the 1,058 samples of the individual cohort with CA125 information, combining our MethylBERT-EOC diagnostic model with CA125 enhanced the sensitivity by 3.21% (from 92.47% to 95.68%) but reduced specificity by 3.81% (from 97.36% to 93.55%) when compared with the MethylBERT-EOC diagnostic model alone. However, for the early-stage EOC samples ( n = 183), the sensitivity went up dramatically from 82.51% to 89.62% on 93.55% specificity in the combined model ( Table 2 ); more importantly, such sensitivity would be as high as nearly 85% when the specificity gone over to 95% ( Table S3 ). Table 2 Specificities and sensitivities of the MethylBERT-EOC diagnostic model and CA125 for distinguishing ovarian cancer from healthy female samples Sample size CA125 MethylBERT-EOC model Combined Sensitivity (%) Specificity (%) Sensitivity (%) Specificity (%) Sensitivity (%) Specificity (%) Healthy female 341 – 96.19 – 97.36 – 93.55 Early EOC 183 44.26 82.51 89.62 Advanced EOC 534 50.26 95.88 97.75 Total EOC 717 48.95 92.47 95.68 Note: only samples with CA125 information are summarized in this table.
Specificities and sensitivities of the MethylBERT-EOC diagnostic model and CA125 for distinguishing ovarian cancer from healthy female samples
Note: only samples with CA125 information are summarized in this table.
Using the same training dataset of the MethylBERT-EOC diagnostic modeling, we built another diagnostic model by using conventional LASSO-logistic regression strategy: to reduce number of markers, the 493 markers were analyzed by 500 times LASSO, and 21 markers that showed up in over 450 times LASSO were subsequently applied to logistic regression for diagnostic modeling. In this way, a LASSO-EOC diagnostic model was obtained ( Table S4 ).
In the training dataset, this model exhibited 88.27% sensitivity on 93.82% specificity (AUC = 0.97) for EOC diagnosis. When it was applied to the same validation dataset of the MethylBERT-EOC diagnostic model, it gave 83.67% sensitivity on 89.04% specificity (AUC = 0.92) ( Figures 3 A–3C); both were lower than those given by the MethylBERT-EOC diagnostic model ( p < 0.01 in McNemar’s test). Indeed, this LASSO-logistic model also showed significant difference in the combined diagnosis score (cd-score) between early- and advanced-stage EOC samples; however, it only identified 49 out of 73 early EOC and gave a 67.12% sensitivity, in the validation dataset ( Figures 3 E and 3F), which was 12% lower than that given by the MethylBERT-EOC diagnostic model. Figure 3 cfDNA methylation analysis for the LASSO-EOC diagnosis model in the individual cohort (A and B) Confusion tables of binary results of the LASSO-EOC diagnostic model in the training (A) and validation (B) datasets. (C) ROC curves of the LASSO-EOC diagnostic model in EOC diagnostic prediction of the training and validation datasets. (D and E) ROC curves of the LASSO-based EOC diagnostic model in early and advanced EOC diagnostic prediction of the training (D) and validation (E) datasets. (F) LASSO-EOC diagnosis-based EOC prediction score of healthy female samples and EOC samples of different stages. ∗∗∗ p < 0.001.
cfDNA methylation analysis for the LASSO-EOC diagnosis model in the individual cohort
(A and B) Confusion tables of binary results of the LASSO-EOC diagnostic model in the training (A) and validation (B) datasets.
(C) ROC curves of the LASSO-EOC diagnostic model in EOC diagnostic prediction of the training and validation datasets.
(D and E) ROC curves of the LASSO-based EOC diagnostic model in early and advanced EOC diagnostic prediction of the training (D) and validation (E) datasets.
(F) LASSO-EOC diagnosis-based EOC prediction score of healthy female samples and EOC samples of different stages. ∗∗∗ p < 0.001.
In aspect of the diagnostic sensitivities of different EOC subtypes, the MethylBERT-EOC diagnostic model outperformed the LASSO-EOC diagnostic model by 11%, 18%, and 25% in early serous, endometrioid, and mucinous carcinoma, respectively, while the two models achieved equal sensitivities in early clear cell and poorly differentiated carcinoma ( Tables S2 and S5 ). In advanced EOC diagnosis, the MethylBERT-EOC diagnostic model showed higher sensitivities in serous carcinoma, mucinous carcinoma, and poorly differentiated adenocarcinoma, lower sensitivities in endometrioid carcinoma, and equal sensitivities in clear cell carcinoma and undifferentiated carcinoma ( Tables S2 and S5 ). It should be noted that the sample sizes were small for all subtypes but serous carcinoma; therefore, further validation was necessary for these subtypes.
To investigate the prognostic prediction potential of the 493 methylation markers selected from the pool cohort, 151 markers were employed for prognostic analysis because they showed over 10% methylation change between EOC and healthy females in the individual cohort. 437 EOC patients in the individual cohort with complete survival information were selected and randomly split into training and validation datasets with a 2:1 ratio. UniCox and LASSO were applied to reduce the dimensionality to three markers and a Cox-model of an EOC prognostic panel (OCPP) was constructed ( Figure 4 A). Kaplan-Meier curves were generated in training and validation datasets using a combined prognosis score (cp-score) of OCPP ( Figures 4 B and 4C). The high-risk group had 149 observations with 62 events in the training dataset, and 59 observations with 27 events in the validation dataset, while the low-risk group had 154 observations with 29 events in the training dataset, and 74 observations with 14 events in the validation dataset. The median survival time in the high-risk group was significantly shorter than that of the low-risk group by log rank test in both the training ( p < 0.01) and validation dataset ( p < 0.01) ( Figures 4 B and 4C). Figure 4 Utility of the OCPP for prognosis prediction of EOC (A) Characteristics of the three methylation markers and their coefficients in EOC prognosis. SE, standard errors of coefficients; z value: Wald z -statistic value. (B and C) Kaplan-Meier plots for the overall survival of EOC patients in the low- and high-risk groups determined by the OCPP in the training (B) and validation (C) datasets. (D and E) Kaplan-Meier plots for the overall survival of high-grade serous carcinoma (D) and other histological subtypes (E) of EOC samples in the low- and high-risk groups determined by the OCPP in the validation datasets.
Utility of the OCPP for prognosis prediction of EOC
(A) Characteristics of the three methylation markers and their coefficients in EOC prognosis. SE, standard errors of coefficients; z value: Wald z -statistic value.
(B and C) Kaplan-Meier plots for the overall survival of EOC patients in the low- and high-risk groups determined by the OCPP in the training (B) and validation (C) datasets.
(D and E) Kaplan-Meier plots for the overall survival of high-grade serous carcinoma (D) and other histological subtypes (E) of EOC samples in the low- and high-risk groups determined by the OCPP in the validation datasets.
For high-grade serous carcinoma, in the validation dataset, the high-risk group had 44 observations with 21 events while the low-risk group had 41 observations with 9 events. Although the difference of median survival time between the high- and low-risk group was not as pronounced as when calculated with all samples, it was still significant ( p = 0.026) ( Figure 4 E). For the other EOC histological subtypes, since the sample size of each individual subtype was too small in the validation dataset, they were combined to be analyzed as a group. As a result, in the validation dataset, the high-risk group of the other EOC subtypes had 15 observations with 6 events, while the low-risk group had 34 observations with 5 events. Despite the small sample size, the difference of median survival time between the high- and low-risk group was still significant ( p < 0.01) ( Figure 4 E).
Furthermore, we analyzed the performance of CA125 and HE4 for EOC prognostic prediction. Because the CA125 and HE4 information was missed for some samples, the training and validation datasets of these two indicators were not exactly same as each other nor as the OCPP datasets. Despite that, both indicators could successfully distinguish high-risk groups from low-risk groups in Kaplan-Meier curves ( Figures S7 A, S7B, S7E, and S7F). For high-grade serous carcinoma, the prognostic prediction using CA125 and HE4 still showed effective results, but, for the other EOC subtypes, the median survival time between high- and low-risk CA125 groups was not significantly different ( Figures S7 C, S7D, S7G, and S7H). This could be caused by two factors: first, the sample size for other EOC histological subtypes was small, and, second, the prognostic models trained by a dataset that was constituted mainly by one EOC subtype (high-grade serous carcinoma) might not be able to fully represent the characteristics of other EOC subtypes. On the other hand, for the ROC curves, OCPP showed slightly larger AUC in high-grade serous carcinoma when compared to CA125 and HE4, and combining them together resulted in an ROC curve with a significantly larger AUC ( Figure S7 I). This indicated a potential application of OCPP in supporting the widely applied EOC indicators, CA125 and HE4, in the prognostic prediction.
With this good performance characteristic of our diagnostic models, we evaluated the potential of developing a simple and cost-effective ddPCR-based assay. We chose the methylation site OV1, which exhibited the largest and most significant methylation difference between EOC and normal samples in the individual cohort, to develop a ddPCR assay, and validated its utility in an independent cohort (referred to as ddPCR cohort, Table S1 ) of 305 EOC patients and 480 healthy female subjects ( Figure S8 , Table S6 , and Data S3 ). Overall, ddPCR showed a significant methylation difference in OV1 between EOC and healthy female samples ( Figure 5 F); hence, this cohort was randomly split as a 2:1 ratio into training and validation datasets, and logistic regression was applied to the training dataset for threshold value determination. As a result, OV1 achieved a 77.4% sensitivity on 92.59% specificity (AUC = 0.912) in the training dataset and 72.16% sensitivity on 92.95% specificity in the validation dataset (AUC = 0.877) ( Figures 5 A–5C). In contrast, CA125 only showed 48.85% sensitivity on 95% specificity in this ddPCR cohort ( Table S7 , Data S3 ). In aspect of diagnosing different EOC subtypes, OV1 ddPCR assay showed over 70% sensitivities in all except mucinous carcinoma, where only 2 of 8 samples were successfully detected ( Table S8 ), suggesting that OV1 might not be a marker for this EOC subtype. However, since the sample size of mucinous carcinoma was too small, more samples were needed to validate this result. Figure 5 Performance of the ddPCR assay with OV1 for discriminating ovarian cancer and healthy females (A) ROC curves of OV1 in the training (red) and validation (blue) datasets of the ddPCR cohort. (B and C) Confusion tables of binary results of the OV1 prediction model in the training (B) and validation (C) datasets. (D) ROC curves of OV1 and OV1-CA125 combination in distinguishing early (red and green) and advanced (blue and purple) EOC from healthy females in the ddPCR cohort. (E) Confusion table summarizing OV1 distinguishing early and advanced EOC. (F) Beeswarm plots presenting the methylation levels of OV1 in the ddPCR cohort between ovarian cancer and healthy females; red plots are healthy female samples, and blue plots are EOC samples (∗∗∗ p < 0.001).
Performance of the ddPCR assay with OV1 for discriminating ovarian cancer and healthy females
(A) ROC curves of OV1 in the training (red) and validation (blue) datasets of the ddPCR cohort.
(B and C) Confusion tables of binary results of the OV1 prediction model in the training (B) and validation (C) datasets.
(D) ROC curves of OV1 and OV1-CA125 combination in distinguishing early (red and green) and advanced (blue and purple) EOC from healthy females in the ddPCR cohort.
(E) Confusion table summarizing OV1 distinguishing early and advanced EOC.
(F) Beeswarm plots presenting the methylation levels of OV1 in the ddPCR cohort between ovarian cancer and healthy females; red plots are healthy female samples, and blue plots are EOC samples (∗∗∗ p < 0.001).
For diagnosis of different staged EOC of this cohort, OV1 and CA125 achieved 57.66% sensitivity on 92.71% specificity and 38.74% sensitivity on 95% specificity, respectively, in early EOC ( n = 111), and 86.08% sensitivity on 92.71% specificity and 54.64% sensitivity on 95% specificity, respectively, in advanced EOC ( n = 194) ( Figures 5 D and 5E; Table S7 ). Though the diagnostic sensitivity for early EOC was not satisfactory by either OV1 or CA125, OV1 and CA125 in combination (a sample would be predicted to be EOC positive if either OV1 or CA125 was positive) dramatically improved the sensitivity to 72.07% while the specificity was still as high as 88.12% ( p < 0.01 McNemar’s test) ( Table S7 ).
More importantly, when evaluating different sensitivities (80%, 85%, 90%, 95%) of OV1, we found a very good consistency of corresponding specificity values between the ddPCR assay and the sequencing results ( Tables S9 and S10 ), indicating an excellent performance of ddPCR and this platform could be an ideal substitute for sequencing strategy.
As the next step, we evaluated the utility of OV1 as a methylation marker for a longitudinal cancer screening cohort which consisted of 2,117 EOC high-risk participants. OV1 ddPCR values and CA125 were measured and used as the first-line tests for this high-risk EOC cohort. Participants who received an EOC-positive prediction result from the OV1 and CA125 combined model would undergo a TVU imaging study (the second-line test) by two senior sonographers. If no adnexal mass was seen, the participant would take two more TVU tests to verify the negative result in the following 6 months. If a mass was found, the participant would take a TVU test each month within the next 3 months. Based on sonographers’ subjective assessment, any participant who found to be EOC positive or with any suspicion of malignancy on TVU would undergo an abdominal MRI for further evaluation, 7 , 8 , 22 then the gynecologists would decide if it is necessary for the participants to take a biopsy for histological confirmation.
In this prospective study, we identified 314 EOC-positive participants from the first-line tests; 4 of them were confirmed to be EOC: 3 were at stage I, and 1 was at stage III. In addition, after over 10 months follow-up for the EOC-positive participants of this cohort, with each participant taking at least two TVU tests during these 10 months, only one of the negatively predicted participants was reported with EOC. Therefore, we assume the rest of the 1,802 participants with EOC-negative prediction was true negative at the time of examination; hence, in the 2,112 EOC-negative participants, 310 of them were falsely predicted to be EOC positive by the OV1 and CA125 combined model that gave a specificity of 85.3%. Meanwhile, 4 out of 5 EOC were successfully revealed by the model, which gave a sensitivity of 80%; these results were close with the results in our retrospective ddPCR study cohort (84.92% sensitivity on 88.12% specificity).
Discussion
In this study, we utilized cfDNA pool samples as subjects for a primary EOC methylation marker screening. Previous cfDNA methylation-based cancer diagnostic studies commonly employed tissues or cell lines for a first-stage marker screening, 23 , 24 , 25 , 26 but, since the methylation pattern in tissue or cell lines was different from that in cfDNA, screening directly on cfDNA samples might target cfDNA methylation markers more accurately. Furthermore, using a 3.3 million CpG surveying kit revealed more potential markers than the 450K or 850K assay. The most efficient marker identified by our study, OV1, for example, resided in a non-genetic, non-regulatory region; such regions were not covered by most of the methylation screening assay. Moreover, we adopted the transformer concept and trained a methylation transformer, the MethylBERT, from 110,000 cancers methylation data. Using this MethylBERT on the marker-targeted sequencing data of individual cfDNA samples, we constructed an EOC diagnostic model that outperformed conventional LASSO-logistic regression model in both sensitivity and specificity by 6% and 5%, respectively. Lastly, we developed a fast and cost-effective EOC screening method which combined CA125 test and OV1 ddPCR assay and achieved >70% sensitivity in early EOC detection while the specificity was as high as 88%.
The global cancer burden lies mainly in late detection. This is particularly true for EOC where the mortality rate is high when detected at the later stages. Reduction in the mortality burden in EOC sufferers relies heavily on early detection. Genetic and epigenetic analysis of cfDNA obtained from liquid biopsies is a promising approach to achieving diagnostic information from just a blood sample. 27 Circulating free DNA (ctDNA) can be shed by tumor cells, and importantly they retain the same copy-number alterations, mutations, and epigenetic markers. 11 , 28 Therefore, genetic analysis of these cfDNA could detect early epigenetic changes correlated with malignant transformation. Compared to invasive approaches such as TVU, colonoscopy, gastroscopy, and tissue-based histological examinations, a cfDNA-based diagnostic assay takes advantages of it being easy to conduct, cost-effective, and less harmful to patients, making it more suitable as a first-line and regular cancer screening strategy. Promising results have been reported for the use of cfDNA in the diagnosis of different cancers. 15 , 16 , 24 , 25 Our study provided another example of the usefulness of cfDNA in cancer diagnosis. The MethylBERT-EOC diagnostic model gained a nearly 90% sensitivity in EOC diagnosis, with a particular improvement in early EOC diagnosis, where it gave a sensitivity of 80%, which was 30%–40% higher than that of the CA125 assay, or 13% higher than that of our conventional model. Although the improvement in the detection of advanced EOC by the MethylBERT-EOC model was not as significant as the improvement in early EOC detection, since the 5-year survival rate of early EOC was over 70%, compared to the estimated 40% and 20% five-year survival rate of stages III and IV, respectively, a higher early EOC detection rate would be more clinically significant. 4
Traditional CA125 biomarker is an effective indicator of EOC; our individual and ddPCR cohorts revealed its over 50% sensitivity in all-stage EOC and nearly 40% sensitivity in early EOC detection. An over-200,000-participants clinical trial has shown that annual CA125 measurement increased early EOC incidence by 39.2% and decreased advanced EOC incidence by 10.2%. 29 It should be noted that an accurate CA125 sensitivity is hard to assess. In retrospective studies, large numbers of ovarian cancer subjects are diagnosed due to high or elevated CA125 value, resulting in a biased sample of OC cohorts with higher CA125 value. In prospective studies, on the other hand, it is difficult to determine whether CA125-negative results are true negative or false negative, since no histological confirmation could be obtained from the negative subjects. Alternatively, researchers have recruited women with pelvic or adnexal mass who were scheduled for surgical removal for prospective study, where histological examination of all removed mass exhibited CA125 with 79%–91% sensitivities on 59%–79% specificities, 30 , 31 , 32 but whether the subjects were pre-selected by CA125 examination was not addressed in these studies. Skates et al. conducted a more rigorous prospective estimation of CA125 sensitivity in 3,992 women using both CA125 and risk of ovarian cancer (ROCA) value, a personalized OC algorithm based on CA125 change among longitudinal measurements, for OC screening. Their results showed that 50% of the invasive OC (3 early-stage and 3 advanced-stage) were detected by ROCA before CA125 exceeded 35 U/mL, suggesting that the standard cutoff would give less than 50% sensitivity in OC detection. 33 In another OC screening trial on 46,237 general females, it was estimated that CA125 only gave OC 41% sensitivity (34 early-stage and 36 advanced-stage) at the 35 U/mL cutoff. 34 This sensitivity is inefficient for first-line, general population screening. Unsurprisingly, the EOC-related mortality was not significantly reduced in another clinical trial 29 ; thus, reducing mortality will require a more sensitive screening strategy. Meanwhile, the same trial also indicated that annual TVU is not a good first-line screening strategy despite its accuracy because it gave worse performance than CA125 performance in early EOC detection. 29 Intriguingly, the 39.2% increase in early EOC incidence by annual CA125 measurement is close to the sensitivity of CA125 in detecting early EOC in our individual and ddPCR cohorts (44.2% and 38.7% respectively), indicating that this ∼40% increase in early EOC incidence might come from its ∼40% sensitivity in early EOC detection; if so, CA125 combined with our MethylBERT diagnostic model would increase the early incidence to 80%. All in all, our MethylBERT diagnostic model would be an excellent substitute or supplement for CA125 testing, as it increased early EOC diagnostic sensitivity to nearly 80% while the specificity was as close as 95.
Three characteristics define the ideal EOC diagnostic test: high sensitivity, high PPV, and low FPR. A highly sensitive EOC diagnostic test, particularly in the early EOC stage domain will lead to an improvement in cancer mortality. Computer simulations have suggested that improving the EOC detection sensitivity, currently relying on CA125, could reduce overall mortality by up to 25%. 35 An EOC test with high PPV will help alleviate the anxiety of the patient while waiting for confirmatory TVU results. Finally, a test with low FPR will be a true benefit to healthcare systems because the number of unnecessary TVU tests will be kept to a minimum. In this study, the sensitivity, PPV, and FPR of our MethylBERT-EOC diagnostic model were estimated to be 89.24%, 91.43%, and 5.53%, respectively, in the validation dataset of the individual cohort that comprised 251 EOC patients and 374 healthy females, while the sensitivity was increased by over 40% compared to CA125; PPV and FPR were at an acceptable level. Moreover, combining the model with CA125 would further increase the sensitivity and PPV to 95.68% and 96.89%, respectively, though FPR was compromisingly increased a little to 6.29%.
As a conventional approach for binary classification modeling, LASSO-logistic method was widely adopted by previous studies. Though its selection and reduction of variables based on difference level conferred the model better stability by discarding less representative variables, it excluded the potential for exploring connections of features in higher dimensionality. Recently emerged deep learning-based neural network techniques could overcome this challenge and therefore have largely replaced the conventional approach in biological studies. 36 , 37 Our MethylBERT is such a technique; by pretraining it with large-scale methylation datasets, a subset of methylation knowledge could be generalized. Previously developed transformers such as GeneFormer 19 and scBERT 20 utilized single-cell transcriptomes from various samples and incorporated all available data. This was because, in the realm of foundational model training, leveraging a diverse array of data was a research paradigm. Hence, in this study, we utilized the methylation data of not only EOC but also all available cancers, to train our MethylBERT. There are two key advantages to adopt such paradigm. Firstly, pan-cancer cohorts offer a significantly larger dataset compared to EOC cohorts, differing by 1–2 orders of magnitude. This abundance of data facilitated a smoother training process, minimizing the likelihood of encountering overfitting. Secondly, training on pan-cancer cohorts enabled the model to learn more generalized and stable associations between diverse methylation sites, enhanced robustness, and result reliability. As a result, 493 methylation sites were expanded through MethylBERT to a larger-scaled feature output, providing more hyperplanes to distinguish EOC from healthy females. However, there is a major limitation of MethylBERT: pretraining of this model was only applied with methylation data, and other biological knowledge such as signaling pathways and methylation-gene expression relationships was not utilized due to the lack of RNA-methylation matched data. Since gene expression and signaling pathway are downstream effectors of methylation change, integrating methylation and gene expression matched data, along with incorporating known regulatory pathways, in future work should improve MethylBERT’s accuracy and generalizability, leading to more accurate and general early EOC detections. Moreover, as a proof-of-concept study, our work has demonstrated the utility of transformer technology in methylation pattern prediction. In the application of cfDNA methylation-based EOC detection, it largely improved the diagnostic performance, suggesting that it may not be limited to facilitating EOC but also other cancers methylation-related diagnosis.
The first limitation of this study is that the collected EOC samples were majorly serous carcinoma; as a result, the developed diagnostic and prognostic models may favor to distinguish this particular EOC subtype from healthy female. Therefore, the high sensitivities of the other EOC subtypes from the diagnostic model need to be further confirmed in the future study. The second limitation of this study was our prospective study, it must be noted that the sensitivity and specificity calculation of the OV1 and CA125 combined model were based largely on the EOC confirmation by TVU. However, as indicated by previous findings, TVU was not a gold standard for EOC detection; it could miss up to 30% of the EOC cases 33 and may not have the resolution to detect EOC at low CA125 level. 34 Another drawback of our prospective study design is the short follow-up period compared to other classic OC screening prospective trials, which usually lasted over five years. 10 , 33 , 34 All in all, these limitations could lead to an incorrect estimation of the true-positive and true-negative numbers, resulting in inaccurate sensitivity and specificity outcomes in our prospective cohort.
Introduction
Ovarian cancer was an important cause of cancer death in women, with an incidence of 313,000 and over 200,000 deaths in 2020; 85%–95% of ovarian cancers were from epithelial cells. 1 , 2 Although breast and cervical cancers are more common, epithelial ovarian cancer (EOC) has a much lower 5-year survival rate after diagnosis, which makes EOC more lethal to females when compared with breast and cervical cancers. 1 This high mortality and low 5-year survival rate of EOC were mainly related to a late diagnosis, with more than 80% of patients already at advanced stages when diagnosed. 3 Based on current data, if EOC was diagnosed at stage I, its 5-year survival rate would be at around 90% 4 ; this rapidly declines to around 20% if diagnosed at the later stage III/IV.
At present, serum biomarker cancer antigen 125 (CA125) and transvaginal ultrasound (TVU) examination are the two most commonly used tests for EOC screening. Serum Human Epididymis Protein 4 (HE4) has also emerged as an important serum biomarker for EOC diagnosis and is implicated in the detection of recurrence. 5 The use of CA125 alone for EOC detection has a low sensitivity 6 ; on the other hand, although TVU is highly sensitive and accurate for EOC detection, 7 routine TVU use for first-line mass EOC screening is clinically not feasible due to its inconvenience and time-consuming nature, and conclusion largely depends on the experience of the sonographer. 7 , 8 In practice, two large clinical studies have found that annual evaluation of CA125 alone, or in combination with TVU, did not reduce EOC-related mortality. 9 , 10 These findings highlighted the urgent need for a highly sensitive and specific EOC test that is effective for the early detection of EOC.
Circulating cell-free DNAs (cfDNAs) are extracellular nucleic acid fragments found in liquid biopsies. When cfDNAs are shed by tumor cells, for instance during apoptosis, they are potentially useful in the diagnosis of cancer because they contain the same genetic and epigenetic alterations of the tumor cells from which they derive. 11 The application of cfDNA in EOC screening has been demonstrated with some promising results by previous studies 12 , 13 , 14 , 15 ; however, these studies were limited by a relatively small sample size and a bias toward later-stage EOC. Hence, the utility of cfDNA tests for the diagnosis of early EOC was not well characterized. 16 Another limitation is that cfDNAs were fragmented and fast degraded; thereby, only a small part of them could give high enough copy number (>10) for analysis by sequencing. In turn, the number of potential markers found in cfDNA was much less than that in the tissue or cell samples.
By analyzing cfDNA from cancer and healthy samples, difference in their genetic or epigenetic patterns could be identified and utilized as markers for building a diagnostic model that distinguishes cancer from healthy samples. As a classic model construction strategy, least absolute shrinkage and selection operator (LASSO)-based dimensionality reduction followed by logistic regression for binary classification was widely adopted in previous studies. However, such an approach was limited by the number of biomarkers that could be included for modeling due to constraints on events per variable (EPV), a ratio between feature numbers and sample size. For example, a p logistic regression analysis model generally adopted >10 EPV for a good perdition, 17 , 18 which means that, for each marker to be included in the model selection, it needs to be examined in at least 10 samples. This significantly limited the number of candidate markers that can be considered for model construction.
Therefore, a method that is able to predict unexamined genetic or epigenetic pattern from examined markers without being constrained by the number of input markers would be ideal for constructing cfDNA-based diagnostic models. Recent revolutionized work on single-cell transcriptome data processing has employed a state-of-the-art deep learning technology called transformer, whose idea is to pretrain AI learning large-scale general datasets and then fine-tune the learned knowledge toward a vast array of downstream tasks with limited task-specific data. 19 , 20 These works inspired us since a large-scaled prediction in settings with limited data is also a need in cfDNA study, and incorporating deep learning technology into cfDNA-based cancer diagnosis may not only expand limited cfDNA methylation data to a larger-scaled methylation pattern but also allow for more markers to be included into the construction of a cancer diagnostic model.
In this article, we surveyed over 3.3 million CpG sites in over 420 EOC and healthy female pooled cfDNA samples and validated 493 most significant methylation markers in 754 EOC (205 early EOC) and 1,118 healthy female individual cfDNA samples. In the following diagnostic model construction, we not only employed the conventional LASSO-logistic regression approach but also pretrained a methylation transformer called MethylBERT by transferring labels from labeled to unlabeled dataset in over 110,000 cancer methylation data and then applied this transformer to construct a deep learning-based diagnostic model. At the end of this work, we selected the most significant marker as the target and adapted our methylation screen assay into a fast-testing and low-cost digital PCR (ddPCR) platform and validated its utility in early EOC screening.
Star★Methods
REAGENT or RESOURCE SOURCE IDENTIFIER Biological samples Plasma Guangzhou Women and Children’s Medical Center; Zhuhai People’s Hospital; Dazhou Central Hospital N/A Critical commercial assays cfDNA extraction Kit Magen D3182-04 Qubit dsDNA High Sensitivity Kit Invitrogen Q33231 TruSeq Methyl Capture EPIC Library Prep Kit Illumina FC-151-1003 NEBNext® Ultra™ II DNA Library Prep Kit NEB 7645L xGen Hybridization and Wash Kit IDT 1080584 EZ-96 DNA Methylation-Lightning Mag Prep Kit ZYMO RESEARCH D5047 ddPCR Supermix for Probes (No dUTP) BIO-RAD 1863024 DG8 gaskets for ddPCR BIO-RAD 1863009 PCR Plate Heat Seal, foil, pieceable BIO-RAD 1814040 ddPCR 96-Well Plates BIO-RAD 12001925 Software and algorithms Code for MethylBERT development This paper https://github.com/methylbert/methylbert Deposited data Sequencing data of the individual cohort This paper GSA: HRA011451 Oligonucleotides IDT8 UDI (UDI-UMI) adaptor IDT N/A Customized sequencing probes IDT N/A ddPCR probes and primers Thermo Fisher Scientific N/A
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Kang Zhang (
[email protected] ).
The customized sequencing probes, IDT8 UDI (UDI-UMI) adaptor and ddPCR probes and primers generated in this study will be made available on request, but we may require a payment and/or a completed Materials Transfer Agreement if there is potential for commercial application.
The sequencing data reported in this paper are deposited in Genome Sequence Archive (GSA) with project number HRA011451 ( https://ngdc.cncb.ac.cn/gsa-human/browse/HRA011451 ) and will be publicly available as of the date of publication. There are no restrictions on data access for academic use. Requests to access data should follow the GSA’s "Data Access Request Guidance" available at https://ngdc.cncb.ac.cn/gsa-human/document . The code for the MethylBERT development is deposited in Gihut at https://github.com/methylbert/methylbert which is publicly accessible. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
1.5-2mL plasma was collected from each subject and stored at −80°C before the cfDNA was extracted. Patient recruitment and blood sample collection were done at Guangzhou Women and Children’s Medical Center, Zhuhai People’s Hospital, and Dazhou Central Hospital from January 2017 to May 2020.
The retrospective study was approved by the Research Ethics Committee of Guangzhou Women and Children’s Medical Center. This investigation includes three retrospective cohorts of EOC and healthy female plasma samples: the pool cohort, the individual cohort, and the ddPCR cohort. The pool cohort was composed of pool samples with each mixed by > 20 individual cfDNA samples and the other two cohorts were all individual cfDNA samples.
The prospective study was approved by the Research Ethics Committee of the Zhuhai People’s Hospital. We conducted a prospective EOC screening cohort study from August 2022 to July 2023 to evaluate the utility of the OV1 methylation marker in combination with a conventional screening method. All the participants were enrolled due to an increased EOC risk, including (i) female, (ii) post-menopausal, (iii) history of breast cancer or family history of cancers, and (iv) BRCA1/2 mutations.
For the high-risk prospective cohort, 2117 subjects were screened by OV1 ddPCR and CA125 as the first line tests, samples predicted to be EOC positive by the OV1 (35U/mL) were further examined by TVU as the second line test. Any TVU positive or suspicious finding was given an abdominal MRI imaging validation. The gynecologists will make a suggestion based on TVU and MRI imaging results to the participants for whether they need surgical removal of the mass, if the mass were removed, it was sent to pathologists for histologic confirmation.
Cell-free DNA was isolated from plasma by using Magen cfDNA extraction Kit (D3182-04) following the manufacturer’s instructions. The quantity of cfDNA was determined by Qubit 2.0 fluorometer (Invitrogen, Life Technologies) with the Qubit dsDNA High Sensitivity Kit (Invitrogen).
cfDNA samples were extracted individually from subjects of early stage, advanced stage and healthy females, Qubit 2.0 fluorometer (Invitrogen, Life Technologies) was employed to estimate the DNA amount of each sample, for the EOC (early or advanced stage) samples, cfDNA amount ranging from 20 to 100ng were selected, for healthy female samples, cfDNA amount ranging from 10 to 50ng were selected. Then 20 age matched early or advanced EOC cfDNA samples, and 30 age matched healthy female samples were pooled together, DNA amount of each pool were assayed by Qubit 2.0 fluorometer, if the amount was over 500ng, the pool would be applied to the subsequent library construction, otherwise, if the amount was less than 500ng, more cfDNA samples around median age would be extracted and added to the pool until 500ng was achieved.
Each pool samples’ methylation library was prepared by using the TruSeq Methyl Capture EPIC Library Prep Kit (TMC-EPIC kit, FC-151-1003, Illumina), steps were according to the manufacturer’s instruction except the fragmentation step was skipped over. The screening regions this kit covered was indicated in Data S4 . The concentration of prepared libraries was determined by the Qubit 2.0 fluorometer (Invitrogen, Life Technologies), and the libraries’ quality was assessed by capillary electrophoresis (Qsep100, Bioptic). Qualified libraries were sequenced on the Illumina HiSeq X10 platform (Illumina).
500 methylation markers were selected from the pool cohort based on following criteria: 1. Methylation difference >10% between EOC and healthy female datasets; 2. p -value 10% between early EOC and healthy female datasets ( p -value not considered here); 4. For sites with methylation difference between 10% and 15%, sites within a differentially methylated region (DMR, difference>10%, p -value 3 CpG sites within a 200bp region) were retained; 5. Sites on genes that were reported to be involved in tumorigenesis by previous articles were preferentially selected.
cfDNA extracted from the plasma sample were ligated to methylation adaptor by using NEBNext Ultra II DNA Library Prep Kit for Illumina (NEB #7645L) from NEB, the methylation adaptors composed of an 8-bp index and an 8-bp index linked to a 9-bp UMI sequence were customized from Integrated DNA Technologies (reference number: 04099708Q). Adaptor ligated cfDNA was 12-to-1 mixed and hybridized with the customized probes (Integrated DNA Technologies) by using xGen hybridization capture of DNA libraries Kit (Integrated DNA Technologies). Hybridized mixture samples were eluted by adopting the reagents and steps of the “Second Elution” part of TruSeq Methyl Capture EPIC Library Prep Kit (FC-151-1003, Illumina), then bisulfite converted by using EZ-96 DNA Methylation-Lightning Mag Prep Kit (D5047, ZYMO RESEARCH). Bisulfate converted samples were amplified by adopting the reagents and steps of the “Amplify Enriched Library” section of the TruSeq Methyl Capture EPIC Library Prep Kit (FC-151-1003, Illumina). The concentration of prepared libraries was determined by the Qubit 2.0 fluorometer (Invitrogen, Life Technologies) and the libraries’ quality was assessed by capillary electrophoresis (Qsep100, Bioptic). Qualified libraries were sequenced on Illumina Nova-seq platform (Illumina).
For the pool cohort, raw methylation data were preprocessed using fastp (version 0.20.0) with default parameters. Clean reads were then aligned to human genome build hg19 using bitmapperBS (version 1.0.2.3) with default parameters, and bam format results were sorted by sambamba (version 0.7.0). DNA methylation calling was performed using MethylDackel (version 0.4.0) extract default parameter, and DNA methylation calls for methylated and unmethylated controls were extracted from the alignment file. The methylated values located in target regions were extracted using bedtools (version 2.29.0).
For the individual cohort, raw methylation data were processed by umi-tools (version 1.0.1) with the extract program, and the reads were preprocessed using fastp (version 0.20.0) with default parameters. Clean reads were then aligned to human genome build hg19 using bitmapperBS (version 1.0.2.3) in “pbat” mode, and bam format results were sorted by sambamba (version 0.7.0). Aligned reads were deduplicated based on UMIs using the umi-tools dedup program. DNA methylation calling was performed using MethylDackel (version 0.4.0) extract with the “--keepDups” parameter and DNA methylation calls for methylated and unmethylated controls were extracted from the alignment file. The methylated values located in target regions were extracted using bedtools (version 2.29.0).
To facilitate the pretraining of MethylBERT, we collected extensive DNA methylation data from two primary sources: the GEO-methyl dataset and the TCGA-methyl dataset. In total, we amassed over 110,000 samples, with the data exceeding 3 terabytes in size. Both datasets contain comprehensive genome-wide methylation data from diverse tissue types and conditions, providing a rich resource for training and evaluating the model’s performance.
The GEO-methyl dataset is derived from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) repository, which houses a large collection of high-throughput sequencing and microarray datasets. For our purposes, we focused on whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) datasets that provide genome-wide methylation data across various tissues, diseases, and conditions. We retrieved Accession data from several platforms, including GPL13534 , GPL21145 , GPL8490 , GPL23976 , and GPL9183 , and extracted beta values files for further analysis. After data cleaning and preprocessing, the GEO-methyl dataset comprises methylation data from 95,995 samples, covering a wide range of species, including humans, mice, and plants. This dataset offers a diverse and extensive resource for pretraining the MethylBERT model.
The TCGA-methyl dataset is derived from The Cancer Genome Atlas (TCGA), a comprehensive resource containing multi-omics data for over different cancer types. For MethylBERT, we focused on the DNA methylation data generated using the Illumina Infinium HumanMethylation450 BeadChip platform. The TCGA-methyl dataset includes methylation data from 15,439 human samples, comprising both tumor and adjacent normal tissues. By incorporating this dataset, we enable MethylBERT to learn the diverse methylation patterns associated with various cancer types and stages.
MethylBERT features an innovative CpG site embedding scheme, comprising four distinct embedding types: chromosome embedding ( E c ), position embedding ( E p ), methylation level embedding ( E l ), and gene embedding ( E g ). Each embedding captures unique aspects of CpG site information, enhancing the model’s performance. Chromosome embedding E c : By representing each CpG site’s chromosome as an embedding, the model can learn and differentiate among various chromosomal contexts, taking into account functional and structural variations across chromosomes. Position embedding E p : CpG sites are assigned to bins, each L base pairs (bps) in length, with a unique embedding assigned to each bin. This allows the model to learn relationships between neighboring CpG sites and capture the spatial organization of methylation patterns within the genomic landscape. In this implementation, L is set to 2000. Methylation level embedding E l : To facilitate the learning of methylation patterns, continuous methylation levels, which range from 0 to 1, are discretized into B bins. This approach enables the model to effectively capture the nuances of methylation dynamics. In this case, B is set to 20. Gene embedding E g : Gene embeddings are employed for CpG sites to associate them with their potential functional roles. For sites with known correlations, the closest gene is used; for those without, the nearest downstream gene is selected. These gene embeddings are derived from gene2vec, which learns gene-gene association information based on gene expression profiles across various tissues and conditions, enabling the creation of gene-gene co-expression networks. The final CpG site embedding is obtained by summing these embeddings: E = E c + E p + E l + E g .
The attention mechanism of the Transformer architecture exhibits quadratic computational complexity, posing a significant challenge when handling more than 20 million CpG sites. To address this issue, we employ the Performer, a matrix decomposition-based Transformer model designed to reduce computational complexity from quadratic to linear ( O ( L 2 ) to O ( L ) ), enabling efficient processing of large-scale data. The Performer utilizes an approximation technique termed "kernelized attention" with random feature maps. In contrast to the standard Transformer attention, represented as A t t ( Q , K , V ) = s o f t m a x ( Q K T / s q r t ( d k ) ) V , the Performer attention mechanism is formulated as A t t ( Q , K , V ) = softmax ( Φ ( Q ) Φ ( K ) T / sqrt ( d k ) ) V . Here, Q , K , and V denote the query, key, and value matrices, respectively, and Φ signifies a feature map function that projects input into a new space, facilitating efficient approximation of the dot-product attention. In this study, the number of Transformer layers is set to six.
In the pretraining phase, we adapt the masked language model (MLM) objective, used in BERT, to suit the methylation data. We refer to this as masked methylation level prediction (MMLP). The goal of MMLP is to predict the methylation level of some masked CpG sites, given the context of their surrounding CpG sites. To achieve this, a certain percentage P m of CpG sites in the input sequence is masked, and the model is trained to predict their methylation levels. Following BERT, we set P m to 15%. Formally, let x i be the input CpG sequence, where i ∈ 1 , … , L , and y i be the corresponding ground truth methylation levels. During pretraining, we randomly mask P m of CpG sites, replacing their methylation level embeddings with a special [MASK] token. The MMLP loss is calculated as the cross-entropy between the predicted methylation levels y ˆ i and the ground truth y i for the masked positions: MMLP Loss L m a s k = − ∑ i ∈ M y i log ( y ˆ i ) , where y ˆ i = softmax ( f ( x i ) ) , M is a set of CpGs with masked methylation levels, and f ( x i ) is the output of the MethylBERT model for the masked CpG site i . The model is optimized to minimize this loss function, encouraging it to learn biologically relevant patterns and correlations between CpG sites.
To ensure computational efficiency and stability during the training process, we further adopt two strategies. First, we segment the data chromosome-wise, allowing the model to concentrate on smaller, more manageable portions of the data while preserving the unique characteristics of each chromosome. By training on individual chromosome sections, the model can learn essential features and relationships specific to each genomic region. Second, we implement a random down sampling strategy, selecting N CpG sites per training sample. Initially, a contiguous set of N m CpG sites is chosen, with N m ranging from N to 10 N . Subsequently, N CpG sites are randomly selected from this set. This method ensures a representative subset of CpG sites is obtained, capturing relevant information while maintaining computational efficiency. In this study, N is set to 8192. By adopting these strategies, MethylBERT effectively addresses the computational challenges posed by the large-scale nature of DNA methylation data, enabling the model to learn meaningful CpG site representations and relationships without compromising performance or scalability.
We fine-tuned the pretrained MethylBERT model for EOC detection tasks. To this end, we first employed MethylBERT to obtain representations for each of all chromosomes, then concatenate these representations to form a sample representation, which will be used for the final prediction. For a given sample, let h c denote the representation of chromosome c obtained from MethylBERT, where c ∈ { 1 , … , C } . We concatenate these representations to form the final sample representation z = C o n c a t ( h 1 , h 2 , … , h C ) . Here, for human samples C is 23. In the training dataset of 503 EOC and 744 healthy female samples that were randomly selected from the individual cohort, to predict the presence or absence of EOC in a sample, we employ a binary classification approach using a fully connected layer followed by a sigmoid activation function, obtaining the probability y ˆ i of the i -th sample belonging to the EOC class ( y = 1 ). The model is trained to minimize the binary cross-entropy loss L b i n a r y ( y ˆ i , y i ) , where y i denotes the ground truth label. By fine-tuning the MethylBERT model in a supervised manner for EOC detection, we enabled it to capture EOC-specific methylation patterns and relationships, ultimately resulting in the MethylBERT-EOC diagnostic model for detecting EOC in DNA methylation data.
Samples in training and validation datasets of the individual cohort were as same as which were used for the MethylBERT-EOC diagnostic model construction. The 493 markers screened-out from pool cohort were processed by LASSO in the training dataset to distinguish EOC from healthy female samples. 500 times of LASSO were performed with each time randomly selecting 70% of samples. Markers that appeared in over 450 times of LASSO were retained and were applied to the training dataset to construct an EOC diagnostic model based on logistic regression, then the diagnostic model was tested in the validation dataset.
EOC samples with complete survival information in the individual cohort were randomly split with a 2:1 ratio to training and validation datasets. Markers were pre-screened in the pool and individual cohort. The screened-out markers were processed by LASSO in the training dataset to distinguish samples of incidence from samples of other observations. 100 times of LASSO were performed with each time randomly selecting 70% of samples. Markers that appeared in over 90 times of LASSO were retained. Concurrently, UniCox was also employed to process the selected markers in the training dataset to distinguish samples of incidence from samples of other observations. Markers with p -value < 0.05 in UniCox were overlapped with the markers retained from LASSO. These overlapped markers were applied to the training dataset to construct an EOC prognostic model based on logistic regression, then the prognostic model was tested in the validation dataset. The formula of cp-score calculation was:
cp score = Sigmoid(-0.995 + 1.678∗OV27–2.359∗OV16 + 2.029∗OV56). S i g m o i d ( x ) = 1 1 + e − x
cfDNA samples were extracted from plasma and bisulfite converted by using EZ DNA Methylation-Lightning Kit (Zymo Research, Irvine, CA, USA) according to the manufacturer’s instructions. The subsequent examination and analysis were based on QX200 droplet digital PCR system according to the manufacturer’s instruction (Bio-Rad, Pleasanton, California, USA). FAM and HEX fluorophore were employed to label the methylation and unmethylation probes, respectively, and the sequence of probes and primers were indicated in Table S6 . For each reaction, the system and parameters were as follows.
2x ddPCR Supermix for Probes (No dUTP) (Bio-rad) 10μL.
Primer mix(10μM) 1.6μL.
Probe mix(10μM) 0.8μL
bisulfite-converted DNA 0.6μL.
Nuclease free water (AM9937, Life Technologies Corp.) 0.6μL.
(1) 98°C 10min (2) 98°C 30s (3) 45.7°C 60s (4) Repeat steps 2 and 3 for 39 rounds (5) 98°C 10min (6) 4°C 20min
98°C 10min
98°C 30s
45.7°C 60s
Repeat steps 2 and 3 for 39 rounds
98°C 10min
4°C 20min
For both TMC EPIC and targeted EPIC methylation sequencing, differentially methylated CpGs between healthy and tumor samples were identified with DMRfinder (version 0.3) with the beta-binomial hierarchical modeling and Wald test with the significant cutoff of p 0.1. ROC analyses were conducted by pROC package for the assessment of the diagnostic performance. The cd-score between clinical characteristics was evaluated by the Wilcoxon rand-sum test and a p-value of <0.05 was considered statistically significant. The p -values of performance between different models, and between OV1-CA125 combined assay and CA125 were tested by McNemar’s test.