Construction of a nomogram model for predicting benignity and malignancy in adnexal masses.

OA: gold CC-BY-NC-4.0
Full text 26,376 characters · extracted from pmc-nxml · 4 sections · click to expand

Intro

Malignant ovarian tumors are associated with a low rate of early detection and poor prognosis and rank highest among gynecological malignancies in terms of mortality [ 1 , 2 ]. Some studies [ 3 ] have reported that patients with stage I ovarian cancer have a 5-year survival rate greater than 90%, whereas the rate decreases to 27% in stage III and 13% in stage IV disease. Therefore, timely and accurate discrimination between benign and malignant ovarian masses is critical. Several international expert consensus guidelines and predictive models have been proposed to differentiate benign from malignant ovarian masses [ 4 ]. Ultrasound features such as septations, papillary or solid components, bilateral involvement, ascites, and metastases are generally recognized as key indicators of malignancy. The Ovarian-Adnexal Reporting and Data System (O-RADS) is widely regarded as having strong diagnostic performance. However, the authors’ previous research demonstrated that accurate application of the O-RADS model remains dependent on the subjective experience of ultrasound physicians [ 5 ]. For example, the diagnostic accuracy for typical benign lesions is significantly influenced by operator experience [ 5 ]. Furthermore, implementation of O-RADS may be challenging in primary hospitals with limited diagnostic capacity and a large number of junior physicians, who may require prolonged training to achieve proficiency [ 6 ]. These limitations may hinder broader dissemination and rapid clinical adoption of the model. Previous studies by the authors of this paper indicated that the Risk of Malignancy Index (RMI) is relatively simple to learn and apply. The RMI4 model incorporates four variables: ultrasound score, maximum tumor diameter, menopausal status, and serum carbohydrate antigen 125 (CA125) level [ 7 ]. The present research team previously reported excellent interobserver agreement for RMI4, with a kappa value of 0.92 among ultrasound physicians with varying levels of experience [ 8 ]. The diagnostic specificity was high (95.3% for senior physicians and 96.4% for junior physicians). However, the sensitivity and area under the receiver operating characteristic curve (AUC) were suboptimal. Specifically, sensitivity values were 60.7% and 58.6% for senior and junior physicians, respectively, and the AUC was 0.78 in both groups [ 8 ]. RMI4 tended to miss malignant tumors without CA125 elevation and misclassify benign tumors with elevated CA125 levels [ 8 ], which may explain its limited diagnostic performance. Given the limitations of CA125 in distinguishing benign from malignant tumors, researchers have explored combining additional biomarkers with RMI to improve diagnostic performance while maintaining simplicity. In 2003, Hellström first proposed human epididymis protein 4 (HE4) as a potentially useful biomarker for ovarian cancer [ 9 ]. In 2008, Moore et al. [ 10 ] reported that HE4 demonstrated the highest AUC among single tumor markers for ovarian cancer detection. Even in stage I disease, HE4 achieved a sensitivity of 45.9% and an AUC of 0.765. Subsequent studies [ 10 – 13 ] confirmed that HE4 complements CA125, particularly in distinguishing endometriosis in premenopausal women. For example, Moore et al. [ 13 ] reported that 67% of patients with endometriosis had elevated CA125 levels, whereas only 3% had elevated HE4 levels (P<0.001). Abdalla et al. [ 14 ] replaced CA125 with HE4 in the RMI1–4 models, but this modification did not improve diagnostic performance. Lof et al. [ 15 ] combined HE4 with RMI1 and found that the AUC improved compared with RMI alone (0.762 vs. 0.710) but remained lower than HE4 alone (0.762 vs. 0.799). Therefore, the optimal method of incorporating HE4 into the RMI framework remains unclear. Accordingly, this study aimed to construct a nomogram based on the diagnostic variables included in RMI4 and the HE4 level. It further compared its diagnostic performance with that of the RMI4 and O-RADS models to develop a more accurate and practical tool for distinguishing benign from malignant ovarian masses.

Results

This study included 794 female patients with a total of 855 ovarian masses. Patients ranged in age from 6 to 83 years. Among the included patients, 594 (74.8%) were premenopausal and 200 (25.2%) were postmenopausal. Overall, 110 (18.5%) of the premenopausal patients and 105 (52.5%) of the postmenopausal patients had malignant ovarian masses. In total, 320 patients (40.3%) had elevated serum CA125 levels, including 231 (72.2%) premenopausal and 89 (27.8%) postmenopausal patients. Elevated serum HE4 levels were observed in 168 patients (21.2%), including 96 (57.1%) premenopausal and 72 (42.9%) postmenopausal patients. The training dataset included 490 patients with 525 masses, the internal validation dataset included 210 patients with 225 masses, and the external validation dataset included 94 patients with 105 masses. Comparisons of baseline characteristics across the three datasets showed no statistically significant differences in ultrasound score, menopausal status, maximum tumor diameter, CA125 value, HE4 value, O-RADS category, or RMI4 score (all P>0.05). However, the maximum tumor diameter differed between the two validation datasets (P<0.05). Ultrasound score, maximum tumor diameter, CA125 and HE4 values, O-RADS category, and RMI4 score were higher in malignant than in benign masses (all P<0.05), and malignant tumors accounted for a larger proportion of masses in postmenopausal patients (P<0.05) ( Table 1 ). Interobserver agreement between the two physicians was high. The kappa values for the RMI4 and O-RADS classifications were κ = 0.91 (95% confidence interval [CI], 0.86 to 0.97) and κ = 0.82 (95% CI, 0.76 to 0.87), respectively. These results indicate good consistency for both models. Among the 855 ovarian masses, postoperative histopathology identified 627 (73.3%) benign lesions and 228 (26.7%) malignant lesions. Mature cystic teratoma was the most common benign lesion, and cystadenocarcinoma was the most common malignant lesion. The detailed histopathological classification of the 855 ovarian masses is provided in Table 2 . In the variable analyses, ultrasound score, menopausal status, maximum tumor diameter, and serum CA125 and HE4 levels were statistically significant (P0.10) and VIF values ranging from 1.22 to 2.80 (all <10), indicating no severe multicollinearity among variables included in the model ( Table 3 ). The nomogram constructed using the screened predictors is shown in Fig. 1 . The regression equation was as follows: p=–4.865+0.973×U+0.892×M+0.016×S+0.002×CA125+0.009×HE4. In the training dataset, the likelihood ratio test indicated that adding HE4 to the original RMI4 variables in the nomogram significantly improved model fit (χ²=90.544, degrees of freedom=6, P<0.001). The receiver operating characteristic curve for the training dataset is shown in Fig. 2A , with an AUC of 0.912 (95% CI, 0.882 to 0.942). After bootstrap validation, the optimism-corrected AUC was 0.911 (95% CI, 0.886 to 0.935), and the optimism value was 0.001. The calibration curve indicated good agreement between predicted and observed outcomes ( Fig. 3A ), which was supported by the Hosmer-Lemeshow test (χ²=12.12, P>0.05). The receiver operating characteristic curves for the internal and external validation datasets are shown in Fig. 2B and C , with AUCs of 0.906 (95% CI, 0.859 to 0.953) and 0.949 (95% CI, 0.905 to 0.994), respectively. Calibration curves in both validation datasets also indicated good model fit ( Fig. 3B , C ). Hosmer–Lemeshow test results were χ²=6.40 and χ²=5.47, respectively, with P>0.05 in both datasets. Performance metrics for the nomogram, RMI4, and O-RADS models are summarized in Table 4 . Across the three datasets, the nomogram demonstrated good discriminative ability, as reflected by consistently high AUC values. In the training dataset, the AUC of the nomogram was higher than those of the RMI4 and O-RADS models (P<0.05). In the internal validation dataset, the nomogram AUC was higher than that of RMI4 (P0.05). In the external validation dataset, AUCs did not differ significantly among the nomogram, O-RADS, and RMI4 models (P>0.05). Using a cut-off value of 0.277, nomogram sensitivity values in the training, internal validation, and external validation datasets were 79.9%, 71.2%, and 93.3%, respectively, and specificity values were 87.6%, 90.4%, and 86.7%, respectively. Compared with RMI4, the nomogram demonstrated higher sensitivity in all three datasets (P<0.05), with slightly lower specificity (P0.05). Nomogram specificity was higher than that of O-RADS in the training dataset (P0.05). In the external validation dataset, sensitivity and specificity did not differ significantly among the three models (P>0.05). Representative model classifications for an ovarian mass are shown in Fig. 4 . Decision curves for the nomogram model in the three datasets are shown in Fig. 5 . Across the full range of threshold probabilities, the nomogram curve remained farther from the two extreme strategies than the comparator curves, indicating favorable overall decision performance. Overall, the nomogram demonstrated good net benefit and clinical utility across a broad range of threshold probabilities. In the training dataset, the nomogram provided higher net benefit than O-RADS, RMI4, and the two extreme strategies across an approximate threshold range of 5% to 70%. The decision curves for the nomogram and O-RADS intersected at a low-risk threshold, with similar net benefit before the intersection; beyond that point, the nomogram showed higher net benefit than O-RADS. The decision curves for the nomogram and RMI4 intersected at a higher threshold; before the intersection, the nomogram demonstrated higher net benefit than RMI4. In clinical practice, low-to-moderate thresholds (e.g., 10%–40%) are often emphasized. Within this range, the starting threshold at which the O-RADS and RMI4 curves rose above the “All” line was farther to the right than that of the nomogram. The nomogram advantage was more evident at lower thresholds (0%–40%), suggesting improved sensitivity for identifying high-risk cases and reducing missed diagnoses ( Fig. 5A ). In the internal validation dataset, the training-dataset advantage of the nomogram was replicated, with net benefit across an approximate threshold range of 10% to 70%. The nomogram and O-RADS curves intersected at a moderate-risk threshold. Before this intersection, O-RADS showed slightly higher net benefit; however, after the curves rose above the “All” line, the nomogram curve remained more consistently separated from the “All” line than the O-RADS curve. Beyond the intersection point, the nomogram demonstrated higher net benefit than O-RADS. Compared with RMI4, the nomogram curve did not intersect with the RMI4 curve and maintained higher net benefit across a wider range of threshold probabilities ( Fig. 5B ). In the external validation dataset, the nomogram continued to demonstrate robust performance. The nomogram curve spanned a wider range of threshold probabilities and generally provided the highest net benefit across an approximate threshold range of 5% to 65%. The nomogram and O-RADS curves intersected at a low-risk threshold and showed similar net benefit before the intersection. After this point, the nomogram generally demonstrated higher net benefit than O-RADS across most thresholds, except where the curves intersected again at a moderate-to-high threshold. The nomogram and RMI4 curves intersected at a moderate threshold: before the intersection, the nomogram showed higher net benefit, whereas beyond the intersection the net benefits were similar. Notably, within the clinically relevant decision threshold range of 20%–40%, the nomogram provided the greatest improvement in net benefit ( Fig. 5C ). For O-RADS 4 masses, the diagnostic accuracy of O-RADS and the nomogram was 52.1% (61/117) and 73.5% (86/117), respectively, in the training dataset. In the two validation datasets, corresponding accuracies were 58.8% (20/34) versus 70.6% (24/34) and 47.4% (9/19) versus 78.9% (15/19), respectively.

Discussion

Current ultrasound-based approaches for malignancy risk stratification of ovarian masses include O-RADS, IOTA, and RMI models [ 20 ]. Key sonographic features incorporated into these models include laterality, cystic versus solid components, septations, papillary projections, ascites, peritoneal nodules, tumor size, and Doppler blood-flow scores. Compared with other approaches, RMI4 is simpler to apply. Accordingly, the RMI4 variables were selected (ultrasound score, menopausal status, maximum diameter of the mass, and serum CA125 level) as core predictors for constructing the nomogram. RMI4 does not incorporate blood-flow assessment but includes CA125. In most patients with an ovarian mass, CA125 is measured preoperatively. CA125 is widely used to monitor epithelial ovarian cancer [ 21 ]; however, in clinical practice it is limited by substantial false-positive and false-negative results. The U.S. Preventive Services Task Force has noted that CA125 should not be used as a standalone test for ovarian cancer detection [ 22 ]. By contrast, HE4 is a useful biomarker for identifying malignant ovarian masses. Hamed et al. reported that HE4 is more sensitive and specific for early-stage ovarian cancer and that combining HE4 with CA125 may improve diagnostic sensitivity [ 23 ]. In the present study, HE4 levels differed between malignant and benign ovarian masses, consistent with the findings of Braicu et al. [ 24 ]. Compared with CA125, HE4 is less likely to be elevated in benign lesions such as endometriosis or adenomyosis [ 25 ]. Multiple studies, including those by Moore et al. [ 10 ] and Nolen et al. [ 26 ], have supported the complementary role of CA125 and HE4 in distinguishing pathological subtypes and stages of malignant ovarian masses. This study implemented a modeling strategy based on prior clinical knowledge and combined key ultrasound features with CA125 and HE4 to construct a nomogram. Before model development, these variables were identified as independent risk factors associated with malignancy in ovarian masses in the present dataset. Multicollinearity diagnostics indicated no severe multicollinearity among included predictors. In the training dataset, the likelihood ratio test showed that adding HE4 significantly improved nomogram performance beyond the fixed core variables (ultrasound score, menopausal status, maximum diameter of the mass, and serum CA125 level). In addition, the nomogram AUC was higher than that of RMI4 in the training dataset. Together, these results support an incremental diagnostic contribution of HE4 in this modeling framework. Previous studies have reported that O-RADS achieves higher AUC and sensitivity than RMI4 but lower specificity [ 8 ], which is consistent with this study’s findings. Across the three datasets, the nomogram outperformed RMI4 in AUC and sensitivity and demonstrated comparable or slightly better performance than O-RADS. Although the nomogram specificity was slightly lower than that of RMI4, its higher sensitivity may reduce missed diagnoses. These findings suggest that incorporating HE4 alongside CA125 improves the identification of malignant ovarian masses within an RMI4-based framework. In the external validation dataset, AUC, sensitivity, and specificity were similar across the three models. This may be related to the smaller external sample size and to differences in the distribution of malignant tumors without CA125 elevation and benign tumors with CA125 elevation compared with the other two datasets, consistent with the authors’ previous findings [ 7 ]. Calibration results indicated good agreement between predicted and observed outcomes in the training and internal validation datasets. Although the model showed a slight lack of fit in the external validation dataset, this difference was not statistically significant. Overall, the nomogram demonstrated acceptable calibration across datasets. DCA indicated that the nomogram provided higher net benefit than both O-RADS and RMI4 across a broad threshold range, although the decision-curve patterns differed somewhat across datasets. The nomogram demonstrated clinical usefulness across an approximate threshold range of 10%–65%, suggesting applicability to clinicians with varying decision thresholds. Notably, within commonly used decision thresholds (typically 0%–40%), the nomogram net benefit curve remained consistently higher than those of O-RADS and RMI4, indicating greater net benefit in routine decision-making. However, at higher thresholds (>60%), net benefits across models converged. The overall consistency of the DCA curves across datasets suggests limited overfitting and supports model generalizability, providing strong evidence supporting future clinical adoption. For O-RADS 4 masses, the estimated malignancy risk ranges from 10% to 50% [ 19 ], reflecting a broad intermediate-risk category. While O-RADS provides standardized management recommendations by risk category, accurate discrimination between benign and malignant masses remains important to streamline clinical pathways. For O-RADS 4 masses, O-RADS recommends considering menopausal status, expert ultrasound assessment, magnetic resonance imaging features, and serum biomarkers to guide referral to a gynecological oncologist. In the present study, the nomogram achieved higher diagnostic accuracy than O-RADS for O-RADS 4 masses, which may support more rapid and accurate clinical decision-making. In addition, the ultrasound variables included in the nomogram are relatively straightforward and may be easier to apply than the full O-RADS lexicon, particularly for less experienced ultrasound physicians and in primary-care settings. This study has limitations. The nomogram was developed primarily using data from one medical center and was externally validated in only one additional center. In addition, the external validation sample size was relatively small. Further validation in multicenter studies with larger samples is needed. In summary, the nomogram demonstrated higher diagnostic performance than RMI4 and was comparable to O-RADS, while offering simpler application than O-RADS and less dependence on physician experience. Considering both diagnostic performance and net benefit, the nomogram may serve as a practical tool to support less experienced ultrasound physicians and facilitate broader implementation for predicting the benign or malignant nature of ovarian tumors.

Materials|Methods

This study was approved by the Clinical Research Ethics Committees of the Second Xiangya Hospital and the Third Xiangya Hospital of Central South University (No. 2021-038 and 2022-056). The requirement for informed consent was waived. This study included 794 female patients who underwent surgery for ovarian masses at two medical centers at Central South University between January 2017 and September 2024. Among them, 700 patients were treated at center 1 (the Second Xiangya Hospital) and 94 at center 2 (the Third Xiangya Hospital). Patients from center 1 were randomly divided into training and internal validation datasets at a ratio of 7:3. Patients from center 2 comprised the external validation dataset. The inclusion criteria were as follows: (1) patients who underwent surgical treatment at one of the two medical centers, had complete clinical data, and had definitive histopathological results; and (2) patients who underwent a comprehensive preoperative ultrasound examination (abdominal ultrasound combined with transvaginal or transrectal ultrasound) with adequate image quality. The exclusion criteria were as follows: (1) pregnancy complicated by ovarian masses; (2) receipt of antitumor therapy (chemotherapy, radiotherapy, hormone therapy, or targeted therapy) before surgery; (3) coexistence of ovarian malignancy and other primary malignancies (excluding metastatic ovarian tumors); (4) severe hepatic or renal dysfunction, including liver or kidney failure; (5) acute infection; (6) coagulation disorders; and (7) autoimmune diseases. For patients with bilateral lesions, if both sides had identical pathological diagnoses, both masses were included. If the pathological diagnoses differed, only the lesion with the higher malignant potential was included. This study included samples from center 1 that were also used in the authors’ previous publications [ 5 , 8 ], in addition to newly collected cases from centers 1 and 2. The overlapping samples from center 1 contributed to prior publications [ 5 , 8 ], whereas the newly added cases have not been previously published. The following data were collected: age, menopausal status, ultrasonographic characteristics of the ovarian masses, serum CA125 and HE4 levels, surgical approach, and detailed histopathological results. Postmenopausal status was defined as meeting any of the following criteria: age ≥50 years, history of hysterectomy, or amenorrhea for ≥1 year. Elevated CA125 was defined as ≥35 U/mL [ 16 ]. Elevated HE4 was defined as ≥70 pmol/L for premenopausal patients and ≥140 pmol/L for postmenopausal patients [ 17 ]. All ultrasound images were acquired by physicians with more than 5 years of experience in gynecological ultrasonography using 9–15 MHz intracavitary transducers. The ultrasound systems included GE Voluson S6, GE E8, and GE E10 (GE Healthcare, Milwaukee, WI, USA), SonoScape P60 (SonoScape, Shenzhen, China), and Mindray Resona R7 (Mindray Medical, Shenzhen, China). Lesions were described in accordance with the 2000 International Ovarian Tumor Analysis (IOTA) consensus on ultrasound terminology for ovarian masses [ 18 ]. For each mass, the following features were recorded: unilateral/bilateral involvement; cystic/solid components; shape; margin; cyst wall thickness; cyst wall regularity/irregularity; presence/absence of acoustic shadows; maximum diameter of the mass; papillary projections; septations; ascites; peritoneal or pelvic wall–implanted nodules; and color Doppler blood-flow score. The modeling strategy was based on prior clinical knowledge. When incorporating variables, data-driven selection or exclusion was not applied to the four core RMI4 variables (ultrasound score, menopausal status, maximum diameter of the mass, and CA125 value). These variables have been extensively validated and were included as a priori, mandatory components of the model. The incremental value of the newly added variable (HE4) was evaluated against this benchmark model with fixed core variables. Using the training dataset, a nomogram was developed to predict malignancy in ovarian masses. Bootstrap internal validation was then performed in the training dataset. Finally, the model was validated in the two validation datasets to assess generalizability. An ultrasound physician with more than 2 years of experience in gynecological ultrasonography first learned the theoretical framework for both the RMI4 and O-RADS models and then applied each model to 200 ovarian-mass cases for practice. This physician and a second physician with 10 years of experience—who had already mastered the application rules for both models—subsequently used both models to classify 40 randomly selected ovarian masses. Interobserver agreement between the two physicians was assessed. After satisfactory agreement was achieved, the junior physician applied the RMI4 and O-RADS models to all ovarian masses included in this study. The 240 cases used for practice and agreement assessment were not included in the final analysis. The RMI4 cut-off value was 450: masses with an RMI4 score<450 were considered benign, whereas those with an RMI4 score≥450 were classified as malignant [ 6 ]. For O-RADS [ 19 ], masses categorized as O-RADS 1–3 were considered benign, and those categorized as O-RADS 4–5 were classified as malignant. All model-based classifications were compared with postoperative histopathological results. Histopathological diagnoses were categorized according to the World Health Organization classification of female genital tumors. Statistical analyses were performed using R version 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria) and SPSS version 26.0 (IBM Corp., Armonk, NY, USA). Unordered categorical variables were compared using the chi-square test, and ordered categorical variables were compared using the Mann-Whitney U test. Continuous variables with non-normal distributions were compared using the Kruskal-Wallis H test. Univariable and multivariable analyses were conducted to identify risk factors. Multicollinearity was assessed among all included independent variables to ensure stability of the regression estimates. Tolerance (TOL) and variance inflation factor (VIF) were used as diagnostic indices; severe multicollinearity was defined as TOL10. A likelihood ratio test was performed to evaluate the contribution of newly added variables. Diagnostic performance of the nomogram was quantified using the AUC. The model cut-off value was selected based on the maximum Youden index, while also considering clinical applicability. Bootstrap validation (≥1,000 resamples) was performed for the final model to estimate the optimism-corrected AUC and optimism value. Model calibration was assessed using calibration curves and the Hosmer-Lemeshow test. Decision curve analysis (DCA) was conducted to quantify the clinical net benefit of the nomogram, and the clinically meaningful threshold range was described. The McNemar test was used to compare sensitivity and specificity. AUCs across diagnostic models were compared using the DeLong test. Interobserver agreement was assessed using kappa statistics, with κ≥0.75 indicating high agreement, 0.40≤κ<0.75 indicating moderate agreement, and κ<0.40 indicating low agreement. In all analyses, P<0.05 was considered statistically significant.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-06-25T06:14:32.897245+00:00
unpaywall
last seen: 2026-06-13T06:42:57.164913+00:00
License: CC-BY-NC-4.0