Intro
Ovarian Hyperstimulation Syndrome (OHSS) remains a clinically significant complication of controlled ovarian hyperstimulation (COH) in assisted reproductive technology (ART) ( 1 , 2 ). The overall incidence of mild OHSS ranges from 20% to 30%, while moderate-to-severe cases occur in 1% to 6% of high-risk populations ( 3 – 5 ). Clinical manifestations range from mild abdominal distension, nausea, and ovarian enlargement to severe sequelae including massive ascites, pleural effusion, and thromboembolic events ( 6 , 7 ). This potentially life-threatening condition is pathologically characterized by systemic capillary hyperpermeability, leading to hemoconcentration and third-space fluid sequestration, which frequently necessitates intensive care admission in severe cases ( 8 , 9 ). The syndrome not only affects patient’s health, but also results cycle cancellation, prolongs treatment, increases healthcare costs, and inflicts substantial psychological distress ( 10 ). Current clinical strategies for OHSS prediction primarily relies on clinicians’ experiential judgment and conventional biomarkers [e.g., serum Anti-Müllerian Hormone (AMH), antral follicle count (AFC)], patient history [e.g., previous OHSS, polycystic ovary syndrome (PCOS)], and stimulation response monitoring ( 11 ). However, the pathophysiological complexity of OHSS involves non-linear interactions between genetic predisposition, angiogenic factors (notably VEGF and IL-6), gonadotropin receptor polymorphisms, and iatrogenic hCG exposure, with emerging evidence implicating specific immune cell phenotypes, plasma metabolites, and novel circulating proteins (such as SBSN, CD244, ENO3, and TXNDC12) in its pathogenesis ( 12 – 16 ). Traditional statistical methods like univariate or multivariate logistic regression are often based on linear assumptions and limited variable combinations, struggling to capture non-linear relationships, complex interactions within clinical data, thereby restricting predictive accuracy and sensitivity, and failing to effectively identify all high-risk patients ( 17 , 18 ).
The application of multiple machine learning (ML) models offers a transformative paradigm for enhancing OHSS prediction ( 19 , 20 ). ML algorithms excel at identifying intricate, non-linear patterns within large, multidimensional datasets, surpassing the analytical capacity of conventional statistics ( 17 , 18 ). The novelty of employing multiple ML models [e.g., Random Forest, eXtreme Gradient Boosting (Xgboost), etc.] has advantages ( 19 , 20 ). The theoretical justification is robust, evidenced by ML’s proven superiority over traditional models in predicting complex medical outcomes across diverse fields ( 17 , 20 , 21 ). However, ML models are often perceived as “black boxes”, with opaque decision-making processes that pose significant barriers to clinical implementation. SHapley Additive exPlanations (SHAP), a game theory-based approach, addresses this limitation by quantifying the contribution of each feature to individual predictions, thereby providing transparent, patient-level risk explanations ( 22 , 23 ). The integration of high-performance ML models with SHAP interpretability represents a promising strategy for developing clinically actionable risk stratification tools.
Therefore, we designed a comprehensive study employing three distinct ML models to predict OHSS risk in infertile patients undergoing COH based solely on readily available baseline clinical indicators, and to achieve model interpretation through SHAP methodology. This multi-model strategy aims to identify key predictive factors and provide clinicians with scientific evidence to support post-retrieval risk stratification for timely preventive interventions to mitigate OHSS-related morbidity.
Results
This study enrolled 500 infertility patients who met the inclusion and exclusion criteria. Data on 6 clinical features (type of infertility, etiology of infertility, age, infertility duration, previous IVF cycles, and BMI) and 19 laboratory parameters were collected. The missing rate for all variables was below 5%, with missing data addressed through multiple imputations. The dataset was randomly split into training (n=350) and test (n=150) sets ( Figure 1 ). Univariate logistic regression was first applied to the 25 variables within the training set. Baseline characteristics of the training set were compared between OHSS and NOHSS groups ( Table 1 ), revealing no statistically significant differences (P > 0.05) in type of infertility, infertility duration, previous IVF cycles, or BMI. However, significant differences (P < 0.05) were observed in 2 clinical features (age and infertility duration), and 9 laboratory parameters (baseline FSH level, ratio of baseline LH/FSH, AMH level, FSH level on the first day of ovulation induction, P and E 2 level on trigger day, number of follicles diameter ≥ 14mm, oocytes retrieved, and mature oocytes) between OHSS and NOHSS patients. Variables with P < 0.05 were retained as candidate set A (11 features). Separately, LASSO regression was applied to the same 25 variables, strictly within the training set. The penalty parameter λ was tuned with an inner ten-fold cross-validation loop, and the value that yielded the minimum mean cross-validated deviance (λ =1.1883757) was retained. At this λ, only four coefficients remained non-zero: Age, P levels on the trigger day, number of oocytes retrieved, and ratio of baseline LH/FSH, forming candidate set B. The shrinkage profiles of all 25 variables are displayed in Figures 2A, B , where the vertical line marks the chosen λ. To select the minimal set of features for OHSS, the final set of features for model construction was taken as the intersection of set A (univariate-selected) and set B (LASSO-selected). The final set of predictors for model construction was defined as the intersection of set A and set B, which resulted in the same four variables. These were subsequently carried forward for final model fitting and evaluation.
Characteristics of the patients in training set.
(A) Trajectory plot showing the selection process of variables from the dataset using LASSO regression for multivariate analysis; (B) Confidence intervals for each lambda value during LASSO regression.
The discriminative ability of the models was assessed using the C-index. All three models demonstrated acceptable discriminative performance (C-index > 0.80). Using bootstrap resampling with 1000 iterations, the optimism-corrected C-index was 0.90 (95% CI 0.81-0.97) ( Supplementary Figure S1B ) for the Random Forest model, 0.88 (95% CI 0.76-0.97) for the Xgboost model ( Supplementary Figure S1C ), and 0.81 (95% CI 0.68-0.91) for the Logistic model ( Supplementary Figure S1A ).The RandomForest model demonstrated the highest discrimination.
On the training set ( Figure 3A ), Random Forest achieved perfect AUC score of 1.00 (95% CI: 1.00-1.00). Xgboost also performed excellently with AUC of 0.94 (95% CI: 0.90-0.98). They were followed by Logistic with AUC of (95% CI: 0.76-0.91)). On the test set ( Figure 3B ), model discrimination was as follows: Random Forest, AUC 0.81 (95% CI: 0.68-0.95); Logistic, AUC 0.78 (95% CI: 0.65-0.92); and XGBoost, AUC 0.77 (95% CI: 0.64-0.89).
(A) ROC curves for each model in the training set; (B) ROC curves for each model in the test set. In the image, Logistic, logistic regression; RandomForest; Xgboost, extreme gradient boosting; ROC, receiver operating characteristic; AUC, area under the curve.
The observed overfitting phenomenon can be attributed primarily to class imbalance within the dataset, where the minority class (OHSS) was significantly underrepresented compared to the majority class (NOHSS). To address this issue, the SMOTE was applied exclusively to the training set to generate synthetic samples of the minority class and balance the class distribution. However, while SMOTE effectively mitigated imbalance-related bias during model training, it concurrently introduced artificial data points that amplified model complexity and fostered over-optimistic performance estimates on the training set. This manifested as near-perfect apparent discrimination on the training set (AUC 0.83-1.00 across models). However, the optimism-corrected C-index from bootstrap resampling (0.81-0.90) revealed substantially lower performance. This discrepancy indicates that the apparent AUC was inflated by the SMOTE-augmented training distribution, whereas the bootstrap C-index, derived from out-of-bag original samples, provided a more realistic estimate of discriminative capacity. These findings underscore the critical importance of independent test set evaluation in ML model development, as training set metrics alone, including those derived from resampled data, proved insufficient to ascertain true predictive capability.
Table 2 summarizes detailed metrics for both training and test sets. On the training set, Random Forest achieved perfect scores (1.00) across all metrics-accuracy, sensitivity, specificity, precision, and F1 score. Xgboost attained F1 score (0.88), along with an accuracy of 0.88, sensitivity of 0.88, specificity of 0.89, and precision of 0.89. Logistic reached 0.98 sensitivity, along with an accuracy of 0.79, specificity of 0.59, precision of 0.71, and F1 score of 0.82. In the unseen test set, performance varied among models. RandomForest attained the highest F1 score (0.37), along with an accuracy of 0.79, sensitivity of 0.75, specificity of 0.80, and precision of 0.24.Xgboost and Logistic reached 0.67 sensitivity, along with an accuracy of 0.77, specificity of 0.67, precision of 0.21, and F1 score of 0.32.
Comparison of each model’s performance in terms of accuracy, sensitivity, specificity, precision and F1 in the training set and test set.
On the training set, Random Forest model exhibited relatively smooth curves close to the diagonal line. In the low predicted probability range, the Random Forest and Logistic models performed reasonably well. However, on the test set, all models deviated significantly from the diagonal, indicating a substantial deterioration in calibration performance. In the low predicted probability range (0-25%), the Random Forest model performed relatively well ( Figures 4A, B ). In terms of clinical applicability, the training set DCA curves demonstrated strong net benefit; however, performance on the test set was more limited ( Figures 5A, B ). Despite this, within the clinically relevant risk threshold range of 1-10%, where OHSS incidence is most concentrated, the RandomForest model exhibited competitive and robust net benefit, making them the most suitable choices for post-retrieval OHSS risk stratification ( Figure 5B ).
(A) Calibration curves for each model in the training set; (B) Calibration curves for each model in the test set.
(A) DCA, Decision curve analysis curves for each model in the training set; (B) DCA curves for each model in the test set.
Bootstrap feature selection stability analysis (1000 iterations) revealed heterogeneous reliability among the candidate predictors. The number of oocytes retrieved demonstrated the highest stability (mean rank 1.84 ± 0.99, selection frequency 0.914), retained in the top three across 91.4% of resamples. Ratio of baseline LH/FSH showed high stability (mean rank 2.29 ± 1.08, selection frequency 0.819). Age showed moderate stability (mean rank 2.69 ± 1.09, selection frequency 0.695), while P levels on the trigger day exhibited low stability (mean rank 3.18 ± 0.84, selection frequency 0.572), with rankings varying substantially across bootstrap samples ( Table 3 ). These findings indicate that the predictive signal is concentrated in a limited subset of baseline variables, with the number of oocytes retrieved and ratio of baseline LH/FSH representing the most robust and reproducible predictors of post-retrieval OHSS risk stratification.
Bootstrap feature selection stability analysis (n=1000 iterations).
In order to visually explain the selected variables, we used SHAP to illustrate how these variables affect the OHSS risk in the RandomForest model. Figure 6A shows the top 4 risk variables evaluated by the average absolute SHAP value. Higher SHAP value indicates greater contribution to the OHSS risk. Figure 6B uses a bee swarm plot to show how each feature influences the model’s predictions. The points are color-coded, with purple representing low risk and yellow representing high risk, making the relationship between a feature’s value and its impact easy to see. Among all features, number of oocytes retrieved has the strongest influence. Higher values of this feature (shown in yellow) correspond to a strong positive effect on the prediction, meaning they are associated with higher OHSS risk. Age is another important feature, where a higher value (older age) generally lowers the predicted risk. P levels on the trigger day also contribute to the model’s output. Figures 6C–E provided another view with scatter plots, confirming these relationships. An increase in number of oocytes retrieved significantly raises the OHSS risk. In contrast, age shows a negative correlation with the prediction. P levels on the trigger day demonstrates a non-linear relationship with OHSS risk, where low P levels (2.0 ng/mL) paradoxically demonstrates negative contributions. This analysis helps clarify the role of each feature under different conditions, improving the model’s interpretability and providing clinical insights for OHSS risk assessment.
Interpretability analysis of RandomForest model. (A) Importance ranking plot of features of the RandomForest model. (B) SHAP dendrogram of features of the RandomForest model. (C–E) Scatter plots of SHAP values for Top 3 features.
Figure 7A , B are waterfall plots that break down the prediction for a single patient. Each bar represents a feature’s contribution: purple for variables that decrease risk and yellow for those that increase it. In the low-risk patient example ( Figure 7A ), the model identified age of 35 years (SHAP value: -0.11) as the principal variables reduce OHSS risk. The final prediction score of 0.00 was significantly lower than the mean prediction value of 0.23 indicating low OHSS risk. Conversely, in the high-risk patient example ( Figure 7B ), number of oocytes retrieved of 15 emerged as the predominant risk-enhancing factor (SHAP value: +0.327). Additional contributors included P levels on the trigger day of 0.736 pg/mL (SHAP value: +0.21) and younger age of 28 years (SHAP value: +0.135). The prediction score of 1.00 significantly higher than the mean value, indicating high OHSS risk.
Local model explanation by the SHAP method. (A) Waterfall plot for one NOHSS patient. (B) Waterfall plot for one OHSS patient.
Discussion
This study exploratorily developed and internally validated three ML models for post-retrieval OHSS risk stratification in patients undergoing ART treatment. In the test set, the RandomForest model demonstrated optimal discriminatory performance, achieving an AUC of 0.81, sensitivity of 0.75, and specificity of 0.80. Furthermore, LASSO and SHAP analysis enhanced model transparency by identifying the number of oocytes retrieved, age, P levels on the trigger day, and ratio of baseline LH/FSH as the most influential predictors, thereby providing clinically interpretable insights.
Notably, the RandomForest model achieved 0.80 sensitivity, indicating that the vast majority of patients who actually developed OHSS were correctly identified, with minimal proportion of missed evaluation. This is particularly crucial in the clinical context of OHSS risk stratification, where even a small number of false negatives may result in high-risk patients not receiving necessary preventive measures. Although specificity was relatively lower (0.75), implying that some low-risk patients may be over-warned, such over-warning carries substantially lower clinical costs compared to missed evaluation in practice. The importance of accurate OHSS risk stratification is further underscored by its significant impact on pregnancy outcomes. A 2021 meta-analysis demonstrated that OHSS patients experience adverse pregnancy outcomes 8.8 more frequently than NOHSS patients ( 30 ). Consistent with these findings, previous studies have also reported miscarriage rates of 12.2-29.8% in OHSS-affected pregnancies ( 7 , 31 , 32 ), as well as higher rates of pre-eclampsia and preterm birth relative to NOHSS IVF controls ( 32 , 33 ). Through further clinical assessment and stratified management (such as cycle segmentation, GnRH antagonist protocol modifications, or low-dose hCG triggers), these false-positive patients can be reclassified without delaying treatment or causing substantial harm ( 34 ).
SHAP analysis revealed that the number of oocytes retrieved, age, and P levels on the trigger day were the three most important features for predicting OHSS risk at the population level ( 35 – 37 ). This finding is highly consistent with existing literature and pathophysiological mechanisms. The number of oocytes retrieved directly reflects ovarian response to gonadotropin stimulation and serves as the most direct indicator of OHSS risk; large numbers of retrieved oocytes lead to significant ovarian enlargement and substantial secretion of vasoactive substances such as VEGF, thereby increasing vascular permeability-the central pathogenic mechanism of OHSS ( 15 , 38 ). Age functions as a negative predictor, closely related to ovarian reserve and responsiveness-younger patients typically possess better ovarian reserve and more sensitive responses to stimulation, thus carrying higher OHSS risk ( 37 ). P levels on the trigger day, as a biomarker of follicular luteinization and ovarian hyper-response, predict extensive granulosa cell maturation and premature progesterone elevation at high levels, positively correlating with OHSS risk ( 39 ). However, our research showed that the combination of extremely low P (2.0 ng/mL) paradoxically lose predictive value, suggesting an “optimal interval” effect for P-level-based OHSS risk stratification. Beyond population level interpretations, SHAP analysis enables individualized risk stratification by revealing how feature contributions vary across patients. For example, Figure 7 illustrates two contrasting cases: Patient A exhibited minimal SHAP values for oocyte-related variables, with age driving the prediction instead, whereas Patient B showed dominant contributions from retrieved oocyte counts. This heterogeneity demonstrates that the model does not apply uniform rules to all patients but dynamically adjusts its predictive logic based on individual feature profiles. Such capability is clinically valuable for identifying patient specific risk patterns. Thus, SHAP-enhanced interpretability transforms the model from a “black box” predictor into a clinically actionable tool for personalized OHSS prevention strategies.
This study possesses several key strengths. First, the comprehensive comparison of three ML algorithms identified RandomForest as particularly effective for this risk stratification. Second, the integration of SHAP addresses the critical barrier of model interpretability, rendering model logic accessible to clinicians while enabling intuitive understanding of each patient’s risk composition. For instance, young patients with high AMH and number of oocyte retrieved has higher risk scores with clear factor contributions, whereas older patients with diminished ovarian reserve maintain relatively low predicted risk despite higher gonadotropin doses - facilitating transparent risk communication and collaborative decision-making. Third, the use of readily available clinical variables enhances generalizability to routine practice, these tools might eventually serve as hypothesis-generating frameworks to inform-but not replace-clinical judgment in post-retrieval risk stratification, potentially informing consideration of preventive interventions (e.g., cancellation of fresh embryo transfer, additional use of GnRH antagonists, albumin supplementation, or deferred embryo transfer) based on actual oocyte numbers.
This study has several limitations. First, the relatively low OHSS incidence (8%) resulted in class imbalance, which may have influenced model calibration and contributed to the modest F1 score (0.37). While our random oversampling during training partially mitigated this, future studies should consider advanced resampling techniques or synthetic data generation. Second, the small sample size, particularly with only 40 OHSS-positive cases, substantially limited model stability and generalizability. Future studies should prioritize multi-center recruitment to achieve adequate sample size and event counts. Third, this study employed a single-center, retrospective design, susceptible to selection and information bias. The lack of external validation means that our findings may not generalize to other settings. Future work must prioritize external validation using the following concrete strategies: (1) temporal validation using data from a later period at the same center to assess model stability over time; (2) geographic validation across two or more independent centers with different patient demographics and IVF protocols; and (3) prospective validation in a separate cohort to evaluate real-world performance. Only after successfully passing these external validation steps could the model be considered for clinical implementation. Fourth, model specificity requires improvement; future work may reduce false-positive rates and enhance prediction precision through feature engineering (incorporating additional biomarkers, ultrasound parameters) and algorithm optimization (cost-sensitive learning, threshold adjustment). Moreover, there remains a gap between prediction and clinical action: the model can identify high-risk patients, but how to translate this prediction into specific preventive interventions (e.g., individualized gonadotropin dosing, cycle cancellation, or elective freeze-all) requires further implementation research.
Furthermore, this study did not assess intervention effectiveness, whether clinical decision adjustments based on model predictions can actually reduce OHSS incidence and improve patient outcomes. Conducting randomized controlled trials comparing ML-guided individualized protocols versus standard protocols will be a critical step in validating the clinical value of these models.
Conclusions
In summary, the RandomForest-based ML model developed in this study stratify OHSS risk in ART patients using routinely collected clinical data, demonstrating high sensitivity and good discriminatory capacity. The integration of SHAP analysis provides explicit, individualized risk interpretation, helping clinicians intuitively understand the impact of key predictive features such as number of oocytes retrieved, age, and P levels on the trigger day. This tool shows promise as an important component of risk stratification support systems, identifying high-risk patients, generating hypotheses for future research. Ultimately, it may help reduce the incidence and severity of OHSS, thereby improving the safety and efficiency of ART treatment, although these findings are exploratory, prospective external validation remains warranted before any clinical application.
Materials|Methods
A retrospective cohort study was conducted at Jinxin Xinan Women & Children Hospital from January 2020 to December 2023. The study population comprised patients undergoing IVF or ICSI cycles who received either a single hCG trigger, a single GnRH agonist (GnRHa) trigger, or a dual trigger (hCG plus GnRHa) for COH. Participants were included when they demonstrated preserved ovarian reserve, defined by basal FSH ≤15 IU/L and E 2 ≤60 pg/mL, with anatomically normal reproductive organs verified through hysterosalpingographic or hysteroscopic assessment. Participants were excluded if they presented a history of previous poor ovarian response, gonadotropin treatment within 3 months before the study, hydrosalpinx, ovarian endometrioma, or relevant uterine disorders (adenomyosis, congenital malformations, endometrial polyps, intrauterine adhesions), as well as parental chromosomal abnormalities. Cycles initially planned for IUI and later converted to IVF, and fertility preservation for malignancy were also disqualified.The study followed the TRIPOD guideline ( 24 ). Further details are available in Supplementary Table 1 .
OHSS was diagnosed based on the 2024 American Society for Reproductive Medicine (ASRM) guideline on Prevention of moderate and severe OHSS: a guideline ( 18 , 25 ).
Classical statistical modelling was originally designed for datasets with a few dozen input variables and small to moderate sample sizes. As the number of variables per subject increases, statistical inferences become less precise. ML algorithms can effectively capture complex relationships in high-dimensional data, even when the data is gathered without a carefully controlled experimental design ( 26 ). These capabilities have demonstrated excellent outcomes across diverse disciplines ( 17 , 26 ). In the present study, we developed a predictive model for OHSS using relevant variables. Key features associated with OHSS risk were identified through Least Absolute Shrinkage and Selection Operator (LASSO) logistic regression analysis. Subsequently, three ML algorithms: Logistic Regression (Logistic), Random Forest, and Xgboost, were trained and optimized using bootstrap resampling (1000 iterations). Model performance was assessed using multiple metrics, such as the area under the receiver operating characteristic curve (AUC-ROC), accuracy, sensitivity, specificity, precision, F1 score, calibration curves, and decision curve analysis.
The dataset was first randomly splited into training and test sets in a 70%:30% ratio ( Figure 1 ). Feature selection was conducted strictly within the training set using two parallel approaches. First, univariate logistic regression on the 25 variables identified features with P < 0.05 (candidate set A). Second, LASSO regression was applied to the original 25 variables, with the penalty parameter optimized via 10−fold inner cross−validation (minimum mean cross−validated deviance; λ), yielding non−zero coefficient features (candidate set B). The final set of predictors for model construction was defined as the intersection of set A and set B. All feature selection steps were completed before any evaluation on the validation set, ensuring no information leakage. The initial dataset exhibited a significant class imbalance. Hence, the training set was augmented with synthetic samples generated by Synthetic Minority Over-sampling Technique (SMOTE) to achieve a 1:1 ratio and enhance the ML model’s performance ( 27 – 29 ) ( Figure 1 ). ML models were built on the training set. Bootstrap resampling with 1000 iterations was used to optimize the hyper-parameters and assess the stability of ML models. Finally, the performance of each model was evaluated and compared in the test set. Three ML models were trained with fixed random seeds to ensure reproducibility ( Figure 1 ). Through comprehensive evaluation of multiple evaluation indicators (ROC curves, F1 score, Calibration curves and Decision curves analysis), the best performing model among the three models was selected for further risk stratification to distinguish between OHSS and non-OHSS (NOHSS) ( Figure 1 ).
Analysis flow for the development and evaluation of models.
The stability of predictor rankings was evaluated using bootstrap resampling (1000 iterations) from the test set. At each iteration, the Random Forest model was refitted on a bootstrap sample, and permutation-based feature importance was recorded. Selection frequency (proportion of resamples with ranking ≤ 3), mean rank, and rank standard deviation were computed for each feature. Stability was classified as High (selection frequency ≥ 0.8), Moderate (0.6 ≤ selection frequency < 0.8), or Low (selection frequency < 0.6).
Drawing on cooperative game theory, we employed SHAP to render the model’s outputs and optimization trajectory interpretable ( 22 , 23 ) ( Figure 1 ). SHAP values quantified each feature’s marginal contribution to the predictive outcome; the resulting swarm plot depicted the complete distribution of these contributions across observations, whereas case-level SHAP profiles elucidated how individual features influenced specific predictions. Recursive feature elimination was subsequently applied to retain only the most informative variables, yielding a parsimonious yet transparent model.
In the study, the Kolmogorov-Smirnov test was used for continuous variables. As all continuous variables were non-normally distributed, they are represented as median (IQR) and compared using the Mann-Whitney U test. Categorical data were presented as frequencies (proportions) and compared using the Pearson chi-square test. The ML models used all predictor variables as independent variables. All models were built with OHSS or NOHSS as the response variable. All statistical tests were two-tailed, with significance set at P < 0.05. Data analysis was conducted using R (version 4.3.3) and RStudio.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
is the canonical version.