Results
A total of 543 infertile patients were included in this study, of whom 105 underwent fresh embryo transfer and 438 underwent FET. According to clinical pregnancy outcomes, the patients were categorized into a pregnancy group ( n = 294) and a non-pregnancy group ( n = 249). All patients were randomly allocated to either a training set ( n = 380) or a test set ( n = 163) at a ratio of 7:3. The training set comprised 206 patients with clinical pregnancy and 174 without; while the test set comprised 88 and 75 patients, respectively. The baseline characteristics of the study participants are shown in Table 1 . Table 1 Basic characteristics and analysis of differences between non-clinical and clinical pregnancy groups Characteristics Total ( n = 543) Non-clinical pregnancy ( n = 249) Clinical pregnancy ( n = 294) P Age(Years), Mean ± SD 32.10 ± 3.77 32.71 ± 3.75 31.59 ± 3.72 < 0.001* BMI(Kg/m 2 ), Mean ± SD 22.29 ± 3.05 22.16 ± 3.03 22.41 ± 3.06 0.344 AMH(ng/ml), Mean ± SD 4.36 ± 3.58 3.93 ± 3.31 4.73 ± 3.76 0.009* AFC, Mean ± SD 13.42 ± 6.86 12.43 ± 6.68 14.26 ± 6.91 0.002* Duration of infertility(Years), M (Q₁, Q₃) 1.0 (1.0, 3.0) 2.0 (1.0, 3.0) 1.0 (1.0, 3.8) 0.939 No.of prior ET cycles, M (Q₁, Q₃) 1.0 (0.0, 1.0) 1.0 (0.0, 2.0) 0.0 (0.0, 1.0) 0.033* Infertility type, n(%) 0.237 Primary Infertility 363 (66.85) 160 (64.26) 203 (69.05) Secondary Infertility 180 (33.15) 89 (35.74) 91 (30.95) Infertility factor, n(%) 0.532 Tubal factor 128 (23.57) 56 (22.49) 72 (24.49) Ovulatory disorder 26 (4.79) 12 (4.82) 14 (4.76) Endometriosis 11 (2.03) 7 (2.81) 4 (1.36) DOR 40 (7.37) 16 (6.43) 24 (8.16) Male factor infertility 80 (14.73) 33 (13.25) 47 (15.99) Unexplained infertility 86 (15.84) 37 (14.86) 49 (16.67) Combined factors 172 (31.68) 88 (35.34) 84 (28.57) Endometrial preparation protocol, n(%) 0.262 Fresh Cycle 105 (19.34) 52 (20.88) 53 (18.03) Nature Cycle 40 (7.37) 17 (6.83) 23 (7.82) HRT 290 (53.41) 126 (50.60) 164 (55.78) GnRH-a Cycle 72 (13.26) 34 (13.65) 38 (12.93) Ovulation Stimulation Cycle 22 (4.05) 15 (6.02) 7 (2.38) Other 14 (2.58) 5 (2.01) 9 (3.06) EMT(mm), Mean ± SD 9.93 ± 2.22 9.91 ± 2.45 9.95 ± 2.02 0.818 Endometrial pattern, n(%) 0.498 Type A 370 (68.14) 164 (65.86) 206 (70.07) Type B 170 (31.31) 84 (33.73) 86 (29.25) Type C 3 (0.55) 1 (0.40) 2 (0.68) Endometrial blood flow, n(%) 0.089 Type I 291 (53.59) 127 (51.00) 164 (55.78) Type II 189 (34.81) 85 (34.14) 104 (35.37) Type III 63 (11.60) 37 (14.86) 26 (8.84) No.of embryos transferred, M (Q₁, Q₃) 1.0 (1.0, 1.0) 1.0 (1.0, 1.0) 1.0 (1.0, 1.0) 0.987 1 508 (93.55) 233 (93.57) 275 (93.54) 0.986 2 35 (6.45) 16(6.43) 19 (6.46) Stage of embryo transferred, n(%) < 0.001* Cleavage stage embryos 159 (29.28) 95 (38.15) 64 (21.77) Blastocyst 384 (70.72) 154 (61.85) 230 (78.23) High quality embryo, n(%) 0.114 Yes 445 (81.95) 197 (79.12) 248 (84.35) No 98 (18.05) 52 (20.88) 46 (15.65) AFC Antral follicle count, AMH Anti-müllerian hormone, BMI Body mass index, DOR Diminished ovarian reserve, HRT Hormone replacement therapy, EMT Endometrial thickness, SD Standard deviation, M Median, Q₁ 1 st Quartile; Q₃: 3 st Quartile *
p < 0.05 was considered statistically significant
Basic characteristics and analysis of differences between non-clinical and clinical pregnancy groups
AFC Antral follicle count, AMH Anti-müllerian hormone, BMI Body mass index, DOR Diminished ovarian reserve, HRT Hormone replacement therapy, EMT Endometrial thickness, SD Standard deviation, M Median, Q₁ 1 st Quartile; Q₃: 3 st Quartile
*
p < 0.05 was considered statistically significant
Significant differences were observed between the pregnancy and non-pregnancy groups in the following variables ( P < 0.05). Specifically, patients in the pregnancy group were younger (31.59 ± 3.72 vs. 32.71 ± 3.75 years) and had significantly higher AMH levels (4.73 ± 3.76 vs. 3.93 ± 3.31 ng/mL) and AFC (14.26 ± 6.91 vs. 12.43 ± 6.68). In addition, the pregnancy group had fewer previous transfer attempts (0.0 [0.0, 1.0] vs. 2.0 [1.0, 3.0]) and a higher proportion of blastocyst transfers (78.23% vs. 61.85%). No statistically significant differences were observed between the two groups for the remaining variables ( P > 0.05) (Table 1 ).
Clinical variables were analyzed using univariate and multivariate logistic regression. The results of the univariate analysis are presented in Additional Table 1. Variables with P < 0.05 in the univariate analysis were included in the multivariate regression analysis. Results indicated that age and embryo transfer type were independent predictors of clinical pregnancy (age: OR = 0.93, 95% CI = 0.89–0.97, P < 0.05; embryo transfer type: OR = 2.03, 95% CI = 1.39–2.99, P < 0.01). Additionally, compared to type I endometrial blood flow, type III endometrial blood flow was a negative predictor for clinical pregnancy (OR = 0.53, 95% CI = 0.30–0.94, P < 0.05) (Table 2 ). Furthermore, BMI, EMT, and high-quality embryo status were incorporated into the model construction. Based on prior research, these factors hold potential value in predicting pregnancy outcomes and were therefore retained to enhance the model's clinical utility. Table 2 Multivariate logistic regression analysis of variables significant in the univariate analysis Characteristics Univariate analysis Multivariate Analysis OR (95%CI) P value OR (95%CI) P value Age(Years), Mean ± SD 0.92(0.88—0.97) < 0.001* 0.93 (0.89—0.97) 0.002* Endometrial blood flow, n(%) Type I 1.00 (Reference) 1.00 (Reference) Type II 0.95 (0.66—1.37) 0.774 0.96 (0.66—1.40) 0.836 Type III 0.54 (0.31—0.95) 0.031* 0.53 (0.30—0.94) 0.029* Stage of embryo transferred, n(%) Cleavage stage embryos 1.00 (Reference) 1.00 (Reference) Blastocyst 2.22 (1.52—3.23) <.001* 2.03 (1.39—2.99) < 0.001* CI Confidence interval, OR Odds ratio, SD Standard deviation * P < 0.05
Multivariate logistic regression analysis of variables significant in the univariate analysis
CI Confidence interval, OR Odds ratio, SD Standard deviation
* P < 0.05
A total of 19 basic imaging features were initially extracted from the ROI. These features were clustered using the K-means algorithm, and the clustering performance was evaluated with the CH index. The results showed that when the number of clusters (k) was 4, the CH index reached its maximum value, thus dividing the ROI into four sub-regions with significant differences (Fig. 3 ). Based on this, radiomic features were systematically extracted from the entire ROI (h0) and the four sub-regions (h1-h4), yielding 1,561 features from each region, for a total of 7,805 features. The specific distribution is shown in Additional Fig. 1. To select key features, three rounds of feature selection were performed using the U-test, Pearson correlation analysis, and the mRMR algorithm, ultimately retaining the 15 most discriminative radiomic features. The complete list is provided in Additional Table 2. Furthermore, to assess the correlations between the features, we analyzed the relationship between these 15 radiomic features and 6 clinical characteristics, with the results presented as a heatmap in Fig. 4 . Fig. 3 Feature extraction and clustering. a Representative radiomic feature maps extracted from the endometrial ROI. A total of 19 features were extracted, with eight shown here as examples. b CH values for cluster numbers (k) ranging from 2 to 8. The line chart shows that the CH index peaks when the number of clusters (k) is set to 4, indicating the optimal clustering structure with high intra-class compactness and maximum inter-class separation under this condition. CH, Calinski-Harabasz; ROI, region of interest Fig. 4 Correlation heatmap of predictive variables for model construction. This heatmap displays the correlations between the selected radiomic features and clinical variables. The intensity of the color represents the magnitude of the Pearson correlation coefficient, with red indicating a positive correlation and blue indicating a negative correlation. The diagonal of the matrix represents the self-correlation of each feature ( r = 1)
Feature extraction and clustering. a Representative radiomic feature maps extracted from the endometrial ROI. A total of 19 features were extracted, with eight shown here as examples. b CH values for cluster numbers (k) ranging from 2 to 8. The line chart shows that the CH index peaks when the number of clusters (k) is set to 4, indicating the optimal clustering structure with high intra-class compactness and maximum inter-class separation under this condition. CH, Calinski-Harabasz; ROI, region of interest
Correlation heatmap of predictive variables for model construction. This heatmap displays the correlations between the selected radiomic features and clinical variables. The intensity of the color represents the magnitude of the Pearson correlation coefficient, with red indicating a positive correlation and blue indicating a negative correlation. The diagonal of the matrix represents the self-correlation of each feature ( r = 1)
We constructed a clinical-radiomic fusion model based on clinical features and ultrasound radiomic features, utilizing 11 machine learning methods. In the training set, ROC curve analysis showed that algorithms such as RF, XGBoost, LightGBM, and Gradient Boost exhibited relatively better discriminatory performance (Fig. 5 a). The radar chart further demonstrated that these models performed well across multiple evaluation metrics (Fig. 5 c). Comprehensive performance metrics for each model in the training set are presented in Additional Table 3. Fig. 5 Performance evaluation of predictive models for clinical pregnancy. a , b ROC curves for the training and test sets, with the dotted line (AUC = 0.5) representing a random classifier. c , d Radar charts showing AUC, specificity, sensitivity, F1 score, and accuracy for both sets. e Confusion matrices of three optimal models on the test set, displaying true positives, true negatives, false positives, and false negatives to illustrate classification performance and error patterns. AUC, area under the curve; KNN, k-nearest neighbors; MLP, multilayer perceptron; ROC, receiver operating characteristic; SVM, support vector machine
Performance evaluation of predictive models for clinical pregnancy. a , b ROC curves for the training and test sets, with the dotted line (AUC = 0.5) representing a random classifier. c , d Radar charts showing AUC, specificity, sensitivity, F1 score, and accuracy for both sets. e Confusion matrices of three optimal models on the test set, displaying true positives, true negatives, false positives, and false negatives to illustrate classification performance and error patterns. AUC, area under the curve; KNN, k-nearest neighbors; MLP, multilayer perceptron; ROC, receiver operating characteristic; SVM, support vector machine
In the test set, the ExtraTrees model demonstrated optimal performance with an AUC of 0.766 (95% CI: 0.689–0.830), followed by LR(AUC = 0.741, 95% CI: 0.664, 0.812) and RF (AUC = 0.734, 95% CI: 0.651–0.806). In contrast, the discriminative efficacy of the MLP and KNN models was lower, with AUC values of 0.648 and 0.649 respectively (Fig. 5 b). Radar chart results further indicated that ExtraTrees demonstrated stable performance across multiple metrics (Fig. 5 d).
Overall, ExtraTrees achieved the best balance across all evaluation metrics, with a sensitivity of 73.9%, specificity of 65.3%, and an F1 score of 0.726. Although RF and LR also showed strong predictive power, both models had certain limitations in terms of sensitivity or specificity. Performance metrics for all models on the test set are summarized in Table 3 . Table 3 Performance comparison of different machine learning models in test set Model AUROC(95% CI) Accuracy F1-Score Sensitivity Specificity PPV NPV LR 0.741 (0.664–0.812) 69.9% 0.713 69.3% 70.7% 73.5% 66.3% RF 0.734 (0.651–0.806) 69.3% 0.722 73.9% 64.0% 70.7% 67.6% XGBoost 0.704 (0.619–0.784) 68.1% 0.720 76.1% 58.7% 68.4% 67.7% SVM 0.700 (0.612–0.780) 65.0% 0.692 72.7% 56.0% 66.0% 63.6% LightGBM 0.693 (0.614–0.778) 63.8% 0.674 69.3% 57.3% 65.6% 61.4% ExtraTrees 0.766 (0.689–0.830) 69.9% 0.726 73.9% 65.3% 71.4% 68.1% MLP 0.648 (0.567–0.728) 58.3% 0.699 89.8% 21.3% 57.2% 64.0% KNN 0.649 (0.559–0.728) 60.7% 0.640 64.8% 56.0% 63.3% 57.5% AdaBoost 0.723 (0.642–0.796) 66.3% 0.678 65.9% 66.7% 69.9% 62.5% Gradient Boost 0.685 (0.604–0.767) 63.8% 0.670 68.2% 58.7% 65.9% 61.1% Naïve Bayes 0.665 (0.578–0.748) 55.2% 0.407 28.4% 86.7% 71.4% 50.8% AUROC Area under the receiver operating characteristic curve, CI Confidence Interval, KNN K-nearest neighbors, LR Logistic regression, MLP Multi-layer perceptron, NPV Negative predictive value, PPV Positive predictive value, RF Random forest, SVM Support vector machine
Performance comparison of different machine learning models in test set
AUROC Area under the receiver operating characteristic curve, CI Confidence Interval, KNN K-nearest neighbors, LR Logistic regression, MLP Multi-layer perceptron, NPV Negative predictive value, PPV Positive predictive value, RF Random forest, SVM Support vector machine
Furthermore, the confusion matrices for the three top-performing models on the test set are presented in Fig. 5 e, demonstrating robust classification capabilities for both positive and negative pregnancy outcome categories. Confusion matrices for the remaining models are detailed in Additional Fig. 2.
To enhance the interpretability of the model, we conducted an interpretability analysis on the best-performing ExtraTrees model. The contribution of each feature to the model output was visualized through SHAP values. Fig. 6 a shows the a SHAP beeswarm plot, in which each dot represents a patient, and the color reflects the feature value, with red indicating high values and blue indicating low values. Among them, the type of transplanted embryo had a significantly stronger impact on the model output than other features. Many higher-order texture features (such as wavelet_HHH_glcm_JointAverage_h4, wavelet_LHH_glcm_ClusterShade_h2) captured the subtle changes in local image regions and made considerable contributions to the model's predictions. The possibility of successful clinical pregnancy for patients gradually decreased with age. Fig. 6 SHAP visualisation results for the Extra Trees model. a Feature importance ranking plot, with features ordered by their average impact on model output, the most influential features positioned at the top; the right-hand side displays the impact of features on the model, with red indicating positive effects and blue indicating negative effects, where darker colours denote greater influence. b Individual prediction explanation plot based on SHAP. This plot illustrates the decision-making process by which the Extra Trees model successfully predicted patients without clinical pregnancy. The baseline value E[ f(x) ] represents the model's average predicted value across the entire training set. Each feature arrow illustrates its contribution to shifting the sample's predicted value from the baseline towards the final output. The length of red and blue arrows respectively indicates the extent to which the feature increases or decreases the predicted value. f(x) denotes the final predicted value. BMI, body mass index; EMT, endometrial thickness
SHAP visualisation results for the Extra Trees model. a Feature importance ranking plot, with features ordered by their average impact on model output, the most influential features positioned at the top; the right-hand side displays the impact of features on the model, with red indicating positive effects and blue indicating negative effects, where darker colours denote greater influence. b Individual prediction explanation plot based on SHAP. This plot illustrates the decision-making process by which the Extra Trees model successfully predicted patients without clinical pregnancy. The baseline value E[ f(x) ] represents the model's average predicted value across the entire training set. Each feature arrow illustrates its contribution to shifting the sample's predicted value from the baseline towards the final output. The length of red and blue arrows respectively indicates the extent to which the feature increases or decreases the predicted value. f(x) denotes the final predicted value. BMI, body mass index; EMT, endometrial thickness
We also randomly selected the SHAP values of an infertile patient for analysis (Fig. 6 b). The results showed that both the transplantation of non-ideal embryos and the age of 37 had negative impacts on the clinical pregnancy outcome, with the transplantation of poor-quality embryos having the most significant effect (SHAP value = –0.06). In contrast, the transplantation of blastocysts had a positive promoting effect on the pregnancy outcome (SHAP value = + 0.04).
Material
This study was reviewed and approved by the hospital's ethics committee (TJ-IRB202509050), and informed consent was waived for the patients. The detailed steps involved in the research methods are presented in Fig. 1 . Fig. 1 The flowchart of this study. LR, logistic regression; ML, machine learning; MLP, multilayer perceptron; mRMR, minimum redundancy maximum relevance; ROC, receiver operating characteristic; SVM, support vector machine
The flowchart of this study. LR, logistic regression; ML, machine learning; MLP, multilayer perceptron; mRMR, minimum redundancy maximum relevance; ROC, receiver operating characteristic; SVM, support vector machine
A total of 800 infertile patients who underwent fresh or frozen-thawed embryo transfer (FET) at our center between December 2023 and June 2025 were initially screened. A total of 543 eligible patients were finally included and randomly assigned in a 7:3 ratio to a training set ( n = 380) and a test set ( n = 163). The screening process is illustrated in Fig. 2 . Fig. 2 Flowchart of patient selection. HCG, human chorionic gonadotropin; IVF, in vitro fertilization; ICSI, intracytoplasmic sperm injection; PGT, preimplantation genetic testing; RSA, recurrent spontaneous abortion
Flowchart of patient selection. HCG, human chorionic gonadotropin; IVF, in vitro fertilization; ICSI, intracytoplasmic sperm injection; PGT, preimplantation genetic testing; RSA, recurrent spontaneous abortion
The inclusion criteria were as follows: (1) age ≤ 40 years, (2) underwent IVF or ICSI treatment with at least one transferable embryo. The exclusion criteria were as follows: (1) congenital uterine abnormalities (e.g., unicornuate or septate uterus) or untreated uterine lesions such as endometrial polyps, submucosal fibroids, and intrauterine adhesions, (2) sequential embryo transfer, (3) incomplete clinical data or poor-quality US images, (4) recurrent miscarriage or repeated implantation failure, defined as two or more pregnancy losses before 28 weeks of gestation with the same partner, or failure to achieve clinical pregnancy after three or more fresh or frozen cycles with three or more high-quality embryos in women under 40 years of age, (5) severe autoimmune or malignant diseases.
The criteria for high-quality embryos in our center are based on the Istanbul consensus with appropriate modifications [ 23 ]. At the cleavage stage, embryos exhibiting 7–10 cells on day 3 with a fragmentation rate < 20% were defined as high-quality cleavage-stage embryos; at the blastocyst stage, high-quality blastocysts are defined as those scoring ≥ 3BB according to the Gardner grading system [ 24 ].
Patients who met the criteria for fresh embryo transfer received the transfer of one to two day-3 (D3) embryos on day 3 post-oocyte retrieval or one high-quality day-5 (D5) blastocyst. The embryo transfer protocol for patients undergoing FET was determined based on the patients' individual clinical conditions. For patients with regular menstruation, follicular development was monitored by US from days 8 to 12 of the menstrual cycle. The time for synchronous embryo transfer was calculated from the day of ovulation. Patients with menstrual irregularities underwent a hormone replacement therapy (HRT) cycle. From days 2–4 of the menstrual cycle, they were given 4 mg/day of estradiol orally or 4 pumps/day of estradiol gel externally. Once the EMT reached the expected level, progesterone was administered for endometrial transformation.In downregulation cycles, long-acting GnRH agonist (GnRH-a) was injected on day 2 of the menstrual cycle. After 28 days, patients returned to the hospital for a follow-up US scan and assessment of sex hormone levels. Upon meeting the criteria for downregulation, an HRT cycle was initiated to prepare the endometrium. Ovulation induction agents were started on days 3–5 of the cycle, and EMT and follicular growth were monitored. When the follicle diameter reached ≥ 18 mm and the EMT was appropriate, ovulation was triggered, and frozen-thawed embryo transfer was performed on day 3 or day 5–6 after ovulation.
The number of embryos transferred strictly followed the ASRM guidelines, with a maximum of 2 blastocysts per transfer [ 25 ]. Serum blood β-HCG testing was performed 12–14 days after ET to confirm pregnancy. If pregnancy was confirmed, a US examination was conducted 4 weeks post-transfer. Clinical pregnancy was determined if one or more gestational sacs were visible on the US.
This study collected clinical data for all patients from the hospital medical records system and conducted a descriptive analysis of baseline data. Clinical characteristics collected included age, body mass index (BMI), Anti-Müllerian hormone (AMH), Antral follicle count (AFC), type, duration and cause of infertility, prior number of embryo transfers, fresh or frozen-thawed embryo transfer protocol, EMT, endometrial morphology, blood flow distribution pattern on the day of HCG administration or endometrial conversion day, number of embryos transferred, embryo type, high-quality or not, and clinical pregnancy outcomes.
All patients included in the study underwent transvaginal ultrasound (TVUS) examination on the day of HCG injection or the endometrial transformation day. The specific ultrasound device models are detailed in Additional Material 1 . The examination process strictly adhered to the International Ovarian Tumor Analysis (IETA) consensus standards [ 26 ], with patients positioned in the lithotomy position to ensure image consistency and repeatability. Using 2D US mode, EMT was first measured along the midsagittal plane, and endometrial morphology was evaluated. Subsequently, color Doppler was employed to evaluate endometrial blood flow distribution, categorized into three grades according to the modified Applebaum criteria [ 27 ]: Grade I: Blood flow signal does not reach the hyperechoic edge of the endometrium; Grade II: Blood flow signal enters the endometrium but does not exceed half of the single-layer thickness of the endometrium; Grade III: Blood flow signal reaches the center of the endometrium. All images were stored in Digital Imaging and Communications in Medicine (DICOM) format.
A sonographer with more than five years of experience manually delineated the endometrial region on mid-sagittal uterine ultrasound images using 3D Slicer software (version 5.8.1), tracing along the hyperechoic margin of the endometrium. All delineations were independently reviewed by a second senior sonographer to ensure segmentation accuracy. Where discrepancies arose between the two practitioners, consensus was reached through joint discussion.
We employed the open-source Python package PyRadiomics, combined with a local neighborhood approach, to systematically extract radiomic features from the ROI. Following feature extraction, unsupervised clustering analysis was conducted using the K-means algorithm, with the number of clusters (k) ranging from 2 to 8. The Calinski-Harabasz (CH) index was utilized to evaluate the separation effectiveness of each cluster count, wherein a higher index value indicates greater inter-class differentiation and superior clustering structure. Following the determination of the optimal number of clusters, PyRadiomics was further employed to extract radiomics feature sets from the whole ROI and its subregions, encompassing first-order statistics, morphological features, and higher-order texture features. To prevent model overfitting and enhance its generalization capability, all features underwent Z-score standardization followed by a two-stage feature selection process. Initial dimensionality reduction was performed using the Mann–Whitney U test and Pearson correlation coefficients (with a threshold of 0.9). Subsequently, the Minimum Redundancy Maximum Relevance (mRMR) algorithm was applied to further select a discriminant-capable subset of features. Additionally, univariate and multivariate logistic regression analyses were conducted to identify clinically independent predictors associated with pregnancy outcomes. Variables yielding p < 0.05 in univariate analysis were incorporated into multivariate regression models to further evaluate their independent predictive value.
The selected clinical and radiomic features were used to train 11 machine learning algorithms: Logistic Regression, KNN, RF, Multilayer perceptron (MLP), ExtraTrees, SVM, LightGBM, AdaBoost, XGBoost, Gradient Boost, and Naïve Bayes. Five-fold stratified cross-validation was performed on the training set through grid search to optimize hyperparameters. Additionally, label smoothing was introduced to enhance the generalization ability of the model. The model performance was evaluated using the area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score. To explain our classification model, SHAP was introduced to quantify the contribution of predictive variables to model predictions.
Statistical analysis, data preprocessing, and feature engineering were conducted using Python 3.9 (Python Software Foundation, USA) and R 4.2.0 (R Foundation for Statistical Computing, Vienna, Austria). Continuous variables with normal distribution were expressed as mean ± standard deviation, and intergroup comparisons were performed using the independent samples t-test. Variables with non-normal distribution were expressed as median (interquartile range), and inter-group comparisons were conducted using the Wilcoxon rank sum test. Categorical variables were presented as frequency and percentage, and inter-group comparisons were made using the chi-square test. When the expected frequency of any cell was less than 5, the Fisher's exact test was used. A two-sided test with P -value < 0.05 was considered statistically significant.
Conclusion
In conclusion, by integrating endometrial US images and clinical characteristics, we have developed and validated a machine learning model with clinical application potential. This model can non-invasively predict the clinical pregnancy outcomes for both fresh embryo transfer and FET cycles. The SHAP method is employed to provide explainable analysis of the model's decision-making process, thereby identifying the key factors influencing individual pregnancy outcomes. This approach provides a new intelligent tool for the field of assisted reproduction and is expected to offer strong support to clinicians in formulating personalized ET strategies and individualized fertility management.
Discussion
This study applied habitat sub-region clustering analysis to develop an ML model that integrated endometrial US images with clinical features for predicting clinical pregnancy outcomes following fresh and frozen-thawed embryo transfers. Previous studies predicting IVF pregnancy outcomes have primarily focused on clinical indicators such as age, BMI, or US morphological parameters, achieving some progress but still underutilizing the deep information contained in US images [ 28 , 29 ]. In this study, we constructed predictive models based on eleven different machine learning algorithms, among which ExtraTrees demonstrated the best overall discrimination performance. We also conducted an interpretability analysis of the best model using the SHAP method, further revealing the internal decision-making mechanism of the model and indicating that the integration of clinical and imaging information can enhance prediction performance.
US imaging has been widely applied in the screening and diagnosis of various diseases. However, it still has certain limitations in routine clinical practice. On the one hand, US images are susceptible to noise and artifacts, and imaging quality is constrained by examination conditions. On the other hand, image acquisition and interpretation largely depend on operator experience, resulting in limited inter-operator consistency. ML can address these limitations by meeting different clinical demands in US imaging, playing an important role in classification, segmentation, and outcome prediction, thereby improving the reproducibility and consistency of image evaluation. For example, in breast ultrasound, machine learning–based classification models have been widely applied to assist in differentiating benign and malignant lesions. By extracting morphological and textural features of the lesions, these models can achieve diagnostic accuracy comparable to that of experienced breast radiologists [ 30 ]. In the field of echocardiography, deep learning methods have been successfully used for automatic segmentation of cardiac chambers and myocardium, enabling automated calculation of functional parameters such as ejection fraction and thus improving measurement efficiency and reproducibility [ 31 ]. It should be emphasized that machine learning models based on ultrasound imaging are not intended to replace clinicians’ judgment, but rather to serve as auxiliary decision-support tools that provide more objective references for optimizing individualized diagnostic and treatment strategies.
Based on traditional radiomics, habitat analysis radiomics models identify sub-regions with similar features through voxel clustering, which can more accurately quantify the spatial heterogeneity of tissues [ 20 ]. Therefore, it has attracted increasing attention and application in these years. In model construction, this method has demonstrated superior predictive performance compared with whole-region and purely clinical models. This advantage has been validated in diagnostic and prognostic studies of malignant tumours such as glioblastoma, lung cancer, and ovarian cancer [ 32 – 34 ]. From a biological perspective, the endometrium undergoes complex structural and functional changes throughout the menstrual cycle [ 35 ], and its thickness, echo pattern, and internal blood perfusion exhibit significant spatial and temporal heterogeneity [ 36 ]. Such heterogeneity may be closely associated with the receptivity of the endometrium, thereby affecting subsequent embryo implantation and pregnancy outcomes. Traditional radiomics models extract statistical features from the entire ROI, which may overlook morphological and functional variations among different subregions of the endometrium [ 37 ]. In contrast, by using the habitat-based radiomics model to segment subregions and quantify internal feature differences within the endometrium, we can better reflect the microstructure characteristics of the endometrium and thereby improve the accuracy of pregnancy outcome prediction.
In the field of FET, ML models based on US imaging and clinical features made significant progress. Yang et al. constructed a nomogram by combining the US features of the endometrium and its junctional zone with clinical variables, which improved the prediction of pregnancy outcomes [ 38 ]. Another study evaluated the performance of ROI in the endometrial junctional zone at varying areas, and identified that a 4 mm subendometrial zone may be the most optimal region for predicting clinical pregnancy following FET [ 39 ]. Additionally, Liang et al. used a deep learning radiomics approach to predict the FET outcomes of high-quality embryos (≥ 4BB), with a model AUC of 0.825 [ 40 ], demonstrating the potential of the combination of radiomics and AI in the assessment of FET outcomes. However, fresh embryo transfer and FET differ in embryo handling, transfer timing, and physiological environment, and the choice of strategy often depends on individualized conditions. Although the current literature does not provide a unified conclusion regarding the superiority of the pregnancy outcomes of the two strategies, multiple randomized controlled trials and meta-analyses have indicated that, in terms of cumulative pregnancy rate or live birth rate, the two strategies are generally comparable [ 41 ]. In our study, in addition to the FET cycles, we also included patients with fresh embryo transfer cycles and achieved good predictive performance.
Embryo quality and ER are widely regarded as the two core factors determining the clinical pregnancy outcome [ 7 , 8 , 28 ]. This study found that the type of transplanted embryo was one of the most predictive variables in the model. Blastocyst transfer significantly increased the clinical pregnancy rate (CPR) compared to cleavage-stage embryo transfer, which is consistent with previous research [ 42 ]. Multiple studies have confirmed that, whether in fresh or frozen-thawed cycles, blastocyst transfer leads to higher implantation and pregnancy rates [ 42 , 43 ]. This may be related to the superior developmental potential of blastocysts and better synchronization with the endometrium, thereby improving the success rate of implantation.
Our research has several limitations. First, the images and data used were from our center and did not cover patients from different regions. A larger, more diverse dataset could help improve the robustness and generalization ability of the model. Second, in our study, two US devices were used, and image preprocessing was conducted to reduce the impact of machine differences. However, some studies have suggested that due to variations in image brightness, contrast, and other features among different devices, the performance of radiomics models may be affected by the type of US device [ 44 ]. Third, as this study was based on data from a single center, the model has not been validated with external datasets, so its generalization performance still needs to be further confirmed. Lastly, some studies have used deep learning to build predictive models. Deep learning outperforms traditional machine learning algorithms in handling large and complex datasets. Due to the limited sample size in this study, we did not explore the potential of deep learning. In the future, we plan to develop a prospective multi-center, large-sample deep learning multimodal radiomics model to further enhance clinical diagnosis and treatment.
Introduction
Infertility refers to a disease where a couple fails to achieve clinical pregnancy after regular, unprotected sexual intercourse for 12 months or more [ 1 ]. According to the latest statistics, approximately 17.5% of adults worldwide have experienced infertility at some point in their lives [ 2 ]. Due to the increasing incidence of infertility and the expanding indications for in vitro fertilization-embryo transfer (IVF-ET), the number of patients undergoing assisted reproductive technology (ART) has significantly increased. Currently, over 10 million babies have been born through this technology [ 3 ]. However, the widespread use of this technology also highlights its limited success rate at the individual level. Data aggregated by the Society for Assisted Reproductive Technology (SART) in 2023 show that even among women under 40, the live birth rate (LBR) per single egg retrieval cycle remains at approximately 20% to 45%, which places significant economic and psychological pressure on infertile couples [ 4 ]. Therefore, during the treatment process, patients generally hope to understand the possibility of their own IVF success, in order to make more rational and scientific decisions. However, the outcome of IVF is influenced by multiple factors, including female age, ovarian reserve function, embryo quality, endometrial receptivity (ER), and lifestyle [ 5 , 6 ]. Due to the interaction and varying weights of these factors, it is difficult for clinicians to comprehensively and precisely integrate all variables in clinical practice, thus there are certain limitations in predicting individualized pregnancy outcomes.
In IVF-ET cycles, insufficient ER accounts for approximately two-thirds of embryo implantation failures [ 7 , 8 ]. Various methods have been proposed to assess ER. Among them, ultrasound (US) examination has become the preferred method for reproductive specialists due to its simplicity, noninvasive nature, and high repeatability. As early as 2012, researchers proposed using 2D and 3D US parameters to predict the pregnancy outcomes of IVF patients and suggested that endometrial thickness (EMT), morphology, volume, blood flow and endometrial elasticity could enhance the predictive efficacy of pregnancy outcomes to some extent [ 9 – 11 ]. However, these traditional methods rely on the experience and subjective judgment of sonographers, and their results have certain variability. Moreover, the aforementioned indicators are largely confined to visually discernible morphological features and quantitative parameters, resulting in limited predictive accuracy that fails to fully capture the underlying information contained in US images.
In the medical field, ML methods have shown the potential for clinical decision support using routine clinical data, including assisting medical staff in screening patients with a high risk of early miscarriage [ 12 , 13 ], identifying special infertile populations (such as those with POI or PCOS) [ 14 , 15 ], and optimizing the trigger timing during controlled ovarian stimulation in IVF procedures [ 16 ]. Various ML algorithms, such as logistic regression (LR), support vector machine (SVM), K-nearest neighbor (KNN), decision tree (DT), and random forest (RF), have been widely applied for outcome prediction. Building on this, radiomics, by integrating multiple ML methods, can extract and analyze a large set of quantitative features from medical images (such as MRI, CT, and US). For instance, using sperm images and deep learning algorithms, Leung et al. developed a binary classification model to predict sperm-zona pellucida binding ability, achieving an accuracy exceeding 96%, significantly outperforming traditional manual analysis methods [ 17 ].
In recent years, the combination of radiomics with clinical features to construct feature-fusion models has also shown broad application prospects. Salih et al. and Mikołaj et al. extracted radiomic features from embryo and fetal US images and combined them with patients' clinical variables using different ML models. This approach enhanced the accuracy of predicting pregnancy outcomes and improved fetal age estimation [ 18 , 19 ]. Additionally, the rise of habitat radiomics has further advanced medical radiomics. By combining it with ML techniques, habitat radiomics can reveal the tissue characteristics of different sub-regions and analyze the tissue heterogeneity within the same organ, thereby uncovering individualized features of patients [ 20 ]. Jin et al. successfully predicted the long-term efficacy of uterine artery embolization treatment for patients with adenomyosis using this method and showed that its predictive performance was superior to that of the whole-region radiomics model and the pure clinical model [ 21 ].
Although many complex ML models can make accurate predictions, their decision-making processes are often difficult to understand, known as the 'black box problem'. For this reason, explainable artificial intelligence (XAI) has emerged. By introducing Shapley values, the specific contribution of each feature to the prediction result can be quantified [ 22 ]. This greatly enhances the transparency and credibility of the model, providing more reliable decision support for clinical applications.
The main purpose of this study is to identify sub-regions of the endometrium using habitat radiomics analysis and combine patients' clinical information and endometrial US image features to construct a prediction model integrating multiple ML methods. Through the interpretability analysis of the model, the internal decision-making process of the model can be better understood. This model aims to assess clinical pregnancy probabilities following embryo transfer in both fresh and frozen cycles, thereby enabling early optimization of patient transfer protocols and medication regimens to achieve improved clinical outcomes.
Supplementary Material
Supplementary Material 1.
Supplementary Material 2.
Supplementary Material 3.
Supplementary Material 4.
Supplementary Material 5.
Supplementary Material 6.
Supplementary Material 1.
Supplementary Material 2.
Supplementary Material 3.
Supplementary Material 4.
Supplementary Material 5.
Supplementary Material 6.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.