Immunological risk factors for recurrent implantation failure using a deep learning model: a multicenter retrospective cohort study.

OA: gold CC-BY-NC-ND-4.0
Full text 45,277 characters · extracted from pmc-nxml · 7 sections · click to expand

Results

Overall, 2,463 RIF patients were included in the study (Fig.  1 ). The median age was 24 (range: 18–40). Table  1 displays characteristics and compares them between the two groups of live birth and implantation failure. There were significant differences in these characteristics across the two groups, except for TSH ( p  = 0.47), CD3 ( p  = 0.057), CD3/CD4 ( p  = 0.17), anti-dsDNA ( p  = 0.26), and embryo quality ( p  = 0.36) (Table  1 ). Table 1 Comparison of characteristics between live birth and implantation failure groups. Characteristics Normal range or threshold Overall, N  = 2463 Live birth, N  = 1616 Implantation failure, N  = 847 p -value Age (years) 18–45 25.24 ± 5.98 26.81 ± 6.55 22.23 ± 2.90 p  < 0.01 BMI (kg/m 2 ) < 30 27.78 ± 5.34 29.24 ± 5.93 25.01 ± 1.99 p  < 0.01 Vitamin D 3 (ng/mL) 30–100 34.99 ± 14.88 34.23 ± 14.76 36.43 ± 15.02 p  < 0.001 TSH (mIU/L) 0.5–5.0 3.25 ± 1.34 3.26 ± 1.35 3.22 ± 1.31 p  = 0.47 Th1/Th2 ratio < 10 14.88 ± 9.32 18.64 ± 9.15 7.71 ± 3.80 p  < 0.01 T-cell (CD3 + cells) (cells/µl) (%) < 85 76.47 ± 1.49 76.41 ± 1.50 76.53 ± 1.48 p  = 0.057 CD16 + cells (cells/µl) < 19 14.26 ± 2.47 14.34 ± 2.46 14.10 ± 2.50 p  = 0.02 B-cells (CD19 + cells) (cells/µl) (%) < 14 10.70 ± 2.30 7.98 ± 2.37 10.75 ± 2.32 p  < 0.01 CD56 + cells (cells/µl) < 19.5 14.76 ± 2.59 14.83 ± 2.59 14.61 ± 2.59 p  = 0.04 Helper T-cells (CD3/CD4 + cells) (cells/µl) (%) < 44 31.93 ± 3.78 32.04 ± 3.76 31.82 ± 3.79 p  = 0.17 Cytotoxic T-cells (CD3/CD8 + cells) (cells/µl) (%) < 35 43.64 ± 3.78 43.47 ± 3.75 43.81 ± 3.82 p  = 0.03 NK cells (CD16/56 + cells) (cells/µl) (%) < 14 9.74 ± 2.33 9.83 ± 2.33 9.56 ± 2.33 p  = 0.006 Antiphospholipid antibody (IgG, IgM) (µg/ml) < 10 10.95 ± 7.39 13.76 ± 6.97 6.52 ± 5.66 p  < 0.01 Anticardiolipin antibody (IgG, IgM) (µg/ml) < 10 9.87 ± 6.50 12.14 ± 6.84 6.07 ± 3.41 p  < 0.01 Anti b2 glycoprotein antibody (IgG, IgM) (µg/ml) < 20 13.99 ± 8.10 16.12 ± 8.66 10.27 ± 5.26 p  < 0.01 Anti TTG antibody (IgG, IgA) (U/ml) < 11 7.63 ± 5.23 8.89 ± 5.77 5.25 ± 2.75 p  < 0.01 ANA < 1.2 RFU 1.00 ± 0.77 0.86 ± 0.67 1.25 ± 0.86 p  < 0.01 Anti ds-DNA (IU/mL) < 10 6.40 ± 4.78 6.32 ± 4.63 6.55 ± 5.02 p  = 0.26 Anti TPO antibodies (U/mL) < 30 17.05 ± 12.23 13.14 ± 7.60 24.16 ± 15.43 p  < 0.01 Anti TG antibodies (IU/mL) < 116 64.23 ± 40.67 56.28 ± 30.78 78.70 ± 51.16 p  < 0.01 Embryo quality (Grade A) A-B 77.3% (1906/2463) 77.9% (1260/1616) 76.2% (646/847) p  = 0.36 BMI: Body mass index; TSH: Thyroid stimulating hormone; Th: T-helper cell; TTG: Tissue transglutaminases; ANA: Antinuclear antibody; ds-DNA: double-stranded DNA; TPO: Thyroid peroxidase; and TG: thyroglobulin. All data are reported as mean ± SD. P  < 0.05 for RIF groups in comparison to control group. p -value < 0.5 considered significant. Comparison of characteristics between live birth and implantation failure groups. BMI: Body mass index; TSH: Thyroid stimulating hormone; Th: T-helper cell; TTG: Tissue transglutaminases; ANA: Antinuclear antibody; ds-DNA: double-stranded DNA; TPO: Thyroid peroxidase; and TG: thyroglobulin. All data are reported as mean ± SD. P  < 0.05 for RIF groups in comparison to control group. p -value < 0.5 considered significant. To ensure the training and validation sets were statistically comparable across these characteristics, we used a grid-based process where the data was shuffled (randomized) and divided, resulting in the validation data being nearly identical to the training data. Following univariate and multivariate logistic regression, some selected variables were identified as independent risk factors. Among these, all autoantibodies and the Th1/Th2 ratio were found to be significant risk factors in multivariate analysis, as shown in Table  2 . Table 2 Univariable and multivariable logistic regression analysis of RIF predictors. Characteristics Univariate OR (95% CI) p -value Multivariate OR (95% CI) p -value Age 1.19 (1.16–1.22) p  < 0.001 0.96 (0.91–1.01) p  = 0.20 BMI 1.26 (1.23–1.30) p  < 0.001 1.03 (0.97–1.09) p  = 0.25 Vitamin D 3 0.99 (0.98–0.99) p  < 0.001 0.99 (0.98–1.00) p  = 0.09 TSH 1.02 (0.96–1.09) p  = 0.41 1.03 (0.93–1.14) p  = 0.49 Th1/Th2 ratio 1.30 (1.27–1.33) p  < 0.001 1.23 (1.18–1.27) p  < 0.001 CD3 + cells 0.91 (1.27–1.33) p  < 0.001 0.86 (0.51–1.46) p  = 0.59 CD16 + cells 1.03 (1.00–1.06) p  = 0.04 0.92 (0.71–1.19) p  = 0.55 CD19 + cells 0.98 (0.95–1.01) p  = 0.37 0.97 (0.90–1.03) p  = 0.38 CD56 + cells 1.02 (0.99–1.05) p  = 0.07 1.03 (0.84–1.26) p  = 0.74 CD3/CD4 + cells 0.90 (0.89–0.91) p  < 0.001 1.03 (0.61–1.74) p  = 0.89 CD3/CD8 + cells 0.91 (0.89–0.90) p  < 0.001 1.01 (0.60–1.70) p  = 0.94 CD16/56 + cells 1.04 (1.01–1.08) p  = 0.01 1.06 (0.89–1.25) p  = 0.48 Antiphospholipid antibody (IgG, IgM) 1.08 (1.07–1.09) p  < 0.001 1.03 (0.99–1.06) p  = 0.05 Anticardiolipin antibody (IgG, IgM) 1.11 (1.10–1.13) p  < 0.001 1.07 (1.03–1.10) p  < 0.001 Anti b2 glycoprotein antibody (IgG, IgM) 1.06 (1.05–1.07) p  < 0.001 1.02 (1.00-1.05) p  = 0.03 Anti TTG antibody (IgG, IgA) 1.17 (1.15–1.20) p  < 0.001 1.09 (1.05–1.13) p  < 0.001 ANA 0.48 (0.43–0.54) p  < 0.001 1.15 (0.89–1.50) p  = 0.26 Anti ds-DNA 0.97 (0.95–0.98) p  < 0.001 1.19 (1.15–1.23) p  < 0.001 Anti TPO antibodies 0.91 (0.91–0.92) p  < 0.001 0.97 (0.96–0.99) p  < 0.01 Anti TG antibodies 0.98 (0.98–0.98) p  < 0.001 0.99 (0.99–0.99) p  < 0.01 Embryo quality 0.90 (0.74–1.10) p  = 0.33 1.07 (0.78–1.47) p  = 0.65 OR: Odds ratio; BMI: Body mass index; TSH: Thyroid stimulating hormone; Th: T-helper cell; TTG: Tissue Transglutaminases; ANA: Antinuclear antibody; ds-DNA: double-stranded DNA; TPO: Thyroid peroxidase; and TG: thyroglobulin. p -value < 0.5 considered significant. Univariable and multivariable logistic regression analysis of RIF predictors. OR: Odds ratio; BMI: Body mass index; TSH: Thyroid stimulating hormone; Th: T-helper cell; TTG: Tissue Transglutaminases; ANA: Antinuclear antibody; ds-DNA: double-stranded DNA; TPO: Thyroid peroxidase; and TG: thyroglobulin. p -value < 0.5 considered significant. Moreover, the dataset exhibited varying levels of missingness across several immunological and cellular biomarkers, with the most pronounced absence (29%) observed in T-cell panel features including CD3, CD3/CD4, and CD3/CD8 (Table  3 ). This degree of missingness in key immune variables may limit the model’s robustness and generalizability when applied in real-world clinical settings, where data completeness is less controlled. Table 3 Feature missingness and uniqueness analysis. Feature Missing count Missing % Unique values CD3/CD4 713 29 326 CD3/CD8 713 29 332 CD3 713 29 52 Anti Phospho 277 11 1352 Anti Cardio 195 8 1403 Anti B2 Glyco 138 6 1572 ANA 95 4 270 anti ds-DNA 95 4 1122 Anti-TG 75 3 2176 Anti TPO 75 3 1719 Anti TTG 16 1 1276 CD16 10 0 252 CD56 10 0 261 CD19 10 0 230 CD16/CD56 10 0 234 Th1:Th2 Ratio 0 0 337 TSH 0 0 453 VitD 0 0 51 Age 0 0 23 BMI 0 0 21 embryo quality 0 0 2 R-LiveBirth 0 0 2 Feature missingness and uniqueness analysis. According to the confusion matrix (Fig.  2 A), our model demonstrated strong classification performance, with a high true positive (TP) count of 284 and true negative (TN) count of 145, compared to false positive (FP) (27) and false negative (FN) (37) rates. This distribution indicates that the model distinguishes between positive and negative outcomes, with balanced performance across sensitivity and specificity. The TabNet model achieved an AUROC of 0.952, accuracy of 0.874, sensitivity of 0.885, specificity of 0.855, precision of 0.919, and F1 score of 0.902 on the validation set (see Fig.  2 A). Fig. 2 TabNet model performance: ( A ) confusion matrix, ( B ) precision-recall curve, ( C ) ROC curve for the training set, ( D ) ROC curve for the validation set. Abbreviations: TP = true positive, FP = false positive, FN = false negative, TN = true negative. TabNet model performance: ( A ) confusion matrix, ( B ) precision-recall curve, ( C ) ROC curve for the training set, ( D ) ROC curve for the validation set. Abbreviations: TP = true positive, FP = false positive, FN = false negative, TN = true negative. The precision-recall curve (Fig.  2 B) highlights the model’s high reliability in positive classification. Precision remains nearly perfect (close to 1.00) throughout most of the recall range, only beginning to decline modestly after recall exceeds approximately 0.90. This indicates that the model maintains strong confidence in its positive predictions even as it captures a high proportion of true positives. The eventual drop in precision near maximum recall reflects the typical trade-off in which capturing the last few positives introduces a small number of false positives. Overall, the curve reflects a highly performant classifier well-suited for applications in which precision in identifying true positives is critical. The ROC curves for both the training and validation sets (Fig.  2 C and D) demonstrate strong model performance in distinguishing between the positive and negative classes, with AUROC values of 0.988 on the training set and 0.952 on the validation set. The relatively small gap between training and validation AUROC suggests that the model generalizes well and does not suffer from significant overfitting. Nonetheless, additional external validation on independent cohorts would further confirm the model’s robustness and reliability in real-world settings. The evaluation metrics are summarized in Table  4 . Table 4 Performance of TabNet model with 95% confidence intervals. Metric Mean 95% CI Accuracy 0.870 0.84–0.901 Precision 0.913 0.877–0.942 Sensitivity 0.885 0.848–0.92 Specificity 0.843 0.787–0.893 F1-score 0.898 0.871–0.922 AUC 0.946 0.928–0.963 AUC: Area under the curve. Performance of TabNet model with 95% confidence intervals. AUC: Area under the curve. Calibration plots (Fig.  3 A and B) illustrate the alignment between the model’s predicted probabilities and the observed outcomes, following post-hoc isotonic calibration. On the training set (Fig.  3 A), the model showed slight overfitting at lower probabilities but remained well-aligned above the 0.4 threshold, with most points lying close to the diagonal. This is expected with isotonic calibration on internal data, as the non-parametric fit may capture local patterns more tightly. In contrast, the validation set (Fig.  3 B) displayed near-perfect calibration across the full probability range, with the curve closely tracking the ideal diagonal line. This indicates that the calibrated model provides well-calibrated probability estimates suitable for clinical interpretation and decision-making. The excellent alignment supports the reliability of risk predictions, particularly in threshold-based applications such as patient stratification or treatment decision support. Moreover, in the training calibration curve (Fig.  3 A), the apparent deviation below a predicted probability of 0.4 reflects data sparsity rather than true miscalibration. Very few cases were assigned probabilities in this range, which exaggerates small differences between predicted and observed outcomes. This apparent underestimation therefore represents sampling noise rather than a systematic bias. As shown in the validation calibration plot (Fig.  3 B), the model aligns closely with the ideal diagonal, confirming that predicted probabilities are well calibrated across the full range when applied to unseen data. Fig. 3 Calibration plots: training set ( A ) showed mild overestimation (intercept = 0.23, slope = 0.76) with some miscalibration in the mid-probability range but strong discrimination (C-statistic = 0.97); validation set ( B ) showed improved calibration (intercept = 0.25, slope = 0.89) with excellent discrimination (C-statistic = 0.97). Calibration plots: training set ( A ) showed mild overestimation (intercept = 0.23, slope = 0.76) with some miscalibration in the mid-probability range but strong discrimination (C-statistic = 0.97); validation set ( B ) showed improved calibration (intercept = 0.25, slope = 0.89) with excellent discrimination (C-statistic = 0.97). Feature importance was extracted from TabNet’s learned feature selection masks and aggregated to determine the relative contribution of each biomarker. According to our model, when all input variables were available, the overall importance of the features is shown in Fig.  4 . The most important feature was age, followed by Th1/Th2 ratio, BMI, anti-TPO, ANA, anti-dsDNA, anti-TTG, CD16, CD3, and anti-cardiolipin. To maintain near-perfect accuracy and AUROC, we did not exclude any features. Fig. 4 Feature importance (lambda plots): relative contribution of each variable to the model’s prediction of pregnancy outcomes; x-axis reflects importance score, with higher values indicating greater influence on decision-making. Feature importance (lambda plots): relative contribution of each variable to the model’s prediction of pregnancy outcomes; x-axis reflects importance score, with higher values indicating greater influence on decision-making. Nevertheless, feature importance is less interpretable with deep learning (TabNet) because these models often capture interactions between features, making their importance dependent on the context or combination of other features. Additionally, these models (e.g., neural networks) use multiple layers of non-linear transformations, making it difficult to isolate the contribution of any single feature. At first glance, some features such as smoking, CD16/CD56, CD3/CD8, vit D, anti-TG, CD19, and others appeared less important. However, we undertook various tests with different combinations of feature importance (Table S2). It is true that when all 23 biomarkers are available at inference, the model can indeed perform well with only 6 or so data inputs. Nonetheless, there are many cases where data could be missing, in which case these low-importance features play a key role in allowing the deep learning model to still perform well. The correlation heatmap (Fig.  5 ) presents a thorough summary of the connections between every biochemical, immunological, and clinical variable that was measured in the RIF cohort. It is feasible to notice several different clusters. A degree of co-activation within the autoimmune antibody network is indicated by the moderate to strong positive correlations found in the first cluster, which includes anti-phospholipid antibodies, anti-cardiolipin, anti-B2 glycoprotein, anti-TTG, anti-TPO, anti-TG, and ANA. Fig. 5 Correlation matrix among candidate predictors in the RIF cohort. Positive and negative Pearson correlations are denoted by red and blue shades, respectively. There are clear correlation clusters between lymphocyte subpopulations (e.g., CD3, CD4, CD8, CD16, CD19, and CD56) and autoimmune antibody markers (e.g., Anti-Phospholipid, Anti-Cardiolipin, Anti-B2 Glycoprotein, and ANA). Conversely, there are weak to negligible correlations between immunological parameters and biochemical and lifestyle factors such as age, BMI, and embryo quality. Correlation matrix among candidate predictors in the RIF cohort. Positive and negative Pearson correlations are denoted by red and blue shades, respectively. There are clear correlation clusters between lymphocyte subpopulations (e.g., CD3, CD4, CD8, CD16, CD19, and CD56) and autoimmune antibody markers (e.g., Anti-Phospholipid, Anti-Cardiolipin, Anti-B2 Glycoprotein, and ANA). Conversely, there are weak to negligible correlations between immunological parameters and biochemical and lifestyle factors such as age, BMI, and embryo quality. The lymphocyte subpopulation markers (CD3, CD3/CD4, CD3/CD8, CD16, CD56, CD19, and CD16/CD56) are involved in a second, well-defined block. The cellular immune profile’s biological coherence is reflected in the strong correlations between these variables. Recurrent pregnancy loss has been linked to both a pro-inflammatory Th1 shift and excessive NK-cell activity, as evidenced by the negative correlation between the Th1:Th2 ratio and certain NK markers. Conversely, immunological markers exhibit weaker correlations with lifestyle and physiological variables such as age, BMI, and embryo quality. This suggests that immune-related factors in this dataset vary largely independently of general metabolic or demographic parameters. Remarkably, R-LiveBirth exhibits mildly negative associations with a number of immune markers, especially those that belong to the NK/T-cell cluster. As summarized in Table 5 , a comprehensive series of quantitative validation tests was undertaken to confirm the model’s reliability, robustness, and clinical interpretability. Repeated 5 × 5 stratified cross-validation demonstrated excellent internal discrimination with a mean AUROC of 0.952 ± 0.01 and minimal variance, consistent with the single hold-out result and showing no indication of overfitting 30 , 31 . A permuted-label sanity check produced an AUROC of 0.570, confirming that performance collapsed to chance when labels were randomly reassigned, and thereby excluding any feature-to-label leakage or data contamination 32 . Calibration testing returned a Brier score of 0.036 — well below the 0.07 threshold — with the calibration curve closely following the ideal diagonal and showing only mild over-prediction at low probability values. Together these metrics indicate that the model’s probabilistic outputs are well calibrated and reliable for clinical interpretation 33 . Each diagnostic yields a single quantitative measure rather than a continuous relationship; therefore, the results are reported numerically rather than through graphical representation. Table 5 Model performance & validation tests. Test name Purpose of the test Pass criterion (pre-specified) Observed outcome Verdict/commentary Repeated 5 × 5 stratified cross-validation Estimates internal discrimination and variance; exposes classic over-fitting Mean AUROC within 0.03 of single hold-out and SD ≤ 0.01 Mean 0.954 ± 0.01 (range 0.944–0.964)—only 0.0123 below original 0.952 Pass Low variance, no indication of over-fit Permuted-label sanity check Detects any feature-to-label leakage (IDs, timestamps, future info) AUROC with shuffled labels ≈ 0.50 ± 0.02 0.519 Pass Model collapses to chance → no leakage Calibration curve & Brier score Assesses probability calibration for clinical reliability Brier ≤ 0.07 and calibration line near identity Brier 0.036; curve mildly over-predicts at low risk but tracks diagonal overall Pass Well within threshold; optional isotonic scaling could fine-tune low-prob band Model performance & validation tests. Pass Low variance, no indication of over-fit Pass Model collapses to chance → no leakage Pass Well within threshold; optional isotonic scaling could fine-tune low-prob band

Materials

The data was collected from Tehran, Shiraz, Ardebil, Mashhad, and Tabriz fertility centers between December 2014 and January 2025. Each clinic provided their respective data sets after internally removing any patient identifiers. We ensured each clinic uses the same standardized testing procedures in order to avoid heterogenicity in data. The assembled dataset was then consolidated by the research team in preparation for analysis. Overall, the final data consisted of 2,463 RIF patients. The latest ESHRE guidelines were used to select patients. We included women aged between 18 and 45 years who had experienced RIF meeting the diagnostic criteria of ESHRE for different age groups 2 . We used strict exclusion criteria as follows: (1) Chromosomal abnormalities in either partner or in the products of conception from the index pregnancy (as confirmed by karyotyping or genetic testing), (2) Abnormal pregnancy outcomes such as ectopic pregnancy or hydatidiform mole, (3) Loss to follow-up, (4) Endocrine dysfunction such as thyroid dysfunction (including patients receiving thyroid medications) and hyperprolactinemia, (5) Chronic endometritis, (6) Polyps, (7) Intrauterine adhesion, (8) Endometriosis, (9) Hydrosalpinx, (10) Adenomyosis, 11) Rheumatic diseases, 12) Uterine anomalies, 13) Uterine fibroids, 14) Abnormal ovarian reserve, 15) Müllerian abnormalities, 16) Inherited/acquired thrombophilias, and 17) Polycystic ovarian syndrome (PCOS). In short, all known-cause RIF cases were excluded, focusing only on unexplained or idiopathic RIF without significant reproductive abnormalities. Figure 1 displays the patient selection and workflow diagram. Fig. 1 Workflow and patient selection: strict exclusion criteria were applied to remove patients with anatomical or gynecological anomalies. Workflow and patient selection: strict exclusion criteria were applied to remove patients with anatomical or gynecological anomalies. All methods were carried out in accordance with relevant guidelines and regulations. This retrospective study used fully de-identified datasets provided by participating clinics, each of which removed all patient identifiers prior to data sharing. The consolidated dataset was used solely for research purposes to develop a deep learning model. Informed consent was obtained from all participants and/or their legal guardian(s) in accordance with institutional and national ethical standards. Since this was a retrospective study, informed consent had been routinely obtained during standard clinical care prior to data use for research purposes. The study protocol was reviewed and approved by the Research Ethics Committee of Tabriz University of Medical Sciences, under approval number IR.TBZMED.REC.1404.198. The main pregnancy outcome in this study was “live birth,” defined as the delivery of an infant beyond 24 weeks of gestation, which marked as “positive outcome”. Additionally, “implantation failure” was considered to be a good-quality embryo that was transferred into the uterus but did not establish a pregnancy, as shown by ultrasound visualization of an intrauterine gestational sac 18 . In addition to implantation failure, other pregnancy complications and conditions such as miscarriage, ectopic pregnancy, etc., which did not result in a live birth were marked as “negative outcome.” Participants were followed up through clinic visits or via phone call. Candidate variables were classified into 6 categories: (1) Demographic characteristics: age, BMI; (2) Biochemical tests: Vitamin D3, TSH; (3) Immune assays: Th1/Th2 ratio, CD3, CD16, CD19, CD56, CD3/CD4, CD3/CD8, CD16/56; (4) Measurement of autoantibodies: Antiphospholipid antibody (IgG, IgM), Anticardiolipin antibody (IgG, IgM), Anti-β2 glycoprotein antibody (IgG, IgM), Anti-TTG antibody (IgG, IgA), ANA, Anti-dsDNA, Anti-TPO, Anti-TG; and (5) Embryo quality (Gardner grading: A or B 19 . All these tests were performed at least 12 weeks prior to the embryo transfer. This timing was chosen as part of the clinical work-up to aid in diagnosis and guide treatment decisions for these participants. We used ESHRE criteria to determine abnormal tests for thyroid function, and autoantibodies. The reference ranges supplied by the testing laboratory were used to define abnormal values for immune assays. We also used the International APS Classification Criteria (Updated Sydney Criteria) for antiphospholipid antibody syndrome (APS) evaluation, including anticardiolipin antibodies, anti-β2 glycoprotein, and lupus anticoagulant 20 . Conventional IVF was utilized for patients without severe male factor infertility, while ICSI was used for cases with severe male factor infertility or prior fertilization failure. Standard laboratory settings were used to cultivate the embryos until they reached the blastocyst stage (Days 5 or 6). Depending on the stage of embryo development, ETs were carried out on Days 3, 5, or 6. High-quality embryos were defined as day 3 embryos with 7–9 cells that were rated as I or II. High-quality blastocysts were defined as those at stage ≥ 3 with at least one inner cell mass (ICM) or trophectoderm (TE) graded as B or higher 19 . There were two types of embryo transfers: fresh and frozen-thawed. The choice between fresh and FET was based on ovarian response, endometrial thickness, and clinical indication. For frozen transfers, a hormone replacement cycle or a natural cycle was used, depending on the situation. These procedures made sure that each participant’s endometrial and endocrine environments were similar. The hormonal differences between fresh and FET cycles were unlikely to have introduced systematic bias in immune modulation or model performance because immunotherapy was customized based on each patient’s immune profile rather than the type of ET. Implantation was defined as the increase of serum β-HCG levels. Tacrolimus 8 and IVIG 9 , 21 treatment protocols were previously published. In short, patients received 16 days of tacrolimus medication (Prograf, Astellas Pharma Ltd, Staines, UK) prior to an endometrial biopsy during the first treatment cycle. Similarly, we administered tacrolimus for 16 days (two days prior to ET and up to 14 days following ET) in the second treatment cycle. Tacrolimus dosage was modified based on the level of Th1/Th2 elevation 8 . Regarding IVIG therapy, 400–2000 mg/kg of IVIG was injected 3 days before ET 9 . While immunomodulatory therapy was administered to all participants, the specific type of treatment was chosen based on the immune profiles of each patient. For instance, IVIG was administered to patients with broader immune abnormalities or increased NK-cell activity, while tacrolimus was administered to patients with elevated Th1/Th2 ratios. A deep learning-based classifier was selected to capture complex, non-linear relationships within the high-dimensional biomarker dataset. Among several architectures trialed in earlier unpublished experiments, TabNet emerged as the most effective, demonstrating superior performance in both accuracy and generalizability. TabNet, a decision-aware model that incorporates sequential attention mechanisms and feature selection within its architecture, is particularly well-suited to clinical datasets where interpretability and robustness are critical 22 , 23 . Unlike traditional feedforward networks, TabNet can inherently model tabular data without requiring extensive preprocessing or feature engineering. It also supports native handling of missing values through its sparse attention mechanisms and learned embeddings, making it a favorable choice in biomedical domains where missingness itself may carry diagnostic significance 23 – 25 . Model training was conducted using a fixed random seed to ensure reproducibility. The dataset was fed in mini batches of 128 samples, with training proceeding for a maximum of 1,000 epochs. Early stopping was employed with a patience threshold of 40 epochs, allowing the training process to halt once validation performance plateaued. Model evaluation was carried out on both training and validation sets at each epoch, with accuracy as the primary evaluation metric. Parallel data loading across four worker threads was utilized to optimize performance, and the final model was saved upon completion for downstream analysis. There were several steps taken to prepare the dataset to ensure our data was balanced and produced an expected outcome. To ensure a complete dataset for model training, missing values were addressed using a univariate mean imputation strategy, wherein each missing entry was replaced by the mean of the corresponding feature across the dataset. This method assumes data are missing at random and does not incorporate feature interdependencies, but it remains a widely adopted and computationally efficient approach for handling incomplete clinical data 26 . The fully imputed dataset was subsequently used for training a deep learning model based on the TabNet architecture, which is designed for interpretable learning from tabular biomedical inputs 23 . Binary variables, including Live Birth (encoded as 0 for negative and 1 for positive outcomes) and Embryo Quality (with ‘A’ mapped to 0 and ‘B’ to 1), were converted into numeric representations to ensure compatibility for model training. The dataset was randomly divided into training and validation sets in a ratio of 80/20. Stratification was applied to preserve the proportional distribution of outcome classes in both subsets. To ensure consistency in experimental results, the randomization procedure was fixed using a reproducible seed. This separation allows for an unbiased evaluation of the model’s capacity to generalize beyond the training data. The training data showed a notable skew in live birth results (1,286 positive vs. 684 negative), which can bias the model toward the majority class. To counter this, we applied the Synthetic Minority Over-sampling Technique (SMOTE) 27 , which constructs new minority class examples by interpolating between existing ones. This method improved balance without simply duplicating data, helping the model learn the underlying patterns in less common outcomes more effectively. To ensure that the model’s predicted probabilities accurately reflected the true likelihood of live birth, we applied post-hoc probability calibration using isotonic regression. While the classifier demonstrated strong discriminatory performance (AUROC > 0.91), its raw output probabilities exhibited deviations from perfect calibration, particularly in the mid-probability range. Poorly calibrated models can be misleading in clinical contexts where decisions rely not only on classification but also on risk estimation. Isotonic regression, a non-parametric calibration method that fits a piecewise non-decreasing, monotonic function to the model’s probability estimates, has been shown to significantly improve reliability without compromising discrimination 28 , 29 . After calibration, the model exhibited excellent probability alignment, as demonstrated by a well-fitting validation calibration curve and a reduced Brier score, supporting its use for threshold-based and individualized clinical decision-making. Descriptive statistics (average ± standard deviation or median with interquartile range for continuous variables, and percentage distributions for categorical variables) were used to summarize patient characteristics. Group comparisons between live birth vs. non-live-birth were performed via t-tests, Mann-Whitney U tests, or chi-square tests, as appropriate. All analyses were conducted in R software. A p-value <  0.05 was considered statistically significant. Moreover, logistic and multivariate regression analyses were employed with the training group to generate odds ratios. The evaluation metrics for the model comprised the area under the curve (AUROC), the receiver operating characteristic curve (ROC), calibration plots, and the confusion matrix.

Conclusion

This study demonstrates the success of a deep learning approach (TabNet) for predicting live birth among RIF patients using immune biomarkers and embryo quality data. With an AUROC of 0.952 and an accuracy of 87.4%, the model underscores the pivotal role of immune dysfunction - particularly age, BMI, Th1/Th2 imbalance, autoantibodies, and T cell subsets - in RIF. These findings advocate for early and targeted immune profiling in RIF, guiding personalized immunomodulatory therapies to enhance implantation success in IVF. Further large-scale external validation and randomized trials are warranted to confirm these findings and to explore how best to integrate AI-driven predictions into routine fertility care.

Discussion

To the best of our knowledge, this is the first study that investigates these variables - especially the immunological profile - as risk factors in a RIF population with no anatomical or gynecological abnormalities. Previous studies paid less attention to these parameters. Instead, they primarily identified ovarian reserve and anatomical abnormalities as key contributors. However, as some patients may present with none of these conditions, immune dysregulation may play a critical role in implantation failure. Therefore, these patients are candidates for further work-up by clinical immunologists, specifically evaluating immune markers such as lymphocyte subsets, autoantibodies, and thyroid antibodies. We used these variables to construct our model, guided by both literature and clinical experience, which led to a model with an AUROC of 0.952 and an accuracy of 87.4%. Immunoassays were conducted prior to the start of treatment, even though immunomodulatory therapy was administered to every participant. Immune parameters’ continued use as powerful predictors in the model implies that immune dysregulation at baseline affects reproductive outcomes even after treatment, potentially due to inconsistent or heterogeneous treatment responses. We showed that in this population, risk factors such as age, Th1/Th2 ratio, BMI, anti-TPO, ANA, anti-dsDNA, anti-TTG, CD16 (NK cells, neutrophils, monocytes, and macrophages), CD3 (T-cells), and anti-cardiolipin are among the most important predictors for identifying patients likely to have immune-related dysfunction and who may benefit from immunotherapy. Advanced age is a recognized risk factor for implantation failure, and the definition of RIF was categorized by age in the 2023 ESHRE guidelines 2 , 15 . Women of younger age generally possess a greater proportion of euploid embryos, which are essential for successful implantation 2 . The impact of BMI on RIF is complex and multifaceted. It can influence immune response, ovarian reserve, and overall reproductive health. BMI affects both peripheral and endometrial immune cells during the implantation window and has been shown to reduce NK cells and macrophages, thereby impairing implantation success 34 . Moreover, a systematic review indicated that elevated BMI adversely influences live birth rates after frozen embryo transfer, albeit it does not significantly alter implantation rates 35 . Dominance of Th2 cells is necessary for a healthy pregnancy 8 . Reproductive dysfunction is linked to an imbalance of Th1/Th2. As the Th1 immune response in RIF patients increases, so does the Th1/Th2 ratio in peripheral blood 36 . Tumour necrosis factor-alpha (TNF-α) and interferon-gamma (IFN-γ), which are primarily released by Th1 cells, are examples of pro-inflammatory cytokines found to be elevated in the peripheral blood of patients with RIF 8 , 37 . In this regard, Cai et al. 38 measured lymphocyte subsets in peripheral blood and used them as potential biomarkers to predict RIF. According to their findings, T cells, Tregs, T follicular helper 1 (Tfh1), Tfh2, Tfh17, NK cells, and early inhibitory NK cells can all serve as predictive biomarkers and may play significant regulatory roles in embryo implantation. They employed logistic regression rather than AI-based models, which yielded an AUROC of 0.900 when using a combination of three biomarkers: Treg, Tfh17, and early inhibitory NK cells 38 . The underlying mechanisms linking autoantibodies to implantation failure are multifaceted, involving immunological, metabolic, and adhesion-related pathways. For instance, patients with autoantibodies often exhibit an increased rate of biochemical pregnancies, and women with anti-cardiolipin antibodies have shown morphological abnormalities in embryos. Although correlations exist, there is no definitive evidence that these antibodies are the direct cause of implantation failure 13 . Moreover, thyroid autoimmunity is more prevalent in women with subfertility, largely based on elevated levels of TPO antibodies alone 11 . A clinical trial also found that seropositivity for ANAs and antiphospholipid antibodies is correlated with IVF implantation failure 39 . Anti-TTG antibodies have been associated with trophoblast damage that may impair embryo transfer and pregnancy outcomes 40 . Another study showed that in both fresh and frozen-thawed embryo transfer cycles, the ANA+/anti-dsDNA + group had the lowest rates of fertilization, implantation, and clinical pregnancy, along with the highest incidence of early miscarriage 41 . In a related study, Wang et al. 15 showed that their support vector machine (SVM) model yielded an AUROC of 0.83 (95% CI: 0.80–0.86). They also selected 32 risk factors based on literature and clinical experience. According to their findings, anti-müllerian hormone (AMH) had the greatest impact on RIF risk, while the second, third, and fourth most important risk factors were chronic endometritis, intrauterine adhesions, and BMI, respectively. In comparison, our model yielded a higher AUROC and accuracy. However, the main differences between the two studies lie in study populations and inclusion criteria, as many of their selected risk factors were among our exclusion criteria - such as intrauterine adhesions, endometriosis, hydrosalpinx, PCOS, uterine anomalies, and abnormal ovarian reserve. Ozer et al. 16 used machine learning algorithms to predict risk factors for first-trimester pregnancy loss in good-quality frozen-thawed embryo transfer (FET) cycles. Their variables were based on clinical and cycle characteristics. They found that a history of RPL, RIF, advanced maternal age, presence of PCOS, and high BMI (> 30 kg/m²) were associated with increased risk of first-trimester pregnancy loss 16 . Notably, immunological variables were not included in their study. Their random forest (RF) model yielded an AUROC of 0.766. Because autoimmune antibodies (anti-phospholipid antibodies, anti-cardiolipin, anti-B2 glycoprotein, anti-TTG, anti-TPO, anti-TG, and ANA) cluster together, patients who have one positive autoantibody are likely to have higher levels of the others, which is consistent with a generalized autoimmune tendency as opposed to isolated antigen-specific responses. Given that T-cell and NK-cell subsets frequently rise or fall together in immune dysregulation linked to implantation failure, the strong positive associations between them seen in the lymphocyte subpopulation cluster are to be expected. The idea that increased immune activation is associated with decreased live-birth potential is supported by the observed inverse relationship between NK activity and live-birth outcome. Overall, a biologically plausible separation of autoantibody-mediated and cell-mediated immune pathways is supported by the dataset’s correlation structure. The multifaceted character of immune dysregulation in RIF is further supported by the internal coherence and restricted overlap of both immune clusters. This correlation pattern offers a logical foundation for integrating these partially independent predictors of reproductive outcome using multivariate or machine-learning techniques, like TabNet. Identifying immune dysfunction in RIF patients before undergoing further IVF cycles could enable more personalized management. Therapies such as corticosteroids, cell therapies, IVIG, or tacrolimus have shown benefit in patients exhibiting Th1/Th2 imbalances or NK cell activity 42 , 43 . The high accuracy of our TabNet model suggests that a targeted immune workup - integrated into routine IVF evaluation - could stratify patients by risk and guide immunomodulatory interventions. More broadly, these results support the notion of dedicated reproductive immunology services where patients undergo detailed immune assessments and receive personalized treatments informed by machine learning insights.

Limitations

The primary limitation of this study is that it is based on a single-country dataset with internal validation only. Another limitation of this study is the modest sample size and retrospective design, which may restrict the generalizability of the findings. To address class imbalance, SMOTE was applied to synthetically augment the minority (negative outcome) class within the training set only. While this approach improves training stability and model discrimination, it may not fully capture the complexity of biological variability. As such, performance metrics should be interpreted with appropriate caution. Larger, multicenter studies are warranted to determine whether the model’s high AUROC is maintained across diverse patient populations and laboratory conditions. Furthermore, although TabNet provided informative feature importance scores, more granular immune profiling - such as endometrial cytokine measurement or single-cell immune analysis - could further enhance predictive accuracy and biological insight. Moreover, as we acknowledge that further external validation, multi-center studies and clinical trials is needed to assess the model’s performance and generalizability, which we aim to undertake in future work.

Introduction

Due to the development of assisted reproductive technology (ART), numerous infertile couples have hope. However, even after multiple high-quality embryo transfers (ET), 15–20% of couples are unable to conceive 1 . The phrase “recurrent implantation failure” (RIF) is frequently used to describe failure that follows multiple attempts at in vitro fertilization (IVF) 2 . Similar to recurrent pregnancy loss (RPL), the clinical definition of RIF is not always consistent. Most definitions in use today are predicated on the quantity of transferred embryos that do not result in pregnancy 2 . RIF was defined by the European Society of Human Reproduction and Embryology (ESHRE) as implantation failure following at least three, four, and six cycles for women under 35, 35–39, and over 40 years old, respectively, who have IVF or intracytoplasmic sperm injection (ICSI) 2 . Although the ESHRE guidelines for RIF diagnosis stratify by age and the significance of age in the effectiveness of ART treatment is widely recognized, Other risk variables are linked to this complex condition 2 . According to ESHRE guidelines, most of these risk factors are gynecological, anatomical, and embryological conditions 2 . However, the pathogenesis of RIF is poorly understood, and there are groups of patients who, despite the absence of anatomical or endocrine abnormalities, experience RIF. While this approach is not based on the current guidelines, portion of these patients are referred to an immunologist for further work-up regarding their immune profile, and to receive appropriate immune system regulation treatment. RIF, particularly unexplained RIF, is thought to be associated with immune factors 3 . Important participants at the feto-maternal interface, regulatory T-cells (Tregs) and peripheral and uterine natural killer cells (pNK, uNK cells), appear to play a role in the pathophysiology of RIF 3 , 4 . Additionally, this disorder is linked to an imbalance between T helper 1 (Th1) and Th2 cytokines 5 . Accordingly, RIF is closely related to changes in peripheral blood T-cell subsets and B-cells 4 , 6 . Moreover, it has been demonstrated that vitamin D supplementation enhances NK cell function, cytokine profiles, and the Th1/Th2 ratio in RIF patients 7 . For this population with immunological conditions, Tacrolimus and intravenous immunoglobulin (IVIG) used as immune regulation treatment have been observed to significantly increase live births and pregnancy rates 8 , 9 . In this group of patients, lifestyle factors such as body mass index (BMI) and stress can reduce the efficacy of IVF-ET 10 . Moreover, another etiology that may affect the rate of implantation is the embryo factor, as chromosomal abnormalities in embryos are a major contributing factor to implantation failure or miscarriage 10 . Thyroid dysfunction is one endocrine condition that might lead to implantation failure 11 . Although these risk factors are not based on a consensus perspective, RIF patients also have been studied for several autoimmune disorders or autoantibodies, including anti-thyroid and anti-transglutaminase antibodies, which might have been linked to decreased pregnancy rates 12 . Other autoantibodies that may contribute to biochemical pregnancy loss include anti-cardiolipin, anti-β2 glycoprotein, anti-phospholipid, and anti-nuclear antibodies 13 . At present, artificial intelligence (AI) approaches have been rapidly developed and widely used in various fields of healthcare, and reproductive medicine is not an exception. Previously, several studies aimed to use this emerging technology of AI to find the best embryo transfer strategy 14 , define risk factors for RIF according to ESHRE 15 , identify risk factors for first-trimester pregnancy loss in frozen-thawed, high-quality embryo transfer cycles 16 , and examine the endometrium immune profile 17 . In clinical practice, immune problems are generally diagnosed from abnormal laboratory values; however, some patients with apparently normal values may still have underlying immune dysregulation. By analyzing multiple immune, biochemical, and clinical parameters simultaneously, deep learning model may detect complex patterns of immune dysfunction that are not visible through conventional assessment, potentially guiding more effective immunotherapy. Therefore, the aim of this study was to construct a deep learning model to help identify RIF patients with an immune problem who may benefit from appropriate immune regulatory therapies.

Supplementary Material

Below is the link to the electronic supplementary material. Supplementary Material 1 Supplementary Material 1

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-06-24T06:10:11.469335+00:00
unpaywall
last seen: 2026-05-21T05:10:58.409756+00:00
License: CC-BY-NC-ND-4.0