Machine learning-based diagnostic prediction of IgA nephropathy: model development and validation study

doi:10.21203/rs.3.rs-4203860/v1

Machine learning-based diagnostic prediction of IgA nephropathy: model development and validation study

2024 · doi:10.21203/rs.3.rs-4203860/v1

preprint OA: closed

Full text JSON View at publisher

Full text 113,398 characters · extracted from preprint-html · click to expand

Machine learning-based diagnostic prediction of IgA nephropathy: model development and validation study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Machine learning-based diagnostic prediction of IgA nephropathy: model development and validation study Ryunosuke Noda, Daisuke Ichikawa, Yugo Shibagaki This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4203860/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 10 You are reading this latest preprint version Abstract IgA nephropathy progresses to kidney failure, making early detection important. However, definitive diagnosis depends on invasive kidney biopsy. This study aimed to develop non-invasive prediction models for IgA nephropathy using machine learning. We collected retrospective data on demographic characteristics, blood tests, and urine tests of the patients who underwent kidney biopsy. The dataset was divided into derivation and validation cohorts, with temporal validation. We employed four machine learning models—eXtreme Gradient Boosting (XGBoost), LightGBM, Random Forest, and Artificial Neural Networks—and logistic regression, evaluating performance via the area under the receiver operating characteristic curve (AUROC) and explored variable importance through SHapley Additive exPlanations method. The study included 1268 participants, with 353 (28%) diagnosed with IgA nephropathy. In the derivation cohort, LightGBM achieved the highest AUROC of 0.913 (95% CI 0.906–0.917), significantly higher than logistic regression and Artificial Neural Network, not significantly different from XGBoost and Random Forest. In the validation cohort, XGBoost demonstrated the highest AUROC of 0.894 (95% CI 0.850–0.935), maintaining its robust performance from the derivation phase. Key predictors identified were age, serum albumin, serum IgA/C3 ratio, and urine red blood cells, aligning with existing clinical insights. Machine learning can be a valuable non-invasive tool for IgA nephropathy. Health sciences/Nephrology Health sciences/Nephrology/Kidney Health sciences/Nephrology/Kidney diseases Health sciences/Medical research/Biomarkers/Diagnostic markers Health sciences/Medical research/Biomarkers/Predictive markers Health sciences/Health care/Diagnosis IgA nephropathy kidney biopsy artificial intelligence machine learning glomerulonephritis Figures Figure 1 Figure 2 Figure 3 Figure 4 INTRODUCTION IgA nephropathy (IgAN) is the most common primary glomerulonephritis worldwide, leading to end-stage kidney failure in 30–40% of patients within two decades of diagnosis 1 . For favorable outcomes in IgAN patients, early detection and timely treatment are essential. IgAN presents variable clinical courses, characterized by various degrees of hematuria and/or proteinuria, complicating diagnosis with general laboratory tests 2,3 . Definitive diagnosis requires kidney biopsy, which, however, has several contraindications and entails risks like significant bleeding 4 , which in severe cases, may require interventions such as transfusion, arterial embolization, or surgery, representing a crucial clinical challenge 5 . The potential for predicting the diagnosis of IgAN before or without kidney biopsy has been a topic of discussion 6 . Specifically, the study of non-invasive diagnostic approaches for IgAN through blood and urine biomarkers has gained attention. Biomarkers such as microscopic hematuria, persistent proteinuria, serum IgA levels, and the serum IgA/C3 ratio have been identified as effective for distinguishing IgAN from other kidney diseases 7,8 . Although these variables are measurable in routine clinical settings, their diagnostic capability is limited, serving primarily to aid differential diagnosis. Recent studies have emphasized the importance of galactose-deficient IgA1 (Gd-IgA1), Gd-IgA1-specific IgG, and Gd-IgA1-containing immune complexes in IgAN pathogenesis 9,10 . Elevated serum levels of these markers have been observed in IgAN, suggesting their potential as specific biomarkers 11 . However, their practical application is limited, as their measurement requires advanced equipment not available in general medical facilities. Machine learning, a subset of artificial intelligence, is instrumental in analyzing extensive clinical data from electronic health records, facilitating the development of predictive models 12,13 . Its application in nephrology expects advancements in predicting acute kidney injury onset 14 , prognosis of chronic kidney disease 15 , dialysis hypotension onset 16 , and assisting kidney pathological diagnosis 17 . IgAN diagnostic prediction studies have predominantly employed logistic regression 8,18,19 , a conventional statistical model assuming linear relationships and thus limiting predictive performance. Advanced machine learning algorithms, capable of modeling non-linear relationships and complex interactions, could improve predictive performance 20 . However, the efficacy of machine learning in predicting IgAN diagnosis remains unexplored. This study aims to develop and validate diagnostic prediction models for IgAN using machine learning, based on patient demographics, blood tests, and urine tests, which can be easily obtained in clinical practice. Our other goal is to show machine learning models can be a non-invasive, highly accurate, and reliable diagnostic approach for IgAN, compared to the conventional clinical parameters or conventional statistical models. METHODS Study design and study participants This study is a retrospective cohort study involving patients at St. Marianna University Hospital. It included all adult patients who underwent native kidney biopsy from January 1, 2006, to September 30, 2022. Patients with inconclusive diagnoses and those with multiple primary diagnoses were excluded. The data for the cohort were collected from electronic health records, with patients who underwent kidney biopsy between January 1, 2006, and December 31, 2019, included in the derivation cohort, and those who underwent biopsy between January 1, 2020, and September 30, 2022, included in the validation cohort. Details of patient selection for the derivation and validation cohorts are shown in Fig. 1 . Ethics approval This study was conducted according to “The Declaration of Helsinki”, “the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis Statement” 21 , and “Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View” 22 . The study protocol was approved by the institutional review board of St. Marianna University Hospital (approval number 6025). As the study was retrospective and involved minimal risk, the requirement for informed consent was waived. Predictor variables We utilized information that is routinely measured in clinical practice as potential predictor variables. Baseline data of patients before the native kidney biopsy were retrospectively collected from electronic health records. These included demographic characteristics, blood tests, and urine tests. Demographic characteristics included age, sex, height, weight, body mass index, and blood test items comprising white blood cells, hemoglobin, total protein, albumin, blood urea nitrogen, creatinine, uric acid, aspartate aminotransferase, alanine aminotransferase, alkaline phosphatase, lactate dehydrogenase (LDH), creatine kinase (CK), total cholesterol, glucose, hemoglobin A1c, C-reactive protein, immunoglobulin G (IgG), immunoglobulin A (IgA), immunoglobulin M (IgM), complement C3, complement C4, IgA/complement C3 ratio (IgA/C3), antinuclear antibodies. Urine test items included urine protein/creatinine ratio (UPCR) and urine red blood cells (Urine RBC), with Urine RBC scored on a scale of 0 = < 1/high power field (HPF), 2.5 = 1 ~ 4 /HPF, 7.5 = 5 ~ 9 /HPF, 20 = 10 ~ 29 /HPF, 40 = 30 ~ 49/ HPF, 75 = 50 ~ 99/HPF, 100 = ≥ 100/HPF. Outcome measures The outcome of this study is the diagnosis of IgA nephropathy. The definitive diagnoses made through kidney biopsy by nephrologists were collected. IgA nephropathy was assigned as the correct label (1) and all other diagnoses as (0). Data preprocessing The number and proportion of missing values for each variable are shown in Supplementary Table S1 . Variables with more than 20% missing values were not included in the analysis. To avoid potential bias arising from excluding patients with missing data, imputation was adopted. The k-nearest neighbor imputation algorithm was employed to fill in missing values for continuous variables. The variables were standardized to have a mean of 0 and a standard deviation of 1. Variable selection The variable reduction was performed to prevent overfitting of machine learning models and to reduce computational costs. Predictor variables with nearly zero variance, i.e., variables whose proportion of unique values was less than 5%, were excluded from the analysis. Four variable selection methods were applied to identify subsets of predictor variables. These methods included Least Absolute Shrinkage and Selection Operator, Random Forest-Recursive Feature Elimination, Random Forest-Filtering, and SelectFromModel with Extra Trees. The final predictor variables for model development were determined by integrating the results from the four methods, choosing variables that appeared three or more times across all methods. Model development and evaluation For model development, the following four machine learning algorithms—eXtreme Gradient Boosting (XGBoost), LightGBM, Random Forest, Artificial Neural Network—and logistic regression were applied to the data of the derivation cohort. XGBoost and LightGBM, along with Random Forest, are tree-based algorithms that combine decision trees with ensemble learning. XGBoost and LightGBM enhance predictive performance by sequentially building decision trees and correcting the errors of previous trees by boosting techniques 23,24 . Random Forest mitigates overfitting by independently training multiple decision trees and integrating their predictions through bagging techniques 25 . Artificial Neural Networks consist of an input layer, hidden layers, and an output layer, capable of handling complex relationships between inputs and outputs using non-linear activation functions 26 . logistic regression is a statistical linear model widely used for binary classification, producing probabilistic outputs and classifying as positive or negative based on a specific threshold 27 . To identify the optimal hyperparameters for each model, training, and validation were conducted in the derivation cohort using 5-repeated 5-fold cross-validation. Bayesian optimization was used for hyperparameter tuning. The hyperparameters of each model tuned are shown in Supplementary Table S2. The performance of the final prediction models was evaluated in both the derivation and validation cohorts. The model performance was assessed using the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). AUROC and AUPRC were selected as they reflect performance across all classification thresholds and are less affected by class imbalance. We also included precision, recall, and F1 score (the harmonic mean of precision and recall) as common evaluation metrics for binary classification. The 95% confidence intervals (95% CI) for each metric were generated through 1000 bootstrap iterations with unique random seeds. The model calibration was evaluated using calibration plots and the Brier score. Calibration plots compare the actual positive fraction to the average predicted probability across quintiles of predicted probability. The Brier score, reflecting the mean squared difference between predicted probabilities and actual outcomes, serves as a dual measure of predictive performance and calibration. Model interpretations The SHapley Additive exPlanations (SHAP) method was used to explore the interpretability of the models with high diagnostic performance. SHAP provides a unified approach for the interpretation of model predictions, offering consistent and locally accurate attribution values, i.e., the SHAP values, for each variable within the predictive model 28 . The role of each variable in predicting IgA nephropathy can be explained as their collective contributions to the overall risk output for each case. Statistical analysis Continuous variables were described using mean and standard deviation for normally distributed data, and median values along with interquartile ranges for non-normally distributed data. Categorical variables were presented as counts and percentages. For statistical comparisons, the Student's t-test was applied to normally distributed continuous variables, the Mann-Whitney U test to non-normally distributed continuous variables, and the chi-square test to categorical variables. Variables with a two-tailed p-value less than 0.05 were considered statistically significant. For variable selection, we employed the sklearn library in Python (version 3.10.12). Model development utilized the sklearn, xgboost, and lightgbm libraries and evaluation was conducted using the sklearn, optuna, and shap libraries. R (version 4.2.2) was used for statistical analyses. RESULTS Patient characteristics After excluding participants with multiple primary diagnoses or without definitive diagnosis, 1,268 participants were enrolled. Of these, 1,027 were included in the derivation cohort and 241 in the validation cohort. The baseline characteristics and outcomes for the derivation and validation cohorts are presented in Table 1 . In the derivation cohort, 294 (28.6%) were diagnosed with IgA nephropathy, compared to 59 (24.5%) in the validation cohort. The baseline characteristics of the IgA nephropathy and non-IgA nephropathy groups for each cohort are detailed in Supplementary Tables S3 and S4. Table 1 Baseline characteristics and outcomes of the patients in the derivation and validation cohorts. Variables Derivation cohort (n = 1027) Validation cohort (n = 241) p-value Demographic characteristics Age (years) 47 [32, 65] 59[44, 72] < 0.001 Sex (male) 483 (47.0) 131 (54.4) 0.045 Height (cm) 161.6 [155.0, 169.0] 162.7 [155.8, 169.6] 0.301 Body Weight (kg) 58.1 [50.5, 67.4] 59.0 [50.5, 69.3] 0.361 Body Mass Index (kg/m 2 ) 22.2 [19.9, 25.2] 22.3 [20.1, 25.4] 0.298 Blood tests White blood cells (/µL) 6500 [5200, 8300] 6100 [4500, 8100] 0.032 Hemoglobin (g/dL) 12.4 [10.6, 14.0] 12.1 [10.3, 13.5] 0.035 Total protein (g/dL) 6.5 [5.7, 7.1] 6.4 [5.7, 6.9] 0.057 Albumin (g/dL) 3.6 [2.7, 4.1] 3.3 [2.3, 3.8] < 0.001 BUN (mg/dL) 16.5 [12.3, 23.2] 18.00 [12.8, 28.2] 0.019 Creatinine (mg/dL) 0.90 [0.69, 1.35] 1.11 [0.76, 1.67] < 0.001 Uric acid (mg/dL) 6.1 [4.9, 7.3] 6.2 [5.0, 7.5] 0.341 AST (U/L) 20 [ 16 , 26 ] 20 [ 17 , 26 ] 0.412 ALT (U/L) 16 [ 11 , 24 ] 16 [ 12 , 24 ] 0.672 ALP (U/L) 203 [162, 259] 83 [60, 161] < 0.001 LDH (U/L) 192 [164, 237] 204 [166, 241] 0.127 CK (U/L) 73 [44, 123] 73 [42, 111] 0.460 Total cholesterol (mg/dL) 200 [166, 243] 195 [165, 242] 0.292 Glucose (mg/dL) 98 [91, 111] 99 [92, 109] 0.407 HbA1c (%) 5.3 [4.9, 5.7] 5.6 [5.2, 5.9] < 0.001 C-Reactive Protein (mg/dL) 0.08 [0.03, 0.52] 0.14 [0.04, 0.98] 0.002 IgG (mg/dL) 1187 [861, 1572] 1184 [817, 1547] 0.777 IgA (mg/dL) 282 [206, 383] 291 [211, 366] 0.900 IgM (mg/dL) 96 [65, 143] 82 [51, 115] < 0.001 Complement C3 (mg/dL) 100 [82, 123] 106 [90, 132] 0.003 Complement C4 (mg/dL) 25 [ 18 , 33 ] 27 [ 20 , 35 ] 0.012 IgA/C3 2.83 [1.97, 4.31] 2.64 [1.85, 3.76] 0.19 ANA (titer) 40 [40, 80] 40 [40, 40] 0.011 Urine tests Urine RBC (/HPF) 0.048 < 1 173 (16.8) 46 (19.1) 1 ~ 4 191 (18.6) 50 (20.7) 5 ~ 9 148 (14.4) 30 (12.4) 10 ~ 29 219 (21.3) 49 (20.3) 30 ~ 49 71 (6.9) 29 (12.0) 50 ~ 99 71 (6.9) 14 (5.8) ≥ 100 154 (15.0) 23 (9.5) UPCR (g/gCre) 1.19 [0.47, 3.74] 1.10 [0.42, 2.98] 0.143 Outcome IgA nephropathy 294 (28.6) 59 (24.5) 0.203 BUN: Blood Urea Nitrogen, AST: Aspartate Aminotransferase, ALT: Alanine Aminotransferase, ALP: Alkaline Phosphatase, LDH: Lactate Dehydrogenase, CK: Creatine Kinase, HbA1c: Hemoglobin A1c, IgG: Immunoglobulin G, IgA: Immunoglobulin A, IgM: Immunoglobulin M, IgA/C3: Immunoglobulin A / Complement C3 ratio, ANA: Antinuclear antibodies, Urine RBC: Urine red blood cells, UPCR: Urine protein to creatinine ratio. Predictor variables A total of 14 variables were selected as predictors through four variable selection methods and included in the machine learning models: age, hemoglobin, total protein, albumin, LDH, CK, C-reactive protein, IgG, IgA, complement C3, complement C4, IgA/C3, Urine RBC, and UPCR (Supplementary Table S5). Model performance The results of the AUROC, AUPRC, precision, recall, and F1 score for each machine learning model in the derivation and validation cohorts are shown in Table 2 . In the derivation cohort, LightGBM achieved the highest AUROC at 0.913 (95% CI 0.906–0.919), significantly higher than logistic regression and Artificial Neural Network, not significantly different from XGBoost and Random Forest (Fig. 2 ). In the validation cohort, XGBoost had the highest AUROC at 0.894 (95% CI 0.850–0.935), though no significant differences were observed with any models. In the derivation cohort, the AUPRC for XGBoost was 0.779 (95% CI 0.771–0.794), significantly higher than logistic regression and Artificial Neural Network, with no significant difference from LightGBM and Random Forest (Fig. 3 ). In the validation cohort, XGBoost also scored the highest AUPRC at 0.748 (95% CI 0.630–0.846), no significant differences were found with any models. The calibration plot demonstrated good calibration for all models, with the Brier Score ranging from 0.107 to 0.137 (Supplementary Fig. S1 ). Table 2 Performance of the machine learning models in derivation cohort and validation cohort. Model AUROC AUPRC precision recall F1 score Derivation cohort XGBoost 0.910 (0.903–0.917) 0.779 (0.771–0.794) 0.731 (0.716–0.744) 0.798 (0.785–0.811) 0.676 (0.656–0.694) LightGBM 0.913 (0.906–0.919) 0.778 (0.770–0.795) 0.735 (0.719–0.751) 0.794 (0.776–0.812) 0.687 (0.665–0.709) Random Forest 0.910 (0.904–0.916) 0.757 (0.749–0.776) 0.702 (0.687–0.718) 0.816 (0.799–0.833) 0.619 (0.597–0.640) Artificial Neural Network 0.893 (0.886–0.903) 0.736 (0.718–0.755) 0.703 (0.688–0.718) 0.714 (0.697–0.730) 0.697 (0.673–0.721) Logistic Regression 0.865 (0.854–0.874) 0.683 (0.672–0.707) 0.613 (0.592–0.634) 0.691 (0.670–0.713) 0.552 (0.528–0.576) Validation cohort XGBoost 0.894 (0.850–0.935) 0.748 (0.630–0.846) 0.735 (0.603–0.861) 0.562 (0.434–0.694) 0.634 (0.529–0.738) LightGBM 0.890 (0.839–0.935) 0.740 (0.617–0.843) 0.769 (0.633–0.886) 0.507 (0.377–0.635) 0.609 (0.488–0.716) Random Forest 0.893 (0.850–0.933) 0.710 (0.578–0.827) 0.761 (0.625–0.886) 0.608 (0.475–0.730) 0.674 (0.553–0.766) Artificial Neural Network 0.868 (0.821–0.907) 0.597 (0.468–0.724) 0.627 (0.500-0.755) 0.561 (0.435–0.695) 0.590 (0.486–0.696) Logistic Regression 0.861 (0.813–0.904) 0.631 (0.501–0.748) 0.639 (0.472–0.806) 0.359 (0.246–0.478) 0.457 (0.337–0.581) Mean (95% confidence interval) is listed in the columns. AUROC: area under the receiver-operating characteristic curve, AUPRC: area under the precision-recall curve. Model Interpretations SHAP values were calculated for the high-performing XGBoost, LightGBM, and Random Forest models to interpret these models. the SHAP bar plots showed the influential variables on the models' predictions (Supplementary Fig. S2). Age, albumin, IgA/C3, and Urine RBC were consistently among the top five predictor variables across all three models. Figure 4 shows the SHAP beeswarm plots, which revealed a negative correlation between age and the prediction of IgAN, while positive correlations were observed with albumin, IgA/C3, and Urine RBC. The SHAP dependence plots indicated various complex relationships between variables and the prediction of IgAN, showing similar patterns across the models (Supplementary Fig. S3 – S5). DISCUSSION In this study, we developed and validated machine learning-based predictive models for diagnosing IgA nephropathy. To the best of our knowledge, this is the first study to compare and evaluate the performance of multiple machine-learning models in diagnosing IgA nephropathy. We evaluated several algorithms, finding that models employing XGBoost, LightGBM, and Random Forest were effective. We confirmed key predictors like age, serum albumin, serum IgA/C3 ratio, and urine red blood cells, in line with previous findings. Additionally, it suggested that the relationships between predictive factors and IgAN predictions could extend beyond simple linearity, hinting at the importance of analyzing diverse patterns for accurate diagnosis of IgAN. These findings suggest that machine learning has potential in the non-invasive and reliable diagnosis of IgA nephropathy. Several machine learning models were evaluated, and XGBoost, LightGBM, and Random Forest exhibited consistently superior predictive performance in the derivation and validation cohorts to the conventional logistic regression model, which we have traditionally relied on for the prediction of IgAN 8,18,19 . While logistic regression is known for its high transparency and interpretability, it assumes linear relationships between predictor variables and target variables, which can limit its performance 20,27 . An IgAN prediction study involving 155 patients undergoing kidney biopsy showed the utility of machine learning models, with Bayesian Networks achieving an AUROC of 0.83 and logistic regression an AUROC of 0.75 29 . Another study with 519 IgAN patients and 211 non-IgAN patients indicated the potential of Artificial Neural Networks with an AUROC of 0.839 for logistic regression and 0.881 for Artificial Neural Networks 30 . However, these studies were limited to comparisons between two models without statistical analysis using 95% confidence intervals, making it difficult to generalize the results. We evaluated five models (4 machine learning models and a conventional logistic regression model), with LightGBM performing best in the derivation phase, statistically significantly higher than Artificial Neural Network and logistic regression, not significantly different from XGBoost and Random Forest. Previous studies evaluated various tabular data sets and showed that machine learning methods frequently outperformed logistic regression 31,32 . Recent studies have shown that tree-based machine learning models like XGBoost, LightGBM, and Random Forest outperform Artificial Neural Networks in general tabular data prediction 33,34 . The superior performance of XGBoost, LightGBM, and Random Forest in predicting IgA nephropathy is consistent with these previous findings, underscoring the potential value of tree-based machine learning models for non-invasive diagnosis of IgAN. However, no significant differences in model performance were observed in the validation phase, indicating the necessity for further verification of model generalizability. We clarified the "black box" of XGBoost, LightGBM, and Random Forest through the SHAP method, identifying age, albumin, IgA/C3, and Urine RBC as important predictor variables. This method is a widely used explanatory technique for interpreting the contribution of predictor variables to model outputs 28,35,36 . Previous studies have reported that the presence of microscopic hematuria and/or persistent proteinuria, IgA, and IgA/C3 are useful for distinguishing IgA nephropathy from other kidney diseases 7,8 . Other research using multivariate logistic regression suggested that age, IgA/C3, albumin, IgA, IgG, eGFR, and the presence of hematuria are independent predictive variables for IgAN 30 . The key predictor variables identified in our study are in line with previous related studies. We additionally visualized the relationships between each variable and the predictions of these models through the SHAP dependence plots, discovering the possibility of various complex relationships. The findings that all three models showed similar results also suggest the importance and robustness of age, albumin, IgA/C3, and Urine RBC as predictor variables. These insights are poised to enhance our understanding of how these variables relate to IgAN moving forward. Our findings have clinically significant implications. First, simple, accurate, and non-invasive predictive models for IgAN can be developed using similar methods, with potential for clinical application. Second, our models employ variables that are routinely collected in clinical settings, meaning that their adoption requires no additional tests or financial costs beyond standard clinical care procedures. Third, identifying key variables and visualizing their relationships with IgAN predictions could provide new perspectives for distinguishing IgAN in clinical settings. This study has several limitations. First, it relied on the data from a single center, lacking external validation across various institutions. Assessing our model's external validity in diverse patient groups remains essential. Second, the limited sample size, particularly in the validation phase, might lead to inadequate statistical power, requiring careful interpretation of each model's evaluative performance. Third, our study cohort included all patients undergoing kidney biopsy, not focused on those with specific clinical manifestations of IgAN like chronic glomerulonephritis. This wide scope requires careful consideration before implementing our predictive models in clinical settings. Given these limitations, future studies should aim for broader validation and verification in multiple institutions to assess the generalizability and clinical potential utility of the models. In conclusion, this study demonstrated the utility of machine learning models using common clinical data in the diagnostic prediction of IgA nephropathy. The machine learning models (XGBoost, LightGBM, and Random Forest) showed higher diagnostic performance compared to a conventional statistical model and the ability to handle complex relationships of prediction. These models can be helpful for non-invasive and reliable methods to predict IgAN. Declarations ACKNOWLEDGEMENTS We extend our heartfelt gratitude to Ms. Yoshiko Ono and Ms. Mami Ohori for their significant contributions to the collection of patient data. Their dedication and efforts were instrumental in the advancement of our research. We wish to express our sincere gratitude to the Tateishi Science and Technology Foundation and the Nishikawa Medical Foundation for their generous support of our research. Their financial contributions were instrumental in enabling us to pursue this project and have made a significant impact on our ability to advance in our field. AUTHOR CONTRIBUTIONS R.N. designed the research plan and analyzed the data. R.N., D.I., and Y.S. participated in the writing of the paper. R.N., D.I., and Y.S. participated in the approval of the final manuscript. DATA AVAILABILITY The dataset cannot be disclosed as approval has not been received from the Ethics Committee of St Marianna University Hospital. The code underlying this article will be shared on reasonable request to the corresponding author. CONFLICT OF INTEREST STATEMENT R.N. was financially supported by the Tateishi Science and Technology Foundation (Grant ID: 2237009) and the Nishikawa Medical Foundation (Grant ID: 202201). ETHICS APPROVAL AND CONSENT TO PARTICIPATE The study was performed in accordance with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects. The study was approved by the St. Marianna University Hospital Institutional Review Board (approval number: 6025) which allowed for analysis of patient-level data with a waiver of informed consent. References Chauveau, D. & Droz, D. Follow-up evaluation of the first patients with IgA nephropathy described at Necker Hospital. Contrib Nephrol 104 , 1–5 (1993). Rovin, B. H. et al. Executive summary of the KDIGO 2021 Guideline for the Management of Glomerular Diseases. Kidney Int 100 , 753–779 (2021). Rodrigues, J. C., Haas, M. & Reich, H. N. IgA Nephropathy. Clin J Am Soc Nephrol 12 , 677–686 (2017). Eiro, M., Katoh, T. & Watanabe, T. Risk factors for bleeding complications in percutaneous renal biopsy. Clin Exp Nephrol 9 , 40–45 (2005). Poggio, E. D. et al. Systematic Review and Meta-Analysis of Native Kidney Biopsy Complications. Clin J Am Soc Nephrol 15 , 1595 (2020). Tomino, Y. et al. Measurement of serum IgA and C3 may predict the diagnosis of patients with IgA nephropathy prior to renal biopsy. J Clin Lab Anal 14 , 220–223 (2000). Maeda, A. et al. Significance of serum IgA levels and serum IgA/C3 ratio in diagnostic analysis of patients with IgA nephropathy. J Clin Lab Anal 17 , 73–76 (2003). Nakayama, K. et al. Prediction of diagnosis of immunoglobulin a nephropathy prior to renal biopsy and correlation with urinary sediment findings and prognostic grading. J Clin Lab Anal 22 , 114–118 (2008). Kiryluk, K. et al. Aberrant Glycosylation of IgA1 is Inherited in Pediatric IgA Nephropathy and Henoch-Schönlein Purpura Nephritis. Kidney Int 80 , 79–87 (2011). Magistroni, R., D’Agati, V. D., Appel, G. B. & Kiryluk, K. New developments in the genetics, pathogenesis, and therapy of IgA nephropathy. Kidney Int 88 , 974–989 (2015). Yanagawa, H. et al. A Panel of Serum Biomarkers Differentiates IgA Nephropathy from Other Renal Diseases. PLoS ONE 9 , e98081 (2014). Wong, J., Horwitz, M. M., Zhou, L. & Toh, S. Using machine learning to identify health outcomes from electronic health record data. Curr Epidemiol Rep 5 , 331–342 (2018). Hobensack, M., Song, J., Scharp, D., Bowles, K. H. & Topaz, M. Machine learning applied to electronic health record data in home healthcare: A scoping review. Int J Med Inform 170 , 104978 (2023). Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572 , 116–119 (2019). Kanda, E., Epureanu, B. I., Adachi, T. & Kashihara, N. Machine-learning-based Web system for the prediction of chronic kidney disease progression and mortality. PLOS Digit Health 2 , e0000188 (2023). Lee, H. et al. Deep Learning Model for Real-Time Prediction of Intradialytic Hypotension. Clin J Am Soc Nephrol 16 , 396 (2021). Jayapandian, C. P. et al. Development and evaluation of deep learning–based segmentation of histologic structures in the kidney cortex with multiple histologic stains. Kidney Int 99 , 86–101 (2021). Gao, J. et al. A novel differential diagnostic model based on multiple biological parameters for immunoglobulin A nephropathy. BMC Med Inform Decis Mak 12 , 58 (2012). Han, Q.-X. et al. A non-invasive diagnostic model of immunoglobulin A nephropathy and serological markers for evaluating disease severity. Chin Med J 132 , 647 (2019). Goldstein, B. A., Navar, A. M. & Carter, R. E. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges. Eur Heart J 38 , 1805–1814 (2017). Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med 162 , 55–63 (2015). Luo, W. et al. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. J Med Internet Res 18 , e323 (2016). Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, San Francisco California USA, 2016). Ke, G. et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017). Breiman, L. Random Forests. Mach Learn 45 , 5–32 (2001). Jain, A. K., Mao, J. & Mohiuddin, K. M. Artificial neural networks: a tutorial. Computer 29 , 31–44 (1996). Cox, D. R. The Regression Analysis of Binary Sequences. J R Stat Soc Ser 20 , 215–242 (1958). Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. in Proceedings of the 31st International Conference on Neural Information Processing Systems 4768–4777 (Curran Associates Inc., Red Hook, NY, USA, 2017). Ducher, M. et al. Comparison of a Bayesian Network with a Logistic Regression Model to Forecast IgA Nephropathy. BioMed Res Int 2013 , 1–6 (2013). Hou, J., Fu, S., Wang, X., Liu, J. & Xu, Z. A noninvasive artificial neural network model to predict IgA nephropathy risk in Chinese population. Sci Rep 12 , 8296 (2022). Caruana, R. & Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd international conference on Machine learning - ICML ’06 161–168 (ACM Press, Pittsburgh, Pennsylvania, 2006). Fernández-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? J Mach Learn Res 15 , 3133–3181 (2014). Borisov, V. et al. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learning Syst. 1–21 (2022). Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? Preprint at https://doi.org/10.48550/arXiv.2207.08815 (2022). Lv, Z., Cui, F., Zou, Q., Zhang, L. & Xu, L. Anticancer peptides prediction with deep representation learning features. Brief Bioinform 22 , bbab008 (2021). Thorsen-Meyer, H.-C. et al. Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. Lancet Digit Health 2 , e179–e191 (2020). Additional Declarations Competing interest reported. R.N. was financially supported by the Tateishi Science and Technology Foundation (Grant ID: 2237009) and the Nishikawa Medical Foundation (Grant ID: 202201). Supplementary Files Supplementarymaterial20240402.pdf Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 02 May, 2024 Reviews received at journal 21 Apr, 2024 Reviewers agreed at journal 13 Apr, 2024 Reviews received at journal 12 Apr, 2024 Reviewers agreed at journal 12 Apr, 2024 Reviewers invited by journal 11 Apr, 2024 Editor assigned by journal 11 Apr, 2024 Editor invited by journal 11 Apr, 2024 Submission checks completed at journal 09 Apr, 2024 First submitted to journal 02 Apr, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4203860","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":290196923,"identity":"74a004c6-cb62-4d9a-8dfd-8fa711accc6c","order_by":0,"name":"Ryunosuke Noda","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAy0lEQVRIiWNgGAWjYHACAxDBw8beAGJbEK9Fjp/nAIgtQbwWY8kZCSCaCC3m7M0bP1e2HU7ccPP51Q0/CiQY+Nu7E/Bqsew5Vix5FqTldk7ZzR6gwyTOnN2A31U3cgwkGyFa0m7wALUYSOQS1GL8E6zl5pm0m3+I1GIGsgXoffZjt4mz5cyxMsuGc+nAQM5huy1jIMFD2C/HmzffbCizBkbl8Wc33/yxkeNv78WvBQwY2UAkDyQZEFYOBn9ABPsDIlWPglEwCkbBSAMAcF9MAL5FFeIAAAAASUVORK5CYII=","orcid":"","institution":"St. Marianna University School of Medicine","correspondingAuthor":true,"prefix":"","firstName":"Ryunosuke","middleName":"","lastName":"Noda","suffix":""},{"id":290196924,"identity":"152bd022-cbb4-4421-bcc5-85c9c716205b","order_by":1,"name":"Daisuke Ichikawa","email":"","orcid":"","institution":"St. Marianna University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Daisuke","middleName":"","lastName":"Ichikawa","suffix":""},{"id":290196925,"identity":"975e93da-8ac7-4acb-8e11-c1ef7bb610c9","order_by":2,"name":"Yugo Shibagaki","email":"","orcid":"","institution":"St. Marianna University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yugo","middleName":"","lastName":"Shibagaki","suffix":""}],"badges":[],"createdAt":"2024-04-02 04:52:30","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4203860/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4203860/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":54594596,"identity":"fbfd203a-1790-427e-b572-54c6420e201f","added_by":"auto","created_at":"2024-04-12 18:28:34","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":140563,"visible":true,"origin":"","legend":"\u003cp\u003eFlow diagram of patient selection.\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-4203860/v1/15f62059968f44f339023027.png"},{"id":54594598,"identity":"75a03612-faa5-48ab-9da4-6419d353d97a","added_by":"auto","created_at":"2024-04-12 18:28:34","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":478449,"visible":true,"origin":"","legend":"\u003cp\u003eReceiver-operating characteristic curves of the machine learning models in (a) derivation cohort and (b) validation cohort.\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-4203860/v1/2ab2504814dd20cccf3bc67f.png"},{"id":54594611,"identity":"e7ed8ca9-52a8-4eaf-bfc3-e7dd64fe9b3d","added_by":"auto","created_at":"2024-04-12 18:28:36","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":566386,"visible":true,"origin":"","legend":"\u003cp\u003ePrecision-recall curves of the machine learning models in (a) derivation cohort and (b) validation cohort.\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-4203860/v1/f60321b8efc0a43db308a68d.png"},{"id":54594639,"identity":"2804a873-3833-4c61-836b-f4191f8e31df","added_by":"auto","created_at":"2024-04-12 18:28:42","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1185482,"visible":true,"origin":"","legend":"\u003cp\u003eShapley additive explanations beeswarm plots of (a) XGBoost, (b) LightGBM, and (c) Random Forest for prediction of IgA nephropathy. LDH: Lactate Dehydrogenase, CK: Creatine Kinase, IgG: Immunoglobulin G, IgA: Immunoglobulin A, IgA/C3: Immunoglobulin A / Complement C3 ratio, Urine RBC: Urine red blood cells, UPCR: Urine protein to creatinine ratio.\u003c/p\u003e","description":"","filename":"Figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-4203860/v1/7432b86518b362c2291cdfb4.png"},{"id":54594654,"identity":"705800c3-7b31-4f83-81bf-3736d5e501a8","added_by":"auto","created_at":"2024-04-12 18:28:53","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":894996,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4203860/v1/120e2bcc-ba75-41ef-a394-0352260ecfaf.pdf"},{"id":54594622,"identity":"3a87e4e6-70f5-4a30-9aab-741eb42249fc","added_by":"auto","created_at":"2024-04-12 18:28:38","extension":"pdf","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":1196863,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterial20240402.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4203860/v1/f5ee0b0f6a786e47c9aa1295.pdf"}],"financialInterests":"Competing interest reported. R.N. was financially supported by the Tateishi Science and Technology Foundation (Grant ID: 2237009) and the Nishikawa Medical Foundation (Grant ID: 202201).","formattedTitle":"Machine learning-based diagnostic prediction of IgA nephropathy: model development and validation study","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eIgA nephropathy (IgAN) is the most common primary glomerulonephritis worldwide, leading to end-stage kidney failure in 30\u0026ndash;40% of patients within two decades of diagnosis \u003csup\u003e1\u003c/sup\u003e. For favorable outcomes in IgAN patients, early detection and timely treatment are essential. IgAN presents variable clinical courses, characterized by various degrees of hematuria and/or proteinuria, complicating diagnosis with general laboratory tests \u003csup\u003e2,3\u003c/sup\u003e. Definitive diagnosis requires kidney biopsy, which, however, has several contraindications and entails risks like significant bleeding \u003csup\u003e4\u003c/sup\u003e, which in severe cases, may require interventions such as transfusion, arterial embolization, or surgery, representing a crucial clinical challenge \u003csup\u003e5\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe potential for predicting the diagnosis of IgAN before or without kidney biopsy has been a topic of discussion \u003csup\u003e6\u003c/sup\u003e. Specifically, the study of non-invasive diagnostic approaches for IgAN through blood and urine biomarkers has gained attention. Biomarkers such as microscopic hematuria, persistent proteinuria, serum IgA levels, and the serum IgA/C3 ratio have been identified as effective for distinguishing IgAN from other kidney diseases \u003csup\u003e7,8\u003c/sup\u003e. Although these variables are measurable in routine clinical settings, their diagnostic capability is limited, serving primarily to aid differential diagnosis. Recent studies have emphasized the importance of galactose-deficient IgA1 (Gd-IgA1), Gd-IgA1-specific IgG, and Gd-IgA1-containing immune complexes in IgAN pathogenesis \u003csup\u003e9,10\u003c/sup\u003e. Elevated serum levels of these markers have been observed in IgAN, suggesting their potential as specific biomarkers \u003csup\u003e11\u003c/sup\u003e. However, their practical application is limited, as their measurement requires advanced equipment not available in general medical facilities.\u003c/p\u003e \u003cp\u003eMachine learning, a subset of artificial intelligence, is instrumental in analyzing extensive clinical data from electronic health records, facilitating the development of predictive models \u003csup\u003e12,13\u003c/sup\u003e. Its application in nephrology expects advancements in predicting acute kidney injury onset \u003csup\u003e14\u003c/sup\u003e, prognosis of chronic kidney disease \u003csup\u003e15\u003c/sup\u003e, dialysis hypotension onset \u003csup\u003e16\u003c/sup\u003e, and assisting kidney pathological diagnosis \u003csup\u003e17\u003c/sup\u003e. IgAN diagnostic prediction studies have predominantly employed logistic regression \u003csup\u003e8,18,19\u003c/sup\u003e, a conventional statistical model assuming linear relationships and thus limiting predictive performance. Advanced machine learning algorithms, capable of modeling non-linear relationships and complex interactions, could improve predictive performance \u003csup\u003e20\u003c/sup\u003e. However, the efficacy of machine learning in predicting IgAN diagnosis remains unexplored.\u003c/p\u003e \u003cp\u003eThis study aims to develop and validate diagnostic prediction models for IgAN using machine learning, based on patient demographics, blood tests, and urine tests, which can be easily obtained in clinical practice. Our other goal is to show machine learning models can be a non-invasive, highly accurate, and reliable diagnostic approach for IgAN, compared to the conventional clinical parameters or conventional statistical models.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy design and study participants\u003c/h2\u003e \u003cp\u003eThis study is a retrospective cohort study involving patients at St. Marianna University Hospital. It included all adult patients who underwent native kidney biopsy from January 1, 2006, to September 30, 2022. Patients with inconclusive diagnoses and those with multiple primary diagnoses were excluded. The data for the cohort were collected from electronic health records, with patients who underwent kidney biopsy between January 1, 2006, and December 31, 2019, included in the derivation cohort, and those who underwent biopsy between January 1, 2020, and September 30, 2022, included in the validation cohort. Details of patient selection for the derivation and validation cohorts are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eEthics approval\u003c/h2\u003e \u003cp\u003eThis study was conducted according to \u0026ldquo;The Declaration of Helsinki\u0026rdquo;, \u0026ldquo;the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis Statement\u0026rdquo; \u003csup\u003e21\u003c/sup\u003e, and \u0026ldquo;Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View\u0026rdquo; \u003csup\u003e22\u003c/sup\u003e. The study protocol was approved by the institutional review board of St. Marianna University Hospital (approval number 6025). As the study was retrospective and involved minimal risk, the requirement for informed consent was waived.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003ePredictor variables\u003c/h2\u003e \u003cp\u003eWe utilized information that is routinely measured in clinical practice as potential predictor variables. Baseline data of patients before the native kidney biopsy were retrospectively collected from electronic health records. These included demographic characteristics, blood tests, and urine tests. Demographic characteristics included age, sex, height, weight, body mass index, and blood test items comprising white blood cells, hemoglobin, total protein, albumin, blood urea nitrogen, creatinine, uric acid, aspartate aminotransferase, alanine aminotransferase, alkaline phosphatase, lactate dehydrogenase (LDH), creatine kinase (CK), total cholesterol, glucose, hemoglobin A1c, C-reactive protein, immunoglobulin G (IgG), immunoglobulin A (IgA), immunoglobulin M (IgM), complement C3, complement C4, IgA/complement C3 ratio (IgA/C3), antinuclear antibodies. Urine test items included urine protein/creatinine ratio (UPCR) and urine red blood cells (Urine RBC), with Urine RBC scored on a scale of 0\u0026thinsp;=\u0026thinsp;\u0026lt;\u0026thinsp;1/high power field (HPF), 2.5\u0026thinsp;=\u0026thinsp;1\u0026thinsp;~\u0026thinsp;4 /HPF, 7.5\u0026thinsp;=\u0026thinsp;5\u0026thinsp;~\u0026thinsp;9 /HPF, 20\u0026thinsp;=\u0026thinsp;10\u0026thinsp;~\u0026thinsp;29 /HPF, 40\u0026thinsp;=\u0026thinsp;30\u0026thinsp;~\u0026thinsp;49/ HPF, 75\u0026thinsp;=\u0026thinsp;50\u0026thinsp;~\u0026thinsp;99/HPF, 100\u0026thinsp;=\u0026thinsp;\u0026ge;\u0026thinsp;100/HPF.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eOutcome measures\u003c/h2\u003e \u003cp\u003eThe outcome of this study is the diagnosis of IgA nephropathy. The definitive diagnoses made through kidney biopsy by nephrologists were collected. IgA nephropathy was assigned as the correct label (1) and all other diagnoses as (0).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eData preprocessing\u003c/h2\u003e \u003cp\u003eThe number and proportion of missing values for each variable are shown in Supplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e. Variables with more than 20% missing values were not included in the analysis. To avoid potential bias arising from excluding patients with missing data, imputation was adopted. The k-nearest neighbor imputation algorithm was employed to fill in missing values for continuous variables. The variables were standardized to have a mean of 0 and a standard deviation of 1.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eVariable selection\u003c/h2\u003e \u003cp\u003eThe variable reduction was performed to prevent overfitting of machine learning models and to reduce computational costs. Predictor variables with nearly zero variance, i.e., variables whose proportion of unique values was less than 5%, were excluded from the analysis. Four variable selection methods were applied to identify subsets of predictor variables. These methods included Least Absolute Shrinkage and Selection Operator, Random Forest-Recursive Feature Elimination, Random Forest-Filtering, and SelectFromModel with Extra Trees. The final predictor variables for model development were determined by integrating the results from the four methods, choosing variables that appeared three or more times across all methods.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eModel development and evaluation\u003c/h2\u003e \u003cp\u003eFor model development, the following four machine learning algorithms\u0026mdash;eXtreme Gradient Boosting (XGBoost), LightGBM, Random Forest, Artificial Neural Network\u0026mdash;and logistic regression were applied to the data of the derivation cohort. XGBoost and LightGBM, along with Random Forest, are tree-based algorithms that combine decision trees with ensemble learning. XGBoost and LightGBM enhance predictive performance by sequentially building decision trees and correcting the errors of previous trees by boosting techniques \u003csup\u003e23,24\u003c/sup\u003e. Random Forest mitigates overfitting by independently training multiple decision trees and integrating their predictions through bagging techniques \u003csup\u003e25\u003c/sup\u003e. Artificial Neural Networks consist of an input layer, hidden layers, and an output layer, capable of handling complex relationships between inputs and outputs using non-linear activation functions \u003csup\u003e26\u003c/sup\u003e. logistic regression is a statistical linear model widely used for binary classification, producing probabilistic outputs and classifying as positive or negative based on a specific threshold \u003csup\u003e27\u003c/sup\u003e. To identify the optimal hyperparameters for each model, training, and validation were conducted in the derivation cohort using 5-repeated 5-fold cross-validation. Bayesian optimization was used for hyperparameter tuning. The hyperparameters of each model tuned are shown in Supplementary Table S2.\u003c/p\u003e \u003cp\u003eThe performance of the final prediction models was evaluated in both the derivation and validation cohorts. The model performance was assessed using the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). AUROC and AUPRC were selected as they reflect performance across all classification thresholds and are less affected by class imbalance. We also included precision, recall, and F1 score (the harmonic mean of precision and recall) as common evaluation metrics for binary classification. The 95% confidence intervals (95% CI) for each metric were generated through 1000 bootstrap iterations with unique random seeds. The model calibration was evaluated using calibration plots and the Brier score. Calibration plots compare the actual positive fraction to the average predicted probability across quintiles of predicted probability. The Brier score, reflecting the mean squared difference between predicted probabilities and actual outcomes, serves as a dual measure of predictive performance and calibration.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eModel interpretations\u003c/h2\u003e \u003cp\u003eThe SHapley Additive exPlanations (SHAP) method was used to explore the interpretability of the models with high diagnostic performance. SHAP provides a unified approach for the interpretation of model predictions, offering consistent and locally accurate attribution values, i.e., the SHAP values, for each variable within the predictive model \u003csup\u003e28\u003c/sup\u003e. The role of each variable in predicting IgA nephropathy can be explained as their collective contributions to the overall risk output for each case.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eContinuous variables were described using mean and standard deviation for normally distributed data, and median values along with interquartile ranges for non-normally distributed data. Categorical variables were presented as counts and percentages. For statistical comparisons, the Student's t-test was applied to normally distributed continuous variables, the Mann-Whitney U test to non-normally distributed continuous variables, and the chi-square test to categorical variables. Variables with a two-tailed p-value less than 0.05 were considered statistically significant. For variable selection, we employed the sklearn library in Python (version 3.10.12). Model development utilized the sklearn, xgboost, and lightgbm libraries and evaluation was conducted using the sklearn, optuna, and shap libraries. R (version 4.2.2) was used for statistical analyses.\u003c/p\u003e \u003c/div\u003e"},{"header":"RESULTS","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003ePatient characteristics\u003c/h2\u003e\n \u003cp\u003eAfter excluding participants with multiple primary diagnoses or without definitive diagnosis, 1,268 participants were enrolled. Of these, 1,027 were included in the derivation cohort and 241 in the validation cohort. The baseline characteristics and outcomes for the derivation and validation cohorts are presented in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e. In the derivation cohort, 294 (28.6%) were diagnosed with IgA nephropathy, compared to 59 (24.5%) in the validation cohort. The baseline characteristics of the IgA nephropathy and non-IgA nephropathy groups for each cohort are detailed in Supplementary Tables S3 and S4.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eBaseline characteristics and outcomes of the patients in the derivation and validation cohorts.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eVariables\u003c/p\u003e\n \u003c/th\u003e\n \u003cth colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eDerivation cohort (n\u0026thinsp;=\u0026thinsp;1027)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eValidation cohort (n\u0026thinsp;=\u0026thinsp;241)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ep-value\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"3\" align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eDemographic characteristics\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"4\" align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eAge (years)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e47 [32, 65]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e59[44, 72]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eSex (male)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e483 (47.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e131 (54.4)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.045\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eHeight (cm)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e161.6 [155.0, 169.0]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e162.7 [155.8, 169.6]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.301\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eBody Weight (kg)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e58.1 [50.5, 67.4]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e59.0 [50.5, 69.3]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.361\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eBody Mass Index (kg/m\u003csup\u003e2\u003c/sup\u003e)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e22.2 [19.9, 25.2]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e22.3 [20.1, 25.4]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.298\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eBlood tests\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"4\" align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eWhite blood cells (/\u0026micro;L)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e6500 [5200, 8300]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e6100 [4500, 8100]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.032\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eHemoglobin (g/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e12.4 [10.6, 14.0]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e12.1 [10.3, 13.5]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.035\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eTotal protein (g/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e6.5 [5.7, 7.1]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e6.4 [5.7, 6.9]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.057\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eAlbumin (g/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e3.6 [2.7, 4.1]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e3.3 [2.3, 3.8]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eBUN (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e16.5 [12.3, 23.2]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e18.00 [12.8, 28.2]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.019\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eCreatinine (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e0.90 [0.69, 1.35]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e1.11 [0.76, 1.67]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eUric acid (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e6.1 [4.9, 7.3]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e6.2 [5.0, 7.5]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.341\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eAST (U/L)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e20 [\u003cspan class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e20 [\u003cspan class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.412\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eALT (U/L)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e16 [\u003cspan class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e16 [\u003cspan class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.672\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eALP (U/L)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e203 [162, 259]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e83 [60, 161]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eLDH (U/L)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e192 [164, 237]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e204 [166, 241]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.127\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eCK (U/L)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e73 [44, 123]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e73 [42, 111]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.460\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eTotal cholesterol (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e200 [166, 243]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e195 [165, 242]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.292\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eGlucose (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e98 [91, 111]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e99 [92, 109]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.407\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eHbA1c (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e5.3 [4.9, 5.7]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e5.6 [5.2, 5.9]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eC-Reactive Protein (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e0.08 [0.03, 0.52]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e0.14 [0.04, 0.98]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.002\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eIgG (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e1187 [861, 1572]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e1184 [817, 1547]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.777\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eIgA (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e282 [206, 383]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e291 [211, 366]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.900\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eIgM (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e96 [65, 143]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e82 [51, 115]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eComplement C3 (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e100 [82, 123]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e106 [90, 132]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.003\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eComplement C4 (mg/dL)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e25 [\u003cspan class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e27 [\u003cspan class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e35\u003c/span\u003e]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.012\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eIgA/C3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e2.83 [1.97, 4.31]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e2.64 [1.85, 3.76]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.19\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eANA (titer)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e40 [40, 80]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e40 [40, 40]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.011\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eUrine tests\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"5\" align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eUrine RBC (/HPF)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"5\" align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.048\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e\u0026lt;\u0026thinsp;1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e173 (16.8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e46 (19.1)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e1\u0026thinsp;~\u0026thinsp;4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e191 (18.6)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e50 (20.7)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e5\u0026thinsp;~\u0026thinsp;9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e148 (14.4)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e30 (12.4)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e10\u0026thinsp;~\u0026thinsp;29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e219 (21.3)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e49 (20.3)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e30\u0026thinsp;~\u0026thinsp;49\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e71 (6.9)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e29 (12.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e50\u0026thinsp;~\u0026thinsp;99\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e71 (6.9)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e14 (5.8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e\u0026ge;\u0026thinsp;100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e154 (15.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e23 (9.5)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eUPCR (g/gCre)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e1.19 [0.47, 3.74]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e1.10 [0.42, 2.98]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.143\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eOutcome\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003eIgA nephropathy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e294 (28.6)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" align=\"left\"\u003e\n \u003cp\u003e59 (24.5)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.203\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003ctfoot\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"7\"\u003eBUN: Blood Urea Nitrogen, AST: Aspartate Aminotransferase, ALT: Alanine Aminotransferase, ALP: Alkaline Phosphatase, LDH: Lactate Dehydrogenase, CK: Creatine Kinase, HbA1c: Hemoglobin A1c, IgG: Immunoglobulin G, IgA: Immunoglobulin A, IgM: Immunoglobulin M, IgA/C3: Immunoglobulin A / Complement C3 ratio, ANA: Antinuclear antibodies, Urine RBC: Urine red blood cells, UPCR: Urine protein to creatinine ratio.\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tfoot\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\n \u003ch2\u003ePredictor variables\u003c/h2\u003e\n \u003cp\u003eA total of 14 variables were selected as predictors through four variable selection methods and included in the machine learning models: age, hemoglobin, total protein, albumin, LDH, CK, C-reactive protein, IgG, IgA, complement C3, complement C4, IgA/C3, Urine RBC, and UPCR (Supplementary Table S5).\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\n \u003ch2\u003eModel performance\u003c/h2\u003e\n \u003cp\u003eThe results of the AUROC, AUPRC, precision, recall, and F1 score for each machine learning model in the derivation and validation cohorts are shown in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e. In the derivation cohort, LightGBM achieved the highest AUROC at 0.913 (95% CI 0.906\u0026ndash;0.919), significantly higher than logistic regression and Artificial Neural Network, not significantly different from XGBoost and Random Forest (Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). In the validation cohort, XGBoost had the highest AUROC at 0.894 (95% CI 0.850\u0026ndash;0.935), though no significant differences were observed with any models. In the derivation cohort, the AUPRC for XGBoost was 0.779 (95% CI 0.771\u0026ndash;0.794), significantly higher than logistic regression and Artificial Neural Network, with no significant difference from LightGBM and Random Forest (Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e). In the validation cohort, XGBoost also scored the highest AUPRC at 0.748 (95% CI 0.630\u0026ndash;0.846), no significant differences were found with any models. The calibration plot demonstrated good calibration for all models, with the Brier Score ranging from 0.107 to 0.137 (Supplementary Fig. \u003cspan class=\"InternalRef\"\u003eS1\u003c/span\u003e).\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003ePerformance of the machine learning models in derivation cohort and validation cohort.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAUROC\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAUPRC\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eprecision\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003erecall\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eF1 score\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eDerivation cohort\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.910 (0.903\u0026ndash;0.917)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.779 (0.771\u0026ndash;0.794)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.731 (0.716\u0026ndash;0.744)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.798 (0.785\u0026ndash;0.811)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.676 (0.656\u0026ndash;0.694)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLightGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.913 (0.906\u0026ndash;0.919)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.778 (0.770\u0026ndash;0.795)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.735 (0.719\u0026ndash;0.751)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.794 (0.776\u0026ndash;0.812)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.687 (0.665\u0026ndash;0.709)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRandom Forest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.910 (0.904\u0026ndash;0.916)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.757 (0.749\u0026ndash;0.776)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.702 (0.687\u0026ndash;0.718)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.816 (0.799\u0026ndash;0.833)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.619 (0.597\u0026ndash;0.640)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eArtificial Neural Network\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.893 (0.886\u0026ndash;0.903)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.736 (0.718\u0026ndash;0.755)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.703 (0.688\u0026ndash;0.718)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.714 (0.697\u0026ndash;0.730)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.697 (0.673\u0026ndash;0.721)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLogistic Regression\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.865 (0.854\u0026ndash;0.874)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.683 (0.672\u0026ndash;0.707)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.613 (0.592\u0026ndash;0.634)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.691 (0.670\u0026ndash;0.713)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.552 (0.528\u0026ndash;0.576)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eValidation cohort\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.894 (0.850\u0026ndash;0.935)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.748 (0.630\u0026ndash;0.846)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.735 (0.603\u0026ndash;0.861)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.562 (0.434\u0026ndash;0.694)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.634 (0.529\u0026ndash;0.738)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLightGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.890 (0.839\u0026ndash;0.935)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.740 (0.617\u0026ndash;0.843)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.769 (0.633\u0026ndash;0.886)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.507 (0.377\u0026ndash;0.635)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.609 (0.488\u0026ndash;0.716)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRandom Forest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.893 (0.850\u0026ndash;0.933)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.710 (0.578\u0026ndash;0.827)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.761 (0.625\u0026ndash;0.886)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.608 (0.475\u0026ndash;0.730)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.674 (0.553\u0026ndash;0.766)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eArtificial Neural Network\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.868 (0.821\u0026ndash;0.907)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.597 (0.468\u0026ndash;0.724)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.627 (0.500-0.755)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.561 (0.435\u0026ndash;0.695)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.590 (0.486\u0026ndash;0.696)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLogistic Regression\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.861 (0.813\u0026ndash;0.904)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.631 (0.501\u0026ndash;0.748)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.639 (0.472\u0026ndash;0.806)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.359 (0.246\u0026ndash;0.478)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.457 (0.337\u0026ndash;0.581)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003ctfoot\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"6\"\u003eMean (95% confidence interval) is listed in the columns.\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"6\"\u003eAUROC: area under the receiver-operating characteristic curve, AUPRC: area under the precision-recall curve.\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tfoot\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\n \u003ch2\u003eModel Interpretations\u003c/h2\u003e\n \u003cp\u003eSHAP values were calculated for the high-performing XGBoost, LightGBM, and Random Forest models to interpret these models. the SHAP bar plots showed the influential variables on the models\u0026apos; predictions (Supplementary Fig. S2). Age, albumin, IgA/C3, and Urine RBC were consistently among the top five predictor variables across all three models. Figure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e shows the SHAP beeswarm plots, which revealed a negative correlation between age and the prediction of IgAN, while positive correlations were observed with albumin, IgA/C3, and Urine RBC. The SHAP dependence plots indicated various complex relationships between variables and the prediction of IgAN, showing similar patterns across the models (Supplementary Fig. S3 \u0026ndash; S5).\u003c/p\u003e\n\u003c/div\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eIn this study, we developed and validated machine learning-based predictive models for diagnosing IgA nephropathy. To the best of our knowledge, this is the first study to compare and evaluate the performance of multiple machine-learning models in diagnosing IgA nephropathy. We evaluated several algorithms, finding that models employing XGBoost, LightGBM, and Random Forest were effective. We confirmed key predictors like age, serum albumin, serum IgA/C3 ratio, and urine red blood cells, in line with previous findings. Additionally, it suggested that the relationships between predictive factors and IgAN predictions could extend beyond simple linearity, hinting at the importance of analyzing diverse patterns for accurate diagnosis of IgAN. These findings suggest that machine learning has potential in the non-invasive and reliable diagnosis of IgA nephropathy.\u003c/p\u003e \u003cp\u003eSeveral machine learning models were evaluated, and XGBoost, LightGBM, and Random Forest exhibited consistently superior predictive performance in the derivation and validation cohorts to the conventional logistic regression model, which we have traditionally relied on for the prediction of IgAN \u003csup\u003e8,18,19\u003c/sup\u003e. While logistic regression is known for its high transparency and interpretability, it assumes linear relationships between predictor variables and target variables, which can limit its performance \u003csup\u003e20,27\u003c/sup\u003e. An IgAN prediction study involving 155 patients undergoing kidney biopsy showed the utility of machine learning models, with Bayesian Networks achieving an AUROC of 0.83 and logistic regression an AUROC of 0.75 \u003csup\u003e29\u003c/sup\u003e. Another study with 519 IgAN patients and 211 non-IgAN patients indicated the potential of Artificial Neural Networks with an AUROC of 0.839 for logistic regression and 0.881 for Artificial Neural Networks \u003csup\u003e30\u003c/sup\u003e. However, these studies were limited to comparisons between two models without statistical analysis using 95% confidence intervals, making it difficult to generalize the results. We evaluated five models (4 machine learning models and a conventional logistic regression model), with LightGBM performing best in the derivation phase, statistically significantly higher than Artificial Neural Network and logistic regression, not significantly different from XGBoost and Random Forest. Previous studies evaluated various tabular data sets and showed that machine learning methods frequently outperformed logistic regression \u003csup\u003e31,32\u003c/sup\u003e. Recent studies have shown that tree-based machine learning models like XGBoost, LightGBM, and Random Forest outperform Artificial Neural Networks in general tabular data prediction \u003csup\u003e33,34\u003c/sup\u003e. The superior performance of XGBoost, LightGBM, and Random Forest in predicting IgA nephropathy is consistent with these previous findings, underscoring the potential value of tree-based machine learning models for non-invasive diagnosis of IgAN. However, no significant differences in model performance were observed in the validation phase, indicating the necessity for further verification of model generalizability.\u003c/p\u003e \u003cp\u003eWe clarified the \"black box\" of XGBoost, LightGBM, and Random Forest through the SHAP method, identifying age, albumin, IgA/C3, and Urine RBC as important predictor variables. This method is a widely used explanatory technique for interpreting the contribution of predictor variables to model outputs \u003csup\u003e28,35,36\u003c/sup\u003e. Previous studies have reported that the presence of microscopic hematuria and/or persistent proteinuria, IgA, and IgA/C3 are useful for distinguishing IgA nephropathy from other kidney diseases \u003csup\u003e7,8\u003c/sup\u003e. Other research using multivariate logistic regression suggested that age, IgA/C3, albumin, IgA, IgG, eGFR, and the presence of hematuria are independent predictive variables for IgAN \u003csup\u003e30\u003c/sup\u003e. The key predictor variables identified in our study are in line with previous related studies. We additionally visualized the relationships between each variable and the predictions of these models through the SHAP dependence plots, discovering the possibility of various complex relationships. The findings that all three models showed similar results also suggest the importance and robustness of age, albumin, IgA/C3, and Urine RBC as predictor variables. These insights are poised to enhance our understanding of how these variables relate to IgAN moving forward.\u003c/p\u003e \u003cp\u003eOur findings have clinically significant implications. First, simple, accurate, and non-invasive predictive models for IgAN can be developed using similar methods, with potential for clinical application. Second, our models employ variables that are routinely collected in clinical settings, meaning that their adoption requires no additional tests or financial costs beyond standard clinical care procedures. Third, identifying key variables and visualizing their relationships with IgAN predictions could provide new perspectives for distinguishing IgAN in clinical settings.\u003c/p\u003e \u003cp\u003eThis study has several limitations. First, it relied on the data from a single center, lacking external validation across various institutions. Assessing our model's external validity in diverse patient groups remains essential. Second, the limited sample size, particularly in the validation phase, might lead to inadequate statistical power, requiring careful interpretation of each model's evaluative performance. Third, our study cohort included all patients undergoing kidney biopsy, not focused on those with specific clinical manifestations of IgAN like chronic glomerulonephritis. This wide scope requires careful consideration before implementing our predictive models in clinical settings. Given these limitations, future studies should aim for broader validation and verification in multiple institutions to assess the generalizability and clinical potential utility of the models.\u003c/p\u003e \u003cp\u003eIn conclusion, this study demonstrated the utility of machine learning models using common clinical data in the diagnostic prediction of IgA nephropathy. The machine learning models (XGBoost, LightGBM, and Random Forest) showed higher diagnostic performance compared to a conventional statistical model and the ability to handle complex relationships of prediction. These models can be helpful for non-invasive and reliable methods to predict IgAN.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eACKNOWLEDGEMENTS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe extend our heartfelt gratitude to Ms. Yoshiko Ono and Ms. Mami Ohori for their significant contributions to the collection of patient data. Their dedication and efforts were instrumental in the advancement of our research. We wish to express our sincere gratitude to the Tateishi Science and Technology Foundation and the Nishikawa Medical Foundation for their generous support of our research. Their financial contributions were instrumental in enabling us to pursue this project and have made a significant impact on our ability to advance in our field.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAUTHOR CONTRIBUTIONS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eR.N. designed the research plan and analyzed the data. R.N., D.I., and Y.S. participated in the writing of the paper. R.N., D.I., and Y.S. participated in the approval of the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDATA AVAILABILITY\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe dataset cannot be disclosed as approval has not been received from the Ethics Committee of St Marianna University Hospital. The code underlying this article will be shared on reasonable request to the corresponding author.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCONFLICT OF INTEREST STATEMENT\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eR.N. was financially supported by the Tateishi Science and Technology Foundation (Grant ID: 2237009) and the Nishikawa Medical Foundation (Grant ID: 202201).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eETHICS APPROVAL AND CONSENT TO PARTICIPATE\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study was performed in accordance with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects. The study was approved by the St. Marianna University Hospital Institutional Review Board (approval number: 6025) which allowed for analysis of patient-level data with a waiver of informed consent.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eChauveau, D. \u0026amp; Droz, D. Follow-up evaluation of the first patients with IgA nephropathy described at Necker Hospital. \u003cem\u003eContrib Nephrol\u003c/em\u003e \u003cstrong\u003e104\u003c/strong\u003e, 1\u0026ndash;5 (1993).\u003c/li\u003e\n\u003cli\u003eRovin, B. H. \u003cem\u003eet al.\u003c/em\u003e Executive summary of the KDIGO 2021 Guideline for the Management of Glomerular Diseases. \u003cem\u003eKidney Int\u003c/em\u003e \u003cstrong\u003e100\u003c/strong\u003e, 753\u0026ndash;779 (2021).\u003c/li\u003e\n\u003cli\u003eRodrigues, J. C., Haas, M. \u0026amp; Reich, H. N. IgA Nephropathy. \u003cem\u003eClin J Am Soc Nephrol\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 677\u0026ndash;686 (2017).\u003c/li\u003e\n\u003cli\u003eEiro, M., Katoh, T. \u0026amp; Watanabe, T. Risk factors for bleeding complications in percutaneous renal biopsy. \u003cem\u003eClin Exp Nephrol\u003c/em\u003e \u003cstrong\u003e9\u003c/strong\u003e, 40\u0026ndash;45 (2005).\u003c/li\u003e\n\u003cli\u003ePoggio, E. D. \u003cem\u003eet al.\u003c/em\u003e Systematic Review and Meta-Analysis of Native Kidney Biopsy Complications. \u003cem\u003eClin J Am Soc Nephrol\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 1595 (2020).\u003c/li\u003e\n\u003cli\u003eTomino, Y. \u003cem\u003eet al.\u003c/em\u003e Measurement of serum IgA and C3 may predict the diagnosis of patients with IgA nephropathy prior to renal biopsy. \u003cem\u003eJ Clin Lab Anal\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 220\u0026ndash;223 (2000).\u003c/li\u003e\n\u003cli\u003eMaeda, A. \u003cem\u003eet al.\u003c/em\u003e Significance of serum IgA levels and serum IgA/C3 ratio in diagnostic analysis of patients with IgA nephropathy. \u003cem\u003eJ Clin Lab Anal\u003c/em\u003e \u003cstrong\u003e17\u003c/strong\u003e, 73\u0026ndash;76 (2003).\u003c/li\u003e\n\u003cli\u003eNakayama, K. \u003cem\u003eet al.\u003c/em\u003e Prediction of diagnosis of immunoglobulin a nephropathy prior to renal biopsy and correlation with urinary sediment findings and prognostic grading. \u003cem\u003eJ Clin Lab Anal\u003c/em\u003e \u003cstrong\u003e22\u003c/strong\u003e, 114\u0026ndash;118 (2008).\u003c/li\u003e\n\u003cli\u003eKiryluk, K. \u003cem\u003eet al.\u003c/em\u003e Aberrant Glycosylation of IgA1 is Inherited in Pediatric IgA Nephropathy and Henoch-Sch\u0026ouml;nlein Purpura Nephritis. \u003cem\u003eKidney Int\u003c/em\u003e \u003cstrong\u003e80\u003c/strong\u003e, 79\u0026ndash;87 (2011).\u003c/li\u003e\n\u003cli\u003eMagistroni, R., D\u0026rsquo;Agati, V. D., Appel, G. B. \u0026amp; Kiryluk, K. New developments in the genetics, pathogenesis, and therapy of IgA nephropathy. \u003cem\u003eKidney Int\u003c/em\u003e \u003cstrong\u003e88\u003c/strong\u003e, 974\u0026ndash;989 (2015).\u003c/li\u003e\n\u003cli\u003eYanagawa, H. \u003cem\u003eet al.\u003c/em\u003e A Panel of Serum Biomarkers Differentiates IgA Nephropathy from Other Renal Diseases. \u003cem\u003ePLoS ONE\u003c/em\u003e \u003cstrong\u003e9\u003c/strong\u003e, e98081 (2014).\u003c/li\u003e\n\u003cli\u003eWong, J., Horwitz, M. M., Zhou, L. \u0026amp; Toh, S. Using machine learning to identify health outcomes from electronic health record data. \u003cem\u003eCurr Epidemiol Rep\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 331\u0026ndash;342 (2018).\u003c/li\u003e\n\u003cli\u003eHobensack, M., Song, J., Scharp, D., Bowles, K. H. \u0026amp; Topaz, M. Machine learning applied to electronic health record data in home healthcare: A scoping review. \u003cem\u003eInt J Med Inform\u003c/em\u003e \u003cstrong\u003e170\u003c/strong\u003e, 104978 (2023).\u003c/li\u003e\n\u003cli\u003eToma\u0026scaron;ev, N. \u003cem\u003eet al.\u003c/em\u003e A clinically applicable approach to continuous prediction of future acute kidney injury. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e572\u003c/strong\u003e, 116\u0026ndash;119 (2019).\u003c/li\u003e\n\u003cli\u003eKanda, E., Epureanu, B. I., Adachi, T. \u0026amp; Kashihara, N. Machine-learning-based Web system for the prediction of chronic kidney disease progression and mortality. \u003cem\u003ePLOS Digit Health\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, e0000188 (2023).\u003c/li\u003e\n\u003cli\u003eLee, H. \u003cem\u003eet al.\u003c/em\u003e Deep Learning Model for Real-Time Prediction of Intradialytic Hypotension. \u003cem\u003eClin J Am Soc Nephrol\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, 396 (2021).\u003c/li\u003e\n\u003cli\u003eJayapandian, C. P. \u003cem\u003eet al.\u003c/em\u003e Development and evaluation of deep learning\u0026ndash;based segmentation of histologic structures in the kidney cortex with multiple histologic stains. \u003cem\u003eKidney Int\u003c/em\u003e \u003cstrong\u003e99\u003c/strong\u003e, 86\u0026ndash;101 (2021).\u003c/li\u003e\n\u003cli\u003eGao, J. \u003cem\u003eet al.\u003c/em\u003e A novel differential diagnostic model based on multiple biological parameters for immunoglobulin A nephropathy. \u003cem\u003eBMC Med Inform Decis Mak\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 58 (2012).\u003c/li\u003e\n\u003cli\u003eHan, Q.-X. \u003cem\u003eet al.\u003c/em\u003e A non-invasive diagnostic model of immunoglobulin A nephropathy and serological markers for evaluating disease severity. \u003cem\u003eChin Med J\u003c/em\u003e \u003cstrong\u003e132\u003c/strong\u003e, 647 (2019).\u003c/li\u003e\n\u003cli\u003eGoldstein, B. A., Navar, A. M. \u0026amp; Carter, R. E. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges. \u003cem\u003eEur Heart J\u003c/em\u003e \u003cstrong\u003e38\u003c/strong\u003e, 1805\u0026ndash;1814 (2017).\u003c/li\u003e\n\u003cli\u003eCollins, G. S., Reitsma, J. B., Altman, D. G. \u0026amp; Moons, K. G. M. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. \u003cem\u003eAnn Intern Med\u003c/em\u003e \u003cstrong\u003e162\u003c/strong\u003e, 55\u0026ndash;63 (2015).\u003c/li\u003e\n\u003cli\u003eLuo, W. \u003cem\u003eet al.\u003c/em\u003e Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. \u003cem\u003eJ Med Internet Res\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, e323 (2016).\u003c/li\u003e\n\u003cli\u003eChen, T. \u0026amp; Guestrin, C. XGBoost: A Scalable Tree Boosting System. in \u003cem\u003eProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\u003c/em\u003e 785\u0026ndash;794 (ACM, San Francisco California USA, 2016).\u003c/li\u003e\n\u003cli\u003eKe, G. \u003cem\u003eet al.\u003c/em\u003e LightGBM: A Highly Efficient Gradient Boosting Decision Tree. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 30 (Curran Associates, Inc., 2017).\u003c/li\u003e\n\u003cli\u003eBreiman, L. Random Forests. \u003cem\u003eMach Learn\u003c/em\u003e \u003cstrong\u003e45\u003c/strong\u003e, 5\u0026ndash;32 (2001).\u003c/li\u003e\n\u003cli\u003eJain, A. K., Mao, J. \u0026amp; Mohiuddin, K. M. Artificial neural networks: a tutorial. \u003cem\u003eComputer\u003c/em\u003e \u003cstrong\u003e29\u003c/strong\u003e, 31\u0026ndash;44 (1996).\u003c/li\u003e\n\u003cli\u003eCox, D. R. The Regression Analysis of Binary Sequences. \u003cem\u003eJ R Stat Soc Ser\u003c/em\u003e \u003cstrong\u003e20\u003c/strong\u003e, 215\u0026ndash;242 (1958).\u003c/li\u003e\n\u003cli\u003eLundberg, S. M. \u0026amp; Lee, S.-I. A unified approach to interpreting model predictions. in \u003cem\u003eProceedings of the 31st International Conference on Neural Information Processing Systems\u003c/em\u003e 4768\u0026ndash;4777 (Curran Associates Inc., Red Hook, NY, USA, 2017).\u003c/li\u003e\n\u003cli\u003eDucher, M. \u003cem\u003eet al.\u003c/em\u003e Comparison of a Bayesian Network with a Logistic Regression Model to Forecast IgA Nephropathy. \u003cem\u003eBioMed Res Int\u003c/em\u003e \u003cstrong\u003e2013\u003c/strong\u003e, 1\u0026ndash;6 (2013).\u003c/li\u003e\n\u003cli\u003eHou, J., Fu, S., Wang, X., Liu, J. \u0026amp; Xu, Z. A noninvasive artificial neural network model to predict IgA nephropathy risk in Chinese population. \u003cem\u003eSci Rep\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 8296 (2022).\u003c/li\u003e\n\u003cli\u003eCaruana, R. \u0026amp; Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. in \u003cem\u003eProceedings of the 23rd international conference on Machine learning - ICML \u0026rsquo;06\u003c/em\u003e 161\u0026ndash;168 (ACM Press, Pittsburgh, Pennsylvania, 2006).\u003c/li\u003e\n\u003cli\u003eFern\u0026aacute;ndez-Delgado, M., Cernadas, E., Barro, S. \u0026amp; Amorim, D. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? \u003cem\u003eJ Mach Learn Res\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 3133\u0026ndash;3181 (2014).\u003c/li\u003e\n\u003cli\u003eBorisov, V. \u003cem\u003eet al.\u003c/em\u003e Deep Neural Networks and Tabular Data: A Survey. \u003cem\u003eIEEE Trans. Neural Netw. Learning Syst.\u003c/em\u003e 1\u0026ndash;21 (2022).\u003c/li\u003e\n\u003cli\u003eGrinsztajn, L., Oyallon, E. \u0026amp; Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? Preprint at https://doi.org/10.48550/arXiv.2207.08815 (2022).\u003c/li\u003e\n\u003cli\u003eLv, Z., Cui, F., Zou, Q., Zhang, L. \u0026amp; Xu, L. Anticancer peptides prediction with deep representation learning features. \u003cem\u003eBrief Bioinform\u003c/em\u003e \u003cstrong\u003e22\u003c/strong\u003e, bbab008 (2021).\u003c/li\u003e\n\u003cli\u003eThorsen-Meyer, H.-C. \u003cem\u003eet al.\u003c/em\u003e Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. \u003cem\u003eLancet Digit Health\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, e179\u0026ndash;e191 (2020).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"IgA nephropathy, kidney biopsy, artificial intelligence, machine learning, glomerulonephritis","lastPublishedDoi":"10.21203/rs.3.rs-4203860/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4203860/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIgA nephropathy progresses to kidney failure, making early detection important. However, definitive diagnosis depends on invasive kidney biopsy. This study aimed to develop non-invasive prediction models for IgA nephropathy using machine learning. We collected retrospective data on demographic characteristics, blood tests, and urine tests of the patients who underwent kidney biopsy. The dataset was divided into derivation and validation cohorts, with temporal validation. We employed four machine learning models\u0026mdash;eXtreme Gradient Boosting (XGBoost), LightGBM, Random Forest, and Artificial Neural Networks\u0026mdash;and logistic regression, evaluating performance via the area under the receiver operating characteristic curve (AUROC) and explored variable importance through SHapley Additive exPlanations method. The study included 1268 participants, with 353 (28%) diagnosed with IgA nephropathy. In the derivation cohort, LightGBM achieved the highest AUROC of 0.913 (95% CI 0.906\u0026ndash;0.917), significantly higher than logistic regression and Artificial Neural Network, not significantly different from XGBoost and Random Forest. In the validation cohort, XGBoost demonstrated the highest AUROC of 0.894 (95% CI 0.850\u0026ndash;0.935), maintaining its robust performance from the derivation phase. Key predictors identified were age, serum albumin, serum IgA/C3 ratio, and urine red blood cells, aligning with existing clinical insights. Machine learning can be a valuable non-invasive tool for IgA nephropathy.\u003c/p\u003e","manuscriptTitle":"Machine learning-based diagnostic prediction of IgA nephropathy: model development and validation study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-04-12 18:26:07","doi":"10.21203/rs.3.rs-4203860/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-05-02T04:56:32+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-04-21T15:07:17+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"d9b7c63f-9acf-49ef-b97f-9b6f39450d04","date":"2024-04-13T10:41:28+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-04-12T10:42:49+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"151c8029-7633-47ba-b105-04e0a1a1c86d","date":"2024-04-12T07:03:57+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-04-11T14:28:02+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-04-11T13:58:48+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2024-04-11T13:55:24+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-04-09T07:16:15+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2024-04-02T04:51:11+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"15d87857-22e8-47a6-91a1-4569a966565f","owner":[],"postedDate":"April 12th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":30566083,"name":"Health sciences/Nephrology"},{"id":30566084,"name":"Health sciences/Nephrology/Kidney"},{"id":30566085,"name":"Health sciences/Nephrology/Kidney diseases"},{"id":30566086,"name":"Health sciences/Medical research/Biomarkers/Diagnostic markers"},{"id":30566087,"name":"Health sciences/Medical research/Biomarkers/Predictive markers"},{"id":30566088,"name":"Health sciences/Health care/Diagnosis"}],"tags":[],"updatedAt":"2024-05-28T05:14:16+00:00","versionOfRecord":[],"versionCreatedAt":"2024-04-12 18:26:07","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4203860","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4203860","identity":"rs-4203860","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00