A Machine Learning Framework for Genomic Prediction of Paratuberculosis Predisposition in Goats: Discrimination–Calibration Dissociation Across Learning Architectures | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article A Machine Learning Framework for Genomic Prediction of Paratuberculosis Predisposition in Goats: Discrimination–Calibration Dissociation Across Learning Architectures Yalçın YAMAN, Ahmet ESER, Devran COŞKUN, Ramazan AYMAZ, Yiğit Emir KİŞİ, and 6 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9421190/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 18 You are reading this latest preprint version Abstract Genomic prediction of complex disease resistance demands frameworks that jointly optimise discriminative power and probabilistic calibration. We systematically benchmark 14 predictive frameworks — spanning regularised linear models, GBLUP, kernel-based classifiers, tree-based ensembles, deep neural networks, and meta-ensemble strategies — for paratuberculosis predisposition classification in 474 goats representing seven indigenous Turkish breeds across 11 provinces. Mutual information-based feature selection distilled 44,375 quality-controlled SNPs into 5,000 maximally informative markers. Seven architecturally diverse frameworks converged to a statistically indistinguishable discrimination ceiling (mean AUC ≈ 0.982); GBLUP's competitive standing within this ceiling implicates predominantly additive genetic architecture consistent with the infinitesimal model framework. The central finding is a discrimination–calibration dissociation: ROC-AUC varied only 1.14-fold across all 14 models, whereas Brier score varied 4.85-fold (0.046–0.223), revealing that probabilistic fidelity diverges far beyond rank-ordering capacity. Post-hoc prevalence recalibration across the plausible range of MAP field deployment scenarios (π = 0.05–0.25) confirmed that this tier-level calibration ordering was fully preserved at all tested prevalences, including the estimated true seroprevalence of 14.07%, indicating that the dissociation reflects an architectural property of the models rather than an artefact of the balanced training design. Critically, tree-based ensembles exhibit architecturally heterogeneous failure modes — erratic fold-instability, systematic decision boundary asymmetry, and structural miscalibration — profiles with distinct remedial implications that composite ranking obscures. Cross-fold prediction stability emerged as a strong integrative proxy for both discrimination and calibration, exhibiting high correlations with both ROC-AUC (r = 0.924) and Brier score (r = − 0.847). The stacking ensemble achieved the highest composite performance (accuracy = 0.945; F1 = 0.947; Brier = 0.046). These findings suggest that genomic prediction model selection should be governed by calibration-aware, multi-dimensional evaluation frameworks. Biological sciences/Computational biology and bioinformatics Physical sciences/Mathematics and computing genomic prediction paratuberculosis machine learning deep learning discrimination–calibration dissociation mutual information model benchmarking goat Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Introduction The challenge of predicting phenotypic outcomes from genomic variation has transformed both agricultural breeding and precision medicine, from genomic selection (GS) accelerating genetic improvement in livestock to polygenic risk scores (PRS) assessing disease susceptibility in humans [1, 2]. GS has revolutionised genetic improvement by leveraging dense genome-wide markers to predict breeding values and accelerate genetic progress, traditionally relying on parametric linear mixed models such as genomic best linear unbiased predictor (GBLUP) and Bayesian regression approaches [3–5]. A shared challenge across both paradigms is that conventional parametric models, while tractable under additive genetic architectures, may inadequately represent the non-linear, epistatic, and context-dependent relationships that characterise high-dimensional genomic data [6, 7]. Standard GBLUP and Bayesian regression implementations predominantly — and in most operational deployments exclusively — model additive SNP substitution effects, leaving dominance deviations and inter-locus epistatic interactions either unmodeled or absorbed into the residual variance [7]. This structural constraint carries predictive consequences that are architecture-dependent: systematic benchmarking across 14 prediction models demonstrates that parametric methods yield superior accuracy under purely additive gene action but are consistently outperformed by non-parametric alternatives when epistasis underlies phenotypic variation, with inadequate architectural representation producing measurable prediction bias [6]. Fitness-related traits — including immune responsiveness and infectious disease susceptibility — represent the class for which non-additive genetic variance is expected to be disproportionately relevant [7], directly motivating the evaluation of architecturally flexible alternatives in the present paratuberculosis resistance context. Critically, analogous constraints extend to human disease genetics, where polygenic risk scores derived from parametric linear frameworks demonstrate reduced predictive portability across ancestral groups and underestimate phenotypic variance attributable to non-additive and population-specific genetic variation, underscoring the need for architecturally flexible and ancestry-aware prediction frameworks broadly [8–11]. Classical genomic prediction models such as GBLUP and Bayesian regression primarily capture additive genetic effects, limiting their ability to represent the full biological complexity of high-dimensional data [3, 12, 13]. These formulations often overlook non-linear relationships, dominance, epistasis, and higher-order interactions pervasive in complex traits [3, 13, 14]. Machine learning (ML) and deep learning (DL) approaches have emerged as flexible, model-free alternatives capable of learning genotype–phenotype mappings without rigid assumptions about the underlying genetic architecture [13, 15, 16], with particular advantages in capturing non-linear and high-order effects and integrating diverse data modalities [12, 17, 18]. However, these gains come at the cost of increased computational demands, reduced interpretability, and greater sensitivity to data quality and sample size [3, 12]. Importantly, emerging comparative benchmarks across binary disease and health trait prediction contexts — including populations of moderate size where additive genetic architecture predominates — have repeatedly found GBLUP to perform indistinguishably from structurally more complex ML architectures such as RF, SVM, XGBoost, and MLP [19], raising fundamental questions about when model complexity translates into practical predictive gain. Paratuberculosis (PTB), or Johne’s disease, is a pervasive, incurable chronic granulomatous enteritis affecting ruminants worldwide [20]. Its etiological agent, Mycobacterium avium subsp. paratuberculosis (MAP), causes substantial economic losses through decreased milk yield, impaired fertility, and premature culling [21, 22]. Beyond its agricultural impact, MAP is recognised as a growing zoonotic threat with epidemiological links to Crohn’s disease and autoimmune conditions in humans [23, 24], facilitated by MAP’s ability to survive pasteurisation and persist in the environment [25, 26]. Elucidating the genetic architecture of host resistance is therefore imperative for addressing this “One Health” challenge [23]. Paratuberculosis resistance is polygenic, with host genotypes significantly determining MAP infection trajectories [27]. In dairy cattle, heritability estimates range from 0.03 to 0.27, and GWAS have identified QTLs across nearly all bovine chromosomes, with robust signals on BTA23 (MHC region), BTA3, and BTA5 [28, 29]. Candidate genes including ATG4D (autophagy–MAP clearance) and LRP1 (inflammatory modulation) have been characterised, and expression QTL analyses have identified regulatory variants influencing macrophage activation [29–31]. In small ruminants, goats exhibit higher clinical susceptibility than sheep [32]. A significant QTL on OAR20 explains up to 18% of the genetic variance in antibody response in sheep [20]; targeted studies have identified SLC11A1 microsatellite variants [33, 34] and a TLR2 mutation conferring 6.6-fold increased resistance in Turkish sheep [35]. In goats, polymorphisms within the SLC11A1 B7 allele remain pivotal determinants of infection risk [33]. Here, we systematically benchmarked 14 genomic prediction frameworks for paratuberculosis resistance classification in goats — a trait with complex polygenic architecture and significant agricultural and public health implications. Using the Illumina 65K Goat BeadChip, we applied mutual information-based dimensionality reduction to extract a refined 5,000-SNP feature set — an information-theoretic selection strategy whose utility for livestock genomic prediction has been independently validated [36] — then subjected this substrate to comparative evaluation spanning parametric linear models, GBLUP, kernel-based classifiers, tree-based ensembles, deep neural networks, and meta-ensemble strategies across 474 animals from seven indigenous Turkish goat breeds. Through repeated cross-validation and comprehensive statistical testing incorporating both discriminative and calibration-aware metrics [37], we dissect the relative merits of model-free flexibility versus parametric parsimony, revealing which algorithmic strategies most effectively capture genetic signatures of host–pathogen interactions and providing a methodological template applicable to other polygenic disease traits in livestock [12, 13, 38–42]. Methods Study population and phenotype data Sampling was conducted across 36 herds in 11 provinces of Turkey, encompassing seven indigenous goat breeds: Hairy, Honamlı, Damascus, Angora, Kilis, Turkish Saanen, and Maltese. A total of 3,069 animals were screened, yielding a mean MAP seroprevalence of 14.07%. From this population, 474 animals were selected for genotyping with a balanced case–control composition of 237 seropositive cases and 237 seronegative controls, stratified to ensure representation of all seven breeds. Disease status was determined using a commercially validated ELISA assay for anti-Mycobacterium avium subsp. paratuberculosis antibodies. Phenotype data were encoded as binary outcomes (0 for controls, 1 for cases) for compatibility with classification algorithms and probabilistic modelling frameworks. Genotyping and quality control Genomic DNA was genotyped using the Illumina 65K Goat BeadChip, interrogating approximately 65,000 SNPs distributed across the caprine genome. Quality control was implemented using PLINK v1.9 [43]. Sex chromosome markers and SNPs lacking chromosome assignments or physical positions were excluded. Stringent filtering thresholds were applied: call rate < 95% (SNP-level), call rate < 90% (individual-level), minor allele frequency < 0.05, and Hardy–Weinberg equilibrium deviation at p < 5×10⁻⁵. Following filtering, 44,375 high-quality SNPs were retained. Residual missing genotypes were imputed using Beagle v5.0 [44] on a per-chromosome basis, and LD pruning was applied in PLINK v1.9 [43] to mitigate multicollinearity. Feature preselection via mutual information Mutual information (MI) was implemented as a filter-based feature preselection method to quantify the statistical dependence between individual SNP genotypes and a binary phenotype. Unlike correlation-based measures, MI captures both linear and non-linear associations and remains invariant to monotonic transformations [45–47]. MI was estimated with a non-parametric kNN estimator, a distribution-free approach suited to discrete–discrete genotype–phenotype relationships [45, 47–49], demonstrating robustness against model misspecification and nonlinearity in high-dimensional genomic contexts [47, 50, 51]. MI-based preselection concentrates computational effort on the most informative SNP subsets, enabling more reliable detection of non-additive effects where relevant [49, 51]. To prevent information leakage, MI ranking was performed strictly within each training partition. For every fold, MI scores were computed from the training subset only, and the top 5,000 markers — approximately 11.3% of the genome-wide panel — were retained for model training and applied unchanged to the corresponding validation set. Nesting feature selection within cross-validation preserved the integrity of out-of-sample performance evaluation across all predictive frameworks. Machine learning framework Fourteen classification algorithms were implemented across five methodological families: (i) parametric linear models — Ridge Regression (L2), LASSO (L1), Elastic Net (L1/L2), and GBLUP, which fits a linear mixed model using a genomic relationship matrix derived from SNP markers; (ii) kernel-based classifiers — SVC-Linear and SVC-RBF; (iii) tree-based ensembles — Random Forest, Gradient Boosting, AdaBoost (SAMME.R), XGBoost, and LightGBM; (iv) deep learning — an MLP with batch normalisation and dropout regularisation, and a CNN with one-dimensional convolutional layers; (v) meta-learning — a Stacking ensemble implemented via scikit-learn's StackingClassifier with three heterogeneous base learners (SVC-RBF, Random Forest, Logistic Regression) whose out-of-fold predicted probabilities served as the input matrix for a Logistic Regression meta-learner. Meta-learner regularisation (C = 1/α; α ∈ [0.001, 500]) was optimised via Bayesian search. Out-of-fold training of base learners prevents information leakage from training into meta-learning. Hyperparameter optimisation Bayesian hyperparameter optimisation was conducted using BayesSearchCV (scikit-optimize) for all models except PyTorch-based architectures, with a composite scoring function combining ROC-AUC (60% weight) and accuracy (40%), inner 5-fold stratified cross-validation, and 120 iterations per model. Regularised linear classifiers searched regularisation strength across five orders of magnitude; SVCs searched regularisation strength and kernel bandwidth; tree-based models searched ensemble size, learning rate, tree depth, subsampling, and regularisation parameters; the MLP optimised 13 hyperparameters including hidden layer configuration, activation function, learning rate schedule, and early stopping patience. For GBLUP, the search space was deliberately broad — encompassing kernel type (linear, RBF, polynomial), optional eigendecomposition-based dimensionality reduction, and probability calibration method — to allow data-driven selection of the optimal covariance structure rather than imposing additive assumptions a priori. Bayesian optimisation consistently converged on the linear kernel without dimensionality reduction, indicating that the additive genomic relationship matrix provided the most effective covariance representation for this trait–sample configuration. The final GBLUP model operated as a conventional additive genomic prediction framework with Platt-scaled probability calibration — an outcome that empirically corroborates the additive convergence interpretation reported in the Discussion. For CNN and MLP, incompatibility with BayesSearchCV necessitated a three-stage hierarchical grid search: coarse sampling, refinement around the optimum, and full-resolution search within a narrow neighbourhood of the stage-2 solution. Cross-validation strategy Model performance was evaluated under repeated stratified k-fold cross-validation (k = 5, 10 independent repetitions). In each repetition, data were reshuffled with a different random seed and a new 5-fold split generated; each fold served as the validation set once per repetition. Models were trained on approximately 80% of the data (n ≈ 379) and validated on the remaining 20% (n ≈ 95). All reported performance metrics represent means across 10 × 5 = 50 validation folds. Goodness-of-Fit composite scoring (GoF) An integrated model ranking was derived by computing a composite GoF score for each framework. For every test metric, observed mean values were min-max normalised across all 14 models to a [0, 1] interval; for Brier Score, normalisation was inverted so that lower raw values yielded higher normalised scores. The GoF score was defined as the unweighted mean of all eight per-metric normalised values (ROC-AUC, Accuracy, Sensitivity, Specificity, Precision, F1 Score, Persistence, Brier Score), yielding a single scalar in [0, 1] that jointly rewards discrimination, calibration, and cross-fold stability without privileging any individual metric. Within this formulation, calibration is represented by Brier Score alone among eight equal-weight components (effective calibration weight: 12.5%); users with calibration-dominant decision objectives should consider explicitly reweighted composites. All composite scoring and visualisation were implemented using the Goodness-of-Fit Analyzer via a custom Python script. Post-hoc prevalence recalibration To assess the generalisability of calibration findings to field deployment conditions, Albert offset recalibration was applied post-hoc to all fourteen models. The logit-scale intercept was shifted by Δ = log[π_field/(1 − π_field)] − log[π_train/(1 − π_train)], where π_train = 0.50 reflects the balanced 1:1 case-control training design, across four target prevalence scenarios: π = 0.25, π = 0.1407 (the estimated true MAP seroprevalence of the study population), π = 0.10, and π = 0.05, collectively spanning the plausible range of MAP seroprevalence in Turkish goat populations. For each scenario, Brier score and Expected Calibration Error (ECE) were computed on the recalibrated probability outputs. ROC-AUC was additionally recorded across all scenarios to verify the rank-preserving property of the logit-intercept transformation. Computational implementation All analyses were implemented in Python 3.8 + using scikit-learn (v1.3+), scikit-optimiee, PyTorch (v2.0+), XGBoost (v2.0+), LightGBM (v4.0+), NumPy, SciPy, pandas, and Matplotlib. All BayesSearchCV optimisations were executed with n_jobs = − 1, leveraging full multi-core parallelisation. Results Dimensionality reduction via mutual information Mutual information (MI) analysis applied to 44,375 quality-controlled SNP markers yielded a markedly right-skewed, heavy-tailed discriminatory landscape, with the majority of markers concentrated in a low-MI mode approaching zero and a minority occupying a progressively informative upper tail — a structure consistent with a polygenic phenotypic architecture wherein variance is distributed across numerous markers each contributing individually small incremental effects. The top 5,000 SNPs, constituting 11.3% of the quality-controlled panel, were selected as the prediction feature space, reducing the genotype matrix from n = 474 × p = 44,375 to n = 474 × p = 5,000 (Fig. 1 ). This information-theoretic selection strategy, which removes uninformative markers while retaining those carrying genuine phenotypic signal, is consistent with recent evidence demonstrating that MI-based SNP filtering substantially improves downstream genomic prediction accuracy relative to whole-panel approaches in livestock [36]. MI-based ranking was computed exclusively within each training partition under the nested cross-validation strategy, ensuring comparability of model evaluations under a common feature space. Violin plot depicting the MI score distribution across 44,375 genome-wide markers. The top 5,000 SNPs occupying the high-MI tail were retained for subsequent analyses. MI quantifies the reduction in uncertainty about the binary phenotype given SNP genotype, ranging from 0 (statistical independence) to min[H(X), H(Y)] (perfect dependence). Model performance and three-tier hierarchy Fourteen machine learning models were evaluated across eight performance metrics under repeated stratified cross-validation (Table 1 ), revealing a three-tier performance hierarchy rendered simultaneously in raw values and normalised perspective in Fig. 2 . The top tier — Stacking, LASSO, ElasticNet, SVC-RBF, Ridge, GBLUP, and SVC-Linear — achieved ROC-AUC values exceeding 0.970, with six regularised and kernel models spanning a compressed AUC range of only 0.0015. Within this tier, Stacking led across all metrics except ROC-AUC: accuracy (0.9453), sensitivity (0.9382), specificity (0.9539), F1 score (0.9471), persistence (0.9041), and Brier score (0.0460) — the lowest in the study. This composite superiority of meta-ensemble learning is consistent with recent livestock genomic prediction benchmarks demonstrating that heterogeneous stacking frameworks integrating SNP feature selection with diverse base learners consistently outperform individual models across independent validation populations [36, 38]. GBLUP's competitive standing within this upper performance tier, comparable to regularised linear and kernel-based models despite its parametric simplicity, aligns with recent systematic comparisons for binary health trait prediction demonstrating that GBLUP performs indistinguishably from structurally more complex ML architectures — including RF, SVM, XGBoost, and MLP — in populations of moderate size [19]. The intermediate band — PyTorchCNN (AUC: 0.9264; Brier: 0.0747) and PyTorchMLP (AUC: 0.9107; Brier: 0.0895) — was distinguished by declining GoF scores alongside meaningfully wider standard deviations, particularly in persistence (PyTorchCNN SD: 0.0474; PyTorchMLP SD: 0.0389). The bottom five tree-based models displayed progressively declining GoF scores with two structurally distinct failure signatures: XGBoost paired a comparatively high sensitivity (0.8616) with the lowest specificity in the table (0.7115, Δ = 0.1501) — the sharpest intra-row metric divergence across all fourteen models — while RandomForest combined a competitive ROC-AUC (0.9552) with the highest Brier score (0.2229), numerically encapsulating the discrimination–calibration dissociation that characterises tree-based ensembles. GradientBoosting recorded the most uniformly depressed values, with accuracy (0.7789), specificity (0.7666), F1 score (0.7986), and persistence (0.6153) all among the lowest in the comparison, and the Persistence column tracked the GoF ranking with particularly high fidelity across all three tiers. Table 1 Predictive performance of fourteen machine learning models evaluated across eight classification metrics under repeated stratified cross-validation. Model ROC-AUC Accuracy Sensitivity Specificity Precision F1 Score Persistence Brier Score Stacking 0.9770 ± 0.0164 0.9453 ± 0.0199 0.9382 ± 0.0250 0.9539 ± 0.0343 0.9571 ± 0.0315 0.9471 ± 0.0200 0.9041 ± 0.0284 0.0460 ± 0.0133 LASSO 0.9823 ± 0.0079 0.9337 ± 0.0142 0.9321 ± 0.0278 0.9340 ± 0.0215 0.9419 ± 0.0194 0.9366 ± 0.0154 0.8950 ± 0.0216 0.0498 ± 0.0096 ElasticNet 0.9821 ± 0.0081 0.9337 ± 0.0142 0.9321 ± 0.0243 0.9340 ± 0.0215 0.9418 ± 0.0196 0.9366 ± 0.0150 0.8948 ± 0.0214 0.0499 ± 0.0094 SVC-RBF 0.9808 ± 0.0081 0.9326 ± 0.0164 0.9294 ± 0.0297 0.9379 ± 0.0249 0.9431 ± 0.0264 0.9357 ± 0.0166 0.8934 ± 0.0204 0.0505 ± 0.0094 Ridge 0.9815 ± 0.0077 0.9337 ± 0.0170 0.9299 ± 0.0269 0.9366 ± 0.0216 0.9433 ± 0.0210 0.9363 ± 0.0185 0.8879 ± 0.0194 0.0594 ± 0.0127 GBLUP 0.9813 ± 0.0085 0.9326 ± 0.0184 0.9228 ± 0.0248 0.9453 ± 0.0290 0.9481 ± 0.0308 0.9349 ± 0.0201 0.8851 ± 0.0185 0.0634 ± 0.0074 SVC-Linear 0.9822 ± 0.0072 0.9284 ± 0.0193 0.9211 ± 0.0349 0.9379 ± 0.0249 0.9425 ± 0.0272 0.9311 ± 0.0205 0.8916 ± 0.0224 0.0513 ± 0.0103 PyTorchCNN 0.9264 ± 0.0229 0.9253 ± 0.0237 0.9114 ± 0.0445 0.9413 ± 0.0295 0.9465 ± 0.0265 0.9278 ± 0.0249 0.8512 ± 0.0474 0.0747 ± 0.0237 PyTorchMLP 0.9107 ± 0.0202 0.9105 ± 0.0201 0.9085 ± 0.0350 0.9128 ± 0.0346 0.9213 ± 0.0327 0.9142 ± 0.0229 0.8205 ± 0.0389 0.0895 ± 0.0201 RandomForest 0.9552 ± 0.0172 0.8926 ± 0.0234 0.8784 ± 0.0450 0.9145 ± 0.0383 0.9190 ± 0.0405 0.8967 ± 0.0214 0.7845 ± 0.0278 0.2229 ± 0.0033 AdaBoost 0.9312 ± 0.0281 0.8537 ± 0.0406 0.8281 ± 0.0683 0.8863 ± 0.0557 0.8923 ± 0.0530 0.8566 ± 0.0411 0.7300 ± 0.0455 0.2180 ± 0.0034 LightGBM 0.8964 ± 0.0256 0.8337 ± 0.0342 0.8130 ± 0.0577 0.8597 ± 0.0569 0.8712 ± 0.0454 0.8387 ± 0.0275 0.6847 ± 0.0553 0.1582 ± 0.0377 XGBoost 0.8747 ± 0.0266 0.7947 ± 0.0276 0.8616 ± 0.0607 0.7115 ± 0.1052 0.7806 ± 0.0508 0.8159 ± 0.0246 0.6473 ± 0.0491 0.1568 ± 0.0172 GradientBoosting 0.8649 ± 0.1116 0.7789 ± 0.1189 0.8100 ± 0.0838 0.7666 ± 0.2235 0.8106 ± 0.1603 0.7986 ± 0.0977 0.6153 ± 0.1915 0.2040 ± 0.0285 Values represent mean ± standard deviation computed over repeated stratified cross-validation folds. ROC-AUC: area under the receiver operating characteristic curve; Accuracy: proportion of correctly classified observations; Sensitivity: true positive rate; Specificity: true negative rate; Precision: positive predictive value; F1 Score: harmonic mean of precision and sensitivity; Persistence: cross-fold prediction stability index; Brier Score: mean squared difference between predicted probabilities and observed outcomes (lower values indicate better probabilistic calibration). Models are ranked in descending order of composite GoF score. Cell values represent mean test performance computed over repeated stratified cross-validation folds. Cell colour encodes the min-max normalised score for each metric column independently, ranging from 0.0 (red; worst observed value) to 1.0 (green; best observed value); for Brier Score, normalisation is inverted such that lower raw values receive higher normalised scores. Calibration analysis Calibration curves revealed a systematic and architecturally structured divergence between predicted probabilities and observed outcomes (Fig. 3 ). LASSO, ElasticNet, and Ridge produced trajectories closely approximating the perfect calibration diagonal (Brier scores: 0.050, 0.050, 0.059); GBLUP and both SVCs exhibited comparable calibration fidelity (Brier: 0.051–0.063). Stacking achieved the lowest Brier score across all architectures (0.046) alongside a top-tier ROC-AUC of 0.977 — a combination of simultaneously minimal probabilistic error and maximal discrimination uniquely realised by Stacking among all fourteen evaluated frameworks. The theoretical basis for interpreting this joint optimisation resides in the formal decomposability of the Brier score into separable discrimination and calibration components [37, 52]. Architectures that independently minimise both components achieve composite superiority over those that attain high discrimination at the expense of probabilistic fidelity, a distinction that aggregate ranking metrics systematically obscure. The discrimination–calibration dissociation was most pronounced in tree-based ensembles: RandomForest attained ROC-AUC = 0.955 yet returned Brier = 0.223, while XGBoost, LightGBM, AdaBoost, and GradientBoosting combined moderate-to-high discrimination with markedly degraded calibration (Brier: 0.157–0.218), exhibiting characteristic sigmoid distortions — systematic overconfidence at intermediate probability values and underestimation at the tails — consistent across replicates. Deep learning architectures produced irregular, high-variance calibration trajectories reflecting stochastic optimisation instability under current sample size constraints. Each panel displays the relationship between mean predicted probability (x-axis) and observed event rate (y-axis) estimated across repeated stratified cross-validation folds; the dashed diagonal represents perfect calibration. Generalisation and overfitting Figure 4 presents a train-test generalisation map across eight metrics, with the identity diagonal serving as the zero-generalisation-gap reference. Top-tier models clustered tightly near the diagonal at high performance values, while GradientBoosting, XGBoost, LightGBM, and AdaBoost were consistently positioned below the diagonal across multiple panels — with GradientBoosting recording the most extreme train-test separation in both the ROC-AUC and Persistence panels, reaching a training persistence value near 1.0 against a test value of approximately 0.6. In the Specificity panel, a distinct pattern emerged: ElasticNet, SVC-Linear, Ridge, LASSO, and GBLUP sat above the diagonal, whereas XGBoost occupied a markedly below-diagonal position with a visually prominent train-test gap. Each panel plots the mean training performance (x-axis) against the mean test performance (y-axis) computed over repeated stratified cross-validation folds for a given metric. The dashed diagonal line represents the identity (zero generalisation gap); points above the diagonal indicate higher mean performance on test folds than on training folds, and points below the diagonal indicate higher mean training than test performance, consistent with overfitting. For Brier Score, lower values indicate better probabilistic calibration; axis orientation follows the raw metric scale without inversion. Model identities are indicated by colour as shown in the legend. Cross-validation stability Figures 5 and 6 jointly characterise fold-to-fold stability. GradientBoosting was the most unstable model by a substantial margin, recording CV% values of 29.2% (specificity) and 31.1% (persistence), and the highest absolute standard deviations across most metrics (mean SD = 0.1270); XGBoost ranked second most unstable (mean SD = 0.0452). Strikingly, the Brier Score column inverted the stability gradient observed in all other columns: RandomForest (1.5%) and AdaBoost (1.6%) displayed apparent maximum Brier stability — reflecting structurally consistent miscalibration rather than genuine calibration quality. The ROC-AUC column was the most uniformly stable metric, with CV% values as low as 0.7% for SVC-Linear across the top tier. ElasticNet and LASSO shared the lowest mean SD (0.0167 and 0.0172 respectively), confirming the stability dominance of the linear and kernel tier. CV% values represent the coefficient of variation computed as (standard deviation / mean) × 100 for each model-metric combination across repeated stratified cross-validation folds. Lower values indicate greater fold-to-fold stability. Colour scale uses an inverted RdYlGn mapping where green denotes lower (more stable) CV% and red denotes higher (less stable) CV%. Models are sorted in descending order of composite GoF score. Values represent the absolute cross-validation standard deviation per model-metric combination. Joint discrimination–calibration positioning Figure 7 maps all fourteen models onto the Brier Score–ROC-AUC two-dimensional performance space. The top-tier models formed a compact cluster in the upper-left ideal region (Brier ≈ 0.05, ROC-AUC ≈ 0.977–0.982), with Stacking displaced leftward to the lowest Brier position in the comparison (0.046) and GBLUP displaced rightward to the highest Brier within the tier (0.063) — exhibiting the most pronounced calibration–discrimination dissociation within the top cluster. RandomForest and AdaBoost occupied a distinctive upper-right position — high ROC-AUC combined with high Brier — while XGBoost, LightGBM, and GradientBoosting were displaced both rightward and downward. Figure 8 extends this analysis by examining the relationship between persistence and both primary performance indices: persistence exhibited a strong positive correlation with ROC-AUC (r = 0.924, p < 0.001) and a strong negative correlation with Brier Score (r = − 0.847, p < 0.001), confirming persistence as a valid surrogate index for both discrimination and calibration quality. The x-axis represents the mean Brier Score and the y-axis the mean ROC-AUC, both computed over repeated stratified cross-validation folds. Error bars denote ± 1 standard deviation on each axis. The ideal model position is the upper-left corner, corresponding to simultaneously low Brier Score and high ROC-AUC. Models are identified by colour as shown in the legend. Pearson correlation coefficients (r) and two-sided p-values were computed across the fourteen model means. Persistence is defined as a custom cross-fold prediction stability index; lower Brier Score values indicate better probabilistic calibration. Both axes represent test-set means computed over repeated stratified cross-validation folds. Model identities are indicated by colour and label. Error profile analysis The confusion matrices revealed three structurally distinct error profiles (Fig. 9 ). Stacking achieved the highest TN rate (95.4% ± 3.4%) alongside TP of 93.8% ± 2.5%, while LASSO and ElasticNet showed the most symmetric FP–FN distribution (difference of only 0.2 percentage points). GBLUP and SVC-Linear displayed a consistent FN > FP asymmetry, with FN rates of 7.7% and 7.9% respectively exceeding their FP rates, confirming the sensitivity–specificity asymmetry identified in the performance table. XGBoost presented the most visually distinctive matrix in the figure: FP = 28.8% ± 10.5% against TN = 71.2% ± 10.5%, directly encoding its sensitivity–specificity divergence. GradientBoosting exhibited the most structurally unstable error behaviour, with FP = 23.3% ± 22.4% — a standard deviation nearly equalling the mean — indicating that false positive behaviour was not merely elevated but structurally unstable across folds. Each matrix displays the mean true negative (TN), false positive (FP), false negative (FN), and true positive (TP) rates as proportions, computed over repeated stratified cross-validation folds. Values in parentheses denote ± 1 standard deviation across folds. Cell colour intensity reflects the magnitude of each rate, with darker blue indicating higher values. Specificity = TN rate; Sensitivity = TP rate. Statistical significance testing Pairwise bootstrap significance testing across four ranking metrics (Eff.MCC, AUC-ROC, Brier Score, ECE) over 10,000 iterations showed that several within-top-tier pairs were statistically equivalent on all metrics (0/4): LASSO vs ElasticNet, LASSO vs SVC-Linear, LASSO vs SVC-RBF, ElasticNet vs SVC-Linear, ElasticNet vs SVC-RBF, and SVC-Linear vs SVC-RBF (Figs. 10 – 11 ). All comparisons between top-tier linear/kernel models and GradientBoosting, RandomForest, and AdaBoost reached 4/4. The Brier Score panel showed the most extensive significant differences across all sub-panels, while the AUC-ROC panel displayed the largest proportion of non-significant cells within the top-tier cluster — indicating that calibration differences are more statistically resolvable than discriminative differences among high-performing models. DeLong test-based comparisons across all 91 model pairs with Benjamini–Hochberg FDR correction (q < 0.05) corroborated this tier structure: all top-tier versus bottom-tier comparisons reached significance on AUC, while most within-top-tier pairs did not. Each cell in the lower triangle of each panel displays the two-sided bootstrap p-value for the corresponding model pair on the indicated ranking metric; green cells denote p < 0.05 (statistically significant difference) and grey cells denote p ≥ 0.05 (non-significant). Bootstrap resampling was performed with 10,000 iterations and a fixed random seed. Ranking metrics: Eff.MCC = effective Matthews Correlation Coefficient (MCC × coverage); AUC-ROC = area under the receiver operating characteristic curve; Brier Score = mean squared probability error; ECE = expected calibration error computed over equal-width bins. Values denote the number of ranking metrics (out of 4) for which the corresponding model pair reached statistical significance at α = 0.05. A value of 0/4 indicates statistical equivalence across all four metrics; 4/4 indicates significant difference on every metric. Both figures were generated using the Goodness-of-Fit Analyzer v11 pipeline with 10,000 paired bootstrap iterations. Composite performance ranking Figure 12 consolidates the evaluation into a composite GoF score — the unweighted mean of per-metric min-max normalised values across all eight test metrics, with Brier Score normalised in the inverted direction. Stacking achieved the highest composite score (0.9944), followed by LASSO (0.9488), ElasticNet (0.9485), SVC-RBF (0.9448), Ridge (0.9382), GBLUP (0.9328), and SVC-Linear (0.9293) — all seven occupying a compressed upper band spanning only 0.0651 composite score units. A clear discontinuity separated this cluster from the intermediate tier of PyTorchCNN (0.8260) and PyTorchMLP (0.7276), and a second discontinuity delimited the bottom cluster: RandomForest (0.6067), AdaBoost (0.4157), LightGBM (0.3278), XGBoost (0.1478), and GradientBoosting (0.0630) — the latter recording a value 15.8-fold below Stacking, reflecting simultaneous underperformance across discrimination, calibration, and fold-to-fold stability. The composite ranking is therefore not a restatement of any single metric but an integrated summary of multi-dimensional performance, providing a dimensionally robust basis for model selection in applied genomic prediction contexts. The composite performance score represents the unweighted mean of per-metric min-max normalised values computed across eight test metrics (ROC-AUC, Accuracy, Sensitivity, Specificity, Precision, F1 Score, Persistence, and Brier Score); Brier Score normalisation is inverted such that lower raw values yield higher normalised scores. Values range from 0 (worst observed) to 1 (best observed). Models are ranked in descending order of composite score. Prevalence recalibration analysis Post-hoc Albert offset recalibration was applied across four deployment-prevalence scenarios (π = 0.25, π = 0.1407, π = 0.10, and π = 0.05), where π = 0.1407 represents the estimated true MAP seroprevalence of the target population. Both Brier score and Expected Calibration Error (ECE) increased monotonically as target prevalence decreased, yet the three-tier model hierarchy was preserved across all scenarios. At true field prevalence (π = 0.1407), top-tier Brier scores ranged from 0.0700 (Stacking) to 0.1785 (GBLUP), while bottom-tier tree-based models ranged from 0.2698 (XGBoost) to 0.3828 (RandomForest; Fig. 13 ). The ECE heatmap at the same prevalence corroborated this separation: top-tier ECE spanned 0.1069 (SVC-Linear) to 0.2607 (GBLUP), whereas bottom-tier values ranged from 0.3064 (XGBoost) to 0.3868 (AdaBoost; Fig. 14 ). One notable observation was that PyTorchCNN recorded the lowest ECE among all fourteen models at π = 0.1407 (0.0668), a value below that of all top-tier models at this scenario, though this did not alter its intermediate-tier composite standing. Each cell displays the Brier score for the corresponding model–scenario combination following Albert offset recalibration. Models are sorted in descending order of composite GoF score and grouped into three performance tiers (Top, Intermediate, Bottom) separated by dashed lines. Lower values indicate better probabilistic calibration. Each cell displays the ECE for the corresponding model–scenario combination following Albert offset recalibration, computed over ten equal-width probability bins. Discussion Genomic Prediction as Disciplined Inference in High-Dimensional Biology Genomic prediction for complex disease resistance operates in a statistical regime that is simultaneously information-rich and sample-constrained. In such contexts, prediction models do more than classify — they instantiate particular assumptions about how biological signal is structured, how uncertainty should be quantified, and how evidence should be translated into decision-relevant probabilities. The present comparative analysis of fourteen frameworks therefore serves not merely as a performance benchmark, but as an examination of how different algorithmic philosophies behave under identical genomic constraints. The foundational premise of whole-genome prediction, established by [2], is that selection decisions can be anchored to the joint information content of all markers simultaneously — a premise whose implications continue to ramify as the diversity of available learning architectures expands. Multi-Breed, Multi-Region Population Composition and Its Analytical Consequences A structurally consequential feature of this study is the composition of the phenotyped population, encompassing seven indigenous Turkish goat breeds sampled from 11 provinces across four geographically distinct regions of Anatolia. Single-breed designs produce performance estimates that may reflect population-specific linkage disequilibrium structure and within-group relatedness rather than biologically meaningful signal; the multi-breed design employed here reduces this risk by requiring models to generalise across animals with divergent demographic histories and different effective population sizes. The consistently high performance of top-tier models supports the interpretation that their discriminative capacity reflects partially generalisable genomic signal, while the systematic and architecturally predictable failure of tree-based ensembles is consistent with structural miscalibration arguments rather than sampling-induced noise. It should be acknowledged that between-breed linkage disequilibrium confounding and differential heritability structure are plausible; breed-stratified analyses were not conducted, and the extent to which the observed performance hierarchy replicates within individual breeds remains an open question. Mutual Information Filtering as Epistemic Pre-structuring of the Feature Space The transformation of 44,375 quality-controlled SNPs into a 5,000-marker subset via mutual information (MI) filtering — retaining approximately 11% of the genome-wide panel [53, 54] — defines the epistemic conditions under which all downstream models operate. In ultra-high-dimensional settings where p ≫ n (44,375 markers vs. 474 animals), unconstrained estimation is mathematically unstable: collinearity inflates variance, gradient-based optimisers converge poorly, and tree-based partitioning faces combinatorial explosion in split evaluation [53]. MI was selected because it is model-agnostic, capturing both linear and non-linear associations and preserving analytical fairness across all architectures [55, 56]; it retains original SNP identities for biological interpretability [57]; and it has demonstrated competitive performance relative to embedded strategies in high-dimensional genomic contexts [55]. The empirical MI distribution exhibited a markedly right-skewed structure consistent with polygenic theory [5, 42]. A critical caveat must be noted: as a univariate filter, MI cannot capture epistatic interactions emerging from multivariate marker combinations [54], and uniform MI filtering — while essential for valid benchmarking — does not constitute the theoretically optimal feature set for each individual learning paradigm. Architectural Convergence and the Ceiling of Extractable Additive Signal The most structurally striking feature of the comparative results is not the identity of the best-performing model, but the architectural breadth of the top tier. Seven frameworks spanning fundamentally different algorithmic paradigms — penalised linear regression (LASSO, ElasticNet, Ridge), kernel-based classification (SVC-RBF, SVC-Linear), a genomic relationship matrix estimator (GBLUP), and a heterogeneous stacking ensemble — achieved ROC-AUC values ranging from 0.9770 to 0.9823, with the six regularised and kernel models spanning a compressed range of only 0.0015 AUC units, confirmed by bootstrap equivalence testing yielding 0/4 significant metric pairs among LASSO, ElasticNet, SVC-Linear, and SVC-RBF across all four ranking criteria. That architecturally unrelated models simultaneously reach the same discriminative ceiling implies that the predictive signal in this dataset is predominantly additive and low-dimensional in structure, offering little purchase to more complex non-linear learning surfaces [58] — consistent with the infinitesimal model framework [2]. GBLUP's competitive performance is particularly informative: a model performing no internal feature selection and optimising no loss function beyond variance component estimation achieves discriminative capacity largely equivalent to purpose-built machine learning frameworks, consistent with prior reports where additive genetic architecture limits the incremental value of non-linear methods [15]. PyTorchCNN and PyTorchMLP failed to surpass the linear tier for reasons fundamentally distinct from tree-based models: deep learning architectures display elevated fold-to-fold instability consistent with stochastic optimisation that has not converged under current sample size constraints, whereas tree-based models retain competitive discrimination while exhibiting stable architectural miscalibration. Stacking's out-of-fold construction protocol imposes structural data separation mitigating overfitting at the ensemble level [59, 60], reflected in its marginally supra-diagonal positioning in the Accuracy generalisation panel and its lowest Brier score (0.046) within the top tier. Discrimination–Calibration Dissociation Across Learning Architectures The central analytical contribution of this study is the empirical documentation and mechanistic interpretation of a discrimination–calibration dissociation operating simultaneously between architectural tiers and within the top tier itself. The theoretical independence of rank-ordering capacity and probabilistic calibration fidelity is well established [52], but its instantiation across a diverse model set under identical data conditions carries specific inferential weight. Across fourteen models, ROC-AUC varied by a factor of 1.14 — from 0.8649 to 0.9823 — whereas Brier score varied by a factor of 4.85, spanning 0.0460 to 0.2229: a near-fivefold differential in relative dispersion against a 1.14-fold differential in discriminative range. The dissociation is most starkly illustrated by RandomForest, whose ROC-AUC of 0.9552 places it within plausible distance of the top tier, yet whose Brier score of 0.2229 — the highest in the comparison — signals that posterior probability estimates are structurally unreliable despite retained rank-ordering ability. This is mechanistically expected: ensemble averaging over unpruned trees stabilises rank predictions while pushing predicted probabilities toward the extremes of the unit interval — a known consequence of aggregating leaf-node class frequencies without explicit post-hoc calibration [61]. Within the top tier, Stacking achieves the lowest Brier score (0.0460) despite ranking last on AUC within its group (0.9770), while GBLUP exhibits the weakest calibration (Brier: 0.0634) — configurations reflecting architectural and optimisation-level properties that operate independently of predictive hierarchy. Stacking's superior calibration despite incorporating the most severely miscalibrated base learner (RandomForest; Brier = 0.223) is explained by the meta-layer architecture: the Logistic Regression meta-learner, trained on held-out base-learner predictions, implicitly learns to discount RandomForest's poorly calibrated outputs, applying implicit post-hoc recalibration through learned combination weights without requiring explicit isotonic or Platt rescaling. The CV% stability analysis reveals a further dimension: RandomForest (Brier CV% = 1.5%) and AdaBoost (1.6%) exhibit apparent maximum Brier stability — reflecting structurally consistent miscalibration rather than genuine calibration quality. These models are not miscalibrated because of instability — they are stably, architecturally miscalibrated. The bootstrap significance results formalise this: the greater prevalence of significant pairwise differences in the Brier Score panel relative to the grey-dominated AUC-ROC panel confirms that models statistically indistinguishable on discrimination can be formally separated on calibration fidelity [52]— a distinction with direct practical consequence wherever predicted probabilities inform culling thresholds, breeding decisions, or epidemiological modelling. These findings collectively argue that model selection should be governed by calibration-aware criteria — Brier score, expected calibration error, and reliability diagram inspection — alongside, rather than subordinate to, conventional discrimination metrics. Failure Mode Heterogeneity and Its Diagnostic Implications The generalisation map and stability analyses jointly reveal that failure modes of underperforming models are architecturally distinct and carry different remedial implications — a dimension entirely obscured by composite ranking. GradientBoosting exhibits erratic fold-variance, with Persistence collapsing from training values approaching 1.0 to test values near 0.6. XGBoost presents systematic decision boundary asymmetry: sensitivity exceeds specificity by 0.1501 with a specificity SD of 0.1052. RandomForest, by contrast, achieves an intermediate persistence score (0.785) yet retains the highest Brier score — a structurally distinct failure mode in which cross-fold rank predictions are relatively consistent while probability assignments remain systematically miscalibrated, confirmed by its departure from the persistence–Brier regression line in a direction unexplainable by instability alone (Fig. 8 ). The 0/4 bootstrap equivalence between LightGBM and XGBoost indicates these models form a statistically coherent failure cluster rather than a graded performance continuum. On the Relativity of Model Adequacy and Composite Ranking The designation of a model as adequate or inadequate is not an intrinsic model property, but a relational judgement dependent on which region of the metric space the investigator inhabits — a choice inseparable from the inferential demands of the application [52]. The composite GoF score (Fig. 12 ) is most usefully interpreted not as a ranking instrument but as a diagnostic of tier membership: within the top tier, composite scores are so compressed that rank differences carry no practical inference; across tiers, discontinuities constitute substantive architectural distinctions rather than sampling artefacts. The GoF formulation assigns calibration an effective weight of only 12.5% — one Brier Score component among eight — and applications where probability fidelity is the primary decision driver may prefer explicitly calibration-weighted composites or Pareto-front approaches identifying non-dominated models without scalar reduction. Prevalence Shift and the Architectural Basis of Miscalibration The recalibration analysis addresses a potential limitation of the balanced case-control design: whether the calibration profiles documented under cross-validation reflect genuine architectural properties or are partly artefacts of the 1:1 training prevalence. The observation that the tier-level Brier and ECE ordering was preserved across all tested scenarios — including the estimated true field prevalence of π = 0.1407 — is consistent with the interpretation that the miscalibration of bottom-tier tree-based ensembles is not primarily a training-prevalence artefact, although this design does not permit a formal causal test of that claim. The apparent ECE advantage of PyTorchCNN at π = 0.1407 warrants caution: given that ECE is a binned summary statistic, the unusually low value (0.0668) may partly reflect the concentration of that model's predicted probabilities within a narrow range that is less sensitive to logit-scale offset, rather than improved probabilistic fidelity, and should not be interpreted as evidence of genuine calibration superiority in isolation. More broadly, these results suggest that the discrimination–calibration dissociation documented under balanced training conditions is likely to remain relevant under realistic field deployment conditions, while acknowledging that external prospective validation would be necessary before any such inference is treated as definitive. Limitations Several structural limitations warrant acknowledgement. The absence of breed-stratified evaluations leaves between-breed performance heterogeneity uncharacterised; paratuberculosis prevalence, immune response architecture, and effective population size likely vary across the seven breeds. MI's univariate filtering cannot capture epistatic interactions, potentially disadvantaging architectures designed for non-additive effects [54]. The moderate sample size (n = 474) relative to the retained feature space (p = 5,000) inherently constrains complex architectures. Finally, all evaluations were conducted under cross-validation on a single cohort without external prospective validation — a limitation that must be resolved before any framework is deployed in operational genomic selection programmes. Conclusions Three conclusions of broad methodological relevance emerge. First, the convergence of linear, kernel-based, and relationship-matrix estimators to a statistically indistinguishable discriminative ceiling — confirmed by bootstrap equivalence across four independent ranking metrics — is consistent with the predictive signal being predominantly captured by additive genetic structure, with the caveat that MI-based pre-filtering inherently pre-structures the feature space in favour of additive signal. Architectural complexity beyond penalised linear models confers no measurable discriminative benefit under these data conditions, and GBLUP's competitive standing is consistent with quantitative genetic theory as an appropriate primary framework for genomic prediction of polygenic disease resistance traits in livestock populations of modest size. Second, the discrimination–calibration dissociation documented across all fourteen models demonstrates that no single metric suffices to characterise a model's fitness for purpose, and that the selection of an evaluation criterion is not a neutral methodological convention but an inferential commitment that must be made explicit relative to the decision context. Third, model failure modes are architecturally heterogeneous — erratic fold-variance in GradientBoosting, stable architectural miscalibration in RandomForest, systematic decision boundary asymmetry in XGBoost — profiles with different remedial implications that composite rankings necessarily conflate. Together, these findings argue that genomic prediction model evaluation should be conducted in a multi-metric, multi-dimensional framework that explicitly distinguishes discrimination from calibration, absolute performance from stability, and statistical equivalence from practical interchangeability — with the inferential target specified before, rather than after, the selection of an evaluation criterion. Finally, post-hoc prevalence recalibration across scenarios spanning the plausible range of MAP field deployment conditions (π = 0.05–0.25), including the estimated true seroprevalence of π = 0.1407, indicated that the tier-level Brier score and ECE ordering observed under balanced training conditions was largely preserved. This pattern is consistent with the miscalibration of bottom-tier tree-based models reflecting an architectural rather than a prevalence-contingent deficit, though this interpretation remains subject to confirmation in external validation cohorts. Declarations Acknowledgements This project was supported by the Scientific and Technological Research Council of Türkiye (TÜBİTAK), and we extend our sincere gratitude to TÜBİTAK for their support. We confirm that informed consent was obtained from all animal owners. In addition, official permission to conduct the field study was granted by the General Directorate of Agricultural Research and Policies (TAGEM), Republic of Türkiye Ministry of Agriculture and Forestry (dated January 16, 2023; document no. E-92190712-604.02-8570746). The corresponding author would like to express sincere gratitude to Mustafa Kemal Atatürk, the founding leader of modern Türkiye, for his enduring legacy of reason and scientific thought. In this spirit, we adhere to his guiding principle: “If, one day, my words contradict science, choose science.” The authors used Claude Sonnet 4.6, a large language model developed by Anthropic, solely for grammatical editing and language refinement during manuscript preparation. All scientific content, analyses, interpretations, and conclusions are entirely the work of the authors. Funding This research project was funded by the Scientific and Technological Research Council of Türkiye (TÜBİTAK) (Project No: TOVAG-222O107). Author contributions Y.Y. conceptualized and designed the study, performed the genomic prediction and machine learning analyses, and wrote the manuscript. Y.Y., R.A., M.K., Ö.G., and K.I. conducted the laboratory work. S.Y. organized the field work. Y.Y., A.E., D.C., Y.E.K., S.S.Ş., and M.B. conducted the field work and data collection. Data availability The data used in this study have been deposited in the figshare.com database under https://doi.org/10.6084/m9.figshare.30845045 Ethics declarations Competing interests The authors declare no competing interests. Ethical statement This study was conducted in accordance with the guidelines of the Local Ethics Committee for Animal Experiments and with an experimental protocol approved by the “Ethics Committee for the Use of Animals in Research and Experimentation” at the Sheep Breeding and Research Institute, Türkiye (Approval No: 066/14.02.2023). Informed consent was obtained from the Ministry of Agriculture and Forestry, Siirt Provincial Directorate of Agriculture (Approval No: E-64380313-325.01-8848896/08.02.2023) prior to the study. The authors also complied with the ARRIVE guidelines. References Hosseini K, Anaraki N, Dastjerdi P, Kazemian S, Hasanzad M, Alkhouli M, et al. Bridging Genomics to Cardiology Clinical Practice: Artificial Intelligence in Optimizing Polygenic Risk Scores: A Systematic Review. JACC: Advances. 2025;4:101803. https://doi.org/10.1016/j.jacadv.2025.101803. Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29. https://doi.org/10.1093/genetics/157.4.1819. Alves A a. C, Espigolan R, Bresolin T, Costa RM, Fernandes Júnior GA, Ventura RV, et al. Genome-enabled prediction of reproductive traits in Nellore cattle using parametric models and machine learning methods. Anim Genet. 2021;52:32–46. https://doi.org/10.1111/age.13021. de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193:327–45. https://doi.org/10.1534/genetics.112.143313. Meuwissen T, Hayes B, Goddard M. Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci. 2013;1:221–37. https://doi.org/10.1146/annurev-animal-031412-103705. Momen M, Mehrgardi AA, Sheikhi A, Kranis A, Tusell L, Morota G, et al. Predictive ability of genome-assisted statistical models under various forms of gene action. Sci Rep. 2018;8:12309. https://doi.org/10.1038/s41598-018-30089-2. Varona L, Legarra A, Toro MA, Vitezica ZG. Non-additive Effects in Genomic Selection. Front Genet. 2018;9:78. https://doi.org/10.3389/fgene.2018.00078. Hughes J, Shymka M, Ng T, Phulka JS, Safabakhsh S, Laksman Z. Polygenic Risk Score Implementation into Clinical Practice for Primary Prevention of Cardiometabolic Disease. Genes. 2024;15:1581. https://doi.org/10.3390/genes15121581. Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51:584–91. https://doi.org/10.1038/s41588-019-0379-x. Pain O. Leveraging global genetics resources to enhance polygenic prediction across ancestrally diverse populations. HGG Adv. 2025;6:100482. https://doi.org/10.1016/j.xhgg.2025.100482. Ruan Y, Lin Y-F, Feng Y-CA, Chen C-Y, Lam M, Guo Z, et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet. 2022;54:573–80. https://doi.org/10.1038/s41588-022-01054-7. Abdollahi-Arpanahi R, Gianola D, Peñagaricano F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet Sel Evol. 2020;52:12. https://doi.org/10.1186/s12711-020-00531-z. Gianola D. Priors in Whole-Genome Regression: The Bayesian Alphabet Returns. Genetics. 2013;194:573–96. https://doi.org/10.1534/genetics.113.151753. González-Recio O, Rosa GJM, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livestock Science. 2014;166:217–31. https://doi.org/10.1016/j.livsci.2014.05.036. Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez P, Barrón-López JA, Martini JWR, Fajardo-Flores SB, et al. A review of deep learning applications for genomic selection. BMC Genomics. 2021;22:19. https://doi.org/10.1186/s12864-020-07319-x. Zingaretti LM, Gezan SA, Ferrão LFV, Osorio LF, Monfort A, Muñoz PR, et al. Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species. Front Plant Sci. 2020;11:25. https://doi.org/10.3389/fpls.2020.00025. Liu J, Yan X, Li W, Xue S-H, Wang Z, Su R. Genomic Selection for Cashmere Traits in Inner Mongolian Cashmere Goats Using Random Forest, Gradient Boosting Decision Tree, Extreme Gradient Boosting and Light Gradient Boosting Machine Methods. Animals (Basel). 2025;15:2940. https://doi.org/10.3390/ani15202940. Xiang T, Li T, Li J, Li X, Wang J. Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs. FASEB J. 2023;37:e22961. https://doi.org/10.1096/fj.202300245R. Thorsrud JA, Evans KM, Quigley KC, Srikanth K, Huson HJ. Performance Comparison of Genomic Best Linear Unbiased Prediction and Four Machine Learning Models for Estimating Genomic Breeding Values in Working Dogs. Animals (Basel). 2025;15:408. https://doi.org/10.3390/ani15030408. Usai MG, Casu S, Sechi T, Salaris SL, Miari S, Mulas G, et al. Advances in understanding the genetic architecture of antibody response to paratuberculosis in sheep by heritability estimate and LDLA mapping analyses and investigation of candidate regions using sequence-based data. Genet Sel Evol. 2024;56:5. https://doi.org/10.1186/s12711-023-00873-4. Hasonova L, Pavlik I. Economic impact of paratuberculosis in dairy cattle herds: a review. Veterinární medicína. 2006;51:193–211. https://doi.org/10.17221/5539-VETMED. Rasmussen P, Barkema HW, Mason S, Beaulieu E, Hall DC. Economic losses due to Johne’s disease (paratuberculosis) in dairy cattle. J Dairy Sci. 2021;104:3123–43. https://doi.org/10.3168/jds.2020-19381. Dow CT, Alvarez BL. Mycobacterium paratuberculosis zoonosis is a One Health emergency. Ecohealth. 2022;19:164–74. https://doi.org/10.1007/s10393-022-01602-x. WADDELL LA, RAJIĆ A, STÄRK KDC, McEWEN SA. The zoonotic potential of Mycobacterium avium ssp. paratuberculosis: a systematic review and meta-analyses of the evidence. Epidemiol Infect. 2015;143:3135–57. https://doi.org/10.1017/S095026881500076X. Ayele WY, Svastova P, Roubal P, Bartos M, Pavlik I. Mycobacterium avium subspecies paratuberculosis cultured from locally and commercially pasteurized cow’s milk in the Czech Republic. Appl Environ Microbiol. 2005;71:1210–4. https://doi.org/10.1128/AEM.71.3.1210-1214.2005. Whittington RJ, Marsh IB, Reddacliff LA. Survival of Mycobacterium avium subsp. paratuberculosis in dam water and sediment. Appl Environ Microbiol. 2005;71:5304–8. https://doi.org/10.1128/AEM.71.9.5304-5308.2005. Kravitz A, Pelzer K, Sriranganathan N. The Paratuberculosis Paradigm Examined: A Review of Host Genetic Resistance and Innate Immune Fitness in Mycobacterium avium subsp. Paratuberculosis Infection. Front Vet Sci. 2021;8. https://doi.org/10.3389/fvets.2021.721706. Alpay F, Zare Y, Kamalludin MH, Huang X, Shi X, Shook GE, et al. Genome-wide association study of susceptibility to infection by Mycobacterium avium subspecies paratuberculosis in Holstein cattle. PLoS One. 2014;9:e111704. https://doi.org/10.1371/journal.pone.0111704. Sanchez M-P, Tribout T, Fritz S, Guatteo R, Fourichon C, Schibler L, et al. New insights into the genetic resistance to paratuberculosis in Holstein cattle via single-step genomic evaluation. Genet Sel Evol. 2022;54:67. https://doi.org/10.1186/s12711-022-00757-z. Badia-Bringué G, Alonso-Hearn M. Integrating transcriptomic and genomic studies for the identification of expression quantitative trait loci associated with bovine paratuberculosis. Front Vet Sci. 2025;12:1632212. https://doi.org/10.3389/fvets.2025.1632212. Canive M, Badia-Bringué G, Vázquez P, González-Recio O, Fernández A, Garrido JM, et al. Identification of loci associated with pathological outcomes in Holstein cattle infected with Mycobacterium avium subsp. paratuberculosis using whole-genome sequence data. Sci Rep. 2021;11:20177. https://doi.org/10.1038/s41598-021-99672-4. Idris SM, Eltom KH, Okuni JB, Ojok L, Elmagzoub WA, El Wahed AA, et al. Paratuberculosis: The Hidden Killer of Small Ruminants. Animals (Basel). 2021;12:12. https://doi.org/10.3390/ani12010012. Korou LM, Liandris E, Gazouli M, Ikonomopoulos J. Investigation of the association of the SLC11A1 gene with resistance/sensitivity of goats (Capra hircus) to paratuberculosis. Vet Microbiol. 2010;144:353–8. https://doi.org/10.1016/j.vetmic.2010.01.009. Mataragka A, Klavdianos Papastathis A, Ikonomopoulos J. Association of SLC11A1 3’UTR (GT)n Microsatellite Polymorphisms with Resistance to Paratuberculosis in Sheep. Pathogens. 2025;14:1150. https://doi.org/10.3390/pathogens14111150. Yaman Y, Aymaz R, Keleş M, Bay V, Ün C, Heaton MP. Association of TLR2 haplotypes encoding Q650 with reduced susceptibility to ovine Johne’s disease in Turkish sheep. Sci Rep. 2021;11:7088. https://doi.org/10.1038/s41598-021-86605-4. Su R, Huang B, Tan J, Shen Z, Zhong P, Liu J. Mutual information stacking method for prediction of the growth traits in pigs. Brief Bioinform. 2025;26:bbaf231. https://doi.org/10.1093/bib/bbaf231. Zhu K, Zheng Y, Chan KCG. Weighted Brier Score—An Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration. Stat Biosci. 2025. https://doi.org/10.1007/s12561-025-09505-5. Chen C, Bhuiyan SA, Ross E, Powell O, Dinglasan E, Wei X, et al. Genomic prediction for sugarcane diseases including hybrid Bayesian-machine learning approaches. Front Plant Sci. 2024;15:1398903. https://doi.org/10.3389/fpls.2024.1398903. Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López O, Jarquín D, de Los Campos G, et al. Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. Trends Plant Sci. 2017;22:961–75. https://doi.org/10.1016/j.tplants.2017.08.011. Irvin MR, Ge T, Patki A, Srinivasasainagendra V, Armstrong ND, Davis B, et al. Polygenic Risk for Type 2 Diabetes in African Americans. Diabetes. 2024;73:993–1001. https://doi.org/10.2337/db23-0232. Lee SS-Y, Stapleton F, MacGregor S, Mackey DA. Genome-wide association studies, Polygenic Risk Scores and Mendelian randomisation: an overview of common genetic epidemiology methods for ophthalmic clinicians. Br J Ophthalmol. 2025;109:433–41. https://doi.org/10.1136/bjo-2024-326554. Ndong Sima CAA, Step K, Swart Y, Schurz H, Uren C, Möller M. Methodologies underpinning polygenic risk scores estimation: a comprehensive overview. Hum Genet. 2024;143:1265–80. https://doi.org/10.1007/s00439-024-02710-0. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. https://doi.org/10.1186/s13742-015-0047-8. Browning BL, Zhou Y, Browning SR. A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet. 2018;103:338–48. https://doi.org/10.1016/j.ajhg.2018.07.015. Guo Y, Zhong Z, Yang C, Hu J, Jiang Y, Liang Z, et al. Epi-GTBN: an approach of epistasis mining based on genetic Tabu algorithm and Bayesian network. BMC Bioinformatics. 2019;20:444. https://doi.org/10.1186/s12859-019-3022-z. Haws DC, Rish I, Teyssedre S, He D, Lozano AC, Kambadur P, et al. Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods. PLoS One. 2015;10:e0138903. https://doi.org/10.1371/journal.pone.0138903. Heinrich F, Ramzan F, Rajavel A, Schmitt AO, Gültas M. MIDESP: Mutual Information-Based Detection of Epistatic SNP Pairs for Qualitative and Quantitative Phenotypes. Biology (Basel). 2021;10:921. https://doi.org/10.3390/biology10090921. Huang H-H, Xu T, Yang J. Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension. BMC Proc. 2014;8 Suppl 1:S96. https://doi.org/10.1186/1753-6561-8-S1-S96. Miller DJ, Zhang Y, Yu G, Liu Y, Chen L, Langefeld CD, et al. An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions. Bioinformatics. 2009;25:2478–85. https://doi.org/10.1093/bioinformatics/btp435. Ferrario PG, König IR. Transferring entropy to the realm of GxG interactions. Brief Bioinform. 2018;19:136–47. https://doi.org/10.1093/bib/bbw086. Wang H, Yin H, Wu X. A Secure High-Order Gene Interaction Detecting Method for Infectious Diseases. Comput Math Methods Med. 2022;2022:4471736. https://doi.org/10.1155/2022/4471736. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:230. https://doi.org/10.1186/s12916-019-1466-7. Bermingham ML, Pong-Wong R, Spiliopoulou A, Hayward C, Rudan I, Campbell H, et al. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci Rep. 2015;5:10312. https://doi.org/10.1038/srep10312. Pudjihartono N, Fadason T, Kempa-Liehr AW, O’Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front Bioinform. 2022;2:927312. https://doi.org/10.3389/fbinf.2022.927312. Alzoubi H, Alzubi R, Ramzan N. Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations. Sensors. 2023;23:4439. https://doi.org/10.3390/s23094439. Vergara JR, Estévez PA. A review of feature selection methods based on mutual information. Neural Comput & Applic. 2014;24:175–86. https://doi.org/10.1007/s00521-013-1368-0. Heinrich F, Lange TM, Kircher M, Ramzan F, Schmitt AO, Gültas M. Exploring the potential of incremental feature selection to improve genomic prediction accuracy. Genet Sel Evol. 2023;55:78. https://doi.org/10.1186/s12711-023-00853-8. Goddard ME, Hayes BJ. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet. 2009;10:381–91. https://doi.org/10.1038/nrg2575. Liang M, Chang T, An B, Duan X, Du L, Wang X, et al. A Stacking Ensemble Learning Framework for Genomic Prediction. Front Genet. 2021;12. https://doi.org/10.3389/fgene.2021.600040. Wolpert DH. Stacked generalization. Neural Networks. 1992;5:241–59. https://doi.org/10.1016/S0893-6080(05)80023-1. Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on Machine learning. New York, NY, USA: Association for Computing Machinery; 2005. p. 625–32. https://doi.org/10.1145/1102351.1102430. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 14 May, 2026 Reviews received at journal 12 May, 2026 Reviews received at journal 11 May, 2026 Reviewers agreed at journal 09 May, 2026 Reviews received at journal 09 May, 2026 Reviewers agreed at journal 08 May, 2026 Reviewers agreed at journal 08 May, 2026 Reviewers agreed at journal 07 May, 2026 Reviewers agreed at journal 06 May, 2026 Reviews received at journal 03 May, 2026 Reviewers agreed at journal 27 Apr, 2026 Reviewers agreed at journal 25 Apr, 2026 Reviewers agreed at journal 21 Apr, 2026 Reviewers invited by journal 21 Apr, 2026 Editor assigned by journal 21 Apr, 2026 Editor invited by journal 21 Apr, 2026 Submission checks completed at journal 19 Apr, 2026 First submitted to journal 19 Apr, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9421190","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":631192416,"identity":"2cbc313f-a86d-4e4b-a446-b4a0152444ef","order_by":0,"name":"Yalçın YAMAN","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7UlEQVRIiWNgGAWjYFAC5gaGBAiL8QGQ4OEjrIURroXZAKSFjSgtUMAmASYJaTA4frDxw8Mdh+V1288eq/yaYyfDxsD88NENfFrOJDZLJJ45bLjtTF7abdltyUCHsRkb5+DTciCxjSGx7TDjtgM5ZrcltzEDtfCwSePVcv4hWIv9tvNvzIolt9UToeUGxJbEbTdyzBg/bjtMWIvkjYdAv7SlJ2+78cZYmnHbcR42ZgJ+4TuffPDjzzZr223ncww//txWbc/P3vzwMT4tCgfAVDOYZOYBk3iUg4B8A5iqA5OMPwioHgWjYBSMgpEJAEIfTnV483LiAAAAAElFTkSuQmCC","orcid":"","institution":"Siirt University","correspondingAuthor":true,"prefix":"","firstName":"Yalçın","middleName":"","lastName":"YAMAN","suffix":""},{"id":631192417,"identity":"fcd4db05-bcad-4090-a655-533c4665d99c","order_by":1,"name":"Ahmet ESER","email":"","orcid":"","institution":"Siirt University","correspondingAuthor":false,"prefix":"","firstName":"Ahmet","middleName":"","lastName":"ESER","suffix":""},{"id":631192418,"identity":"f0f40cc2-e847-4860-853f-e435118167f3","order_by":2,"name":"Devran COŞKUN","email":"","orcid":"","institution":"Siirt University","correspondingAuthor":false,"prefix":"","firstName":"Devran","middleName":"","lastName":"COŞKUN","suffix":""},{"id":631192419,"identity":"97804e99-5468-4064-a5fd-570f7364794b","order_by":3,"name":"Ramazan AYMAZ","email":"","orcid":"","institution":"Siirt University","correspondingAuthor":false,"prefix":"","firstName":"Ramazan","middleName":"","lastName":"AYMAZ","suffix":""},{"id":631192420,"identity":"a83dbd12-8419-466e-aefd-7826393a6d7d","order_by":4,"name":"Yiğit Emir KİŞİ","email":"","orcid":"","institution":"General Directorate of Agricultural Research and Policies","correspondingAuthor":false,"prefix":"","firstName":"Yiğit","middleName":"Emir","lastName":"KİŞİ","suffix":""},{"id":631192421,"identity":"3ea5df24-dd9e-42bf-b964-60b82c1accec","order_by":5,"name":"Murat KELEŞ","email":"","orcid":"","institution":"General Directorate of Agricultural Research and Policies","correspondingAuthor":false,"prefix":"","firstName":"Murat","middleName":"","lastName":"KELEŞ","suffix":""},{"id":631192422,"identity":"09d4bb0b-d07a-492e-a599-03b15983027e","order_by":6,"name":"Serdar YAĞCI","email":"","orcid":"","institution":"General Directorate of Agricultural Research and Policies","correspondingAuthor":false,"prefix":"","firstName":"Serdar","middleName":"","lastName":"YAĞCI","suffix":""},{"id":631192423,"identity":"9541d8c7-56ac-4ee1-a2a3-a4673d36cdca","order_by":7,"name":"Özgül GÜLAYDIN","email":"","orcid":"","institution":"Siirt University","correspondingAuthor":false,"prefix":"","firstName":"Özgül","middleName":"","lastName":"GÜLAYDIN","suffix":""},{"id":631192424,"identity":"405e2e6a-5f7f-4728-89c9-e52abcf0df12","order_by":8,"name":"Serkan Süleyman Şengül","email":"","orcid":"","institution":"General Directorate of Agricultural Research and Policies","correspondingAuthor":false,"prefix":"","firstName":"Serkan","middleName":"Süleyman","lastName":"Şengül","suffix":""},{"id":631192425,"identity":"f0eafb44-33f4-47cd-b230-b53685260637","order_by":9,"name":"Kıvanç İrak","email":"","orcid":"","institution":"Siirt University","correspondingAuthor":false,"prefix":"","firstName":"Kıvanç","middleName":"","lastName":"İrak","suffix":""},{"id":631192426,"identity":"b957c7e2-328b-4106-aa86-4ac073ec3992","order_by":10,"name":"Memiş Bolacali","email":"","orcid":"","institution":"Ahi Evran University","correspondingAuthor":false,"prefix":"","firstName":"Memiş","middleName":"","lastName":"Bolacali","suffix":""}],"badges":[],"createdAt":"2026-04-15 03:38:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9421190/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9421190/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108192762,"identity":"b3b4d984-addc-4cd5-beec-05843980dae7","added_by":"auto","created_at":"2026-04-30 10:11:58","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":78175,"visible":true,"origin":"","legend":"\u003cp\u003eMutual information-based feature selection identifying discriminatory SNP markers for paratuberculosis susceptibility prediction.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/cebc9b85a3f60b36836faac0.png"},{"id":108192709,"identity":"d1483d98-d11c-4289-84aa-c1a4e0365d76","added_by":"auto","created_at":"2026-04-30 10:11:47","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":271791,"visible":true,"origin":"","legend":"\u003cp\u003eTest performance heatmap displaying raw metric values for all fourteen models across eight evaluation criteria, with cell colour representing min-max normalised scores.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/8d3cb76e97d2920e929caf8c.png"},{"id":108192696,"identity":"8a08647f-dfc9-4559-8a44-b6adb13bb672","added_by":"auto","created_at":"2026-04-30 10:11:45","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":217750,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCalibration curves for all fourteen machine learning models plotted against the perfectly calibrated reference diagonal.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/a176e47b125868b11335a5e8.png"},{"id":108192711,"identity":"13b55d75-d796-400a-821d-77e8529cfc72","added_by":"auto","created_at":"2026-04-30 10:11:47","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":367785,"visible":true,"origin":"","legend":"\u003cp\u003eTrain-test scatter generalisation map displaying mean training performance (x-axis) against mean test performance (y-axis) for all fourteen models across eight metrics; points above the dashed diagonal indicate higher test than training performance, and points below the diagonal indicate higher training than test performance, consistent with overfitting.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/c47c6825cccd86f427ed5c00.png"},{"id":108192727,"identity":"6791d233-b694-4394-b30d-ae4191d815ac","added_by":"auto","created_at":"2026-04-30 10:11:48","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":255324,"visible":true,"origin":"","legend":"\u003cp\u003eCV% reliability heatmap displaying the coefficient of variation (standard deviation / mean × 100) for each model-metric combination across repeated stratified cross-validation folds; lower values and greener cells indicate greater fold-to-fold stability.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/4d41ccdfef277c0c4fb6fb18.png"},{"id":108192764,"identity":"031da715-3da4-4b17-a24c-b96a1737c004","added_by":"auto","created_at":"2026-04-30 10:11:58","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":314998,"visible":true,"origin":"","legend":"\u003cp\u003eStability ranking heatmap displaying the absolute cross-validation standard deviation per model-metric combination, with models sorted from most unstable (top) to most stable (bottom) and the final column representing the unweighted mean standard deviation across all eight metrics.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/c97f42ab5a70d80b3ca8cd91.png"},{"id":108491002,"identity":"98a34c34-0107-4d32-8656-f230e079e644","added_by":"auto","created_at":"2026-05-05 09:51:02","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":147532,"visible":true,"origin":"","legend":"\u003cp\u003eBrier Score–ROC-AUC tradeoff plot displaying the joint calibration-discrimination positioning of all fourteen models in a two-dimensional performance space; error bars represent ±1 SD across cross-validation folds and the ideal region is the upper-left corner.\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/8c15eede6d3814135f507a12.png"},{"id":108192698,"identity":"d55280bd-5534-4178-a89a-927c5ee2ac6b","added_by":"auto","created_at":"2026-04-30 10:11:45","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":118629,"visible":true,"origin":"","legend":"\u003cp\u003eCorrelation between cross-validation persistence and test discrimination (left panel) and calibration (right panel) across fourteen machine learning models; Pearson correlation coefficients and associated p-values are shown in panel titles.\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/78ba55d6e2f13b649a1306fb.png"},{"id":108491010,"identity":"c2cb0dd6-5a3b-486d-be05-b039257775ab","added_by":"auto","created_at":"2026-05-05 09:51:07","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":390980,"visible":true,"origin":"","legend":"\u003cp\u003eConfussion matrices for all fourteen models displaying mean true negative (TN), false positive (FP), false negative (FN), and true positive (TP) rates with ±SD across cross-validation folds.\u003c/p\u003e","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/93f45ef48a3f0acff480eaa8.png"},{"id":108192763,"identity":"13201de4-2d49-4bdc-926d-bc13b92a4a13","added_by":"auto","created_at":"2026-04-30 10:11:58","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":333618,"visible":true,"origin":"","legend":"\u003cp\u003ePairwise bootstrap significance test results across four ranking metrics (Eff.MCC, AUC-ROC, Brier Score, ECE) for all fourteen model pairs; green cells indicate statistically significant differences (p \u0026lt; 0.05) and grey cells indicate non-significant pairs (p ≥ 0.05); based on 10,000 bootstrap iterations, two-sided, α = 0.05.\u003c/p\u003e","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/c10355fd32d1187a6dff14c6.png"},{"id":108491009,"identity":"b4faa91a-9838-484d-87ad-7b0a8428a178","added_by":"auto","created_at":"2026-05-05 09:51:06","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":155666,"visible":true,"origin":"","legend":"\u003cp\u003ePairwise significance summary heatmap displaying the count of ranking metrics (out of 4: Eff.MCC, AUC-ROC, Brier Score, ECE) for which each model pair shows a statistically significant difference at α = 0.05.\u003c/p\u003e","description":"","filename":"floatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/725aea9036b4809247370bbd.png"},{"id":108192700,"identity":"9de51433-d651-4dfa-844a-ec1b78e5a4db","added_by":"auto","created_at":"2026-04-30 10:11:45","extension":"png","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":88411,"visible":true,"origin":"","legend":"\u003cp\u003eComposite performance ranking of fourteen machine learning models based on overall test-set scores across eight evaluation metrics.\u003c/p\u003e","description":"","filename":"floatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/04ab5d76e669675397499f57.png"},{"id":108192767,"identity":"8e068fc8-5d87-4706-8dbc-c0f0d878af69","added_by":"auto","created_at":"2026-04-30 10:12:00","extension":"png","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":149078,"visible":true,"origin":"","legend":"\u003cp\u003eBrier score across four prevalence scenarios following post-hoc Albert offset recalibration; the gold column indicates the estimated true MAP seroprevalence (π = 0.1407).\u003c/p\u003e","description":"","filename":"floatimage13.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/6935bfc2943c755da7764cfc.png"},{"id":108192701,"identity":"e8ce77e9-01e7-47f1-b257-32277bbc2a21","added_by":"auto","created_at":"2026-04-30 10:11:45","extension":"png","order_by":14,"title":"Figure 14","display":"","copyAsset":false,"role":"figure","size":152942,"visible":true,"origin":"","legend":"\u003cp\u003eExpected Calibration Error (ECE) across four prevalence scenarios following post-hoc Albert offset recalibration.\u003c/p\u003e","description":"","filename":"floatimage14.png","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/e1d4be18e4fdafed4edaa4d5.png"},{"id":109081168,"identity":"39918d3e-3ddb-4556-bf4e-ca59b314ebf3","added_by":"auto","created_at":"2026-05-12 12:03:09","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3015388,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9421190/v1/cf74afaf-b676-4d99-b27f-77cae9401b5a.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Machine Learning Framework for Genomic Prediction of Paratuberculosis Predisposition in Goats: Discrimination–Calibration Dissociation Across Learning Architectures","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe challenge of predicting phenotypic outcomes from genomic variation has transformed both agricultural breeding and precision medicine, from genomic selection (GS) accelerating genetic improvement in livestock to polygenic risk scores (PRS) assessing disease susceptibility in humans [1, 2]. GS has revolutionised genetic improvement by leveraging dense genome-wide markers to predict breeding values and accelerate genetic progress, traditionally relying on parametric linear mixed models such as genomic best linear unbiased predictor (GBLUP) and Bayesian regression approaches [3\u0026ndash;5]. A shared challenge across both paradigms is that conventional parametric models, while tractable under additive genetic architectures, may inadequately represent the non-linear, epistatic, and context-dependent relationships that characterise high-dimensional genomic data [6, 7]. Standard GBLUP and Bayesian regression implementations predominantly \u0026mdash; and in most operational deployments exclusively \u0026mdash; model additive SNP substitution effects, leaving dominance deviations and inter-locus epistatic interactions either unmodeled or absorbed into the residual variance [7]. This structural constraint carries predictive consequences that are architecture-dependent: systematic benchmarking across 14 prediction models demonstrates that parametric methods yield superior accuracy under purely additive gene action but are consistently outperformed by non-parametric alternatives when epistasis underlies phenotypic variation, with inadequate architectural representation producing measurable prediction bias [6]. Fitness-related traits \u0026mdash; including immune responsiveness and infectious disease susceptibility \u0026mdash; represent the class for which non-additive genetic variance is expected to be disproportionately relevant [7], directly motivating the evaluation of architecturally flexible alternatives in the present paratuberculosis resistance context. Critically, analogous constraints extend to human disease genetics, where polygenic risk scores derived from parametric linear frameworks demonstrate reduced predictive portability across ancestral groups and underestimate phenotypic variance attributable to non-additive and population-specific genetic variation, underscoring the need for architecturally flexible and ancestry-aware prediction frameworks broadly [8\u0026ndash;11].\u003c/p\u003e \u003cp\u003eClassical genomic prediction models such as GBLUP and Bayesian regression primarily capture additive genetic effects, limiting their ability to represent the full biological complexity of high-dimensional data [3, 12, 13]. These formulations often overlook non-linear relationships, dominance, epistasis, and higher-order interactions pervasive in complex traits [3, 13, 14]. Machine learning (ML) and deep learning (DL) approaches have emerged as flexible, model-free alternatives capable of learning genotype\u0026ndash;phenotype mappings without rigid assumptions about the underlying genetic architecture [13, 15, 16], with particular advantages in capturing non-linear and high-order effects and integrating diverse data modalities [12, 17, 18]. However, these gains come at the cost of increased computational demands, reduced interpretability, and greater sensitivity to data quality and sample size [3, 12]. Importantly, emerging comparative benchmarks across binary disease and health trait prediction contexts \u0026mdash; including populations of moderate size where additive genetic architecture predominates \u0026mdash; have repeatedly found GBLUP to perform indistinguishably from structurally more complex ML architectures such as RF, SVM, XGBoost, and MLP [19], raising fundamental questions about when model complexity translates into practical predictive gain.\u003c/p\u003e \u003cp\u003eParatuberculosis (PTB), or Johne\u0026rsquo;s disease, is a pervasive, incurable chronic granulomatous enteritis affecting ruminants worldwide [20]. Its etiological agent, Mycobacterium avium subsp. paratuberculosis (MAP), causes substantial economic losses through decreased milk yield, impaired fertility, and premature culling [21, 22]. Beyond its agricultural impact, MAP is recognised as a growing zoonotic threat with epidemiological links to Crohn\u0026rsquo;s disease and autoimmune conditions in humans [23, 24], facilitated by MAP\u0026rsquo;s ability to survive pasteurisation and persist in the environment [25, 26]. Elucidating the genetic architecture of host resistance is therefore imperative for addressing this \u0026ldquo;One Health\u0026rdquo; challenge [23].\u003c/p\u003e \u003cp\u003eParatuberculosis resistance is polygenic, with host genotypes significantly determining MAP infection trajectories [27]. In dairy cattle, heritability estimates range from 0.03 to 0.27, and GWAS have identified QTLs across nearly all bovine chromosomes, with robust signals on BTA23 (MHC region), BTA3, and BTA5 [28, 29]. Candidate genes including ATG4D (autophagy\u0026ndash;MAP clearance) and LRP1 (inflammatory modulation) have been characterised, and expression QTL analyses have identified regulatory variants influencing macrophage activation [29\u0026ndash;31]. In small ruminants, goats exhibit higher clinical susceptibility than sheep [32]. A significant QTL on OAR20 explains up to 18% of the genetic variance in antibody response in sheep [20]; targeted studies have identified SLC11A1 microsatellite variants [33, 34] and a TLR2 mutation conferring 6.6-fold increased resistance in Turkish sheep [35]. In goats, polymorphisms within the SLC11A1 B7 allele remain pivotal determinants of infection risk [33].\u003c/p\u003e \u003cp\u003eHere, we systematically benchmarked 14 genomic prediction frameworks for paratuberculosis resistance classification in goats \u0026mdash; a trait with complex polygenic architecture and significant agricultural and public health implications. Using the Illumina 65K Goat BeadChip, we applied mutual information-based dimensionality reduction to extract a refined 5,000-SNP feature set \u0026mdash; an information-theoretic selection strategy whose utility for livestock genomic prediction has been independently validated [36] \u0026mdash; then subjected this substrate to comparative evaluation spanning parametric linear models, GBLUP, kernel-based classifiers, tree-based ensembles, deep neural networks, and meta-ensemble strategies across 474 animals from seven indigenous Turkish goat breeds. Through repeated cross-validation and comprehensive statistical testing incorporating both discriminative and calibration-aware metrics [37], we dissect the relative merits of model-free flexibility versus parametric parsimony, revealing which algorithmic strategies most effectively capture genetic signatures of host\u0026ndash;pathogen interactions and providing a methodological template applicable to other polygenic disease traits in livestock [12, 13, 38\u0026ndash;42].\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy population and phenotype data\u003c/h2\u003e \u003cp\u003eSampling was conducted across 36 herds in 11 provinces of Turkey, encompassing seven indigenous goat breeds: Hairy, Honamlı, Damascus, Angora, Kilis, Turkish Saanen, and Maltese. A total of 3,069 animals were screened, yielding a mean MAP seroprevalence of 14.07%. From this population, 474 animals were selected for genotyping with a balanced case\u0026ndash;control composition of 237 seropositive cases and 237 seronegative controls, stratified to ensure representation of all seven breeds. Disease status was determined using a commercially validated ELISA assay for anti-Mycobacterium avium subsp. paratuberculosis antibodies. Phenotype data were encoded as binary outcomes (0 for controls, 1 for cases) for compatibility with classification algorithms and probabilistic modelling frameworks.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eGenotyping and quality control\u003c/h3\u003e\n\u003cp\u003eGenomic DNA was genotyped using the Illumina 65K Goat BeadChip, interrogating approximately 65,000 SNPs distributed across the caprine genome. Quality control was implemented using PLINK v1.9 [43]. Sex chromosome markers and SNPs lacking chromosome assignments or physical positions were excluded. Stringent filtering thresholds were applied: call rate\u0026thinsp;\u0026lt;\u0026thinsp;95% (SNP-level), call rate\u0026thinsp;\u0026lt;\u0026thinsp;90% (individual-level), minor allele frequency\u0026thinsp;\u0026lt;\u0026thinsp;0.05, and Hardy\u0026ndash;Weinberg equilibrium deviation at \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;5\u0026times;10⁻⁵. Following filtering, 44,375 high-quality SNPs were retained. Residual missing genotypes were imputed using Beagle v5.0 [44] on a per-chromosome basis, and LD pruning was applied in PLINK v1.9 [43] to mitigate multicollinearity.\u003c/p\u003e\n\u003ch3\u003eFeature preselection via mutual information\u003c/h3\u003e\n\u003cp\u003eMutual information (MI) was implemented as a filter-based feature preselection method to quantify the statistical dependence between individual SNP genotypes and a binary phenotype. Unlike correlation-based measures, MI captures both linear and non-linear associations and remains invariant to monotonic transformations [45\u0026ndash;47]. MI was estimated with a non-parametric kNN estimator, a distribution-free approach suited to discrete\u0026ndash;discrete genotype\u0026ndash;phenotype relationships [45, 47\u0026ndash;49], demonstrating robustness against model misspecification and nonlinearity in high-dimensional genomic contexts [47, 50, 51]. MI-based preselection concentrates computational effort on the most informative SNP subsets, enabling more reliable detection of non-additive effects where relevant [49, 51].\u003c/p\u003e \u003cp\u003eTo prevent information leakage, MI ranking was performed strictly within each training partition. For every fold, MI scores were computed from the training subset only, and the top 5,000 markers \u0026mdash; approximately 11.3% of the genome-wide panel \u0026mdash; were retained for model training and applied unchanged to the corresponding validation set. Nesting feature selection within cross-validation preserved the integrity of out-of-sample performance evaluation across all predictive frameworks.\u003c/p\u003e\n\u003ch3\u003eMachine learning framework\u003c/h3\u003e\n\u003cp\u003eFourteen classification algorithms were implemented across five methodological families: (i) \u003cem\u003eparametric linear models\u003c/em\u003e \u0026mdash; Ridge Regression (L2), LASSO (L1), Elastic Net (L1/L2), and GBLUP, which fits a linear mixed model using a genomic relationship matrix derived from SNP markers; (ii) \u003cem\u003ekernel-based classifiers\u003c/em\u003e \u0026mdash; SVC-Linear and SVC-RBF; (iii) \u003cem\u003etree-based ensembles\u003c/em\u003e \u0026mdash; Random Forest, Gradient Boosting, AdaBoost (SAMME.R), XGBoost, and LightGBM; (iv) \u003cem\u003edeep learning\u003c/em\u003e \u0026mdash; an MLP with batch normalisation and dropout regularisation, and a CNN with one-dimensional convolutional layers; (v) \u003cem\u003emeta-learning\u003c/em\u003e \u0026mdash; a Stacking ensemble implemented via scikit-learn's StackingClassifier with three heterogeneous base learners (SVC-RBF, Random Forest, Logistic Regression) whose out-of-fold predicted probabilities served as the input matrix for a Logistic Regression meta-learner. Meta-learner regularisation (C\u0026thinsp;=\u0026thinsp;1/α; α \u0026isin; [0.001, 500]) was optimised via Bayesian search. Out-of-fold training of base learners prevents information leakage from training into meta-learning.\u003c/p\u003e\n\u003ch3\u003eHyperparameter optimisation\u003c/h3\u003e\n\u003cp\u003eBayesian hyperparameter optimisation was conducted using BayesSearchCV (scikit-optimize) for all models except PyTorch-based architectures, with a composite scoring function combining ROC-AUC (60% weight) and accuracy (40%), inner 5-fold stratified cross-validation, and 120 iterations per model. Regularised linear classifiers searched regularisation strength across five orders of magnitude; SVCs searched regularisation strength and kernel bandwidth; tree-based models searched ensemble size, learning rate, tree depth, subsampling, and regularisation parameters; the MLP optimised 13 hyperparameters including hidden layer configuration, activation function, learning rate schedule, and early stopping patience.\u003c/p\u003e \u003cp\u003eFor GBLUP, the search space was deliberately broad \u0026mdash; encompassing kernel type (linear, RBF, polynomial), optional eigendecomposition-based dimensionality reduction, and probability calibration method \u0026mdash; to allow data-driven selection of the optimal covariance structure rather than imposing additive assumptions a priori. Bayesian optimisation consistently converged on the linear kernel without dimensionality reduction, indicating that the additive genomic relationship matrix provided the most effective covariance representation for this trait\u0026ndash;sample configuration. The final GBLUP model operated as a conventional additive genomic prediction framework with Platt-scaled probability calibration \u0026mdash; an outcome that empirically corroborates the additive convergence interpretation reported in the Discussion. For CNN and MLP, incompatibility with BayesSearchCV necessitated a three-stage hierarchical grid search: coarse sampling, refinement around the optimum, and full-resolution search within a narrow neighbourhood of the stage-2 solution.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eCross-validation strategy\u003c/h2\u003e \u003cp\u003eModel performance was evaluated under repeated stratified k-fold cross-validation (k\u0026thinsp;=\u0026thinsp;5, 10 independent repetitions). In each repetition, data were reshuffled with a different random seed and a new 5-fold split generated; each fold served as the validation set once per repetition. Models were trained on approximately 80% of the data (n\u0026thinsp;\u0026asymp;\u0026thinsp;379) and validated on the remaining 20% (n\u0026thinsp;\u0026asymp;\u0026thinsp;95). All reported performance metrics represent means across 10 \u0026times; 5\u0026thinsp;=\u0026thinsp;50 validation folds.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eGoodness-of-Fit composite scoring (GoF)\u003c/h3\u003e\n\u003cp\u003eAn integrated model ranking was derived by computing a composite GoF score for each framework. For every test metric, observed mean values were min-max normalised across all 14 models to a [0, 1] interval; for Brier Score, normalisation was inverted so that lower raw values yielded higher normalised scores. The GoF score was defined as the unweighted mean of all eight per-metric normalised values (ROC-AUC, Accuracy, Sensitivity, Specificity, Precision, F1 Score, Persistence, Brier Score), yielding a single scalar in [0, 1] that jointly rewards discrimination, calibration, and cross-fold stability without privileging any individual metric. Within this formulation, calibration is represented by Brier Score alone among eight equal-weight components (effective calibration weight: 12.5%); users with calibration-dominant decision objectives should consider explicitly reweighted composites. All composite scoring and visualisation were implemented using the Goodness-of-Fit Analyzer via a custom Python script.\u003c/p\u003e\n\u003ch3\u003ePost-hoc prevalence recalibration\u003c/h3\u003e\n\u003cp\u003eTo assess the generalisability of calibration findings to field deployment conditions, Albert offset recalibration was applied post-hoc to all fourteen models. The logit-scale intercept was shifted by Δ\u0026thinsp;=\u0026thinsp;log[π_field/(1\u0026thinsp;\u0026minus;\u0026thinsp;π_field)]\u0026thinsp;\u0026minus;\u0026thinsp;log[π_train/(1\u0026thinsp;\u0026minus;\u0026thinsp;π_train)], where π_train\u0026thinsp;=\u0026thinsp;0.50 reflects the balanced 1:1 case-control training design, across four target prevalence scenarios: π\u0026thinsp;=\u0026thinsp;0.25, π\u0026thinsp;=\u0026thinsp;0.1407 (the estimated true MAP seroprevalence of the study population), π\u0026thinsp;=\u0026thinsp;0.10, and π\u0026thinsp;=\u0026thinsp;0.05, collectively spanning the plausible range of MAP seroprevalence in Turkish goat populations. For each scenario, Brier score and Expected Calibration Error (ECE) were computed on the recalibrated probability outputs. ROC-AUC was additionally recorded across all scenarios to verify the rank-preserving property of the logit-intercept transformation.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eComputational implementation\u003c/h2\u003e \u003cp\u003eAll analyses were implemented in Python 3.8\u0026thinsp;+\u0026thinsp;using scikit-learn (v1.3+), scikit-optimiee, PyTorch (v2.0+), XGBoost (v2.0+), LightGBM (v4.0+), NumPy, SciPy, pandas, and Matplotlib. All BayesSearchCV optimisations were executed with n_jobs\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;1, leveraging full multi-core parallelisation.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eDimensionality reduction via mutual information\u003c/h2\u003e \u003cp\u003eMutual information (MI) analysis applied to 44,375 quality-controlled SNP markers yielded a markedly right-skewed, heavy-tailed discriminatory landscape, with the majority of markers concentrated in a low-MI mode approaching zero and a minority occupying a progressively informative upper tail \u0026mdash; a structure consistent with a polygenic phenotypic architecture wherein variance is distributed across numerous markers each contributing individually small incremental effects. The top 5,000 SNPs, constituting 11.3% of the quality-controlled panel, were selected as the prediction feature space, reducing the genotype matrix from n\u0026thinsp;=\u0026thinsp;474 \u0026times; p\u0026thinsp;=\u0026thinsp;44,375 to n\u0026thinsp;=\u0026thinsp;474 \u0026times; p\u0026thinsp;=\u0026thinsp;5,000 (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). This information-theoretic selection strategy, which removes uninformative markers while retaining those carrying genuine phenotypic signal, is consistent with recent evidence demonstrating that MI-based SNP filtering substantially improves downstream genomic prediction accuracy relative to whole-panel approaches in livestock [36]. MI-based ranking was computed exclusively within each training partition under the nested cross-validation strategy, ensuring comparability of model evaluations under a common feature space.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eViolin plot depicting the MI score distribution across 44,375 genome-wide markers. The top 5,000 SNPs occupying the high-MI tail were retained for subsequent analyses. MI quantifies the reduction in uncertainty about the binary phenotype given SNP genotype, ranging from 0 (statistical independence) to min[H(X), H(Y)] (perfect dependence).\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eModel performance and three-tier hierarchy\u003c/h2\u003e \u003cp\u003eFourteen machine learning models were evaluated across eight performance metrics under repeated stratified cross-validation (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e), revealing a three-tier performance hierarchy rendered simultaneously in raw values and normalised perspective in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. The top tier \u0026mdash; Stacking, LASSO, ElasticNet, SVC-RBF, Ridge, GBLUP, and SVC-Linear \u0026mdash; achieved ROC-AUC values exceeding 0.970, with six regularised and kernel models spanning a compressed AUC range of only 0.0015. Within this tier, Stacking led across all metrics except ROC-AUC: accuracy (0.9453), sensitivity (0.9382), specificity (0.9539), F1 score (0.9471), persistence (0.9041), and Brier score (0.0460) \u0026mdash; the lowest in the study. This composite superiority of meta-ensemble learning is consistent with recent livestock genomic prediction benchmarks demonstrating that heterogeneous stacking frameworks integrating SNP feature selection with diverse base learners consistently outperform individual models across independent validation populations [36, 38]. GBLUP's competitive standing within this upper performance tier, comparable to regularised linear and kernel-based models despite its parametric simplicity, aligns with recent systematic comparisons for binary health trait prediction demonstrating that GBLUP performs indistinguishably from structurally more complex ML architectures \u0026mdash; including RF, SVM, XGBoost, and MLP \u0026mdash; in populations of moderate size [19]. The intermediate band \u0026mdash; PyTorchCNN (AUC: 0.9264; Brier: 0.0747) and PyTorchMLP (AUC: 0.9107; Brier: 0.0895) \u0026mdash; was distinguished by declining GoF scores alongside meaningfully wider standard deviations, particularly in persistence (PyTorchCNN SD: 0.0474; PyTorchMLP SD: 0.0389). The bottom five tree-based models displayed progressively declining GoF scores with two structurally distinct failure signatures: XGBoost paired a comparatively high sensitivity (0.8616) with the lowest specificity in the table (0.7115, Δ\u0026thinsp;=\u0026thinsp;0.1501) \u0026mdash; the sharpest intra-row metric divergence across all fourteen models \u0026mdash; while RandomForest combined a competitive ROC-AUC (0.9552) with the highest Brier score (0.2229), numerically encapsulating the discrimination\u0026ndash;calibration dissociation that characterises tree-based ensembles. GradientBoosting recorded the most uniformly depressed values, with accuracy (0.7789), specificity (0.7666), F1 score (0.7986), and persistence (0.6153) all among the lowest in the comparison, and the Persistence column tracked the GoF ranking with particularly high fidelity across all three tiers.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePredictive performance of fourteen machine learning models evaluated across eight classification metrics under repeated stratified cross-validation.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eROC-AUC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSensitivity\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSpecificity\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eF1 Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003ePersistence\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eBrier Score\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStacking\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9770\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0164\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9453\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0199\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9382\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0250\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9539\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0343\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9571\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0315\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9471\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0200\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.9041\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0284\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0460\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0133\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLASSO\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9823\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0079\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9337\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0142\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9321\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0278\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9340\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0215\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9419\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0194\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9366\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0154\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8950\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0216\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0498\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0096\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eElasticNet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9821\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0081\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9337\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0142\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9321\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0243\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9340\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0215\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9418\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0196\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9366\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0150\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8948\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0214\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0499\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0094\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVC-RBF\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9808\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0081\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9326\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0164\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9294\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0297\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9379\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0249\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9431\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0264\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9357\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0166\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8934\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0204\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0505\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0094\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRidge\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9815\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0077\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9337\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0170\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9299\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0269\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9366\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0216\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9433\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0210\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9363\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0185\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8879\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0194\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0594\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0127\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGBLUP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9813\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0085\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9326\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0184\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9228\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0248\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9453\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0290\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9481\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0308\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9349\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0201\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8851\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0185\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0634\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0074\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVC-Linear\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9822\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0072\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9284\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0193\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9211\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0349\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9379\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0249\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9425\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0272\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9311\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0205\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8916\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0513\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0103\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePyTorchCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9264\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0229\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9253\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0237\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9114\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0445\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9413\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0295\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9465\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0265\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9278\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0249\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8512\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0474\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0747\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0237\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePyTorchMLP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9107\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0202\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.9105\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0201\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.9085\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0350\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9128\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0346\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9213\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0327\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.9142\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0229\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.8205\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0389\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.0895\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0201\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandomForest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9552\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0172\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.8926\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0234\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.8784\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0450\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.9145\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0383\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.9190\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0405\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.8967\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0214\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.7845\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0278\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.2229\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0033\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAdaBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.9312\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0281\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.8537\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0406\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.8281\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0683\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.8863\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0557\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.8923\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0530\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.8566\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0411\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.7300\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0455\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.2180\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0034\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLightGBM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.8964\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0256\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.8337\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0342\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.8130\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0577\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.8597\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0569\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.8712\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0454\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.8387\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0275\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.6847\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0553\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.1582\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0377\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.8747\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0266\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.7947\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0276\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.8616\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0607\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.7115\u0026thinsp;\u0026plusmn;\u0026thinsp;0.1052\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.7806\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0508\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.8159\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0246\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.6473\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0491\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.1568\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0172\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGradientBoosting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e0.8649\u0026thinsp;\u0026plusmn;\u0026thinsp;0.1116\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e0.7789\u0026thinsp;\u0026plusmn;\u0026thinsp;0.1189\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e0.8100\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0838\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e0.7666\u0026thinsp;\u0026plusmn;\u0026thinsp;0.2235\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e0.8106\u0026thinsp;\u0026plusmn;\u0026thinsp;0.1603\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e0.7986\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0977\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c8\"\u003e \u003cp\u003e0.6153\u0026thinsp;\u0026plusmn;\u0026thinsp;0.1915\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c9\"\u003e \u003cp\u003e0.2040\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0285\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eValues represent mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation computed over repeated stratified cross-validation folds. ROC-AUC: area under the receiver operating characteristic curve; Accuracy: proportion of correctly classified observations; Sensitivity: true positive rate; Specificity: true negative rate; Precision: positive predictive value; F1 Score: harmonic mean of precision and sensitivity; Persistence: cross-fold prediction stability index; Brier Score: mean squared difference between predicted probabilities and observed outcomes (lower values indicate better probabilistic calibration). Models are ranked in descending order of composite GoF score.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eCell values represent mean test performance computed over repeated stratified cross-validation folds. Cell colour encodes the min-max normalised score for each metric column independently, ranging from 0.0 (red; worst observed value) to 1.0 (green; best observed value); for Brier Score, normalisation is inverted such that lower raw values receive higher normalised scores.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eCalibration analysis\u003c/h2\u003e \u003cp\u003eCalibration curves revealed a systematic and architecturally structured divergence between predicted probabilities and observed outcomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). LASSO, ElasticNet, and Ridge produced trajectories closely approximating the perfect calibration diagonal (Brier scores: 0.050, 0.050, 0.059); GBLUP and both SVCs exhibited comparable calibration fidelity (Brier: 0.051\u0026ndash;0.063). Stacking achieved the lowest Brier score across all architectures (0.046) alongside a top-tier ROC-AUC of 0.977 \u0026mdash; a combination of simultaneously minimal probabilistic error and maximal discrimination uniquely realised by Stacking among all fourteen evaluated frameworks. The theoretical basis for interpreting this joint optimisation resides in the formal decomposability of the Brier score into separable discrimination and calibration components [37, 52]. Architectures that independently minimise both components achieve composite superiority over those that attain high discrimination at the expense of probabilistic fidelity, a distinction that aggregate ranking metrics systematically obscure. The discrimination\u0026ndash;calibration dissociation was most pronounced in tree-based ensembles: RandomForest attained ROC-AUC\u0026thinsp;=\u0026thinsp;0.955 yet returned Brier\u0026thinsp;=\u0026thinsp;0.223, while XGBoost, LightGBM, AdaBoost, and GradientBoosting combined moderate-to-high discrimination with markedly degraded calibration (Brier: 0.157\u0026ndash;0.218), exhibiting characteristic sigmoid distortions \u0026mdash; systematic overconfidence at intermediate probability values and underestimation at the tails \u0026mdash; consistent across replicates. Deep learning architectures produced irregular, high-variance calibration trajectories reflecting stochastic optimisation instability under current sample size constraints.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eEach panel displays the relationship between mean predicted probability (x-axis) and observed event rate (y-axis) estimated across repeated stratified cross-validation folds; the dashed diagonal represents perfect calibration.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eGeneralisation and overfitting\u003c/h2\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents a train-test generalisation map across eight metrics, with the identity diagonal serving as the zero-generalisation-gap reference. Top-tier models clustered tightly near the diagonal at high performance values, while GradientBoosting, XGBoost, LightGBM, and AdaBoost were consistently positioned below the diagonal across multiple panels \u0026mdash; with GradientBoosting recording the most extreme train-test separation in both the ROC-AUC and Persistence panels, reaching a training persistence value near 1.0 against a test value of approximately 0.6. In the Specificity panel, a distinct pattern emerged: ElasticNet, SVC-Linear, Ridge, LASSO, and GBLUP sat above the diagonal, whereas XGBoost occupied a markedly below-diagonal position with a visually prominent train-test gap.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eEach panel plots the mean training performance (x-axis) against the mean test performance (y-axis) computed over repeated stratified cross-validation folds for a given metric. The dashed diagonal line represents the identity (zero generalisation gap); points above the diagonal indicate higher mean performance on test folds than on training folds, and points below the diagonal indicate higher mean training than test performance, consistent with overfitting. For Brier Score, lower values indicate better probabilistic calibration; axis orientation follows the raw metric scale without inversion. Model identities are indicated by colour as shown in the legend.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eCross-validation stability\u003c/h2\u003e \u003cp\u003eFigures \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e and \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e jointly characterise fold-to-fold stability. GradientBoosting was the most unstable model by a substantial margin, recording CV% values of 29.2% (specificity) and 31.1% (persistence), and the highest absolute standard deviations across most metrics (mean SD\u0026thinsp;=\u0026thinsp;0.1270); XGBoost ranked second most unstable (mean SD\u0026thinsp;=\u0026thinsp;0.0452). Strikingly, the Brier Score column inverted the stability gradient observed in all other columns: RandomForest (1.5%) and AdaBoost (1.6%) displayed apparent maximum Brier stability \u0026mdash; reflecting structurally consistent miscalibration rather than genuine calibration quality. The ROC-AUC column was the most uniformly stable metric, with CV% values as low as 0.7% for SVC-Linear across the top tier. ElasticNet and LASSO shared the lowest mean SD (0.0167 and 0.0172 respectively), confirming the stability dominance of the linear and kernel tier.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eCV% values represent the coefficient of variation computed as (standard deviation / mean) \u0026times; 100 for each model-metric combination across repeated stratified cross-validation folds. Lower values indicate greater fold-to-fold stability. Colour scale uses an inverted RdYlGn mapping where green denotes lower (more stable) CV% and red denotes higher (less stable) CV%. Models are sorted in descending order of composite GoF score.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eValues represent the absolute cross-validation standard deviation per model-metric combination.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eJoint discrimination\u0026ndash;calibration positioning\u003c/h2\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e maps all fourteen models onto the Brier Score\u0026ndash;ROC-AUC two-dimensional performance space. The top-tier models formed a compact cluster in the upper-left ideal region (Brier\u0026thinsp;\u0026asymp;\u0026thinsp;0.05, ROC-AUC\u0026thinsp;\u0026asymp;\u0026thinsp;0.977\u0026ndash;0.982), with Stacking displaced leftward to the lowest Brier position in the comparison (0.046) and GBLUP displaced rightward to the highest Brier within the tier (0.063) \u0026mdash; exhibiting the most pronounced calibration\u0026ndash;discrimination dissociation within the top cluster. RandomForest and AdaBoost occupied a distinctive upper-right position \u0026mdash; high ROC-AUC combined with high Brier \u0026mdash; while XGBoost, LightGBM, and GradientBoosting were displaced both rightward and downward. Figure\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e extends this analysis by examining the relationship between persistence and both primary performance indices: persistence exhibited a strong positive correlation with ROC-AUC (r\u0026thinsp;=\u0026thinsp;0.924, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and a strong negative correlation with Brier Score (r\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.847, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), confirming persistence as a valid surrogate index for both discrimination and calibration quality.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eThe x-axis represents the mean Brier Score and the y-axis the mean ROC-AUC, both computed over repeated stratified cross-validation folds. Error bars denote\u0026thinsp;\u0026plusmn;\u0026thinsp;1 standard deviation on each axis. The ideal model position is the upper-left corner, corresponding to simultaneously low Brier Score and high ROC-AUC. Models are identified by colour as shown in the legend.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003ePearson correlation coefficients (r) and two-sided p-values were computed across the fourteen model means. Persistence is defined as a custom cross-fold prediction stability index; lower Brier Score values indicate better probabilistic calibration. Both axes represent test-set means computed over repeated stratified cross-validation folds. Model identities are indicated by colour and label.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eError profile analysis\u003c/h2\u003e \u003cp\u003eThe confusion matrices revealed three structurally distinct error profiles (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e). Stacking achieved the highest TN rate (95.4% \u0026plusmn; 3.4%) alongside TP of 93.8% \u0026plusmn; 2.5%, while LASSO and ElasticNet showed the most symmetric FP\u0026ndash;FN distribution (difference of only 0.2 percentage points). GBLUP and SVC-Linear displayed a consistent FN\u0026thinsp;\u0026gt;\u0026thinsp;FP asymmetry, with FN rates of 7.7% and 7.9% respectively exceeding their FP rates, confirming the sensitivity\u0026ndash;specificity asymmetry identified in the performance table. XGBoost presented the most visually distinctive matrix in the figure: FP\u0026thinsp;=\u0026thinsp;28.8% \u0026plusmn; 10.5% against TN\u0026thinsp;=\u0026thinsp;71.2% \u0026plusmn; 10.5%, directly encoding its sensitivity\u0026ndash;specificity divergence. GradientBoosting exhibited the most structurally unstable error behaviour, with FP\u0026thinsp;=\u0026thinsp;23.3% \u0026plusmn; 22.4% \u0026mdash; a standard deviation nearly equalling the mean \u0026mdash; indicating that false positive behaviour was not merely elevated but structurally unstable across folds.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eEach matrix displays the mean true negative (TN), false positive (FP), false negative (FN), and true positive (TP) rates as proportions, computed over repeated stratified cross-validation folds. Values in parentheses denote\u0026thinsp;\u0026plusmn;\u0026thinsp;1 standard deviation across folds. Cell colour intensity reflects the magnitude of each rate, with darker blue indicating higher values. Specificity\u0026thinsp;=\u0026thinsp;TN rate; Sensitivity\u0026thinsp;=\u0026thinsp;TP rate.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eStatistical significance testing\u003c/h2\u003e \u003cp\u003ePairwise bootstrap significance testing across four ranking metrics (Eff.MCC, AUC-ROC, Brier Score, ECE) over 10,000 iterations showed that several within-top-tier pairs were statistically equivalent on all metrics (0/4): LASSO vs ElasticNet, LASSO vs SVC-Linear, LASSO vs SVC-RBF, ElasticNet vs SVC-Linear, ElasticNet vs SVC-RBF, and SVC-Linear vs SVC-RBF (Figs.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e). All comparisons between top-tier linear/kernel models and GradientBoosting, RandomForest, and AdaBoost reached 4/4. The Brier Score panel showed the most extensive significant differences across all sub-panels, while the AUC-ROC panel displayed the largest proportion of non-significant cells within the top-tier cluster \u0026mdash; indicating that calibration differences are more statistically resolvable than discriminative differences among high-performing models. DeLong test-based comparisons across all 91 model pairs with Benjamini\u0026ndash;Hochberg FDR correction (q\u0026thinsp;\u0026lt;\u0026thinsp;0.05) corroborated this tier structure: all top-tier versus bottom-tier comparisons reached significance on AUC, while most within-top-tier pairs did not.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eEach cell in the lower triangle of each panel displays the two-sided bootstrap p-value for the corresponding model pair on the indicated ranking metric; green cells denote p\u0026thinsp;\u0026lt;\u0026thinsp;0.05 (statistically significant difference) and grey cells denote p\u0026thinsp;\u0026ge;\u0026thinsp;0.05 (non-significant). Bootstrap resampling was performed with 10,000 iterations and a fixed random seed. Ranking metrics: Eff.MCC\u0026thinsp;=\u0026thinsp;effective Matthews Correlation Coefficient (MCC \u0026times; coverage); AUC-ROC\u0026thinsp;=\u0026thinsp;area under the receiver operating characteristic curve; Brier Score\u0026thinsp;=\u0026thinsp;mean squared probability error; ECE\u0026thinsp;=\u0026thinsp;expected calibration error computed over equal-width bins.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eValues denote the number of ranking metrics (out of 4) for which the corresponding model pair reached statistical significance at α\u0026thinsp;=\u0026thinsp;0.05. A value of 0/4 indicates statistical equivalence across all four metrics; 4/4 indicates significant difference on every metric. Both figures were generated using the Goodness-of-Fit Analyzer v11 pipeline with 10,000 paired bootstrap iterations.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eComposite performance ranking\u003c/h2\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig12\" class=\"InternalRef\"\u003e12\u003c/span\u003e consolidates the evaluation into a composite GoF score \u0026mdash; the unweighted mean of per-metric min-max normalised values across all eight test metrics, with Brier Score normalised in the inverted direction. Stacking achieved the highest composite score (0.9944), followed by LASSO (0.9488), ElasticNet (0.9485), SVC-RBF (0.9448), Ridge (0.9382), GBLUP (0.9328), and SVC-Linear (0.9293) \u0026mdash; all seven occupying a compressed upper band spanning only 0.0651 composite score units. A clear discontinuity separated this cluster from the intermediate tier of PyTorchCNN (0.8260) and PyTorchMLP (0.7276), and a second discontinuity delimited the bottom cluster: RandomForest (0.6067), AdaBoost (0.4157), LightGBM (0.3278), XGBoost (0.1478), and GradientBoosting (0.0630) \u0026mdash; the latter recording a value 15.8-fold below Stacking, reflecting simultaneous underperformance across discrimination, calibration, and fold-to-fold stability. The composite ranking is therefore not a restatement of any single metric but an integrated summary of multi-dimensional performance, providing a dimensionally robust basis for model selection in applied genomic prediction contexts.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eThe composite performance score represents the unweighted mean of per-metric min-max normalised values computed across eight test metrics (ROC-AUC, Accuracy, Sensitivity, Specificity, Precision, F1 Score, Persistence, and Brier Score); Brier Score normalisation is inverted such that lower raw values yield higher normalised scores. Values range from 0 (worst observed) to 1 (best observed). Models are ranked in descending order of composite score.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003ePrevalence recalibration analysis\u003c/h2\u003e \u003cp\u003ePost-hoc Albert offset recalibration was applied across four deployment-prevalence scenarios (π\u0026thinsp;=\u0026thinsp;0.25, π\u0026thinsp;=\u0026thinsp;0.1407, π\u0026thinsp;=\u0026thinsp;0.10, and π\u0026thinsp;=\u0026thinsp;0.05), where π\u0026thinsp;=\u0026thinsp;0.1407 represents the estimated true MAP seroprevalence of the target population. Both Brier score and Expected Calibration Error (ECE) increased monotonically as target prevalence decreased, yet the three-tier model hierarchy was preserved across all scenarios. At true field prevalence (π\u0026thinsp;=\u0026thinsp;0.1407), top-tier Brier scores ranged from 0.0700 (Stacking) to 0.1785 (GBLUP), while bottom-tier tree-based models ranged from 0.2698 (XGBoost) to 0.3828 (RandomForest; Fig.\u0026nbsp;\u003cspan refid=\"Fig13\" class=\"InternalRef\"\u003e13\u003c/span\u003e). The ECE heatmap at the same prevalence corroborated this separation: top-tier ECE spanned 0.1069 (SVC-Linear) to 0.2607 (GBLUP), whereas bottom-tier values ranged from 0.3064 (XGBoost) to 0.3868 (AdaBoost; Fig.\u0026nbsp;\u003cspan refid=\"Fig14\" class=\"InternalRef\"\u003e14\u003c/span\u003e). One notable observation was that PyTorchCNN recorded the lowest ECE among all fourteen models at π\u0026thinsp;=\u0026thinsp;0.1407 (0.0668), a value below that of all top-tier models at this scenario, though this did not alter its intermediate-tier composite standing.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eEach cell displays the Brier score for the corresponding model\u0026ndash;scenario combination following Albert offset recalibration. Models are sorted in descending order of composite GoF score and grouped into three performance tiers (Top, Intermediate, Bottom) separated by dashed lines. Lower values indicate better probabilistic calibration.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eEach cell displays the ECE for the corresponding model\u0026ndash;scenario combination following Albert offset recalibration, computed over ten equal-width probability bins.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003eGenomic Prediction as Disciplined Inference in High-Dimensional Biology\u003c/h2\u003e \u003cp\u003eGenomic prediction for complex disease resistance operates in a statistical regime that is simultaneously information-rich and sample-constrained. In such contexts, prediction models do more than classify \u0026mdash; they instantiate particular assumptions about how biological signal is structured, how uncertainty should be quantified, and how evidence should be translated into decision-relevant probabilities. The present comparative analysis of fourteen frameworks therefore serves not merely as a performance benchmark, but as an examination of how different algorithmic philosophies behave under identical genomic constraints. The foundational premise of whole-genome prediction, established by [2], is that selection decisions can be anchored to the joint information content of all markers simultaneously \u0026mdash; a premise whose implications continue to ramify as the diversity of available learning architectures expands.\u003c/p\u003e \u003cdiv id=\"Sec25\" class=\"Section3\"\u003e \u003ch2\u003eMulti-Breed, Multi-Region Population Composition and Its Analytical Consequences\u003c/h2\u003e \u003cp\u003eA structurally consequential feature of this study is the composition of the phenotyped population, encompassing seven indigenous Turkish goat breeds sampled from 11 provinces across four geographically distinct regions of Anatolia. Single-breed designs produce performance estimates that may reflect population-specific linkage disequilibrium structure and within-group relatedness rather than biologically meaningful signal; the multi-breed design employed here reduces this risk by requiring models to generalise across animals with divergent demographic histories and different effective population sizes. The consistently high performance of top-tier models supports the interpretation that their discriminative capacity reflects partially generalisable genomic signal, while the systematic and architecturally predictable failure of tree-based ensembles is consistent with structural miscalibration arguments rather than sampling-induced noise. It should be acknowledged that between-breed linkage disequilibrium confounding and differential heritability structure are plausible; breed-stratified analyses were not conducted, and the extent to which the observed performance hierarchy replicates within individual breeds remains an open question.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e \u003ch2\u003eMutual Information Filtering as Epistemic Pre-structuring of the Feature Space\u003c/h2\u003e \u003cp\u003eThe transformation of 44,375 quality-controlled SNPs into a 5,000-marker subset via mutual information (MI) filtering \u0026mdash; retaining approximately 11% of the genome-wide panel [53, 54] \u0026mdash; defines the epistemic conditions under which all downstream models operate. In ultra-high-dimensional settings where p ≫ n (44,375 markers vs. 474 animals), unconstrained estimation is mathematically unstable: collinearity inflates variance, gradient-based optimisers converge poorly, and tree-based partitioning faces combinatorial explosion in split evaluation [53]. MI was selected because it is model-agnostic, capturing both linear and non-linear associations and preserving analytical fairness across all architectures [55, 56]; it retains original SNP identities for biological interpretability [57]; and it has demonstrated competitive performance relative to embedded strategies in high-dimensional genomic contexts [55]. The empirical MI distribution exhibited a markedly right-skewed structure consistent with polygenic theory [5, 42]. A critical caveat must be noted: as a univariate filter, MI cannot capture epistatic interactions emerging from multivariate marker combinations [54], and uniform MI filtering \u0026mdash; while essential for valid benchmarking \u0026mdash; does not constitute the theoretically optimal feature set for each individual learning paradigm.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section3\"\u003e \u003ch2\u003eArchitectural Convergence and the Ceiling of Extractable Additive Signal\u003c/h2\u003e \u003cp\u003eThe most structurally striking feature of the comparative results is not the identity of the best-performing model, but the architectural breadth of the top tier. Seven frameworks spanning fundamentally different algorithmic paradigms \u0026mdash; penalised linear regression (LASSO, ElasticNet, Ridge), kernel-based classification (SVC-RBF, SVC-Linear), a genomic relationship matrix estimator (GBLUP), and a heterogeneous stacking ensemble \u0026mdash; achieved ROC-AUC values ranging from 0.9770 to 0.9823, with the six regularised and kernel models spanning a compressed range of only 0.0015 AUC units, confirmed by bootstrap equivalence testing yielding 0/4 significant metric pairs among LASSO, ElasticNet, SVC-Linear, and SVC-RBF across all four ranking criteria. That architecturally unrelated models simultaneously reach the same discriminative ceiling implies that the predictive signal in this dataset is predominantly additive and low-dimensional in structure, offering little purchase to more complex non-linear learning surfaces [58] \u0026mdash; consistent with the infinitesimal model framework [2]. GBLUP's competitive performance is particularly informative: a model performing no internal feature selection and optimising no loss function beyond variance component estimation achieves discriminative capacity largely equivalent to purpose-built machine learning frameworks, consistent with prior reports where additive genetic architecture limits the incremental value of non-linear methods [15]. PyTorchCNN and PyTorchMLP failed to surpass the linear tier for reasons fundamentally distinct from tree-based models: deep learning architectures display elevated fold-to-fold instability consistent with stochastic optimisation that has not converged under current sample size constraints, whereas tree-based models retain competitive discrimination while exhibiting stable architectural miscalibration. Stacking's out-of-fold construction protocol imposes structural data separation mitigating overfitting at the ensemble level [59, 60], reflected in its marginally supra-diagonal positioning in the Accuracy generalisation panel and its lowest Brier score (0.046) within the top tier.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec28\" class=\"Section2\"\u003e \u003ch2\u003eDiscrimination\u0026ndash;Calibration Dissociation Across Learning Architectures\u003c/h2\u003e \u003cp\u003eThe central analytical contribution of this study is the empirical documentation and mechanistic interpretation of a discrimination\u0026ndash;calibration dissociation operating simultaneously between architectural tiers and within the top tier itself. The theoretical independence of rank-ordering capacity and probabilistic calibration fidelity is well established [52], but its instantiation across a diverse model set under identical data conditions carries specific inferential weight. Across fourteen models, ROC-AUC varied by a factor of 1.14 \u0026mdash; from 0.8649 to 0.9823 \u0026mdash; whereas Brier score varied by a factor of 4.85, spanning 0.0460 to 0.2229: a near-fivefold differential in relative dispersion against a 1.14-fold differential in discriminative range. The dissociation is most starkly illustrated by RandomForest, whose ROC-AUC of 0.9552 places it within plausible distance of the top tier, yet whose Brier score of 0.2229 \u0026mdash; the highest in the comparison \u0026mdash; signals that posterior probability estimates are structurally unreliable despite retained rank-ordering ability. This is mechanistically expected: ensemble averaging over unpruned trees stabilises rank predictions while pushing predicted probabilities toward the extremes of the unit interval \u0026mdash; a known consequence of aggregating leaf-node class frequencies without explicit post-hoc calibration [61]. Within the top tier, Stacking achieves the lowest Brier score (0.0460) despite ranking last on AUC within its group (0.9770), while GBLUP exhibits the weakest calibration (Brier: 0.0634) \u0026mdash; configurations reflecting architectural and optimisation-level properties that operate independently of predictive hierarchy. Stacking's superior calibration despite incorporating the most severely miscalibrated base learner (RandomForest; Brier\u0026thinsp;=\u0026thinsp;0.223) is explained by the meta-layer architecture: the Logistic Regression meta-learner, trained on held-out base-learner predictions, implicitly learns to discount RandomForest's poorly calibrated outputs, applying implicit post-hoc recalibration through learned combination weights without requiring explicit isotonic or Platt rescaling. The CV% stability analysis reveals a further dimension: RandomForest (Brier CV% = 1.5%) and AdaBoost (1.6%) exhibit apparent maximum Brier stability \u0026mdash; reflecting structurally consistent miscalibration rather than genuine calibration quality. These models are not miscalibrated because of instability \u0026mdash; they are stably, architecturally miscalibrated. The bootstrap significance results formalise this: the greater prevalence of significant pairwise differences in the Brier Score panel relative to the grey-dominated AUC-ROC panel confirms that models statistically indistinguishable on discrimination can be formally separated on calibration fidelity [52]\u0026mdash; a distinction with direct practical consequence wherever predicted probabilities inform culling thresholds, breeding decisions, or epidemiological modelling. These findings collectively argue that model selection should be governed by calibration-aware criteria \u0026mdash; Brier score, expected calibration error, and reliability diagram inspection \u0026mdash; alongside, rather than subordinate to, conventional discrimination metrics.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec29\" class=\"Section2\"\u003e \u003ch2\u003eFailure Mode Heterogeneity and Its Diagnostic Implications\u003c/h2\u003e \u003cp\u003eThe generalisation map and stability analyses jointly reveal that failure modes of underperforming models are architecturally distinct and carry different remedial implications \u0026mdash; a dimension entirely obscured by composite ranking. GradientBoosting exhibits erratic fold-variance, with Persistence collapsing from training values approaching 1.0 to test values near 0.6. XGBoost presents systematic decision boundary asymmetry: sensitivity exceeds specificity by 0.1501 with a specificity SD of 0.1052. RandomForest, by contrast, achieves an intermediate persistence score (0.785) yet retains the highest Brier score \u0026mdash; a structurally distinct failure mode in which cross-fold rank predictions are relatively consistent while probability assignments remain systematically miscalibrated, confirmed by its departure from the persistence\u0026ndash;Brier regression line in a direction unexplainable by instability alone (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e). The 0/4 bootstrap equivalence between LightGBM and XGBoost indicates these models form a statistically coherent failure cluster rather than a graded performance continuum.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eOn the Relativity of Model Adequacy and Composite Ranking\u003c/h3\u003e\n\u003cp\u003eThe designation of a model as adequate or inadequate is not an intrinsic model property, but a relational judgement dependent on which region of the metric space the investigator inhabits \u0026mdash; a choice inseparable from the inferential demands of the application [52]. The composite GoF score (Fig.\u0026nbsp;\u003cspan refid=\"Fig12\" class=\"InternalRef\"\u003e12\u003c/span\u003e) is most usefully interpreted not as a ranking instrument but as a diagnostic of tier membership: within the top tier, composite scores are so compressed that rank differences carry no practical inference; across tiers, discontinuities constitute substantive architectural distinctions rather than sampling artefacts. The GoF formulation assigns calibration an effective weight of only 12.5% \u0026mdash; one Brier Score component among eight \u0026mdash; and applications where probability fidelity is the primary decision driver may prefer explicitly calibration-weighted composites or Pareto-front approaches identifying non-dominated models without scalar reduction.\u003c/p\u003e \u003cdiv id=\"Sec31\" class=\"Section2\"\u003e \u003ch2\u003ePrevalence Shift and the Architectural Basis of Miscalibration\u003c/h2\u003e \u003cp\u003eThe recalibration analysis addresses a potential limitation of the balanced case-control design: whether the calibration profiles documented under cross-validation reflect genuine architectural properties or are partly artefacts of the 1:1 training prevalence. The observation that the tier-level Brier and ECE ordering was preserved across all tested scenarios \u0026mdash; including the estimated true field prevalence of π\u0026thinsp;=\u0026thinsp;0.1407 \u0026mdash; is consistent with the interpretation that the miscalibration of bottom-tier tree-based ensembles is not primarily a training-prevalence artefact, although this design does not permit a formal causal test of that claim. The apparent ECE advantage of PyTorchCNN at π\u0026thinsp;=\u0026thinsp;0.1407 warrants caution: given that ECE is a binned summary statistic, the unusually low value (0.0668) may partly reflect the concentration of that model's predicted probabilities within a narrow range that is less sensitive to logit-scale offset, rather than improved probabilistic fidelity, and should not be interpreted as evidence of genuine calibration superiority in isolation. More broadly, these results suggest that the discrimination\u0026ndash;calibration dissociation documented under balanced training conditions is likely to remain relevant under realistic field deployment conditions, while acknowledging that external prospective validation would be necessary before any such inference is treated as definitive.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec32\" class=\"Section2\"\u003e \u003ch2\u003eLimitations\u003c/h2\u003e \u003cp\u003eSeveral structural limitations warrant acknowledgement. The absence of breed-stratified evaluations leaves between-breed performance heterogeneity uncharacterised; paratuberculosis prevalence, immune response architecture, and effective population size likely vary across the seven breeds. MI's univariate filtering cannot capture epistatic interactions, potentially disadvantaging architectures designed for non-additive effects [54]. The moderate sample size (n\u0026thinsp;=\u0026thinsp;474) relative to the retained feature space (p\u0026thinsp;=\u0026thinsp;5,000) inherently constrains complex architectures. Finally, all evaluations were conducted under cross-validation on a single cohort without external prospective validation \u0026mdash; a limitation that must be resolved before any framework is deployed in operational genomic selection programmes.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eThree conclusions of broad methodological relevance emerge. First, the convergence of linear, kernel-based, and relationship-matrix estimators to a statistically indistinguishable discriminative ceiling \u0026mdash; confirmed by bootstrap equivalence across four independent ranking metrics \u0026mdash; is consistent with the predictive signal being predominantly captured by additive genetic structure, with the caveat that MI-based pre-filtering inherently pre-structures the feature space in favour of additive signal. Architectural complexity beyond penalised linear models confers no measurable discriminative benefit under these data conditions, and GBLUP's competitive standing is consistent with quantitative genetic theory as an appropriate primary framework for genomic prediction of polygenic disease resistance traits in livestock populations of modest size. Second, the discrimination\u0026ndash;calibration dissociation documented across all fourteen models demonstrates that no single metric suffices to characterise a model's fitness for purpose, and that the selection of an evaluation criterion is not a neutral methodological convention but an inferential commitment that must be made explicit relative to the decision context. Third, model failure modes are architecturally heterogeneous \u0026mdash; erratic fold-variance in GradientBoosting, stable architectural miscalibration in RandomForest, systematic decision boundary asymmetry in XGBoost \u0026mdash; profiles with different remedial implications that composite rankings necessarily conflate. Together, these findings argue that genomic prediction model evaluation should be conducted in a multi-metric, multi-dimensional framework that explicitly distinguishes discrimination from calibration, absolute performance from stability, and statistical equivalence from practical interchangeability \u0026mdash; with the inferential target specified before, rather than after, the selection of an evaluation criterion. Finally, post-hoc prevalence recalibration across scenarios spanning the plausible range of MAP field deployment conditions (π\u0026thinsp;=\u0026thinsp;0.05\u0026ndash;0.25), including the estimated true seroprevalence of π\u0026thinsp;=\u0026thinsp;0.1407, indicated that the tier-level Brier score and ECE ordering observed under balanced training conditions was largely preserved. This pattern is consistent with the miscalibration of bottom-tier tree-based models reflecting an architectural rather than a prevalence-contingent deficit, though this interpretation remains subject to confirmation in external validation cohorts.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis project was supported by the Scientific and Technological Research Council of T\u0026uuml;rkiye (T\u0026Uuml;BİTAK), and we extend our sincere gratitude to T\u0026Uuml;BİTAK for their support.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe confirm that informed consent was obtained from all animal owners. In addition, official permission to conduct the field study was granted by the General Directorate of Agricultural Research and Policies (TAGEM), Republic of T\u0026uuml;rkiye Ministry of Agriculture and Forestry (dated January 16, 2023; document no. E-92190712-604.02-8570746).\u003c/p\u003e\n\u003cp\u003eThe corresponding author would like to express sincere gratitude to Mustafa Kemal Atat\u0026uuml;rk, the founding leader of modern T\u0026uuml;rkiye, for his enduring legacy of reason and scientific thought. In this spirit, we adhere to his guiding principle: \u0026ldquo;If, one day, my words contradict science, choose science.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eThe authors used Claude Sonnet 4.6, a large language model developed by Anthropic, solely for grammatical editing and language refinement during manuscript preparation. All scientific content, analyses, interpretations, and conclusions are entirely the work of the authors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research project was funded by the Scientific and Technological Research Council of T\u0026uuml;rkiye (T\u0026Uuml;BİTAK) (Project No: TOVAG-222O107).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eY.Y. conceptualized and designed the study, performed the genomic prediction and machine learning analyses, and wrote the manuscript. Y.Y., R.A., M.K., \u0026Ouml;.G., and K.I. conducted the laboratory work. S.Y. organized the field work. Y.Y., A.E., D.C., Y.E.K., S.S.Ş., and M.B. conducted the field work and data collection.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data used in this study have been deposited in the figshare.com database under https://doi.org/10.6084/m9.figshare.30845045\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics declarations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eCompeting interests\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eEthical statement\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was conducted in accordance with the guidelines of the Local Ethics Committee for Animal Experiments and with an experimental protocol approved by the \u0026ldquo;Ethics Committee for the Use of Animals in Research and Experimentation\u0026rdquo; at the Sheep Breeding and Research Institute, T\u0026uuml;rkiye (Approval No: 066/14.02.2023). Informed consent was obtained from the Ministry of Agriculture and Forestry, Siirt Provincial Directorate of Agriculture (Approval No: E-64380313-325.01-8848896/08.02.2023) prior to the study. The authors also complied with the ARRIVE guidelines.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eHosseini K, Anaraki N, Dastjerdi P, Kazemian S, Hasanzad M, Alkhouli M, et al. Bridging Genomics to Cardiology Clinical Practice: Artificial Intelligence in Optimizing Polygenic Risk Scores: A Systematic Review. JACC: Advances. 2025;4:101803. https://doi.org/10.1016/j.jacadv.2025.101803.\u003c/li\u003e\n\u003cli\u003eMeuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819\u0026ndash;29. https://doi.org/10.1093/genetics/157.4.1819.\u003c/li\u003e\n\u003cli\u003eAlves A a. C, Espigolan R, Bresolin T, Costa RM, Fernandes J\u0026uacute;nior GA, Ventura RV, et al. Genome-enabled prediction of reproductive traits in Nellore cattle using parametric models and machine learning methods. Anim Genet. 2021;52:32\u0026ndash;46. https://doi.org/10.1111/age.13021.\u003c/li\u003e\n\u003cli\u003ede Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193:327\u0026ndash;45. https://doi.org/10.1534/genetics.112.143313.\u003c/li\u003e\n\u003cli\u003eMeuwissen T, Hayes B, Goddard M. Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci. 2013;1:221\u0026ndash;37. https://doi.org/10.1146/annurev-animal-031412-103705.\u003c/li\u003e\n\u003cli\u003eMomen M, Mehrgardi AA, Sheikhi A, Kranis A, Tusell L, Morota G, et al. Predictive ability of genome-assisted statistical models under various forms of gene action. Sci Rep. 2018;8:12309. https://doi.org/10.1038/s41598-018-30089-2.\u003c/li\u003e\n\u003cli\u003eVarona L, Legarra A, Toro MA, Vitezica ZG. Non-additive Effects in Genomic Selection. Front Genet. 2018;9:78. https://doi.org/10.3389/fgene.2018.00078.\u003c/li\u003e\n\u003cli\u003eHughes J, Shymka M, Ng T, Phulka JS, Safabakhsh S, Laksman Z. Polygenic Risk Score Implementation into Clinical Practice for Primary Prevention of Cardiometabolic Disease. Genes. 2024;15:1581. https://doi.org/10.3390/genes15121581.\u003c/li\u003e\n\u003cli\u003eMartin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51:584\u0026ndash;91. https://doi.org/10.1038/s41588-019-0379-x.\u003c/li\u003e\n\u003cli\u003ePain O. Leveraging global genetics resources to enhance polygenic prediction across ancestrally diverse populations. HGG Adv. 2025;6:100482. https://doi.org/10.1016/j.xhgg.2025.100482.\u003c/li\u003e\n\u003cli\u003eRuan Y, Lin Y-F, Feng Y-CA, Chen C-Y, Lam M, Guo Z, et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet. 2022;54:573\u0026ndash;80. https://doi.org/10.1038/s41588-022-01054-7.\u003c/li\u003e\n\u003cli\u003eAbdollahi-Arpanahi R, Gianola D, Pe\u0026ntilde;agaricano F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet Sel Evol. 2020;52:12. https://doi.org/10.1186/s12711-020-00531-z.\u003c/li\u003e\n\u003cli\u003eGianola D. Priors in Whole-Genome Regression: The Bayesian Alphabet Returns. Genetics. 2013;194:573\u0026ndash;96. https://doi.org/10.1534/genetics.113.151753.\u003c/li\u003e\n\u003cli\u003eGonz\u0026aacute;lez-Recio O, Rosa GJM, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livestock Science. 2014;166:217\u0026ndash;31. https://doi.org/10.1016/j.livsci.2014.05.036.\u003c/li\u003e\n\u003cli\u003eMontesinos-L\u0026oacute;pez OA, Montesinos-L\u0026oacute;pez A, P\u0026eacute;rez-Rodr\u0026iacute;guez P, Barr\u0026oacute;n-L\u0026oacute;pez JA, Martini JWR, Fajardo-Flores SB, et al. A review of deep learning applications for genomic selection. BMC Genomics. 2021;22:19. https://doi.org/10.1186/s12864-020-07319-x.\u003c/li\u003e\n\u003cli\u003eZingaretti LM, Gezan SA, Ferr\u0026atilde;o LFV, Osorio LF, Monfort A, Mu\u0026ntilde;oz PR, et al. Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species. Front Plant Sci. 2020;11:25. https://doi.org/10.3389/fpls.2020.00025.\u003c/li\u003e\n\u003cli\u003eLiu J, Yan X, Li W, Xue S-H, Wang Z, Su R. Genomic Selection for Cashmere Traits in Inner Mongolian Cashmere Goats Using Random Forest, Gradient Boosting Decision Tree, Extreme Gradient Boosting and Light Gradient Boosting Machine Methods. Animals (Basel). 2025;15:2940. https://doi.org/10.3390/ani15202940.\u003c/li\u003e\n\u003cli\u003eXiang T, Li T, Li J, Li X, Wang J. Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs. FASEB J. 2023;37:e22961. https://doi.org/10.1096/fj.202300245R.\u003c/li\u003e\n\u003cli\u003eThorsrud JA, Evans KM, Quigley KC, Srikanth K, Huson HJ. Performance Comparison of Genomic Best Linear Unbiased Prediction and Four Machine Learning Models for Estimating Genomic Breeding Values in Working Dogs. Animals (Basel). 2025;15:408. https://doi.org/10.3390/ani15030408.\u003c/li\u003e\n\u003cli\u003eUsai MG, Casu S, Sechi T, Salaris SL, Miari S, Mulas G, et al. Advances in understanding the genetic architecture of antibody response to paratuberculosis in sheep by heritability estimate and LDLA mapping analyses and investigation of candidate regions using sequence-based data. Genet Sel Evol. 2024;56:5. https://doi.org/10.1186/s12711-023-00873-4.\u003c/li\u003e\n\u003cli\u003eHasonova L, Pavlik I. Economic impact of paratuberculosis in dairy cattle herds: a review. Veterin\u0026aacute;rn\u0026iacute; medic\u0026iacute;na. 2006;51:193\u0026ndash;211. https://doi.org/10.17221/5539-VETMED.\u003c/li\u003e\n\u003cli\u003eRasmussen P, Barkema HW, Mason S, Beaulieu E, Hall DC. Economic losses due to Johne\u0026rsquo;s disease (paratuberculosis) in dairy cattle. J Dairy Sci. 2021;104:3123\u0026ndash;43. https://doi.org/10.3168/jds.2020-19381.\u003c/li\u003e\n\u003cli\u003eDow CT, Alvarez BL. Mycobacterium paratuberculosis zoonosis is a One Health emergency. Ecohealth. 2022;19:164\u0026ndash;74. https://doi.org/10.1007/s10393-022-01602-x.\u003c/li\u003e\n\u003cli\u003eWADDELL LA, RAJIĆ A, ST\u0026Auml;RK KDC, McEWEN SA. The zoonotic potential of Mycobacterium avium ssp. paratuberculosis: a systematic review and meta-analyses of the evidence. Epidemiol Infect. 2015;143:3135\u0026ndash;57. https://doi.org/10.1017/S095026881500076X.\u003c/li\u003e\n\u003cli\u003eAyele WY, Svastova P, Roubal P, Bartos M, Pavlik I. Mycobacterium avium subspecies paratuberculosis cultured from locally and commercially pasteurized cow\u0026rsquo;s milk in the Czech Republic. Appl Environ Microbiol. 2005;71:1210\u0026ndash;4. https://doi.org/10.1128/AEM.71.3.1210-1214.2005.\u003c/li\u003e\n\u003cli\u003eWhittington RJ, Marsh IB, Reddacliff LA. Survival of Mycobacterium avium subsp. paratuberculosis in dam water and sediment. Appl Environ Microbiol. 2005;71:5304\u0026ndash;8. https://doi.org/10.1128/AEM.71.9.5304-5308.2005.\u003c/li\u003e\n\u003cli\u003eKravitz A, Pelzer K, Sriranganathan N. The Paratuberculosis Paradigm Examined: A Review of Host Genetic Resistance and Innate Immune Fitness in Mycobacterium avium subsp. Paratuberculosis Infection. Front Vet Sci. 2021;8. https://doi.org/10.3389/fvets.2021.721706.\u003c/li\u003e\n\u003cli\u003eAlpay F, Zare Y, Kamalludin MH, Huang X, Shi X, Shook GE, et al. Genome-wide association study of susceptibility to infection by Mycobacterium avium subspecies paratuberculosis in Holstein cattle. PLoS One. 2014;9:e111704. https://doi.org/10.1371/journal.pone.0111704.\u003c/li\u003e\n\u003cli\u003eSanchez M-P, Tribout T, Fritz S, Guatteo R, Fourichon C, Schibler L, et al. New insights into the genetic resistance to paratuberculosis in Holstein cattle via single-step genomic evaluation. Genet Sel Evol. 2022;54:67. https://doi.org/10.1186/s12711-022-00757-z.\u003c/li\u003e\n\u003cli\u003eBadia-Bringu\u0026eacute; G, Alonso-Hearn M. Integrating transcriptomic and genomic studies for the identification of expression quantitative trait loci associated with bovine paratuberculosis. Front Vet Sci. 2025;12:1632212. https://doi.org/10.3389/fvets.2025.1632212.\u003c/li\u003e\n\u003cli\u003eCanive M, Badia-Bringu\u0026eacute; G, V\u0026aacute;zquez P, Gonz\u0026aacute;lez-Recio O, Fern\u0026aacute;ndez A, Garrido JM, et al. Identification of loci associated with pathological outcomes in Holstein cattle infected with Mycobacterium avium subsp. paratuberculosis using whole-genome sequence data. Sci Rep. 2021;11:20177. https://doi.org/10.1038/s41598-021-99672-4.\u003c/li\u003e\n\u003cli\u003eIdris SM, Eltom KH, Okuni JB, Ojok L, Elmagzoub WA, El Wahed AA, et al. Paratuberculosis: The Hidden Killer of Small Ruminants. Animals (Basel). 2021;12:12. https://doi.org/10.3390/ani12010012.\u003c/li\u003e\n\u003cli\u003eKorou LM, Liandris E, Gazouli M, Ikonomopoulos J. Investigation of the association of the SLC11A1 gene with resistance/sensitivity of goats (Capra hircus) to paratuberculosis. Vet Microbiol. 2010;144:353\u0026ndash;8. https://doi.org/10.1016/j.vetmic.2010.01.009.\u003c/li\u003e\n\u003cli\u003eMataragka A, Klavdianos Papastathis A, Ikonomopoulos J. Association of SLC11A1 3\u0026rsquo;UTR (GT)n Microsatellite Polymorphisms with Resistance to Paratuberculosis in Sheep. Pathogens. 2025;14:1150. https://doi.org/10.3390/pathogens14111150.\u003c/li\u003e\n\u003cli\u003eYaman Y, Aymaz R, Keleş M, Bay V, \u0026Uuml;n C, Heaton MP. Association of TLR2 haplotypes encoding Q650 with reduced susceptibility to ovine Johne\u0026rsquo;s disease in Turkish sheep. Sci Rep. 2021;11:7088. https://doi.org/10.1038/s41598-021-86605-4.\u003c/li\u003e\n\u003cli\u003eSu R, Huang B, Tan J, Shen Z, Zhong P, Liu J. Mutual information stacking method for prediction of the growth traits in pigs. Brief Bioinform. 2025;26:bbaf231. https://doi.org/10.1093/bib/bbaf231.\u003c/li\u003e\n\u003cli\u003eZhu K, Zheng Y, Chan KCG. Weighted Brier Score\u0026mdash;An Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration. Stat Biosci. 2025. https://doi.org/10.1007/s12561-025-09505-5.\u003c/li\u003e\n\u003cli\u003eChen C, Bhuiyan SA, Ross E, Powell O, Dinglasan E, Wei X, et al. Genomic prediction for sugarcane diseases including hybrid Bayesian-machine learning approaches. Front Plant Sci. 2024;15:1398903. https://doi.org/10.3389/fpls.2024.1398903.\u003c/li\u003e\n\u003cli\u003eCrossa J, P\u0026eacute;rez-Rodr\u0026iacute;guez P, Cuevas J, Montesinos-L\u0026oacute;pez O, Jarqu\u0026iacute;n D, de Los Campos G, et al. Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. Trends Plant Sci. 2017;22:961\u0026ndash;75. https://doi.org/10.1016/j.tplants.2017.08.011.\u003c/li\u003e\n\u003cli\u003eIrvin MR, Ge T, Patki A, Srinivasasainagendra V, Armstrong ND, Davis B, et al. Polygenic Risk for Type 2 Diabetes in African Americans. Diabetes. 2024;73:993\u0026ndash;1001. https://doi.org/10.2337/db23-0232.\u003c/li\u003e\n\u003cli\u003eLee SS-Y, Stapleton F, MacGregor S, Mackey DA. Genome-wide association studies, Polygenic Risk Scores and Mendelian randomisation: an overview of common genetic epidemiology methods for ophthalmic clinicians. Br J Ophthalmol. 2025;109:433\u0026ndash;41. https://doi.org/10.1136/bjo-2024-326554.\u003c/li\u003e\n\u003cli\u003eNdong Sima CAA, Step K, Swart Y, Schurz H, Uren C, M\u0026ouml;ller M. Methodologies underpinning polygenic risk scores estimation: a comprehensive overview. Hum Genet. 2024;143:1265\u0026ndash;80. https://doi.org/10.1007/s00439-024-02710-0.\u003c/li\u003e\n\u003cli\u003eChang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. https://doi.org/10.1186/s13742-015-0047-8.\u003c/li\u003e\n\u003cli\u003eBrowning BL, Zhou Y, Browning SR. A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet. 2018;103:338\u0026ndash;48. https://doi.org/10.1016/j.ajhg.2018.07.015.\u003c/li\u003e\n\u003cli\u003eGuo Y, Zhong Z, Yang C, Hu J, Jiang Y, Liang Z, et al. Epi-GTBN: an approach of epistasis mining based on genetic Tabu algorithm and Bayesian network. BMC Bioinformatics. 2019;20:444. https://doi.org/10.1186/s12859-019-3022-z.\u003c/li\u003e\n\u003cli\u003eHaws DC, Rish I, Teyssedre S, He D, Lozano AC, Kambadur P, et al. Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods. PLoS One. 2015;10:e0138903. https://doi.org/10.1371/journal.pone.0138903.\u003c/li\u003e\n\u003cli\u003eHeinrich F, Ramzan F, Rajavel A, Schmitt AO, G\u0026uuml;ltas M. MIDESP: Mutual Information-Based Detection of Epistatic SNP Pairs for Qualitative and Quantitative Phenotypes. Biology (Basel). 2021;10:921. https://doi.org/10.3390/biology10090921.\u003c/li\u003e\n\u003cli\u003eHuang H-H, Xu T, Yang J. Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension. BMC Proc. 2014;8 Suppl 1:S96. https://doi.org/10.1186/1753-6561-8-S1-S96.\u003c/li\u003e\n\u003cli\u003eMiller DJ, Zhang Y, Yu G, Liu Y, Chen L, Langefeld CD, et al. An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions. Bioinformatics. 2009;25:2478\u0026ndash;85. https://doi.org/10.1093/bioinformatics/btp435.\u003c/li\u003e\n\u003cli\u003eFerrario PG, K\u0026ouml;nig IR. Transferring entropy to the realm of GxG interactions. Brief Bioinform. 2018;19:136\u0026ndash;47. https://doi.org/10.1093/bib/bbw086.\u003c/li\u003e\n\u003cli\u003eWang H, Yin H, Wu X. A Secure High-Order Gene Interaction Detecting Method for Infectious Diseases. Comput Math Methods Med. 2022;2022:4471736. https://doi.org/10.1155/2022/4471736.\u003c/li\u003e\n\u003cli\u003eVan Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:230. https://doi.org/10.1186/s12916-019-1466-7.\u003c/li\u003e\n\u003cli\u003eBermingham ML, Pong-Wong R, Spiliopoulou A, Hayward C, Rudan I, Campbell H, et al. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci Rep. 2015;5:10312. https://doi.org/10.1038/srep10312.\u003c/li\u003e\n\u003cli\u003ePudjihartono N, Fadason T, Kempa-Liehr AW, O\u0026rsquo;Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front Bioinform. 2022;2:927312. https://doi.org/10.3389/fbinf.2022.927312.\u003c/li\u003e\n\u003cli\u003eAlzoubi H, Alzubi R, Ramzan N. Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations. Sensors. 2023;23:4439. https://doi.org/10.3390/s23094439.\u003c/li\u003e\n\u003cli\u003eVergara JR, Est\u0026eacute;vez PA. A review of feature selection methods based on mutual information. Neural Comput \u0026amp; Applic. 2014;24:175\u0026ndash;86. https://doi.org/10.1007/s00521-013-1368-0.\u003c/li\u003e\n\u003cli\u003eHeinrich F, Lange TM, Kircher M, Ramzan F, Schmitt AO, G\u0026uuml;ltas M. Exploring the potential of incremental feature selection to improve genomic prediction accuracy. Genet Sel Evol. 2023;55:78. https://doi.org/10.1186/s12711-023-00853-8.\u003c/li\u003e\n\u003cli\u003eGoddard ME, Hayes BJ. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet. 2009;10:381\u0026ndash;91. https://doi.org/10.1038/nrg2575.\u003c/li\u003e\n\u003cli\u003eLiang M, Chang T, An B, Duan X, Du L, Wang X, et al. A Stacking Ensemble Learning Framework for Genomic Prediction. Front Genet. 2021;12. https://doi.org/10.3389/fgene.2021.600040.\u003c/li\u003e\n\u003cli\u003eWolpert DH. Stacked generalization. Neural Networks. 1992;5:241\u0026ndash;59. https://doi.org/10.1016/S0893-6080(05)80023-1.\u003c/li\u003e\n\u003cli\u003eNiculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on Machine learning. New York, NY, USA: Association for Computing Machinery; 2005. p. 625\u0026ndash;32. https://doi.org/10.1145/1102351.1102430.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"genomic prediction, paratuberculosis, machine learning, deep learning, discrimination–calibration dissociation, mutual information, model benchmarking, goat","lastPublishedDoi":"10.21203/rs.3.rs-9421190/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9421190/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eGenomic prediction of complex disease resistance demands frameworks that jointly optimise discriminative power and probabilistic calibration. We systematically benchmark 14 predictive frameworks \u0026mdash; spanning regularised linear models, GBLUP, kernel-based classifiers, tree-based ensembles, deep neural networks, and meta-ensemble strategies \u0026mdash; for paratuberculosis predisposition classification in 474 goats representing seven indigenous Turkish breeds across 11 provinces. Mutual information-based feature selection distilled 44,375 quality-controlled SNPs into 5,000 maximally informative markers. Seven architecturally diverse frameworks converged to a statistically indistinguishable discrimination ceiling (mean AUC\u0026thinsp;\u0026asymp;\u0026thinsp;0.982); GBLUP's competitive standing within this ceiling implicates predominantly additive genetic architecture consistent with the infinitesimal model framework. The central finding is a discrimination\u0026ndash;calibration dissociation: ROC-AUC varied only 1.14-fold across all 14 models, whereas Brier score varied 4.85-fold (0.046\u0026ndash;0.223), revealing that probabilistic fidelity diverges far beyond rank-ordering capacity. Post-hoc prevalence recalibration across the plausible range of MAP field deployment scenarios (π\u0026thinsp;=\u0026thinsp;0.05\u0026ndash;0.25) confirmed that this tier-level calibration ordering was fully preserved at all tested prevalences, including the estimated true seroprevalence of 14.07%, indicating that the dissociation reflects an architectural property of the models rather than an artefact of the balanced training design. Critically, tree-based ensembles exhibit architecturally heterogeneous failure modes \u0026mdash; erratic fold-instability, systematic decision boundary asymmetry, and structural miscalibration \u0026mdash; profiles with distinct remedial implications that composite ranking obscures. Cross-fold prediction stability emerged as a strong integrative proxy for both discrimination and calibration, exhibiting high correlations with both ROC-AUC (r\u0026thinsp;=\u0026thinsp;0.924) and Brier score (r\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.847). The stacking ensemble achieved the highest composite performance (accuracy\u0026thinsp;=\u0026thinsp;0.945; F1\u0026thinsp;=\u0026thinsp;0.947; Brier\u0026thinsp;=\u0026thinsp;0.046). These findings suggest that genomic prediction model selection should be governed by calibration-aware, multi-dimensional evaluation frameworks.\u003c/p\u003e","manuscriptTitle":"A Machine Learning Framework for Genomic Prediction of Paratuberculosis Predisposition in Goats: Discrimination–Calibration Dissociation Across Learning Architectures","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-30 10:11:14","doi":"10.21203/rs.3.rs-9421190/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-05-14T06:26:45+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-12T21:59:45+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-11T12:44:35+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"158754413443167736455156823208988186700","date":"2026-05-09T21:53:56+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-09T07:01:51+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"318111460081744496830823812639705393831","date":"2026-05-08T13:28:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"273245800420803287473688948063815158171","date":"2026-05-08T09:03:08+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"195422697790073256611158644563747994925","date":"2026-05-07T14:50:18+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"202864569595477086937734693741155082306","date":"2026-05-06T09:12:25+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-04T01:30:59+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"155916430029581189519235068350924146376","date":"2026-04-27T07:40:26+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"13582807392805109888327981631542069694","date":"2026-04-25T08:59:33+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"319325982638739848221639377274576064541","date":"2026-04-21T16:11:36+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-04-21T16:08:22+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-04-21T15:33:58+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-04-21T08:55:33+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-04-19T10:44:06+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-04-19T10:39:28+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"8945f822-993c-4b2f-b308-180fbb3e7f2a","owner":[],"postedDate":"April 30th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Revision requested","date":"2026-05-14T06:26:45+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-12T21:59:45+00:00","index":173,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-11T12:44:35+00:00","index":170,"fulltext":""},{"type":"reviewerAgreed","content":"158754413443167736455156823208988186700","date":"2026-05-09T21:53:56+00:00","index":166,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-09T07:01:51+00:00","index":165,"fulltext":""},{"type":"reviewerAgreed","content":"318111460081744496830823812639705393831","date":"2026-05-08T13:28:44+00:00","index":163,"fulltext":""},{"type":"reviewerAgreed","content":"273245800420803287473688948063815158171","date":"2026-05-08T09:03:08+00:00","index":161,"fulltext":""},{"type":"reviewerAgreed","content":"195422697790073256611158644563747994925","date":"2026-05-07T14:50:18+00:00","index":159,"fulltext":""},{"type":"reviewerAgreed","content":"202864569595477086937734693741155082306","date":"2026-05-06T09:12:25+00:00","index":152,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-04T01:30:59+00:00","index":105,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[{"id":67181393,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":67181394,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2026-05-14T06:40:26+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-30 10:11:14","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9421190","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9421190","identity":"rs-9421190","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.