Beyond Global Metrics: A Zone-Stratified Diagnostic Framework Based on Confusion Matrix Components

doi:10.21203/rs.3.rs-7539984/v1

Beyond Global Metrics: A Zone-Stratified Diagnostic Framework Based on Confusion Matrix Components

2025 · doi:10.21203/rs.3.rs-7539984/v1

preprint OA: closed

Full text JSON View at publisher

Full text 179,134 characters · extracted from preprint-html · click to expand

Beyond Global Metrics: A Zone-Stratified Diagnostic Framework Based on Confusion Matrix Components | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Beyond Global Metrics: A Zone-Stratified Diagnostic Framework Based on Confusion Matrix Components Tareef Fadhil Raham This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7539984/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Objective: Traditional diagnostic metrics summarize model performance globally but often obscure localized vulnerabilities. This study proposes a zone-stratified evaluation framework to assess diagnostic reliability across confidence gradients. Methods: Predicted probabilities were stratified into five diagnostic confidence zones using the Robust Adjusted Mean Interval (RAMI) approach: Recessive, Lower Plausible, Trusted, Upper Plausible, and Inflated. Within each zone, confusion matrix components were used to compute: the Zone-Relative Epistemic Divergence Index (ZREDI) for directional trust asymmetry, the Epistemic Equivalence Point (EEP) for calibration symmetry, the Zone-Based Error Skew (ZBES) for misclassification bias, and the Zone-Balanced Error (ZBE) for class-neutral error burden. A higher-order construct, the Zone-Based Tension State (ZBTS), was introduced to quantify overall epistemic instability across diagnostic strata. Results: Simulated assessments and real-world diagnostic data were used to uncover zone-specific diagnostic behaviors often masked by traditional global metrics. In both datasets, the Trusted Zone (TZ) consistently demonstrated calibration stability, with EEPs clustering near zero. In contrast, the outer zones—particularly the Recessive and Inflated regions—exhibited pronounced shifts in ZBES and ZREDI, indicating increased risks of underdiagnosis, overdiagnosis, or erosion of diagnostic trust. Furthermore, ZBTS emphasized the interplay between ZBES and ZREDI, highlighting zones of epistemic fragility and enabling localized performance audits for regulatory transparency and safety profiling. Conclusions: Zone-stratified metrics provide actionable insights into diagnostic model behavior, improving interpretability and safety. This framework advances beyond aggregate measures to support better threshold tuning, risk calibration, and clinical deployment—especially where model trust must align with uncertainty. Epistemic Equivalence Point (EEP) Zone-Relevant Epistemic Divergence Index (ZREDI) Zone-Based Balanced Error Score (ZBES) Zone- Balanced Error (ZBE) Index Zone-Based Tension State (ZBTS) Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 1. Introduction Diagnostic models are increasingly used to support clinical decision-making, risk stratification, and early disease detection. However, Traditional evaluation metrics—such as accuracy, sensitivity, and specificity—offer global assessments of model performance. However, these metrics often fail to capture local reliability, especially in diagnostically uncertain regions or when data are imbalanced [1,2 ]. Although widely used, ROC curves and AUC values summarize performance globally and may obscure important diagnostic failures within specific regions of prediction confidence—even in more localized approaches such as Zone-Restricted ROC (ZR-ROC) [ 3 – 5 ]. These metrics treat all errors equally, irrespective of whether they occur in high-certainty regions or in transitional, borderline zones—where misclassifications may carry greater clinical consequences. Although the F1 score is widely used for evaluating binary classifiers, it reflects performance only on the positive class by harmonizing precision (PPV) and recall (sensitivity). 2 This property makes it particularly useful in class-imbalanced datasets, where accurate positive detection is prioritized. However, the F1 score does not reflect the model’s performance on the negative class, nor does it incorporate the variability of predictive reliability across different score thresholds or confidence strata. As such, it overlooks error distribution across other regions of the prediction space, particularly in zones of diagnostic uncertainty[ 1 ]. c onsequently, it provides a limited—and potentially misleading—view of diagnostic robustness in zone-based clinical decision contexts, particularly where false positives (FPs) and false negatives (FN) carry asymmetric clinical consequences [ 3 , 4 ] . While scalar measures such as the Predictive Summary Index (PSI ) —defined as PPV + NPV − 1 —have been proposed to summarize the overall post-test utility of diagnostic models, their interpretive value remains limited in high-stakes clinical settings. PSI ranges from − 1 (completely misleading test) to + 1 (perfect prediction), providing a concise, user-facing summary of predictive usefulness[ 6 ]. However, despite its intuitive appeal, PSI suffers from several limitations: it ignores class-specific misclassification consequences, fails to account for prediction confidence stratification, and provides no insight into where within the prediction spectrum failures occur. Second, PSI functions as a global average and cannot identify where in the diagnostic score range misclassification risk is concentrated—such as in boundary zones of epistemic uncertainty [ 7 , 8 ].As such, it may obscure zone-specific fragility, particularly in contexts where false reassurance or overtreatment can lead to significant harm. Furthermore, it is highly sensitive to prevalence, making it unreliable across heterogeneous populations or disease distributions [ 7 – 9 ]. In addition, the PSI provides no insight into the directionality of prediction errors—namely, whether a model tends to systematically overpredict or underpredict a condition. Nor does it capture localized collapses in reliability across the diagnostic spectrum. These limitations render PSI insufficient for nuanced interpretability and safety governance in clinical AI, particularly in settings where error asymmetry and zone-specific trust calibration are critical. As a result, there is a growing need for zone-aware evaluation metrics that can characterize directional reliability and misclassification patterns across stratified regions of epistemic uncertainty[ 3 , 8 , 10 ]. In response to these limitations, recent approaches such as ZR-ROC analysis have been developed to evaluate model performance within epistemically stratified regions—such as zones defined by predicted score distributions, likelihood intervals, or adjusted means [ 11 ]. These methods enhance diagnostic transparency by isolating performance in areas of increased uncertainty. However, they remain focused on traditional discriminative metrics (e.g., true positive rate vs. FPR and do not capture the directional reliability of predictions—such as the extent to which a model confidently rules in versus rules out a condition within a given zone. To address these gaps, we introduce a complementary framework of zone-stratified reliability metrics that emphasize both predictive trust and directional diagnostic bias within epistemically defined regions. We introduce a suite of zone-stratified diagnostic metrics designed to assess prediction reliability, calibration balance, and error dynamics across epistemically meaningful strata of diagnostic confidence. These metrics move beyond global performance indicators such as AUC or accuracy, offering localized insight into model behavior in clinically sensitive regions. The core components of the framework include: Zone-Relative Epistemic Divergence Index (ZREDI) : Quantifies calibration asymmetry by measuring the divergence between PPV and NPV within each zone, highlighting trust imbalance between outcome classes. Zone-Based Error Skew (ZBES) : Captures the directionality of misclassification , defined as the difference between the false negative rate (FNR) and false positive rate (FPR). It provides a signed measure of diagnostic bias and complements value-based metrics like ZREDI. Zone-Based Tension State (ZBTS) : A composite interpretive construct defined as ZREDI − ZBES , capturing the epistemic tension between predictive asymmetry and error skew. ZBTS helps identify zones where reliability and misclassification are incongruent or paradoxically aligned . Together, these metrics enable zone-specific evaluation of diagnostic systems across the full spectrum of clinical risk, including diagnostic zones. This framework supports threshold refinement , context-aware model evaluation , and epistemically governed deployment of machine learning models in high-stakes clinical settings. 2. Methods and materials: 2.1. Diagnostic Zone Stratification The dataset was stratified into diagnostic confidence zones using the RAMI framework, which partitions model predictions based on their standardized distance from an adjusted mean score. This standardized distance functions analogously to the standard deviation (SD) in classical statistics, representing how far individual predictive scores deviate from a central reference point, thereby quantifying the degree of epistemic divergence from calibrated expectation. In our epistemic framework, this robustly adjusted central reference point is termed the Credence Value (CV) —a stable proxy for the adjusted mean, around which diagnostic confidence and epistemic stratification are systematically evaluated. Within the RAMI framework, a unit RAMI distance corresponds approximately to the distance between the adjusted mean and ± 1 SD, offering a normalized scale for assessing predictive certainty. This framework functions as an epistemic analogue to classical confidence intervals, but instead of merely bounding statistical estimates, it reflects the operational reality of model trust, calibration integrity, and interpretive stability across predictive strata. Each sample was classified into one of five epistemic zones: the Recessive Zone (RZ) , Lower Plausible Zone (LPZ) , Trusted Zone (TZ) , Upper Plausible Zone (UPZ) , or Inflated Zone (IZ) . These zones reflect increasing degrees of epistemic uncertainty, with the TZ representing predictions made with the highest level of diagnostic confidence and minimal asymmetry. The Lower and Upper Plausible Zones serve as transitional regions on either side of the TZ, where model predictions become less stable and the risk of misclassification begins to rise. At the extremes, the RZ is typically dominated by TN and may correspond to underdiagnosis, while the IZ is dominated by TP and may reflect overdiagnosis or excessive certainty. This stratification provides a foundation for zone-specific performance analysis, emphasizing local diagnostic reliability rather than global accuracy alone. The RAMI-based stratification framework is currently under refinement[ 12 ]. All intellectual property related to its structure, thresholds, and terminology is retained by the author. 2.2. Confusion Matrix Elements per Zone and Metric Computation Workflow The computation of diagnostic metrics was conducted on a zone-specific basis using the RAMI-stratified dataset. Each observation was assigned to one of five diagnostic confidence zones—Recessive, Lower Plausible, Trusted, Upper Plausible, or Inflated—based on RAMI values. Within each zone, the core confusion matrix components were computed: true positives (TPs), true negatives (TNs), FPs, and FNs. TP referred to correctly identified diseased cases, TN to correctly identified non-diseased cases, FP to non-diseased cases incorrectly classified as diseased, and FN to diseased cases missed by the model. These foundational counts enabled the calculation of both traditional performance metrics (e.g., sensitivity, specificity, PPV, NPV) and the novel zone-stratified metrics introduced in this study—namely, ZREDI, ZBES, ZBE, and ZBTS. This localized approach allowed precise characterization of model behavior across varying levels of diagnostic confidence, offering granular insights into reliability, misclassification asymmetry, and epistemic stability. 2.3. Algorithmic Pipeline for RAMI-Stratified Epistemic Metric Extraction: Multistage Computation of ZREDI, Global Divergence Index (GDI ), ZBES, ZBE , Epistemic Equivalence Point ( EEP), DNP, EPEP, and ZBTS Step 1 Stratify the dataset into diagnostic confidence zones — Recessive, Lower Plausible (LPZ), Trusted, Upper Plausible (UPZ), and Inflated — using standard RAMI-derived thresholds (± 1 SE and ± 2 SE). Alternatively, users may define desired zone boundaries to align with specific diagnostic cut-offs, regulatory guidelines, or study protocols. This optional flexibility allows the framework to adapt to study-specific diagnostic thresholds, rare-event sampling considerations, and validation protocols that require non-standard confidence intervals.” Within the Recessive Zone (RZ) and Inflated Zone (IZ), values that extend beyond the 1st Recessive Value (1st RV) or 1st Inflated Value (1st IV) remain part of RZ/IZ. Step 2 For each zone, calculate TPz, TNz, FPz, FNz Step 3 These were used to derive the Zone-Relative Epistemic Divergence Index ( ZREDI), calculated as the absolute difference between PPVz and NPVz ZREDI = (TPz / [TPz + FPz]) – (TNz / [TNz + FNz]). i.e. ZREDI = PPVz − NPVz Step 4: Identify the EEP when: Step 3: Identify the EEP where the Positive Predictive Value (PPV) equal to Negative Predictive Value (NPV) PPVz = NPVz or (TPz / [TPz + FPz]) = (TNz / [TNz + FNz]) Step 5: obtain the Global Divergence Index (GDI = total PPV –total NPV). Step 6: To capture directional misclassification, we computed the False Negative Rate (FNR) and False Positive Rate (FPR) as: FNR = FN/TP + FN, FPR = FP/TN + FP Step 7: Find EEP: EEP occurs when: TPz / (TPz + FPz) = TNz / (TNz + FNz) Step 8 : Compute ZBES for each zone: $$\:ZBES\:=\:FNz\:/\:(TPz\:+\:FNz)\:-\:FPz\:/\:(TNz\:+\:FPz)$$ = FNR − FPR All metrics were computed independently for each RAMI-defined diagnostic zone. Step 8: Find Diagnostic Neutrality Point (DNP): DNP occurs when :FNRz=FPRz⇒ZBESz=0 Step 9: Find Epistemic Pressure Equilibrium Point (EPEP) : EPEP occur when ZREDIz = ZBESz Step 10: Find Zone-Based Tension State (ZBTS) : ZBTS = ZREDI − ZBES Step 11:Find GDI − PSI Delta 2.4.Validation Assessment To ensure both theoretical soundness and practical relevance, we employed a two-part validation strategy integrating both simulation and real-world data. First, synthetic datasets were developed to emulate the behavior of a binary classifier, such as those used in HbA1c-based diagnostic systems. In this simulation, we modeled an idealized diagnostic setting with balanced class distribution and minimal noise, enabling baseline evaluation of zone-based metrics—including ZREDI, ZBES, ZBE, and ZBTS—under stable epistemic conditions. The simulated case was assigned a RAMI score (representing standardized distance from the diagnostic threshold), a binary ground-truth label, and a predicted label based on a deterministic rule: RAMI ≥ 0 predicted positive, while RAMI < 0 predicted negative. RAMI scores were then stratified into five predefined diagnostic confidence zones— to facilitate localized error analysis. This structure revealed zone-specific fragilities, with underdiagnosis concentrated in RZ and LPZ, overdiagnosis in UPZ and IZ, and more balanced behavior in the TZ. Finally, the zone-based evaluation framework was applied to a real-world AFP–HCC dataset to demonstrate clinical utility and interpretability. Together, these validation components support the reliability, generalizability, and epistemic insight offered by zone-stratified metrics across both simulated and real diagnostic contexts. 3.Results 3.1. Simulated Example A total of 250 simulated patient cases (Appendix 1) were generated to emulate a binary classification system, with RAMI values calibrated to mimic the behavior of HbA1c-based diagnostic thresholds. These cases were stratified into five diagnostic confidence zones based on RAMI scores. Zone-specific classification performance was evaluated using standard and novel metrics, including PPV, NPV, ZREDI, ZBES, and ZBTS. The results of this zone-stratified diagnostic evaluation are summarized in Table 1. Table 1. Zone-Stratified Diagnostic Reliability and Error Metrics Summary of predictive and error-based diagnostic metrics across epistemic confidence zones. Metrics include PPV, NPV, ZREDI, ZBES, and ZBTS. Global values (DGI, PSI) are included for comparison. Zone TP TN FP FN PPV (TP / [TP+FP]) NPV (TN / [TN+FN]) ZREDI (PPV-NPV) FNR FPR ZBES (FNR-FPR) ZBTS (ZREDI- ZBES) Recessive 0 32 0 18 NC 0.64 NC 1 0 1 NC Lower Plausible 0 34 0 16 NC 0.68 NC 1 0 1 NC Trusted 13 13 13 11 0.5 0.541667 -0.04167 0.458333 0.5 -0.04167 0 Upper Plausible 42 0 8 0 0.84 NC NC 0 1 -1 NC Inflated 50 0 0 0 1 NC NC 0 NC NC NC Total (global) 105 79 21 45 0.833333 0.637097 DGI =0.196 0.3 0.21 PZI=0.09 0.106 * NC = Not Computable due to lack of class-specific signal. Table 1 presents zone-stratified diagnostic performance metrics derived from Example 1, including key indicators such as PPV, NPV, the ZREDI, computed as PPV minus NPV), ZBES, defined as the difference between the FNR and FPR, and the ZBTS, representing the residual between ZREDI and ZBES. Together, these metrics uncover latent diagnostic imbalances distributed across the epistemic confidence strata and serve to evaluate the alignment between predictive reliability and the distribution of classification errors. As depicted in Figure 1 , PPV exhibits a monotonic increase across the stratified zones, culminating at a value of 1.0 in the IZ , where every positive prediction is correct. Conversely, NPV reaches its peak in the LPZ (0.68), tapering to 0.54 in the TZ . This decline reflects diminishing confidence in negative predictions as one moves from low- to mid-epistemic regions. The TZ , notably, marks the EEP—the zone where PPV and NPV converge (0.50 and 0.54, respectively)—and represents the point of maximal post-test symmetry and diagnostic balance. In zones where TPz + FPz = 0 or TNz + FNz = 0, ZREDI becomes undefined, denoted as “Not Computable” (NC). This arises from division by zero in the predictive value calculations and signals epistemic vacuity—a complete absence of class-specific predictive or diagnostic signal. Such vacuity is observed in the Recessive, Lower Plausible, Upper Plausible, and Inflated Zones, each of which lacks the necessary conditions for computing PPV, NPV, or both. These vacuous zones thus reflect either diagnostic collapse or an overconfident skew. As further illustrated in Figure 2 , the TZ stands out as the convergence nexus of three epistemic landmarks: the EEP (where PPV approximates NPV), the DNP, where ZBES equals zero), and the EPEP, (defined by the equality of ZREDI and ZBES). At this critical junction, the ZBTS value is minimized (≈ 0), indicating a state of low epistemic tension and balanced diagnostic behavior. In contrast, zones outside the TZ manifest polarized diagnostic distortions. The Recessive and LPZ s exhibit underdiagnosis, characterized by undefined ZREDI, extreme positive ZBES (equal to +1.0), and the complete absence of TPs—reflecting a one-sided reliance on negative classifications. On the other hand, the Upper Plausible and Inflated Zones demonstrate signs of overdiagnosis. Here, ZREDI is high or undefined due to inflated PPV, and ZBES is either negative or zero, indicating a heavy skew toward FPs or complete certainty without counterbalance. Although global diagnostic metrics such as PPV (0.833), NPV (0.637), the Global Divergence Index (DGI = 0.196), and the Predictive Skew Index (PSI = 0.09) suggest relatively good overall model performance, these aggregate measures obscure the heterogeneity of reliability and error across zones. The Recessive and LPZ s, for instance, suffer from complete diagnostic failure (TP = 0), with undefined ZREDI and maximal ZBES (+1.0), despite the favorable global PPV. Only the TZ maintains near-convergence of reliability and error (ZREDI ≈ ZBES ≈ –0.0417), validating its designation as a post-test equilibrium region. The global ZBTS value of 0.106 quantifies the overall discrepancy between reliability divergence and error skew, highlighting the underlying epistemic friction and reinforcing the utility of zone-aware metrics such as ZREDI, ZBES, and ZBTS for robust and context-sensitive diagnostic evaluation. 3.1.1.Valuation of Clinical Utility and Zonal Error Burden Using ZBE: Figure 3 presents a zone-stratified analysis of clinical utility metrics in Example 1, focusing on Net Benefit, Expected Diagnostic Cost, and Zone-Balanced Error (ZBE). Net Benefit is calculated at a threshold of 0.5, while Expected Diagnostic Cost is derived under the assumption that FNs carry twice the cost of FPs. ZBE, introduced here as a class-neutral and interpretable measure, captures the average of FNR and FPR, quantifying overall diagnostic misclassification within each epistemic zone. The TZ stands out as the optimal region of diagnostic operation, where error burden is lowest (ZBE = 0.479), and cost and benefit metrics are relatively balanced. While Net Benefit is neutral in the Recessive, Lower Plausible, and TZ s, it sharply increases in the Upper Plausible (0.68) and Inflated (1.0) Zones, indicating the greatest clinical gain in high-confidence regions. These results, supported by Appendix 2, underscore that clinical utility concentrates in zones of elevated epistemic confidence. Expected Diagnostic Cost (Appendix 3) is highest in the Recessive and LPZ s due to the predominance of FNs and absence of TPs, with costs of 1.8 and 1.6, respectively. In contrast, cost drops sharply in the UPZ (0.16) and is neutral in the IZ (0.0), supporting efficient diagnostic decision-making when the model is highly certain. ZBE values from Appendix 4 confirm that the Recessive and LPZ s exhibit the highest misclassification burden (ZBE = 0.5), while the IZ has perfect classification (ZBE = 0.0). The TZ again emerges as the most balanced region in terms of reliability and error (ZBE = 0.479), validating its role as a diagnostic anchor. Together, these findings highlight that global metrics, while informative (global Net Benefit = 0.337; global EDC = 0.942; global ZBE = 0.396), risk obscuring critical intra-zonal variability.(Appendices 2, 3, and 4) Figure 4 illustrates the net clinical benefit of predictive decisions across diagnostic confidence zones as a function of threshold probability. The TZ exhibits the highest net benefit across a broad range of thresholds, confirming its superior diagnostic utility and balance between sensitivity and specificity. The UPZ and LPZ show moderate net benefit, indicating their potential clinical value under selective conditions. In contrast, the RZ and IZ yield the lowest net benefits, reflecting diagnostic inefficiency due to high misclassification risk (underdiagnosis and overdiagnosis, respectively). This zone-based DCA supports threshold-sensitive decision-making by quantifying real-world trade-offs between benefit and harm across epistemic strata. The application of clinical utility metrics at the zone level represents an innovative advancement in evaluating diagnostic model performance. By integrating Net Benefit , Expected Diagnostic Cost , and the newly introduced Zone-Based Error (ZBE) within RAMI-stratified diagnostic zones, this approach transcends conventional, aggregate-level assessments. While global metrics offer a useful summary, they often obscure critical intra-zonal disparities that directly impact clinical decision-making. Zone-stratified evaluation, by contrast, reveals how diagnostic reliability, misclassification burden, and decision value fluctuate across varying levels of model certainty. This granular insight enables more informed deployment of diagnostic tools—targeting zones of high confidence while flagging regions of epistemic fragility. As such, zone-based clinical utility auditing not only enhances interpretability but also establishes a more risk-aware and evidence-aligned framework for responsible clinical model validation. 3.2. Real-World Diagnostic Example In a diagnostic evaluation of hepatocellular carcinoma (HCC), 401 archived plasma samples were analyzed, comprising 208 samples from confirmed HCC patients and 193 from liver cirrhosis patients without HCC , who served as clinical controls. This dataset is derived from the study by Jang et al. (2016), PLoS ONE [https://doi.org/10.1371/journal.pone.0151069[13,14]. In this context, AFP's diagnostic performance was assessed specifically in distinguishing HCC from cirrhosis , a clinically critical challenge since patients with cirrhosis represent the key population targeted for early HCC surveillance . Accordingly, the resulting confusion matrix and all derived diagnostic metrics—summarized in Tables 2 and 3—capture AFP’s capacity to discriminate malignant (HCC) from non-malignant (cirrhosis) chronic liver disease within a clinically relevant high-risk population.This real-world dataset enables a grounded application of the zone-stratified framework, allowing for fine-grained analysis of biomarker reliability and diagnostic utility across different levels of model certainty. In this real-world example, epistemic zone boundaries were predefined to match widely accepted AFP diagnostic ranges used in hepatocellular carcinoma (HCC) surveillance rather than being derived from statistical confidence intervals. This approach ensured immediate clinical relevance while maintaining compatibility with zone-based calibration for interpretability. Table 2. Zone-Stratified Diagnostic Classification of AFP Levels Across Epistemic Confidence Zones This table summarizes zone-wise diagnostic classification counts (TP, FN, FP, TN) across five AFP concentration ranges, aligned with increasing diagnostic confidence. By mapping AFP levels to epistemic zones—from low-certainty (Recessive) to high-certainty (Inflated)—the stratification enables detailed assessment of AFP’s ability to distinguish HCC from cirrhosis within a high-risk population. AFP Zone Epistemic Zone TP FN FP TN 1000 Inflated 8 2 0 2 Table 3. Epistemic metrics of AFP diagnostic performance across confidence zones This table summarizes zone-stratified epistemic metrics for AFP performance, highlighting asymmetries in predictive reliability and error across diagnostic confidence zones. ZBTS values reveal zones of underconfidence (negative values) and overconfidence (positive values), offering a refined lens to interpret AFP’s diagnostic behavior across clinically meaningful thresholds. AFP Zone ZREDI ZBES ZBTS 1000 0.5 0.2 0.3 Global GDI= 0.1857 PSI= 0.2862 Delta GDI-PSI = -0.1005 The application of epistemic indices across stratified AFP diagnostic zones revealed distinct patterns in predictive reliability and error distribution. ZREDI (Zone-Relevant Epistemic Divergence Index) and ZBES varied meaningfully across AFP ranges, with ZBTS capturing latent diagnostic tensions. The Recessive (<5 ng/mL) and Lower Plausible (5–20 ng/mL) zones exhibited negative ZREDI values and high ZBES, indicating underconfident behavior characterized by conservative predictions coexisting with substantial misclassification. In contrast, the Trusted (20–200 ng/mL), Upper Plausible (200–1000 ng/mL), and Inflated (>1000 ng/mL) zones demonstrated increasingly positive ZREDI with declining or stabilizing ZBES, suggesting overconfident diagnostic behavior—high apparent reliability despite error asymmetry. Three key epistemic landmarks were identified: the EEP, where ZREDI = 0 and predictive calibration is symmetrical (PPV = NPV), was found at AFP ≈ 8.56 ng/mL; the intersection of ZREDI and ZBES (AFP ≈ 50.98 ng/mL) reflected equilibrium between calibration divergence and error skew; and a DNP, defined by ZBES = 0 (FNR = FPR), occurred near AFP ≈ 300.32 ng/mL. These results highlight how epistemic analysis of diagnostic zones uncovers nuanced reliability profiles beyond conventional sensitivity or specificity, with implications for threshold optimization and interpretive guidance in HCC screening.(fig.5 and fig.6) To contextualize epistemic performance in terms of actionable clinical value, we applied a three-part Zone-Based Clinical Utility Evaluation to the real AFP dataset. First, Zone-Stratified Net Benefit was assessed using Decision Curve Analysis (DCA), capturing the clinical trade-off between TPs and FPs across AFP-defined diagnostic zones. This revealed that the Trusted and Upper Plausible zones yielded the highest net benefit, particularly at intermediate threshold probabilities. Second, Expected Diagnostic Cost was calculated by assigning penalty weights to FPs and FNs, demonstrating that cost-efficiency improves markedly once AFP levels exceed 200 ng/mL—corresponding to reduced misclassification in those zones. Third, we introduced the Zone-Balanced Error (ZBE) metric, computed as the average of false positive and false negative rates within each zone, providing a class-neutral estimate of misclassification burden. ZBE further distinguished zones of error symmetry, offering complementary insight to the directional asymmetry captured by ZREDI and ZBES. Together, these utility metrics reveal that AFP levels above 200 ng/mL (Upper Plausible and Inflated zones) yield the highest clinical value, combining maximal net benefit, minimal cost, and lowest ZBE. The TZ (20–200 ng/mL) also shows strong utility, balancing moderate cost with high net benefit and low ZBE. In contrast, lower AFP zones (<20 ng/mL) demonstrate high diagnostic cost, negligible net benefit, and elevated ZBE—suggesting limited utility and greater risk of misclassification. Overall, this integrated framework aligns epistemic and utility-based evaluations to inform evidence-weighted threshold selection in AFP-guided HCC screening.(fig.7) 4.Discussion 4.1. Zone-Stratified Diagnostic Evaluation Framework This study introduces a zone-stratified diagnostic evaluation framework anchored by four interrelated metrics—ZREDI, ZBES, Zone-Balanced Error (ZBE), and the composite ZBTS. Unlike traditional global metrics such as sensitivity, specificity, or AUC, which provide aggregate summaries, these zone-specific measures offer granular, epistemically grounded insights into diagnostic model behavior across varying levels of certainty. The framework is applied across five defined diagnostic zones—Recessive, Lower Plausible, Trusted, Upper Plausible, and Inflated—each corresponding to a distinct confidence tier that shapes interpretive value and clinical decision-making. Researchers may rely on RAMI-derived intervals when prioritizing uncertainty quantification, or alternatively use pre-established clinical thresholds when aiming for alignment with regulatory guidelines and real-world validation protocols. This is well demonstrated by the observed misalignment between RAMI-derived probabilistic thresholds and clinically established decision levels, which highlighted the need to adopt predefined AFP cut-offs in the real-world example of AFP-based HCC diagnostics. These thresholds, widely recommended in hepatocellular carcinoma (HCC) surveillance protocols, ensured both clinical relevance and practical interpretability of the diagnostic framework. This finding underscores the flexibility of the zone-stratified framework. In this HCC evaluation, we selected clinically accepted AFP ranges rather than deriving boundaries from statistical confidence intervals. This approach preserved immediate clinical applicability while still enabling zone-based calibration, thereby demonstrating the framework’s adaptability across diverse diagnostic, regulatory, and research contexts. 4.2 Zone-Relative Predictive Symmetry and the EEP The ZREDI quantifies directional asymmetry in predictive reliability by subtracting NPV from PPV within each diagnostic zone. A ZREDI value near zero reflects balanced trust in both positive and negative predictions, as observed in the TZ. As ZREDI diverges from zero—particularly beyond ± 0.30—it indicates increasing asymmetry, signaling overconfidence in one class and reduced reliability in the other, which can elevate clinical risk. For example, highly positive ZREDI values suggest overdiagnosis potential (e.g., in IZs), while negative values imply underdiagnosis (e.g., in RZs). To summarize global post-test asymmetry, the Global Divergence Index (GDI = PPV − NPV) can be used. While informative, GDI lacks ZREDI’s localized granularity and may obscure stratified vulnerabilities. Thus, ZREDI offers a more actionable metric for zone-specific auditing and recalibration. The EEP marks the zone where ZREDI equals zero—that is, where PPV and NPV converge—indicating optimal diagnostic neutrality. This point serves as a key benchmark for model calibration and deployment readiness. In our simulation, the EEP aligned with the TZ, where sensitivity and NPV both approximated 54.17%, reflecting a rare state of post-test symmetry and maximal epistemic stability. When the EEP shifts into higher confidence zones (e.g., Upper Plausible or Inflated), it may suggest delayed calibration or overconfidence. Conversely, if it lies in lower zones, it may signal early but fragile equilibrium. Clinically, the EEP is more than a statistical construct—it acts as a meta-indicator of reliability. Its position guides diagnostic trust, threshold setting, and safety interventions. Embedded within zone-stratified evaluation, the EEP and ZREDI together provide a dual-lens assessment of model performance, combining predictive directionality with balance, and supporting both interpretive clarity and deployment confidence. The ZREDI is intentionally constructed using PPV and NPV—two prediction-centered metrics that reflect the trustworthiness of a model’s outputs rather than its accuracy against ground truth. PPV (TP / [TP + FP]) measures the likelihood that a positive prediction is correct, while NPV (TN / [TN + FN]) provides the same for negative predictions. Because these values are conditioned on the model’s predictions, they align closely with how clinicians and end users interpret and act on diagnostic outputs in real time. This framing makes ZREDI a uniquely user-facing metric, offering a direct lens into the asymmetry of predictive reliability across zones. Unlike sensitivity or specificity—which are grounded in outcome-based performance—ZREDI answers a more immediate and clinically relevant question: “When the model predicts a positive or negative outcome, how reliable is that prediction?” Moreover, PPV and NPV act as practical proxies for epistemic certainty. When either metric is low, it signals a zone of local diagnostic fragility—indicating that the model is offering predictions unsupported by sufficient evidence. By taking the signed difference (PPV − NPV), ZREDI reveals the direction and magnitude of this reliability imbalance, uncovering potential overdiagnosis (when PPV ≫ NPV) or underdiagnosis (when NPV ≫ PPV) that may be obscured in global metrics. Ultimately, ZREDI enables granular, zone-specific auditing of trust asymmetry. It supports recalibration where needed, enhances interpretability for end users, and operationalizes the concept of epistemic risk. By anchoring itself in predictive reliability rather than retrospective correctness, ZREDI bridges methodological precision with clinical applicability, offering a robust tool for epistemically informed model evaluation. 4.3. ZBES, DNP, and the Global Prediction Symmetry Index (PSI) The ZBES is a directional diagnostic metric that quantifies the asymmetry of misclassification within each epistemic zone. Defined as the difference between the FNR and the FPR, it reflects the net diagnostic bias of a model’s errors. A ZBES of zero denotes perfect error symmetry, while positive or negative values highlight skew toward underdiagnosis or overdiagnosis, respectively. This makes ZBES a powerful lens for detecting class-specific vulnerabilities that are often masked by global metrics such as accuracy or AUC. In our simulation, ZBES varied substantially across zones. The TZ achieved near-perfect symmetry (ZBES = + 0.042), indicating well-balanced misclassification. Conversely, the Recessive and Inflated Zones reached extreme values of + 1.0 and − 1.0, respectively, indicating complete directional breakdown—either dominated by FNs or FPs. Such skew reveals critical diagnostic fragility that can significantly impact clinical outcomes if left unrecognized. To guide interpretation, ZBES can be qualitatively categorized: values ≤ 0.10 suggest excellent diagnostic balance; 0.11–0.30 reflects moderate skew; 0.31–0.50 indicates high misclassification risk; and values > 0.50 signify severe directional failure. Though mathematically equivalent to the difference between sensitivity and specificity, ZBES offers distinct interpretive value by focusing explicitly on misclassification burden rather than correct detection. Importantly, ZBES also defines the DNP — the threshold at which FNR equals FPR (i.e., ZBES = 0). This point marks the zone of symmetrical diagnostic error, serving as a complementary benchmark to the EEP, where PPV equals NPV. While EEP signals equilibrium in predictive trust, the DNP indicates balance in diagnostic fallibility. When both occur within the same zone, typically the TZ, it suggests maximal epistemic stability and reinforces that zone’s clinical defensibility. To contextualize ZBES in a broader framework, we compare it to the Prediction Symmetry Index (PSI) — a global analogue defined either as PSI = FNR − FPR or alternatively as PSI = (PPV + NPV) − 1. Both forms capture system-wide directional imbalance, with positive values indicating a tendency toward underdiagnosis and negative values suggesting overdiagnosis. While PSI provides an overarching view of error asymmetry, it lacks the spatial resolution offered by ZBES[ 15 , 16 ].By computing ZBES within each epistemic zone (e.g., Recessive, Trusted, Inflated), one can detect zone-specific misclassification patterns that may be obscured at the aggregate level . Thus, ZBES and PSI function in parallel—PSI offering global insight, ZBES offering local granularity. Used together, they enable comprehensive auditing of diagnostic models within stratified or threshold-sensitive applications. Clinically, this integrated framework supports better triage decisions, recalibration strategies, and risk-aware deployment by identifying not just how often the model is wrong, but how and where its errors manifest. 4.4. ZBTS, Epistemic Tension, and the EPEP The ZBTS is introduced as a composite interpretive construct derived from the interaction between the ZREDI and the ZBES. Defined as ZBTS = ZREDI − ZBES, this metric captures the divergence between predictive trust asymmetry and diagnostic error imbalance within each epistemic zone. A ZBTS value close to zero reflects epistemic balance, indicating that calibration (i.e., trust in predictions) and misclassification patterns are in alignment. In contrast, positive ZBTS values signal overconfidence—situations where a model's predictive reliability exceeds its actual error performance—whereas negative values imply underconfidence, denoting conservative diagnostic behavior that yields higher-than-expected accuracy despite low trust indicators. A unique case arises when ZREDI equals ZBES, resulting in ZBTS = 0. This defines the EPEP, which, unlike the EEP—where PPV equals NPV—or the DNP—where FNR equals FPR—represents a convergence of both calibration and error symmetry. When EPEP occurs, often near the TZ, it marks a zone of minimal epistemic tension and optimal diagnostic harmony. In our simulation, distinct Epistemic Tension States (ETS) were observed across zones. The TZ showed balanced calibration and error (ZREDI ≈ ZBES ≈ 0), resulting in ZBTS = 0 and diagnostic neutrality. The RZ, which produced no positive predictions, had an undefined ZREDI but a maximally defined ZBES of + 1.0, resulting in a low ZBTS value indicative of dysfunctional symmetry—where apparent predictive neutrality coexists with unidirectional diagnostic failure. The IZ showed perfect PPV but undefined NPV and ZBES due to absent negative predictions, implying a high-ZBTS state and overconfidence. These results illustrate how ZBTS identifies zones of latent epistemic instability, even when core metrics are not computable. At the global level, an analogous measure of epistemic divergence can be expressed as the difference between the Divergence in Global Informativeness (DGI = PPV − NPV) and the Prediction Symmetry Index (PSI = FNR − FPR). In our real data example, global DGI was 0.196 and PSI was 0.09, resulting in a delta (Δ_Global) of 0.106. This modest but non-negligible gap reflects global epistemic tension, suggesting that predictive reliability and misclassification are not perfectly aligned. However, global deltas can obscure zone-specific failures. ZBTS, by contrast, reveals where calibration outpaces—or falls behind—actual performance, providing a localized map of epistemic fragility that supports more granular threshold tuning and interpretive safety. Furthermore, while PSI summarizes directional misclassification imbalance across the entire dataset, it is agnostic to local context. ZBES serves as its stratified analogue, revealing spatial heterogeneity in diagnostic error patterns (1,2). ZREDI, ZBES, and ZBTS, taken together, offer a multidimensional lens on model behavior, exposing both global asymmetries and local imbalances. When mapped across diagnostic zones, this triad enables transparency in trust allocation, risk assessment, and deployment planning, particularly in high-stakes clinical settings where epistemic safety is essential . 4.5.Evaluating Diagnostic Utility Across Confidence Zones In this study, we incorporated Zone-Balanced Error (ZBE) alongside Net Benefit and Expected Diagnostic Cost to provide a comprehensive view of diagnostic utility. Zone-Balanced Error (ZBE) is a class-neutral metric designed to quantify the average misclassification burden—calculated as the mean of the FNR and FPR—within each diagnostic confidence zone. Unlike conventional global metrics such as accuracy, sensitivity, or even global balanced error (BE), which aggregate performance across the entire dataset, ZBE enables localized evaluation by isolating error patterns within epistemically defined strata (e.g., Trusted, Recessive, Inflated). This is particularly critical in real-world clinical settings, where model reliability and risk may vary dramatically across prediction thresholds. ZBE provides actionable insight by identifying zones where total error rates are acceptably low—or conversely, dangerously high—thereby supporting zone-specific deployment strategies. Complementing ZBE, additional zone-stratified utility metrics were applied: zone-level Net Benefit, which quantifies clinical utility under specified decision thresholds; and Expected Diagnostic Cost, which incorporates error penalties (e.g., higher cost for FNs). Together, these metrics form a robust toolkit for evaluating diagnostic models under epistemic uncertainty, allowing stakeholders to audit not just how much or what kind of error a model makes—but where those vulnerabilities occur within the decision landscape. 4.6. Zone-Stratified Evaluation in Real-World Context Compared to zone-level results, the global diagnostic metrics present an oversimplified and potentially misleading picture of AFP performance. The global ZREDI of + 0.190 reflects moderate predictive asymmetry overall, yet it masks critical zonal divergences such as the extreme underconfidence seen in the RZ (ZBTS = − 1.026) and the strong overconfidence in the UPZ (ZBTS = + 0.341). Similarly, the global ZBES (+ 0.0667) suggests a modest error imbalance but fails to capture the sharp error skew in zones like Recessive (+ 0.7983) or Inflated (+ 0.2000). While the global ZBTS (+ 0.1233) implies a net overconfident posture, this aggregate hides the diagnostic instability present in lower AFP strata. These discrepancies underscore the importance of zone-stratified analysis in revealing clinically relevant epistemic behaviors that would otherwise be obscured by aggregate statistics. The application of the proposed zone-stratified epistemic framework to real-world AFP data in the context of hepatocellular carcinoma (HCC) revealed diagnostic behaviors that align with, but also refine, prior observations in the literature. In the RZ (< 5 ng/mL), AFP demonstrated a high false-negative burden, consistent with the known limitations of AFP in detecting early-stage or AFP-negative HCC [ 13 , 17 ]. This underscores the need for supplementary diagnostic strategies in AFP-negative populations. In the LPZ (5–20 ng/mL), model behavior remained conservative, with underprediction and modest gains in reliability, reflecting transitional epistemic behavior with residual diagnostic ambiguity. The TZ (20–200 ng/mL), situated between the Lower and Upper Plausible strata, emerged as a region of relative epistemic balance, with ZREDI values approaching zero and ZBES showing minimal asymmetry. This suggests a zone of predictive stability, where both positive and negative predictions can be interpreted with greater confidence. However, the Upper Plausible (200–1000 ng/mL) and Inflated (> 1000 ng/mL) zones revealed a shift toward positive bias, characterized by high PPV, decreasing ZBE, and rising ZREDI values. While this reflects improved detection sensitivity, it also introduces the risk of overdiagnosis if interpretive safeguards are not in place [ 18 , 19 ]. Notably, the lack of full convergence between key epistemic landmarks—such as the EEP and the DNP—in higher zones highlights that diagnostic confidence does not equate to perfect reliability. These findings demonstrate the practical value of zone-based stratification in modeling AFP-based HCC detection: it provides a continuous and interpretable view of reliability dynamics, moving beyond static thresholds to support context-sensitive, risk-adaptive decision-making. 4.7. Comparison with Traditional Metrics Traditional diagnostic metrics—such as accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC)—offer high-level summaries of classifier performance but often obscure localized diagnostic weaknesses that are crucial in clinical decision-making. These global indicators aggregate performance across the entire prediction spectrum and are generally threshold-independent, which can mask overdiagnosis in high-certainty zones or underdiagnosis in ambiguous or low-certainty stra ta[ 15 , 16 , 20 ]. AUC, while widely reported and useful for ranking models, provides limited insight into real-world interpretability, calibration, or localized trustworthiness—especially in imbalanced datasets[ 2 ]. In contrast, the zone-stratified framework proposed in this study—including the ZREDI, ZBES, Zone-Based Error (ZBE), and the EEP—enables more granular and context-aware evaluation. These metrics anchor diagnostic interpretation within epistemically defined strata (e.g., Recessive, Trusted, Inflated), uncovering asymmetries in predictive reliability and misclassification risk that global metrics tend to overlook. For example, while conventional specificity might appear acceptable, ZBES may reveal directional skew due to excess FNs within a specific diagnostic zone. Likewise, EEP identifies the precise threshold where PPV equals NPV—offering a robust signal of calibration symmetry that is invisible to conventional tools .). The F1 score, as noted in the introduction, does not reflect model performance on the negative class and fails to account for variability in predictive reliability across score thresholds or confidence strata. Within zone-stratified analysis, F1 may provide complementary insights—particularly in strata enriched with positive predictions, such as the IZ. However, its interpretive value is limited unless considered alongside zone-aware metrics that capture directional bias, calibration asymmetry, and diagnostic uncertainty. Integrating F1 with metrics like ZREDI, ZBES, and ZBTS ensures a more comprehensive and context-sensitive evaluation of model performance. Together, these zone-based metrics do not replace conventional performance measures but enrich them by localizing interpretability, enhancing threshold sensitivity, and offering epistemic clarity. By aligning performance evaluation with diagnostic reality—where uncertainty and risk vary across thresholds—they support safer, more adaptive deployment of AI models in clinical contexts. 4.8. Global and Clinical Implications of Zone-Stratified Diagnostic Metrics The integration of zone-stratified diagnostic metrics—EEP, ZREDI, ZBES, Zone-Balanced Error (ZBE),and ZBTS —represents a significant advancement in the evaluation and safe deployment of diagnostic models. Unlike traditional metrics such as accuracy or AUC, which provide aggregate summaries, these zone-specific tools allow for fine-grained, context-aware assessments of model reliability, directional bias, and diagnostic fragility across the full spectrum of predictive certainty. Clinically, these metrics anchor performance evaluation within epistemically meaningful strata. The EEP identifies the threshold where sensitivity and NPV converge, signaling diagnostic neutrality. When this equilibrium occurs within the TZ, it reflects well-calibrated, stable performance suitable for clinical action. If displaced into higher zones—such as the Upper Plausible or Inflated—it may reveal threshold drift, overconfidence, or inflated certainty, necessitating secondary validation or expert oversight. ZREDI complements this by quantifying directional asymmetry in predictive trust: positive values indicate excessive confidence in disease prediction (often seen in IZs), while negative values signal a conservative underdiagnosis tendency (common in RZs). Meanwhile, ZBES captures the skew of misclassification errors, distinguishing whether the model tends to produce FPs or FNs within a given zone. Together, ZREDI and ZBES form the basis of the ZBTS, a composite lens for assessing epistemic misalignment. Zones where ZREDI and ZBES diverge reflect either overconfident or underconfident diagnostic behavior, while their convergence (EPEP) marks optimal alignment between trust and error. From a global systems perspective, these metrics have substantial implications. They reveal hidden vulnerabilities that conventional global scores may obscure, particularly in settings with limited diagnostic oversight or high variability in risk tolerance. For instance, while global balanced error (BE ≈ 0.396) may seem acceptable, zone-based analysis uncovers that the Recessive and LPZs carry significantly higher error burdens (ZBE = 0.5), suggesting localized diagnostic instability. The TZ, by contrast, exhibits the lowest ZBE (0.479 in simulation; 0.36 in real data), affirming its centrality as a clinically robust stratum. The potential impact extends beyond clinical practice to regulation and policy. Zone-stratified metrics provide interpretable, auditable evidence of model behavior, enabling regulators and institutional stakeholders to move beyond summary statistics and toward zone-aware performance certification. In public health and telemedicine contexts, where confirmatory testing is limited, these metrics offer actionable insights into where algorithmic decisions are stable, fragile, or unsafe. In sum, this framework supports a paradigm shift in diagnostic AI evaluation—from reliance on global metrics to a nuanced, zone-based approach. It fosters epistemic transparency, improves clinical safety, and enables risk-calibrated deployment in both resource-rich and resource-constrained environments. By aligning model assessment with the realities of diagnostic uncertainty, zone-stratified metrics offer a robust foundation for equitable, context-sensitive decision-making at scale. Together, these metrics enable zone-specific evaluation of diagnostic systems across the full spectrum of clinical risk, including diagnostically fragile regions. The framework is not limited to theoretical performance assessment but directly supports threshold refinement, context-aware model auditing, and integration as an evaluation layer within clinical informatics pipelines . By embedding epistemic governance into machine learning deployment, the approach enhances transparency, safety, and trust in high-stakes clinical decision support, thereby aligning with the broader goals of digital health and translational informatics. 4.9. Methodological Strengths and Limitations This study’s principal methodological strength lies in its shift from conventional global performance metrics toward a zone-stratified evaluation framework that explicitly integrates epistemic context. By introducing and operationalizing metrics such as the EEP, ZREDI, ZBES, and Zone-Based Error (ZBE), the framework allows for both aggregate and localized assessments of diagnostic reliability. This layered approach exposes patterns of directional bias, misclassification asymmetry, and predictive fragility that are often obscured by summary measures like AUC, sensitivity, or specificity. As such, it supports more nuanced and clinically grounded interpretations of model behavior—particularly in scenarios where decision confidence varies across the predictive spectrum. An additional strength is the framework’s comparative flexibility. It enables cross-model benchmarking by tracking variations in the EEP location—revealing how different classifiers distribute predictive balance across zones of low, moderate, or high certainty. This capability facilitates architectural audits, encouraging informed threshold selection and context-sensitive calibration. Furthermore, the successful application of the framework to both simulated and real-world diagnostic datasets underscores its scalability and relevance across diverse stages of clinical deployment, from early triage to confirmatory diagnostics. However, several limitations merit attention. The framework depends on a consistent zoning structure, typically defined by RAMI or equivalent stratification methods. These zone definitions may not generalize across populations or diseases without recalibration, limiting transferability without prior domain-specific validation. Additionally, while the approach excels in binary classification contexts, its extension to multiclass or probabilistic output models remains an area for future development. Integrating these zone-level outputs into clinical workflows also poses practical challenges, particularly under real-time or resource-constrained conditions where interpretability, speed, and interface design are critical. Despite these limitations, the proposed zone-stratified evaluation framework represents a meaningful methodological advance. It reconceptualizes model assessment through the lens of epistemic credibility, offering a structured yet interpretable toolkit for uncovering hidden vulnerabilities and optimizing diagnostic decision-making. By aligning performance evaluation with zones of trust and uncertainty, this approach enhances clinical safety and fosters a more accountable and context-aware use of machine learning in healthcare. 5. Conclusions and Recommendations This study proposes a zone-stratified framework for diagnostic evaluation, centered on four novel metrics: the EEP, ZREDI, ZBES, and Zone-Based Error (ZBE). Unlike conventional global metrics, these indicators provide localized, interpretable insights into model behavior—capturing directional bias, misclassification asymmetry, and epistemic instability across diagnostic confidence zones. Findings from both simulated and real-world examples underscore the diagnostic value of the TZ, where predictive reliability, calibration, and error symmetry converge. In contrast, zones such as Recessive and Inflated revealed heightened epistemic tension, reflecting areas of underdiagnosis, overconfidence, or structural fragility. These distinctions emphasize the importance of zone-specific auditing to mitigate localized risks that global metrics may overlook. We recommend integrating zone-stratified metrics into model development, validation, and regulation—especially in high-stakes clinical contexts. EEP can guide threshold calibration, ZREDI can detect predictive imbalance, and ZBES/ZBE can pinpoint diagnostic instability. This framework also holds promise for enhancing regulatory transparency and supporting adaptive AI systems that monitor for epistemic drift. Future research should focus on expanding the framework to multi-class and continuous-risk models, and on validating its utility in prospective clinical settings. As AI increasingly informs diagnostic decisions, zone-based evaluation offers a critical safeguard—promoting trust, safety, and interpretive clarity in real-world deployment. Abbreviations DNP: Diagnostic Neutrality Point EEP: Epistemic Equivalence Point EPEP: Epistemic Pressure Equilibrium Point ETS: Epistemic Tension State GDI: Global Divergence Index IZ : Inflated Zone LPZ: Lower Plausible Zone NPV: Negative Predictive Value PPV: Positive Predictive Value PSI: Prediction Symmetry Index ROC: receiver operating characteristic RZ: Recessive Zone TZ: Trusted Zone UPZ: Upper Plausible Zone ZBES: Zone-Based Balanced Error Screw ZBE: Zone-Balanced Error ZBTS: Zone-Based Tension State ZREDI: Zone-Relevant Epistemic Divergence Index Declarations Ethics approval and consent to participate: This study did not involve direct experimentation on human subjects. All real-world diagnostic datasets used were fully de-identified and publicly available, and no personal or clinical identifiers were accessed or processed. Therefore, institutional ethical approval and informed consent were not required. The study was conducted in accordance with relevant guidelines for research integrity and data privacy. Consent for publication: Not applicable. Availability of data and materials: The datasets used for illustrative purposes are available from the corresponding author upon reasonable request. Competing interests: The author declares no competing interests. Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Authors' contributions: T.F.R. as the sole author, conceived the study, developed the methodology, performed the analyses, prepared all figures and table, and wrote the manuscript. Acknowledgment :The author used ChatGPT (OpenAI) to assist with language editing and phrasing refinement during manuscript preparation. All conceptual and analytical content was developed by the author. Declaration of Originality, Intellectual Ownership, Authorship, and Future Rights The author, Dr. Tareef Fadhil Raham, affirms that this manuscript is an original work entirely conceived, developed, and authored by the undersigned. All intellectual contributions—including the design and formalization of the zone-stratified diagnostic framework, the derivation of the related metrics such as Epistemic Equivalence Point (EEP), Zone-Relevant Epistemic Divergence Index (ZREDI), Zone-Based Balanced Error Skew (ZBES), Zone-Based Tension State (ZBTS) and Zone-Balanced Error Index (ZBE), as well as the simulation modeling, data analysis, and interpretation—were solely the product of the author’s scholarly effort. This work has not been published, submitted, or disseminated elsewhere and contains no material copied from external sources without proper attribution. No co-authorship, ghostwriting, or collaborative authorship applies to any portion of this research or manuscript. The author retains full intellectual ownership over the concepts, methodologies, and metrics introduced. These contributions are part of an ongoing program of original research and may serve as the basis for future software tools, statistical packages, or decision-support systems. Accordingly, the author reserves the right to pursue licensing, academic dissemination, or intellectual property protection (e.g., through copyright, registration, or algorithmic patents) in accordance with institutional and jurisdictional policies. Declaration of generative AI and AI-assisted technologies in the writing process During the preparation of this work, the author(s) used ChatGPT (OpenAI) to improve the clarity and readability of certain sections of the manuscript. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the published article. References Japkowicz N, Shah M. Evaluating Learning Algorithms: A Classification Perspective . Cambridge University Press; 2011. pp. 113–121, 127. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015 4;10(3):e0118432. doi: 10.1371/journal.pone.0118432. PMID: 25738806; PMCID: PMC4349800. Hand DJ. Evaluating diagnostic tests: The area under the ROC curve and the balance of errors. Stat Med. 2010;29(14):1502-10. doi: 10.1002/sim.3859. PMID: 20087877. Kuhn M, Johnson K. Applied Predictive Modeling. 1st ed. New York: Springer; 2013. pp. 259–266, 281–283. Peirce JC, Cornell RG. Integrating stratum-specific likelihood ratios with the analysis of ROC curves. Med Decis Making. 1993 Apr-Jun;13(2):141-51. doi: 10.1177/0272989X9301300208. PMID: 8483399. Leeflang MM, Rutjes AW, Reitsma JB, Hooft L, Bossuyt PM. Variation of a test's sensitivity and specificity with disease prevalence. CMAJ. 2013 Aug 6;185(11):E537-44. doi: 10.1503/cmaj.121286. Epub 2013 Jun 24. PMID: 23798453; PMCID: PMC3735771. Akobeng AK. Understanding diagnostic tests 2: likelihood ratios, pre- and post-test probabilities and their use in clinical practice. Acta Paediatr. 2007;96(4):487-91. doi: 10.1111/j.1651-2227.2006.00179.x. Epub 2007 Feb 14. PMID: 17306009. Grimes DA, Schulz KF. Uses and abuses of screening tests. Lancet. 2002;359(9309):881-4. doi: 10.1016/S0140-6736(02)07948-5. Altman DG, Bland JM. Diagnostic tests 2: Predictive values. BMJ. 1994 Jul 9;309(6947):102. doi: 10.1136/bmj.309.6947.102. PMID: 8038641; PMCID: PMC2540558. Elkan C. The foundations of cost-sensitive learning. Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI) . 2001:973–978. Raham TF. From global accuracy to local credibility: zone-based ROC and DET frameworks for conservative, transparent, and clinically credible evaluation. Manuscript under editorial consideration . Raham TF. From Inferential Statistics to Epistemic Credibility: A Zone-Based Framework for Conservative Estimation. Under review. Jang ES, Jeong SH, Kim JW, Choi YS, Leissner P, Brechot C. Diagnostic Performance of Alpha-Fetoprotein, Protein Induced by Vitamin K Absence, Osteopontin, Dickkopf-1 and Its Combinations for Hepatocellular Carcinoma. PLoS One. 2016 Mar 17;11(3):e0151069. doi: 10.1371/journal.pone.0151069. PMID: 26986465; PMCID: PMC4795737. Jang ES, Jeong SH, Kim JW, Choi YS, Leissner P, Brechot C. (2016). Data from: Diagnostic performance of alpha-fetoprotein, protein induced by vitamin K absence, osteopontin, Dickkopf-1 and its combinations for hepatocellular carcinoma [Dataset]. Dryad. https://doi.org/10.5061/dryad.3n901 Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020 ;21(1):6. doi: 10.1186/s12864-019-6413-7. PMID: 31898477; PMCID: PMC6941312. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006 Nov-Dec;26(6):565-74. doi: 10.1177/0272989X06295361. PMID: 17099194; PMCID: PMC2577036. European Association for the Study of the Liver. EASL Clinical Practice Guidelines: Management of hepatocellular carcinoma. J Hepatol. 2018;69(1):182-236. doi: 10.1016/j.jhep.2018.03.019. Epub 2018 Apr 5. Erratum in: J Hepatol. 2019 Apr;70(4):817. doi: 10.1016/j.jhep.2019.01.020. PMID: 29628281. Marrero JA, Feng Z, Wang Y, Nguyen MH, Befeler AS, Roberts LR, , et al . Alpha-fetoprotein, des-gamma carboxyprothrombin, and lectin-bound alpha-fetoprotein in early hepatocellular carcinoma. Gastroenterology. 2009;137(1):110-8. doi: 10.1053/j.gastro.2009.04.005. Epub 2009 Apr 9. PMID: 19362088; PMCID: PMC2704256. Trevisani F, D'Intino PE, Morselli-Labate AM, Mazzella G, Accogli E, Caraceni P, Domenicali M, De Notariis S, Roda E, Bernardi M. Serum alpha-fetoprotein for diagnosis of hepatocellular carcinoma in patients with chronic liver disease: influence of HBsAg and anti-HCV status. J Hepatol. 2001 Apr;34(4):570-5. doi: 10.1016/s0168-8278(00)00053-2. PMID: 11394657. Habibzadeh F. Diagnostic tests performance indices: an overview. Biochem Med (Zagreb). 2025 Feb 15;35(1):010101. doi: 10.11613/BM.2025.010101. PMID: 39974192; PMCID: PMC11838712. Additional Declarations No competing interests reported. Supplementary Files 8appendix8.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7539984","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":547768697,"identity":"d731809f-5a1e-43bb-ac46-3e881b235f67","order_by":0,"name":"Tareef Fadhil Raham","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6ElEQVRIiWNgGAWjYNACGzkGhhuMjQc+MDAkEFTMAybTjEFaGg7OIFELA8NhHmK02LOfffjgQ4KBPN/t5obDtm12efzsDYwfPubgsYUn3dhwRoKB4cw7BxsO57YlF0v2HGCWnLkNn8PS2KR5f/xh3HAjEaSFOXHDjQQ2Zl58Wvifsf/+k2BgD9Zi2VZPhBaJNDZmhgSDRLAWxrbDRGi58YxZsifBIHkmUMvBnnPHE2f2HGzG6xf2/jTGDz8SDGz7bqQ/fPCjrDqxn7354IePeLSgAkY2MNlArHoQ+EOK4lEwCkbBKBgpAACbQFpdZ5+eNgAAAABJRU5ErkJggg==","orcid":"","institution":"University of Warith Alanbiyaa","correspondingAuthor":true,"prefix":"","firstName":"Tareef","middleName":"Fadhil","lastName":"Raham","suffix":""}],"badges":[],"createdAt":"2025-09-05 02:23:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7539984/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7539984/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96468430,"identity":"17637007-7ba1-4693-b3bd-e13a36f6986c","added_by":"auto","created_at":"2025-11-21 11:51:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":110022,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePredictive Reliability (PPV and NPV) Across Epistemic Diagnostic Zones\u003c/strong\u003e\u003cbr\u003e\n \u003cem\u003eThis figure illustrates the variation in diagnostic predictive reliability across five epistemic strata: Recessive\u003c/em\u003e, \u003cem\u003eLower Plausible\u003c/em\u003e, \u003cem\u003eTrusted\u003c/em\u003e, \u003cem\u003eUpper Plausible, and Inflated. Positive Predictive Value (PPV) and Negative Predictive Value (NPV) are plotted to capture asymmetries in post-test interpretability. The Trusted Zonemarks the \u003c/em\u003e\u003cem\u003e\u003cstrong\u003eEpistemic Equivalence Point (EEP)\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e—where PPV approximates NPV—signaling diagnostic balance and maximal interpretive symmetry. Departures from this convergence across other zones highlight regions of epistemic uncertainty (e.g., Recessive) or overconfidence (e.g., Inflated), informing zone-specific reliability assessments.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/13f01ef7dedde33fd1d9153e.png"},{"id":96468429,"identity":"867d0497-8fe0-41dd-8498-ac770c3b8042","added_by":"auto","created_at":"2025-11-21 11:51:27","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":122267,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eZone-Stratified Epistemic Metrics: ZREDI, ZBES, and ZBTS\u003c/strong\u003e\u003cbr\u003e\n \u003cem\u003eThis figure illustrates zone-specific values for ZREDI (Zone-Relative Epistemic Divergence Index), ZBES (Zone-Based Error Skew), and ZBTS (Zone-Based Tension State) derived from stratified diagnostic data. In the Trusted Zone, three key epistemic landmarks co-occur: the Epistemic Equivalence Point (EEP) where predictive reliability (ZREDI) and error skew (ZBES) converge; the Diagnostic Neutrality Point (DNP\u003c/em\u003e\u003cem\u003e\u003cstrong\u003e)\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e where ZBES equals zero; and the Epistemic Pressure Equilibrium Point (EPEP)where ZREDI equals ZBES. This convergence marks a region of minimal diagnostic tension and optimal epistemic balance. In contrast, the Recessive and Lower Plausible zones exhibit diagnostic vacuity (ZREDI not computable, ZBES = +1), while Upper\u003c/em\u003e\u003cem\u003e\u003cstrong\u003e \u003c/strong\u003e\u003c/em\u003e\u003cem\u003ePlausible and Inflated zones reflect overdiagnosis (ZBES ≤ –1), underscoring directional misclassification and overconfident predictions. This emphasizes the unique stability of the Trusted Zone as a diagnostic anchor point.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/444e90c39b7dde3128d58b0f.png"},{"id":96603786,"identity":"894a708f-fd14-4c88-8985-e3e152a8127c","added_by":"auto","created_at":"2025-11-24 09:11:31","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":84387,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eZone-Stratified Clinical Utility Metrics for simulated example \u003c/strong\u003e\u003cbr\u003e\n \u003cem\u003eThis figure visualizes Net Benefit (NB), Expected Diagnostic Cost (EDC), and Zone-Balanced Error (ZBE) across RAMI-defined diagnostic zones. Net Benefit is calculated at a decision threshold of 0.5, with diagnostic cost assuming a false negative to false positive (FN:FP) penalty ratio of 2:1. ZBE reflects the average misclassification burden using FNR and FPR. The Trusted Zone shows the lowest error burden (ZBE = 0.479), while net clinical benefit peaks in the Upper Plausible and Inflated Zones.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/9a337324331a6c333711599e.png"},{"id":96468433,"identity":"62e4dc2c-d824-438c-801f-9b9e687a08ec","added_by":"auto","created_at":"2025-11-21 11:51:27","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":175654,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eZone-Stratified Decision Curve Analysis \u003c/strong\u003e\u003cem\u003eThis figure presents the net clinical benefit across a range of threshold probabilities for each diagnostic confidence zone in Example 1. The Trusted Zone demonstrates the highest net benefit over most thresholds, indicating its superior clinical utility compared to other zones within the RAMI-stratified framework.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/3e59baa2d29f4aef9a486c99.png"},{"id":96603580,"identity":"82caef1f-ce66-4c04-87f1-f8bf79ea0fc8","added_by":"auto","created_at":"2025-11-24 09:10:20","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":143735,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEpistemic Diagnostic Landmarks Along the AFP Spectrum\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eZREDI and ZBES values plotted across stratified AFP concentration zones, with corresponding epistemic zone annotations. ZREDI (PPV − NPV) reflects the net predictive confidence for HCC diagnosis, while ZBES (FNR − FPR) quantifies diagnostic risk balance between missed cases and false alarms. The curve illustrates a transition from uncertainty at low AFP levels (\u0026lt;5 ng/mL) to high diagnostic certainty at elevated levels (\u0026gt;1000 ng/mL). The Trusted Zone (200–1000 ng/mL) shows optimal balance, whereas the Recessive Zone exhibits high epistemic uncertainty and error burden.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/5d3f91f4a3caf9f3cfaa631b.png"},{"id":96468435,"identity":"098643da-c53c-4cc8-a09c-00ba360789db","added_by":"auto","created_at":"2025-11-21 11:51:27","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":144963,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eZone-Based Tension State (ZBTS) Across AFP Diagnostic Zones\u003c/strong\u003e\u003cbr\u003e\n \u003cem\u003eThis plot illustrates ZBTS values—calculated as the difference between ZREDI and ZBES—for alpha-fetoprotein (AFP) across five diagnostic confidence zones. Negative ZBTS values in the \u0026lt;5 and 5–20 ng/mL zones indicate underconfidence, while positive values in the 20–200, 200–1000, and \u0026gt;1000 ng/mL zones reflect overconfidence. These trends reveal shifting alignment between predictive reliability and error structure across AFP thresholds\u003c/em\u003e.\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/7d1969428676d2a5833ba909.png"},{"id":96603135,"identity":"79e09536-4473-45e1-ae66-1b5aee39ab46","added_by":"auto","created_at":"2025-11-24 09:07:07","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":166066,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eZone-Based Clinical Utility Evaluation Curves for AFP\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eThis figure illustrates zone-stratified curves for Net Benefit (from Decision Curve Analysis), Expected Diagnostic Cost (using a FN:FP cost ratio of 2:1), and Zone-Balanced Error (ZBE) across alpha-fetoprotein (AFP) concentration strata. Net Benefit peaks in the 200–1000 ng/mL range, while both cost and ZBE decline progressively with increasing AFP. This pattern highlights the clinical utility of higher AFP zones, particularly in supporting accurate and efficient hepatocellular carcinoma (HCC) screening within a high-risk population.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"image7.png","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/4d14e7fceb3c1435d8846e0a.png"},{"id":96708188,"identity":"df98a331-47b8-428c-b93d-044c378e193b","added_by":"auto","created_at":"2025-11-25 09:58:56","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2784114,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/a9563cd0-e5ae-465f-846c-e45799c411f2.pdf"},{"id":96602798,"identity":"1a3f99e8-413c-4fe7-b7df-f6fdb0566d9e","added_by":"auto","created_at":"2025-11-24 09:01:58","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":38315,"visible":true,"origin":"","legend":"","description":"","filename":"8appendix8.docx","url":"https://assets-eu.researchsquare.com/files/rs-7539984/v1/24be474230468353b812423a.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Beyond Global Metrics: A Zone-Stratified Diagnostic Framework Based on Confusion Matrix Components","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eDiagnostic models are increasingly used to support clinical decision-making, risk stratification, and early disease detection. However, Traditional evaluation metrics\u0026mdash;such as accuracy, sensitivity, and specificity\u0026mdash;offer global assessments of model performance. However, these metrics often fail to capture local reliability, especially in diagnostically uncertain regions or when data are imbalanced [1,2 ].\u003c/p\u003e\u003cp\u003eAlthough widely used, ROC curves and AUC values summarize performance globally and may obscure important diagnostic failures within specific regions of prediction confidence\u0026mdash;even in more localized approaches such as Zone-Restricted ROC (ZR-ROC) [\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThese metrics treat all errors equally, irrespective of whether they occur in high-certainty regions or in transitional, borderline zones\u0026mdash;where misclassifications may carry greater clinical consequences.\u003c/p\u003e\u003cp\u003eAlthough the F1 score is widely used for evaluating binary classifiers, it reflects performance only on the positive class by harmonizing precision (PPV) and recall (sensitivity).\u003csup\u003e2\u003c/sup\u003e This property makes it particularly useful in class-imbalanced datasets, where accurate positive detection is prioritized. However, the F1 score does not reflect the model\u0026rsquo;s performance on the negative class, nor does it incorporate the variability of predictive reliability across different score thresholds or confidence strata. As such, it overlooks error distribution across other regions of the prediction space, particularly in zones of diagnostic uncertainty[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. \u003cb\u003ec\u003c/b\u003eonsequently, it provides a limited\u0026mdash;and potentially misleading\u0026mdash;view of diagnostic robustness in zone-based clinical decision contexts, particularly where false positives (FPs) and false negatives (FN) carry asymmetric clinical consequences [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] .\u003c/p\u003e\u003cp\u003eWhile scalar measures such as the Predictive Summary Index (PSI\u003cb\u003e)\u003c/b\u003e\u0026mdash;defined as \u003cb\u003ePPV\u0026thinsp;+\u0026thinsp;NPV\u0026thinsp;\u0026minus;\u0026thinsp;1\u003c/b\u003e\u0026mdash;have been proposed to summarize the overall post-test utility of diagnostic models, their interpretive value remains limited in high-stakes clinical settings. PSI ranges from \u003cb\u003e\u0026minus;\u0026thinsp;1\u003c/b\u003e (completely misleading test) to \u003cb\u003e+\u0026thinsp;1\u003c/b\u003e (perfect prediction), providing a concise, user-facing summary of predictive usefulness[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eHowever, despite its intuitive appeal, PSI suffers from several limitations: it ignores class-specific misclassification consequences, fails to account for prediction confidence stratification, and provides no insight into where within the prediction spectrum failures occur. Second, PSI functions as a global average and cannot identify where in the diagnostic score range misclassification risk is concentrated\u0026mdash;such as in boundary zones of epistemic uncertainty [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].As such, it may obscure zone-specific fragility, particularly in contexts where \u003cb\u003efalse\u003c/b\u003e reassurance or overtreatment can lead to significant harm.\u003c/p\u003e\u003cp\u003eFurthermore, it is highly sensitive to prevalence, making it unreliable across heterogeneous populations or disease distributions [\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eIn addition, the PSI provides no insight into the directionality of prediction errors\u0026mdash;namely, whether a model tends to systematically overpredict or underpredict a condition. Nor does it capture localized collapses in reliability across the diagnostic spectrum. These limitations render PSI insufficient for nuanced interpretability and safety governance in clinical AI, particularly in settings where error asymmetry and zone-specific trust calibration are critical. As a result, there is a growing need for zone-aware evaluation metrics that can characterize directional reliability and misclassification patterns across stratified regions of epistemic uncertainty[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eIn response to these limitations, recent approaches such as \u003cb\u003eZR-ROC\u003c/b\u003e analysis have been developed to evaluate model performance within epistemically stratified regions\u0026mdash;such as zones defined by predicted score distributions, likelihood intervals, or adjusted means [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. These methods enhance diagnostic transparency by isolating performance in areas of increased uncertainty. However, they remain focused on traditional discriminative metrics (e.g., true positive rate vs. FPR and do not capture the \u003cb\u003edirectional reliability\u003c/b\u003e of predictions\u0026mdash;such as the extent to which a model confidently rules in versus rules out a condition within a given zone.\u003c/p\u003e\u003cp\u003eTo address these gaps, we introduce a complementary framework of zone-stratified reliability metrics that emphasize both \u003cb\u003epredictive trust\u003c/b\u003e and \u003cb\u003edirectional diagnostic bias\u003c/b\u003e within epistemically defined regions.\u003c/p\u003e\u003cp\u003eWe introduce a suite of \u003cb\u003ezone-stratified diagnostic metrics\u003c/b\u003e designed to assess prediction reliability, calibration balance, and error dynamics across \u003cb\u003eepistemically meaningful strata\u003c/b\u003e of diagnostic confidence. These metrics move beyond global performance indicators such as AUC or accuracy, offering localized insight into model behavior in clinically sensitive regions. The core components of the framework include:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eZone-Relative Epistemic Divergence Index (ZREDI)\u003c/b\u003e: Quantifies \u003cb\u003ecalibration asymmetry\u003c/b\u003e by measuring the divergence between PPV and NPV within each zone, highlighting \u003cb\u003etrust imbalance\u003c/b\u003e between outcome classes.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eZone-Based Error Skew (ZBES)\u003c/b\u003e: Captures the \u003cb\u003edirectionality of misclassification\u003c/b\u003e, defined as the difference between the false negative rate (FNR) and false positive rate (FPR). It provides a signed measure of diagnostic bias and complements value-based metrics like ZREDI.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eZone-Based Tension State (ZBTS)\u003c/b\u003e: A composite interpretive construct defined as \u003cb\u003eZREDI\u0026thinsp;\u0026minus;\u0026thinsp;ZBES\u003c/b\u003e, capturing the \u003cb\u003eepistemic tension\u003c/b\u003e between predictive asymmetry and error skew. ZBTS helps identify zones where reliability and misclassification are \u003cb\u003eincongruent or paradoxically aligned\u003c/b\u003e.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eTogether, these metrics enable \u003cb\u003ezone-specific evaluation\u003c/b\u003e of diagnostic systems across the full spectrum of clinical risk, including diagnostic zones. This framework supports \u003cb\u003ethreshold refinement\u003c/b\u003e, \u003cb\u003econtext-aware model evaluation\u003c/b\u003e, and \u003cb\u003eepistemically governed deployment\u003c/b\u003e of machine learning models in high-stakes clinical settings.\u003c/p\u003e"},{"header":"2. Methods and materials:","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1. Diagnostic Zone Stratification\u003c/h2\u003e\u003cp\u003eThe dataset was stratified into diagnostic confidence zones using the RAMI framework, which partitions model predictions based on their standardized distance from an adjusted mean score. This standardized distance functions analogously to the standard deviation (SD) in classical statistics, representing how far individual predictive scores deviate from a central reference point, thereby quantifying the degree of epistemic divergence from calibrated expectation. In our epistemic framework, this robustly adjusted central reference point is termed the \u003cb\u003eCredence Value (CV)\u003c/b\u003e\u0026mdash;a stable proxy for the adjusted mean, around which diagnostic confidence and epistemic stratification are systematically evaluated. Within the RAMI framework, a unit RAMI distance corresponds approximately to the distance between the adjusted mean and \u0026plusmn;\u0026thinsp;1 SD, offering a normalized scale for assessing predictive certainty. This framework functions as an epistemic analogue to classical confidence intervals, but instead of merely bounding statistical estimates, it reflects the operational reality of model trust, calibration integrity, and interpretive stability across predictive strata.\u003c/p\u003e\u003cp\u003eEach sample was classified into one of five epistemic zones: the \u003cb\u003eRecessive Zone (RZ)\u003c/b\u003e, \u003cb\u003eLower Plausible Zone (LPZ)\u003c/b\u003e, \u003cb\u003eTrusted Zone (TZ)\u003c/b\u003e, \u003cb\u003eUpper Plausible Zone (UPZ)\u003c/b\u003e, \u003cb\u003eor Inflated Zone (IZ)\u003c/b\u003e. These zones reflect increasing degrees of epistemic uncertainty, with the TZ representing predictions made with the highest level of diagnostic confidence and minimal asymmetry.\u003c/p\u003e\u003cp\u003eThe Lower and Upper Plausible Zones serve as transitional regions on either side of the TZ, where model predictions become less stable and the risk of misclassification begins to rise. At the extremes, the RZ is typically dominated by TN and may correspond to underdiagnosis, while the \u003cem\u003eIZ\u003c/em\u003e is dominated by TP and may reflect overdiagnosis or excessive certainty. This stratification provides a foundation for zone-specific performance analysis, emphasizing local diagnostic reliability rather than global accuracy alone.\u003c/p\u003e\u003cp\u003eThe RAMI-based stratification framework is currently under refinement[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. All intellectual property related to its structure, thresholds, and terminology is retained by the author.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2. Confusion Matrix Elements per Zone and Metric Computation Workflow\u003c/h2\u003e\u003cp\u003eThe computation of diagnostic metrics was conducted on a zone-specific basis using the RAMI-stratified dataset. Each observation was assigned to one of five diagnostic confidence zones\u0026mdash;Recessive, Lower Plausible, Trusted, Upper Plausible, or Inflated\u0026mdash;based on RAMI values. Within each zone, the core confusion matrix components were computed: true positives (TPs), true negatives (TNs), FPs, and FNs. TP referred to correctly identified diseased cases, TN to correctly identified non-diseased cases, FP to non-diseased cases incorrectly classified as diseased, and FN to diseased cases missed by the model. These foundational counts enabled the calculation of both traditional performance metrics (e.g., sensitivity, specificity, PPV, NPV) and the novel zone-stratified metrics introduced in this study\u0026mdash;namely, ZREDI, ZBES, ZBE, and ZBTS. This localized approach allowed precise characterization of model behavior across varying levels of diagnostic confidence, offering granular insights into reliability, misclassification asymmetry, and epistemic stability.\u003c/p\u003e\u003cp\u003e\u003cb\u003e2.3. Algorithmic Pipeline for RAMI-Stratified Epistemic Metric Extraction: Multistage Computation of ZREDI, Global Divergence Index (GDI\u003c/b\u003e), \u003cb\u003eZBES, ZBE\u003c/b\u003e, Epistemic Equivalence Point (\u003cb\u003eEEP), DNP, EPEP, and ZBTS\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eStep 1\u003c/strong\u003e\u003cp\u003eStratify the dataset into diagnostic confidence zones \u0026mdash; Recessive, Lower Plausible (LPZ), Trusted, Upper Plausible (UPZ), and Inflated \u0026mdash; using standard RAMI-derived thresholds (\u0026plusmn;\u0026thinsp;1 SE and \u0026plusmn;\u0026thinsp;2 SE).\u003c/p\u003e\u003c/p\u003e\u003cp\u003eAlternatively, users may define desired zone boundaries to align with specific diagnostic cut-offs, regulatory guidelines, or study protocols. This optional flexibility allows the framework to adapt to study-specific diagnostic thresholds, rare-event sampling considerations, and validation protocols that require non-standard confidence intervals.\u0026rdquo;\u003c/p\u003e\u003cp\u003eWithin the Recessive Zone (RZ) and Inflated Zone (IZ), values that extend beyond the 1st Recessive Value (1st RV) or 1st Inflated Value (1st IV) remain part of RZ/IZ.\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eStep 2\u003c/strong\u003e\u003cp\u003eFor each zone, calculate\u003c/p\u003e\u003c/p\u003e\u003cp\u003eTPz, TNz, FPz, FNz\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eStep 3\u003c/strong\u003e\u003cp\u003eThese were used to derive the Zone-Relative Epistemic Divergence Index \u003cb\u003e(\u003c/b\u003e ZREDI), calculated as the absolute difference between PPVz and NPVz\u003c/p\u003e\u003c/p\u003e\u003cp\u003eZREDI = (TPz / [TPz\u0026thinsp;+\u0026thinsp;FPz]) \u0026ndash; (TNz / [TNz\u0026thinsp;+\u0026thinsp;FNz]). i.e. ZREDI\u0026thinsp;=\u0026thinsp;PPVz\u0026thinsp;\u0026minus;\u0026thinsp;NPVz\u003c/p\u003e\u003cp\u003eStep 4: Identify the EEP when: Step 3: Identify the EEP where\u003c/p\u003e\u003cp\u003ethe Positive Predictive Value (PPV) equal to Negative Predictive Value (NPV)\u003c/p\u003e\u003cp\u003ePPVz\u0026thinsp;=\u0026thinsp;NPVz or (TPz / [TPz\u0026thinsp;+\u0026thinsp;FPz]) = (TNz / [TNz\u0026thinsp;+\u0026thinsp;FNz])\u003c/p\u003e\u003cp\u003eStep 5: obtain the Global Divergence Index (GDI\u0026thinsp;=\u0026thinsp;total PPV \u0026ndash;total NPV).\u003c/p\u003e\u003cp\u003eStep 6: To capture directional misclassification, we computed the False Negative Rate (FNR) and False Positive Rate (FPR) as: FNR\u0026thinsp;=\u0026thinsp;FN/TP\u0026thinsp;+\u0026thinsp;FN, FPR\u0026thinsp;=\u0026thinsp;FP/TN\u0026thinsp;+\u0026thinsp;FP\u003c/p\u003e\u003cp\u003eStep 7: Find EEP: EEP occurs when: TPz / (TPz\u0026thinsp;+\u0026thinsp;FPz)\u0026thinsp;=\u0026thinsp;TNz / (TNz\u0026thinsp;+\u0026thinsp;FNz)\u003c/p\u003e\u003cp\u003eStep 8 : Compute ZBES for each zone:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:ZBES\\:=\\:FNz\\:/\\:(TPz\\:+\\:FNz)\\:-\\:FPz\\:/\\:(TNz\\:+\\:FPz)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e= FNR\u0026thinsp;\u0026minus;\u0026thinsp;FPR\u003c/p\u003e\u003cp\u003eAll metrics were computed independently for each RAMI-defined diagnostic zone.\u003c/p\u003e\u003cp\u003eStep 8: Find Diagnostic Neutrality Point (DNP): DNP occurs when :FNRz=FPRz\u0026rArr;ZBESz=0\u003c/p\u003e\u003cp\u003eStep 9: Find Epistemic Pressure Equilibrium Point (EPEP) : EPEP occur when ZREDIz\u0026thinsp;=\u0026thinsp;ZBESz\u003c/p\u003e\u003cp\u003eStep 10: Find Zone-Based Tension State (ZBTS) : ZBTS\u0026thinsp;=\u0026thinsp;ZREDI\u0026thinsp;\u0026minus;\u0026thinsp;ZBES\u003c/p\u003e\u003cp\u003eStep 11:Find GDI\u0026thinsp;\u0026minus;\u0026thinsp;PSI Delta\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.4.Validation Assessment\u003c/h2\u003e\u003cp\u003eTo ensure both theoretical soundness and practical relevance, we employed a two-part validation strategy integrating both simulation and real-world data. First, synthetic datasets were developed to emulate the behavior of a binary classifier, such as those used in HbA1c-based diagnostic systems. In this simulation, we modeled an idealized diagnostic setting with balanced class distribution and minimal noise, enabling baseline evaluation of zone-based metrics\u0026mdash;including ZREDI, ZBES, ZBE, and ZBTS\u0026mdash;under stable epistemic conditions. The simulated case was assigned a RAMI score (representing standardized distance from the diagnostic threshold), a binary ground-truth label, and a predicted label based on a deterministic rule: RAMI\u0026thinsp;\u0026ge;\u0026thinsp;0 predicted positive, while RAMI\u0026thinsp;\u0026lt;\u0026thinsp;0 predicted negative. RAMI scores were then stratified into five predefined diagnostic confidence zones\u0026mdash; to facilitate localized error analysis. This structure revealed zone-specific fragilities, with underdiagnosis concentrated in RZ and LPZ, overdiagnosis in UPZ and IZ, and more balanced behavior in the TZ. Finally, the zone-based evaluation framework was applied to a real-world AFP\u0026ndash;HCC dataset to demonstrate clinical utility and interpretability. Together, these validation components support the reliability, generalizability, and epistemic insight offered by zone-stratified metrics across both simulated and real diagnostic contexts.\u003c/p\u003e\u003c/div\u003e"},{"header":"3.Results","content":"\u003ch3\u003e3.1. Simulated Example\u0026nbsp;\u003c/h3\u003e\n\u003cp\u003eA total of 250 simulated patient cases (Appendix 1) were generated to emulate a binary classification system, with RAMI values calibrated to mimic the behavior of HbA1c-based diagnostic thresholds. These cases were stratified into five diagnostic confidence zones based on RAMI scores. Zone-specific classification performance was evaluated using standard and novel metrics, including PPV, NPV, ZREDI, ZBES, and ZBTS. The results of this zone-stratified diagnostic evaluation are summarized in Table 1.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1. Zone-Stratified Diagnostic Reliability and Error Metrics\u003c/strong\u003e\u003cbr\u003e\u0026nbsp;Summary of predictive and error-based diagnostic metrics across epistemic confidence zones. Metrics include PPV, NPV, ZREDI, ZBES, and ZBTS. Global values (DGI, PSI) are included for comparison.\u0026nbsp;\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"\" width=\"117%\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 11px;\"\u003e\n \u003cp\u003eZone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003eTP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003eTN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003eFP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003eFN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 13px;\"\u003e\n \u003cp\u003ePPV (TP / [TP+FP])\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 14px;\"\u003e\n \u003cp\u003eNPV (TN / [TN+FN])\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 9px;\"\u003e\n \u003cp\u003eZREDI (PPV-NPV)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003eFNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003eFPR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003eZBES (FNR-FPR)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 12px;\"\u003e\n \u003cp\u003eZBTS\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e(ZREDI-\u003c/p\u003e\n \u003cp\u003eZBES)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 11px;\"\u003e\n \u003cp\u003eRecessive\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 13px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 14px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 9px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 12px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 11px;\"\u003e\n \u003cp\u003eLower Plausible\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 13px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 14px;\"\u003e\n \u003cp\u003e0.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 9px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 12px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 11px;\"\u003e\n \u003cp\u003eTrusted\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 13px;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 14px;\"\u003e\n \u003cp\u003e0.541667\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 9px;\"\u003e\n \u003cp\u003e-0.04167\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e0.458333\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e-0.04167\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 11px;\"\u003e\n \u003cp\u003eUpper Plausible\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 13px;\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 14px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 9px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e-1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 12px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 11px;\"\u003e\n \u003cp\u003eInflated\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 13px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 14px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 9px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 12px;\"\u003e\n \u003cp\u003eNC\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 11px;\"\u003e\n \u003cp\u003eTotal (global)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e105\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 3px;\"\u003e\n \u003cp\u003e45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 13px;\"\u003e\n \u003cp\u003e0.833333\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 14px;\"\u003e\n \u003cp\u003e0.637097\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 9px;\"\u003e\n \u003cp\u003eDGI =0.196\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003e0.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 5px;\"\u003e\n \u003cp\u003e0.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 8px;\"\u003e\n \u003cp\u003ePZI=0.09\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.106\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;*\u003c/strong\u003e NC = Not Computable due to lack of class-specific signal.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1\u003c/strong\u003e presents zone-stratified diagnostic performance metrics derived from Example 1, including key indicators such as PPV, NPV, the ZREDI, computed as PPV minus NPV), ZBES, defined as the difference between the FNR and\u0026nbsp;FPR, and the ZBTS, representing the residual between ZREDI and ZBES. Together, these metrics uncover latent diagnostic imbalances distributed across the epistemic confidence strata and serve to evaluate the alignment between predictive reliability and the distribution of classification errors.\u003c/p\u003e\n\u003cp\u003eAs depicted in \u003cstrong\u003eFigure 1\u003c/strong\u003e, PPV exhibits a monotonic increase across the stratified zones, culminating at a value of 1.0 in the\u0026nbsp;\u003cem\u003eIZ\u003c/em\u003e, where every positive prediction is correct. Conversely, NPV reaches its peak in the\u0026nbsp;\u003cem\u003eLPZ\u003c/em\u003e (0.68), tapering to 0.54 in the\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e. This decline reflects diminishing confidence in negative predictions as one moves from low- to mid-epistemic regions. The\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e, notably, marks the EEP\u0026mdash;the zone where PPV and NPV converge (0.50 and 0.54, respectively)\u0026mdash;and represents the point of maximal post-test symmetry and diagnostic balance.\u003c/p\u003e\n\u003cp\u003eIn zones where TPz + FPz = 0 or TNz + FNz = 0, ZREDI becomes undefined, denoted as \u0026ldquo;Not Computable\u0026rdquo; (NC). This arises from division by zero in the predictive value calculations and signals epistemic vacuity\u0026mdash;a complete absence of class-specific predictive or diagnostic signal. Such vacuity is observed in the Recessive, Lower Plausible, Upper Plausible, and Inflated Zones, each of which lacks the necessary conditions for computing PPV, NPV, or both. These vacuous zones thus reflect either diagnostic collapse or an overconfident skew.\u003c/p\u003e\n\u003cp\u003eAs further illustrated in \u003cstrong\u003eFigure 2\u003c/strong\u003e, the\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e stands out as the convergence nexus of three epistemic landmarks: the EEP (where PPV approximates NPV), the DNP, where ZBES equals zero), and the EPEP, (defined by the equality of ZREDI and ZBES). At this critical junction, the ZBTS value is minimized (\u0026asymp; 0), indicating a state of low epistemic tension and balanced diagnostic behavior.\u003c/p\u003e\n\u003cp\u003eIn contrast, zones outside the\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e manifest polarized diagnostic distortions. The Recessive and\u0026nbsp;\u003cem\u003eLPZ\u003c/em\u003es exhibit underdiagnosis, characterized by undefined ZREDI, extreme positive ZBES (equal to +1.0), and the complete absence of TPs\u0026mdash;reflecting a one-sided reliance on negative classifications. On the other hand, the Upper Plausible and Inflated Zones demonstrate signs of overdiagnosis. Here, ZREDI is high or undefined due to inflated PPV, and ZBES is either negative or zero, indicating a heavy skew toward\u0026nbsp;FPs\u0026nbsp;or complete certainty without counterbalance.\u003c/p\u003e\n\u003cp\u003eAlthough global diagnostic metrics such as PPV (0.833), NPV (0.637), the Global Divergence Index (DGI = 0.196), and the Predictive Skew Index (PSI = 0.09) suggest relatively good overall model performance, these aggregate measures obscure the heterogeneity of reliability and error across zones. The Recessive and\u0026nbsp;\u003cem\u003eLPZ\u003c/em\u003es, for instance, suffer from complete diagnostic failure (TP = 0), with undefined ZREDI and maximal ZBES (+1.0), despite the favorable global PPV. Only the\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e maintains near-convergence of reliability and error (ZREDI \u0026asymp; ZBES \u0026asymp; \u0026ndash;0.0417), validating its designation as a post-test equilibrium region. The global ZBTS value of 0.106 quantifies the overall discrepancy between reliability divergence and error skew, highlighting the underlying epistemic friction and reinforcing the utility of zone-aware metrics such as ZREDI, ZBES, and ZBTS for robust and context-sensitive diagnostic evaluation.\u003c/p\u003e\n\u003cp\u003e3.1.1.Valuation of Clinical Utility and Zonal Error Burden Using ZBE:\u003c/p\u003e\n\u003cp\u003eFigure 3 presents a zone-stratified analysis of clinical utility metrics in Example 1, focusing on Net Benefit, Expected Diagnostic Cost, and Zone-Balanced Error (ZBE). Net Benefit is calculated at a threshold of 0.5, while Expected Diagnostic Cost is derived under the assumption that\u0026nbsp;FNs carry twice the cost of\u0026nbsp;FPs. ZBE, introduced here as a class-neutral and interpretable measure, captures the average of FNR and FPR, quantifying overall diagnostic misclassification within each epistemic zone.\u003c/p\u003e\n\u003cp\u003eThe\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e stands out as the optimal region of diagnostic operation, where error burden is lowest (ZBE = 0.479), and cost and benefit metrics are relatively balanced. While Net Benefit is neutral in the Recessive, Lower Plausible, and\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003es, it sharply increases in the Upper Plausible (0.68) and Inflated (1.0) Zones, indicating the greatest clinical gain in high-confidence regions. These results, supported by Appendix 2, underscore that clinical utility concentrates in zones of elevated epistemic confidence.\u003c/p\u003e\n\u003cp\u003eExpected Diagnostic Cost (Appendix 3) is highest in the Recessive and\u0026nbsp;\u003cem\u003eLPZ\u003c/em\u003es due to the predominance of\u0026nbsp;FNs and absence of TPs, with costs of 1.8 and 1.6, respectively. In contrast, cost drops sharply in the\u0026nbsp;\u003cem\u003eUPZ\u003c/em\u003e (0.16) and is neutral in the\u0026nbsp;\u003cem\u003eIZ\u003c/em\u003e (0.0), supporting efficient diagnostic decision-making when the model is highly certain.\u003c/p\u003e\n\u003cp\u003eZBE values from Appendix 4 confirm that the Recessive and\u0026nbsp;\u003cem\u003eLPZ\u003c/em\u003es exhibit the highest misclassification burden (ZBE = 0.5), while the\u0026nbsp;\u003cem\u003eIZ\u003c/em\u003e has perfect classification (ZBE = 0.0). The\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e again emerges as the most balanced region in terms of reliability and error (ZBE = 0.479), validating its role as a diagnostic anchor.\u003c/p\u003e\n\u003cp\u003eTogether, these findings highlight that global metrics, while informative (global Net Benefit = 0.337; global EDC = 0.942; global ZBE = 0.396), risk obscuring critical intra-zonal variability.(Appendices 2, 3, and 4)\u003c/p\u003e\n\u003cp\u003eFigure 4 illustrates the net clinical benefit of predictive decisions across diagnostic confidence zones as a function of threshold probability. The\u0026nbsp;\u003cem\u003eTZ\u003c/em\u003e exhibits the highest net benefit across a broad range of thresholds, confirming its superior diagnostic utility and balance between sensitivity and specificity. The\u0026nbsp;\u003cem\u003eUPZ\u003c/em\u003e and\u0026nbsp;\u003cem\u003eLPZ\u003c/em\u003e\u0026nbsp; show moderate net benefit, indicating their potential clinical value under selective conditions. In contrast, the\u0026nbsp;\u003cem\u003eRZ\u003c/em\u003e\u0026nbsp; \u0026nbsp;and\u0026nbsp;\u003cem\u003eIZ\u003c/em\u003e\u0026nbsp; \u0026nbsp;yield the lowest net benefits, reflecting diagnostic inefficiency due to high misclassification risk (underdiagnosis and overdiagnosis, respectively). This zone-based DCA supports threshold-sensitive decision-making by quantifying real-world trade-offs between benefit and harm across epistemic strata.\u003c/p\u003e\n\u003cp\u003eThe application of clinical utility metrics at the \u003cstrong\u003ezone level\u003c/strong\u003e represents an innovative advancement in evaluating diagnostic model performance. By integrating \u003cstrong\u003eNet Benefit\u003c/strong\u003e\u003cstrong\u003e, \u003cstrong\u003eExpected Diagnostic Cost\u003c/strong\u003e,\u003c/strong\u003e and the newly introduced \u003cstrong\u003eZone-Based Error (ZBE)\u003c/strong\u003e within RAMI-stratified diagnostic zones, this approach transcends conventional, aggregate-level assessments. While global metrics offer a useful summary, they often obscure critical intra-zonal disparities that directly impact clinical decision-making. Zone-stratified evaluation, by contrast, reveals how diagnostic reliability, misclassification burden, and decision value fluctuate across varying levels of model certainty. This granular insight enables more informed deployment of diagnostic tools\u0026mdash;targeting zones of high confidence while flagging regions of epistemic fragility. As such, zone-based clinical utility auditing not only enhances interpretability but also establishes a more \u003cstrong\u003erisk-aware and evidence-aligned framework\u003c/strong\u003e for responsible clinical model validation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.2. Real-World Diagnostic Example\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn a diagnostic evaluation of hepatocellular carcinoma (HCC), \u003cstrong\u003e401 archived plasma samples\u003c/strong\u003e were analyzed, comprising \u003cstrong\u003e208 samples from confirmed HCC patients\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;and \u003cstrong\u003e193 from liver cirrhosis patients without HCC\u003c/strong\u003e\u003c/strong\u003e, who served as clinical controls. This dataset is derived from the study by Jang et al. (2016), \u003cem\u003ePLoS ONE\u003c/em\u003e [https://doi.org/10.1371/journal.pone.0151069[13,14]. In this context, \u003cstrong\u003eAFP\u0026apos;s diagnostic performance\u003c/strong\u003e was assessed specifically in distinguishing \u003cstrong\u003eHCC from cirrhosis\u003c/strong\u003e\u003cstrong\u003e,\u003c/strong\u003e a clinically critical challenge since \u003cstrong\u003epatients with cirrhosis represent the key population targeted for early HCC surveillance\u003c/strong\u003e\u003cstrong\u003e.\u003c/strong\u003e Accordingly, the resulting confusion matrix and all derived diagnostic metrics\u0026mdash;summarized in Tables 2 and 3\u0026mdash;capture AFP\u0026rsquo;s capacity to discriminate malignant (HCC) from non-malignant (cirrhosis) chronic liver disease within a clinically relevant high-risk population.This real-world dataset enables a grounded application of the zone-stratified framework, allowing for fine-grained analysis of biomarker reliability and diagnostic utility across different levels of model certainty. In this real-world example, \u003cstrong\u003eepistemic zone boundaries\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003ewere \u003cstrong\u003epredefined\u003c/strong\u003e to match \u003cstrong\u003ewidely accepted AFP diagnostic ranges\u003c/strong\u003e used in hepatocellular carcinoma (HCC) surveillance rather than being derived from statistical confidence intervals. This approach ensured \u003cstrong\u003eimmediate clinical relevance\u003c/strong\u003e while maintaining compatibility with \u003cstrong\u003ezone-based calibration\u003c/strong\u003e for interpretability.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2. Zone-Stratified Diagnostic Classification of AFP Levels Across Epistemic Confidence Zones\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThis table summarizes zone-wise diagnostic classification counts (TP, FN, FP, TN) across five AFP concentration ranges, aligned with increasing diagnostic confidence. By mapping AFP levels to epistemic zones\u0026mdash;from low-certainty (Recessive) to high-certainty (Inflated)\u0026mdash;the stratification enables detailed assessment of AFP\u0026rsquo;s ability to distinguish HCC from cirrhosis within a high-risk population.\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"395\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 75px;\"\u003e\n \u003cp\u003eAFP Zone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 64px;\"\u003e\n \u003cp\u003eEpistemic Zone \u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003eTP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003eFN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003eFP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003eTN\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 75px;\"\u003e\n \u003cp\u003e\u0026lt;5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 64px;\"\u003e\n \u003cp\u003eRecessive\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e80\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 75px;\"\u003e\n \u003cp\u003e5-20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 64px;\"\u003e\n \u003cp\u003eLower plausible\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e60\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 75px;\"\u003e\n \u003cp\u003e20-200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 64px;\"\u003e\n \u003cp\u003eTrusted\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e25\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 75px;\"\u003e\n \u003cp\u003e200-1000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 64px;\"\u003e\n \u003cp\u003eUpper Plausible\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 75px;\"\u003e\n \u003cp\u003e\u0026gt;1000\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 64px;\"\u003e\n \u003cp\u003eInflated\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 64px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3. Epistemic metrics of AFP diagnostic performance across confidence zones\u003c/strong\u003e\u003cbr\u003e\u0026nbsp;This table summarizes zone-stratified epistemic metrics for AFP performance, highlighting asymmetries in predictive reliability and error across diagnostic confidence zones. ZBTS values reveal zones of underconfidence (negative values) and overconfidence (positive values), offering a refined lens to interpret AFP\u0026rsquo;s diagnostic behavior across clinically meaningful thresholds.\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"395\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAFP Zone\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 103px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eZREDI\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eZBES\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eZBTS\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026lt;5\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 103px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e-0.2273\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.7983\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e-1.0256\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e5-20\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 103px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.1478\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.3258\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e-0.178\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e20-200\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 103px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.3461\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.0731\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.273\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e200-1000\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 103px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.2259\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e-0.1151\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.341\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026gt;1000\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 103px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.5\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.2\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.3\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 107px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGlobal\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 103px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGDI=\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e0.1857\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePSI=\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e0.2862\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 92px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDelta\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;GDI-PSI =\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;-0.1005\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThe application of epistemic indices across stratified AFP diagnostic zones revealed distinct patterns in predictive reliability and error distribution. ZREDI (Zone-Relevant Epistemic Divergence Index) and ZBES varied meaningfully across AFP ranges, with ZBTS \u0026nbsp; capturing latent diagnostic tensions. The Recessive (\u0026lt;5 ng/mL) and Lower Plausible (5\u0026ndash;20 ng/mL) zones exhibited negative ZREDI values and high ZBES, indicating underconfident behavior characterized by conservative predictions coexisting with substantial misclassification. In contrast, the Trusted (20\u0026ndash;200 ng/mL), Upper Plausible (200\u0026ndash;1000 ng/mL), and Inflated (\u0026gt;1000 ng/mL) zones demonstrated increasingly positive ZREDI with declining or stabilizing ZBES, suggesting overconfident diagnostic behavior\u0026mdash;high apparent reliability despite error asymmetry. Three key epistemic landmarks were identified: the EEP, where ZREDI = 0 and predictive calibration is symmetrical (PPV = NPV), was found at AFP \u0026asymp; 8.56 ng/mL; the intersection of ZREDI and ZBES (AFP \u0026asymp; 50.98 ng/mL) reflected equilibrium between calibration divergence and error skew; and a DNP, defined by ZBES = 0 (FNR = FPR), occurred near AFP \u0026asymp; 300.32 ng/mL. These results highlight how epistemic analysis of diagnostic zones uncovers nuanced reliability profiles beyond conventional sensitivity or specificity, with implications for threshold optimization and interpretive guidance in HCC screening.(fig.5 and fig.6)\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;To contextualize epistemic performance in terms of actionable clinical value, we applied a three-part Zone-Based Clinical Utility Evaluation to the real AFP dataset. First, Zone-Stratified Net Benefit was assessed using Decision Curve Analysis (DCA), capturing the clinical trade-off between TPs and FPs across AFP-defined diagnostic zones. This revealed that the Trusted and Upper Plausible zones yielded the highest net benefit, particularly at intermediate threshold probabilities. Second, Expected Diagnostic Cost was calculated by assigning penalty weights to FPs and FNs, demonstrating that cost-efficiency improves markedly once AFP levels exceed 200 ng/mL\u0026mdash;corresponding to reduced misclassification in those zones. Third, we introduced the Zone-Balanced Error (ZBE) metric, computed as the average of false positive and false negative rates within each zone, providing a class-neutral estimate of misclassification burden. ZBE further distinguished zones of error symmetry, offering complementary insight to the directional asymmetry captured by ZREDI and ZBES. Together, these utility metrics reveal that AFP levels above 200 ng/mL (Upper Plausible and Inflated zones) yield the highest clinical value, combining maximal net benefit, minimal cost, and lowest ZBE. The \u003cem\u003eTZ\u003c/em\u003e (20\u0026ndash;200 ng/mL) also shows strong utility, balancing moderate cost with high net benefit and low ZBE. In contrast, lower AFP zones (\u0026lt;20 ng/mL) demonstrate high diagnostic cost, negligible net benefit, and elevated ZBE\u0026mdash;suggesting limited utility and greater risk of misclassification. Overall, this integrated framework aligns epistemic and utility-based evaluations to inform evidence-weighted threshold selection in AFP-guided HCC screening.(fig.7)\u0026nbsp;\u003c/p\u003e"},{"header":"4.Discussion","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e4.1. Zone-Stratified Diagnostic Evaluation Framework\u003c/h2\u003e\u003cp\u003eThis study introduces a zone-stratified diagnostic evaluation framework anchored by four interrelated metrics\u0026mdash;ZREDI, ZBES, Zone-Balanced Error (ZBE), and the composite ZBTS. Unlike traditional global metrics such as sensitivity, specificity, or AUC, which provide aggregate summaries, these zone-specific measures offer granular, epistemically grounded insights into diagnostic model behavior across varying levels of certainty. The framework is applied across five defined diagnostic zones\u0026mdash;Recessive, Lower Plausible, Trusted, Upper Plausible, and Inflated\u0026mdash;each corresponding to a distinct confidence tier that shapes interpretive value and clinical decision-making.\u003c/p\u003e\u003cp\u003eResearchers may rely on RAMI-derived intervals when prioritizing uncertainty quantification, or alternatively use pre-established clinical thresholds when aiming for alignment with regulatory guidelines and real-world validation protocols. This is well demonstrated by the observed misalignment between RAMI-derived probabilistic thresholds and clinically established decision levels, which highlighted the need to adopt predefined AFP cut-offs in the real-world example of AFP-based HCC diagnostics. These thresholds, widely recommended in hepatocellular carcinoma (HCC) surveillance protocols, ensured both clinical relevance and practical interpretability of the diagnostic framework.\u003c/p\u003e\u003cp\u003eThis finding underscores the flexibility of the zone-stratified framework. In this HCC evaluation, we selected clinically accepted AFP ranges rather than deriving boundaries from statistical confidence intervals. This approach preserved immediate clinical applicability while still enabling zone-based calibration, thereby demonstrating the framework\u0026rsquo;s adaptability across diverse diagnostic, regulatory, and research contexts.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Zone-Relative Predictive Symmetry and the EEP\u003c/h2\u003e\u003cp\u003eThe ZREDI quantifies directional asymmetry in predictive reliability by subtracting NPV from PPV within each diagnostic zone. A ZREDI value near zero reflects balanced trust in both positive and negative predictions, as observed in the TZ. As ZREDI diverges from zero\u0026mdash;particularly beyond \u0026plusmn;\u0026thinsp;0.30\u0026mdash;it indicates increasing asymmetry, signaling overconfidence in one class and reduced reliability in the other, which can elevate clinical risk. For example, highly positive ZREDI values suggest overdiagnosis potential (e.g., in IZs), while negative values imply underdiagnosis (e.g., in RZs).\u003c/p\u003e\u003cp\u003eTo summarize global post-test asymmetry, the Global Divergence Index (GDI\u0026thinsp;=\u0026thinsp;PPV\u0026thinsp;\u0026minus;\u0026thinsp;NPV) can be used. While informative, GDI lacks ZREDI\u0026rsquo;s localized granularity and may obscure stratified vulnerabilities. Thus, ZREDI offers a more actionable metric for zone-specific auditing and recalibration.\u003c/p\u003e\u003cp\u003eThe EEP marks the zone where ZREDI equals zero\u0026mdash;that is, where PPV and NPV converge\u0026mdash;indicating optimal diagnostic neutrality. This point serves as a key benchmark for model calibration and deployment readiness. In our simulation, the EEP aligned with the TZ, where sensitivity and NPV both approximated 54.17%, reflecting a rare state of post-test symmetry and maximal epistemic stability. When the EEP shifts into higher confidence zones (e.g., Upper Plausible or Inflated), it may suggest delayed calibration or overconfidence. Conversely, if it lies in lower zones, it may signal early but fragile equilibrium.\u003c/p\u003e\u003cp\u003eClinically, the EEP is more than a statistical construct\u0026mdash;it acts as a meta-indicator of reliability. Its position guides diagnostic trust, threshold setting, and safety interventions. Embedded within zone-stratified evaluation, the EEP and ZREDI together provide a dual-lens assessment of model performance, combining predictive directionality with balance, and supporting both interpretive clarity and deployment confidence.\u003c/p\u003e\u003cp\u003eThe ZREDI is intentionally constructed using PPV and NPV\u0026mdash;two prediction-centered metrics that reflect the trustworthiness of a model\u0026rsquo;s outputs rather than its accuracy against ground truth. PPV (TP / [TP\u0026thinsp;+\u0026thinsp;FP]) measures the likelihood that a positive prediction is correct, while NPV (TN / [TN\u0026thinsp;+\u0026thinsp;FN]) provides the same for negative predictions. Because these values are conditioned on the model\u0026rsquo;s predictions, they align closely with how clinicians and end users interpret and act on diagnostic outputs in real time.\u003c/p\u003e\u003cp\u003eThis framing makes ZREDI a uniquely user-facing metric, offering a direct lens into the asymmetry of predictive reliability across zones. Unlike sensitivity or specificity\u0026mdash;which are grounded in outcome-based performance\u0026mdash;ZREDI answers a more immediate and clinically relevant question: \u003cem\u003e\u0026ldquo;When the model predicts a positive or negative outcome, how reliable is that prediction?\u0026rdquo;\u003c/em\u003e\u003c/p\u003e\u003cp\u003eMoreover, PPV and NPV act as practical proxies for epistemic certainty. When either metric is low, it signals a zone of local diagnostic fragility\u0026mdash;indicating that the model is offering predictions unsupported by sufficient evidence. By taking the signed difference (PPV\u0026thinsp;\u0026minus;\u0026thinsp;NPV), ZREDI reveals the direction and magnitude of this reliability imbalance, uncovering potential overdiagnosis (when PPV ≫ NPV) or underdiagnosis (when NPV ≫ PPV) that may be obscured in global metrics.\u003c/p\u003e\u003cp\u003eUltimately, ZREDI enables granular, zone-specific auditing of trust asymmetry. It supports recalibration where needed, enhances interpretability for end users, and operationalizes the concept of epistemic risk. By anchoring itself in predictive reliability rather than retrospective correctness, ZREDI bridges methodological precision with clinical applicability, offering a robust tool for epistemically informed model evaluation.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e4.3. ZBES, DNP, and the Global Prediction Symmetry Index (PSI)\u003c/h2\u003e\u003cp\u003eThe ZBES is a directional diagnostic metric that quantifies the asymmetry of misclassification within each epistemic zone. Defined as the difference between the FNR and the FPR, it reflects the net diagnostic bias of a model\u0026rsquo;s errors. A ZBES of zero denotes perfect error symmetry, while positive or negative values highlight skew toward underdiagnosis or overdiagnosis, respectively. This makes ZBES a powerful lens for detecting class-specific vulnerabilities that are often masked by global metrics such as accuracy or AUC.\u003c/p\u003e\u003cp\u003eIn our simulation, ZBES varied substantially across zones. The TZ achieved near-perfect symmetry (ZBES\u0026thinsp;=\u0026thinsp;+\u0026thinsp;0.042), indicating well-balanced misclassification. Conversely, the Recessive and Inflated Zones reached extreme values of +\u0026thinsp;1.0 and \u0026minus;\u0026thinsp;1.0, respectively, indicating complete directional breakdown\u0026mdash;either dominated by FNs or FPs. Such skew reveals critical diagnostic fragility that can significantly impact clinical outcomes if left unrecognized.\u003c/p\u003e\u003cp\u003eTo guide interpretation, ZBES can be qualitatively categorized: values\u0026thinsp;\u0026le;\u0026thinsp;0.10 suggest excellent diagnostic balance; 0.11\u0026ndash;0.30 reflects moderate skew; 0.31\u0026ndash;0.50 indicates high misclassification risk; and values\u0026thinsp;\u0026gt;\u0026thinsp;0.50 signify severe directional failure. Though mathematically equivalent to the difference between sensitivity and specificity, ZBES offers distinct interpretive value by focusing explicitly on misclassification burden rather than correct detection.\u003c/p\u003e\u003cp\u003eImportantly, ZBES also defines the DNP\u003cb\u003e\u0026mdash;\u003c/b\u003ethe threshold at which FNR equals FPR (i.e., ZBES\u0026thinsp;=\u0026thinsp;0). This point marks the zone of symmetrical diagnostic error, serving as a complementary benchmark to the EEP, where PPV equals NPV. While EEP signals equilibrium in predictive trust, the DNP indicates balance in diagnostic fallibility. When both occur within the same zone, typically the TZ, it suggests maximal epistemic stability and reinforces that zone\u0026rsquo;s clinical defensibility.\u003c/p\u003e\u003cp\u003eTo contextualize ZBES in a broader framework, we compare it to the Prediction Symmetry Index (PSI)\u003cb\u003e\u0026mdash;\u003c/b\u003ea global analogue defined either as PSI\u0026thinsp;=\u0026thinsp;FNR\u0026thinsp;\u0026minus;\u0026thinsp;FPR or alternatively as PSI = (PPV\u0026thinsp;+\u0026thinsp;NPV)\u0026thinsp;\u0026minus;\u0026thinsp;1. Both forms capture system-wide directional imbalance, with positive values indicating a tendency toward underdiagnosis and negative values suggesting overdiagnosis. While PSI provides an overarching view of error asymmetry, it lacks the spatial resolution offered by ZBES[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].By computing ZBES within each epistemic zone (e.g., Recessive, Trusted, Inflated), one can detect zone-specific misclassification patterns that may be obscured at the aggregate level .\u003c/p\u003e\u003cp\u003eThus, ZBES and PSI function in parallel\u0026mdash;PSI offering global insight, ZBES offering local granularity. Used together, they enable comprehensive auditing of diagnostic models within stratified or threshold-sensitive applications. Clinically, this integrated framework supports better triage decisions, recalibration strategies, and risk-aware deployment by identifying not just how often the model is wrong, but how and where its errors manifest.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e4.4. ZBTS, Epistemic Tension, and the EPEP\u003c/h2\u003e\u003cp\u003eThe ZBTS is introduced as a composite interpretive construct derived from the interaction between the ZREDI and the ZBES. Defined as ZBTS\u0026thinsp;=\u0026thinsp;ZREDI\u0026thinsp;\u0026minus;\u0026thinsp;ZBES, this metric captures the divergence between predictive trust asymmetry and diagnostic error imbalance within each epistemic zone. A ZBTS value close to zero reflects epistemic balance, indicating that calibration (i.e., trust in predictions) and misclassification patterns are in alignment. In contrast, positive ZBTS values signal overconfidence\u0026mdash;situations where a model's predictive reliability exceeds its actual error performance\u0026mdash;whereas negative values imply underconfidence, denoting conservative diagnostic behavior that yields higher-than-expected accuracy despite low trust indicators.\u003c/p\u003e\u003cp\u003eA unique case arises when ZREDI equals ZBES, resulting in ZBTS\u0026thinsp;=\u0026thinsp;0. This defines the EPEP, which, unlike the EEP\u0026mdash;where PPV equals NPV\u0026mdash;or the DNP\u0026mdash;where FNR equals FPR\u0026mdash;represents a convergence of both calibration and error symmetry. When EPEP occurs, often near the TZ, it marks a zone of minimal epistemic tension and optimal diagnostic harmony. In our simulation, distinct Epistemic Tension States (ETS) were observed across zones. The TZ showed balanced calibration and error (ZREDI\u0026thinsp;\u0026asymp;\u0026thinsp;ZBES\u0026thinsp;\u0026asymp;\u0026thinsp;0), resulting in ZBTS\u0026thinsp;=\u0026thinsp;0 and diagnostic neutrality. The RZ, which produced no positive predictions, had an undefined ZREDI but a maximally defined ZBES of +\u0026thinsp;1.0, resulting in a low ZBTS value indicative of dysfunctional symmetry\u0026mdash;where apparent predictive neutrality coexists with unidirectional diagnostic failure. The IZ showed perfect PPV but undefined NPV and ZBES due to absent negative predictions, implying a high-ZBTS state and overconfidence. These results illustrate how ZBTS identifies zones of latent epistemic instability, even when core metrics are not computable.\u003c/p\u003e\u003cp\u003eAt the global level, an analogous measure of epistemic divergence can be expressed as the difference between the Divergence in Global Informativeness (DGI\u0026thinsp;=\u0026thinsp;PPV\u0026thinsp;\u0026minus;\u0026thinsp;NPV) and the Prediction Symmetry Index (PSI\u0026thinsp;=\u0026thinsp;FNR\u0026thinsp;\u0026minus;\u0026thinsp;FPR). In our real data example, global DGI was 0.196 and PSI was 0.09, resulting in a delta (Δ_Global) of 0.106. This modest but non-negligible gap reflects global epistemic tension, suggesting that predictive reliability and misclassification are not perfectly aligned. However, global deltas can obscure zone-specific failures. ZBTS, by contrast, reveals where calibration outpaces\u0026mdash;or falls behind\u0026mdash;actual performance, providing a localized map of epistemic fragility that supports more granular threshold tuning and interpretive safety.\u003c/p\u003e\u003cp\u003eFurthermore, while PSI summarizes directional misclassification imbalance across the entire dataset, it is agnostic to local context. ZBES serves as its stratified analogue, revealing spatial heterogeneity in diagnostic error patterns (1,2). ZREDI, ZBES, and ZBTS, taken together, offer a multidimensional lens on model behavior, exposing both global asymmetries and local imbalances. When mapped across diagnostic zones, this triad enables transparency in trust allocation, risk assessment, and deployment planning, particularly in high-stakes clinical settings where epistemic safety is essential .\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003e4.5.Evaluating Diagnostic Utility Across Confidence Zones\u003c/h2\u003e\u003cp\u003eIn this study, we incorporated Zone-Balanced Error (ZBE) alongside Net Benefit and Expected Diagnostic Cost to provide a comprehensive view of diagnostic utility.\u003c/p\u003e\u003cp\u003eZone-Balanced Error (ZBE) is a class-neutral metric designed to quantify the average misclassification burden\u0026mdash;calculated as the mean of the FNR and FPR\u0026mdash;within each diagnostic confidence zone. Unlike conventional global metrics such as accuracy, sensitivity, or even global balanced error (BE), which aggregate performance across the entire dataset, ZBE enables localized evaluation by isolating error patterns within epistemically defined strata (e.g., Trusted, Recessive, Inflated). This is particularly critical in real-world clinical settings, where model reliability and risk may vary dramatically across prediction thresholds. ZBE provides actionable insight by identifying zones where total error rates are acceptably low\u0026mdash;or conversely, dangerously high\u0026mdash;thereby supporting zone-specific deployment strategies. Complementing ZBE, additional zone-stratified utility metrics were applied: zone-level Net Benefit, which quantifies clinical utility under specified decision thresholds; and Expected Diagnostic Cost, which incorporates error penalties (e.g., higher cost for FNs). Together, these metrics form a robust toolkit for evaluating diagnostic models under epistemic uncertainty, allowing stakeholders to audit not just how much or what kind of error a model makes\u0026mdash;but where those vulnerabilities occur within the decision landscape.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003e4.6. Zone-Stratified Evaluation in Real-World Context\u003c/h2\u003e\u003cp\u003eCompared to zone-level results, the global diagnostic metrics present an oversimplified and potentially misleading picture of AFP performance. The global ZREDI of +\u0026thinsp;0.190 reflects moderate predictive asymmetry overall, yet it masks critical zonal divergences such as the extreme underconfidence seen in the RZ (ZBTS\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;1.026) and the strong overconfidence in the UPZ (ZBTS\u0026thinsp;=\u0026thinsp;+\u0026thinsp;0.341). Similarly, the global ZBES (+\u0026thinsp;0.0667) suggests a modest error imbalance but fails to capture the sharp error skew in zones like Recessive (+\u0026thinsp;0.7983) or Inflated (+\u0026thinsp;0.2000). While the global ZBTS (+\u0026thinsp;0.1233) implies a net overconfident posture, this aggregate hides the diagnostic instability present in lower AFP strata. These discrepancies underscore the importance of zone-stratified analysis in revealing clinically relevant epistemic behaviors that would otherwise be obscured by aggregate statistics.\u003c/p\u003e\u003cp\u003eThe application of the proposed zone-stratified epistemic framework to real-world AFP data in the context of hepatocellular carcinoma (HCC) revealed diagnostic behaviors that align with, but also refine, prior observations in the literature. In the RZ (\u0026lt;\u0026thinsp;5 ng/mL), AFP demonstrated a high false-negative burden, consistent with the known limitations of AFP in detecting early-stage or AFP-negative HCC [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. This underscores the need for supplementary diagnostic strategies in AFP-negative populations. In the LPZ (5\u0026ndash;20 ng/mL), model behavior remained conservative, with underprediction and modest gains in reliability, reflecting transitional epistemic behavior with residual diagnostic ambiguity.\u003c/p\u003e\u003cp\u003eThe TZ (20\u0026ndash;200 ng/mL), situated between the Lower and Upper Plausible strata, emerged as a region of relative epistemic balance, with ZREDI values approaching zero and ZBES showing minimal asymmetry. This suggests a zone of predictive stability, where both positive and negative predictions can be interpreted with greater confidence. However, the Upper Plausible (200\u0026ndash;1000 ng/mL) and Inflated (\u0026gt;\u0026thinsp;1000 ng/mL) zones revealed a shift toward positive bias, characterized by high PPV, decreasing ZBE, and rising ZREDI values. While this reflects improved detection sensitivity, it also introduces the risk of overdiagnosis if interpretive safeguards are not in place [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eNotably, the lack of full convergence between key epistemic landmarks\u0026mdash;such as the EEP and the DNP\u0026mdash;in higher zones highlights that diagnostic confidence does not equate to perfect reliability. These findings demonstrate the practical value of zone-based stratification in modeling AFP-based HCC detection: it provides a continuous and interpretable view of reliability dynamics, moving beyond static thresholds to support context-sensitive, risk-adaptive decision-making.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e4.7. Comparison with Traditional Metrics\u003c/h2\u003e\u003cp\u003eTraditional diagnostic metrics\u0026mdash;such as accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC)\u0026mdash;offer high-level summaries of classifier performance but often obscure localized diagnostic weaknesses that are crucial in clinical decision-making. These global indicators aggregate performance across the entire prediction spectrum and are generally threshold-independent, which can mask overdiagnosis in high-certainty zones or underdiagnosis in ambiguous or low-certainty stra ta[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. AUC, while widely reported and useful for ranking models, provides limited insight into real-world interpretability, calibration, or localized trustworthiness\u0026mdash;especially in imbalanced datasets[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eIn contrast, the zone-stratified framework proposed in this study\u0026mdash;including the ZREDI, ZBES, Zone-Based Error (ZBE), and the EEP\u0026mdash;enables more granular and context-aware evaluation. These metrics anchor diagnostic interpretation within epistemically defined strata (e.g., Recessive, Trusted, Inflated), uncovering asymmetries in predictive reliability and misclassification risk that global metrics tend to overlook. For example, while conventional specificity might appear acceptable, ZBES may reveal directional skew due to excess FNs within a specific diagnostic zone. Likewise, EEP identifies the precise threshold where PPV equals NPV\u0026mdash;offering a robust signal of calibration symmetry that is invisible to conventional tools .).\u003c/p\u003e\u003cp\u003eThe F1 score, as noted in the introduction, does not reflect model performance on the negative class and fails to account for variability in predictive reliability across score thresholds or confidence strata. Within zone-stratified analysis, F1 may provide complementary insights\u0026mdash;particularly in strata enriched with positive predictions, such as the IZ. However, its interpretive value is limited unless considered alongside zone-aware metrics that capture directional bias, calibration asymmetry, and diagnostic uncertainty. Integrating F1 with metrics like ZREDI, ZBES, and ZBTS ensures a more comprehensive and context-sensitive evaluation of model performance.\u003c/p\u003e\u003cp\u003eTogether, these zone-based metrics do not replace conventional performance measures but enrich them by localizing interpretability, enhancing threshold sensitivity, and offering epistemic clarity. By aligning performance evaluation with diagnostic reality\u0026mdash;where uncertainty and risk vary across thresholds\u0026mdash;they support safer, more adaptive deployment of AI models in clinical contexts.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e4.8. Global and Clinical Implications of Zone-Stratified Diagnostic Metrics\u003c/h2\u003e\u003cp\u003eThe integration of zone-stratified diagnostic metrics\u0026mdash;EEP, ZREDI, ZBES, Zone-Balanced Error (ZBE),and ZBTS \u0026mdash;represents a significant advancement in the evaluation and safe deployment of diagnostic models. Unlike traditional metrics such as accuracy or AUC, which provide aggregate summaries, these zone-specific tools allow for fine-grained, context-aware assessments of model reliability, directional bias, and diagnostic fragility across the full spectrum of predictive certainty.\u003c/p\u003e\u003cp\u003eClinically, these metrics anchor performance evaluation within epistemically meaningful strata. The EEP identifies the threshold where sensitivity and NPV converge, signaling diagnostic neutrality. When this equilibrium occurs within the TZ, it reflects well-calibrated, stable performance suitable for clinical action. If displaced into higher zones\u0026mdash;such as the Upper Plausible or Inflated\u0026mdash;it may reveal threshold drift, overconfidence, or inflated certainty, necessitating secondary validation or expert oversight.\u003c/p\u003e\u003cp\u003eZREDI complements this by quantifying directional asymmetry in predictive trust: positive values indicate excessive confidence in disease prediction (often seen in IZs), while negative values signal a conservative underdiagnosis tendency (common in RZs). Meanwhile, ZBES captures the skew of misclassification errors, distinguishing whether the model tends to produce FPs or FNs within a given zone. Together, ZREDI and ZBES form the basis of the ZBTS, a composite lens for assessing epistemic misalignment. Zones where ZREDI and ZBES diverge reflect either overconfident or underconfident diagnostic behavior, while their convergence (EPEP) marks optimal alignment between trust and error.\u003c/p\u003e\u003cp\u003eFrom a global systems perspective, these metrics have substantial implications. They reveal hidden vulnerabilities that conventional global scores may obscure, particularly in settings with limited diagnostic oversight or high variability in risk tolerance. For instance, while global balanced error (BE\u0026thinsp;\u0026asymp;\u0026thinsp;0.396) may seem acceptable, zone-based analysis uncovers that the Recessive and LPZs carry significantly higher error burdens (ZBE\u0026thinsp;=\u0026thinsp;0.5), suggesting localized diagnostic instability. The TZ, by contrast, exhibits the lowest ZBE (0.479 in simulation; 0.36 in real data), affirming its centrality as a clinically robust stratum.\u003c/p\u003e\u003cp\u003eThe potential impact extends beyond clinical practice to regulation and policy. Zone-stratified metrics provide interpretable, auditable evidence of model behavior, enabling regulators and institutional stakeholders to move beyond summary statistics and toward zone-aware performance certification. In public health and telemedicine contexts, where confirmatory testing is limited, these metrics offer actionable insights into where algorithmic decisions are stable, fragile, or unsafe.\u003c/p\u003e\u003cp\u003eIn sum, this framework supports a paradigm shift in diagnostic AI evaluation\u0026mdash;from reliance on global metrics to a nuanced, zone-based approach. It fosters epistemic transparency, improves clinical safety, and enables risk-calibrated deployment in both resource-rich and resource-constrained environments. By aligning model assessment with the realities of diagnostic uncertainty, zone-stratified metrics offer a robust foundation for equitable, context-sensitive decision-making at scale.\u003c/p\u003e\u003cp\u003eTogether, these metrics enable zone-specific evaluation of diagnostic systems across the full spectrum of clinical risk, including diagnostically fragile regions. The framework is not limited to theoretical performance assessment but directly supports threshold refinement, context-aware model auditing, and integration as an \u003cb\u003eevaluation layer within clinical informatics pipelines\u003c/b\u003e. By embedding epistemic governance into machine learning deployment, the approach enhances transparency, safety, and trust in high-stakes clinical decision support, thereby aligning with the broader goals of digital health and translational informatics.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003e4.9. Methodological Strengths and Limitations\u003c/h2\u003e\u003cp\u003eThis study\u0026rsquo;s principal methodological strength lies in its shift from conventional global performance metrics toward a zone-stratified evaluation framework that explicitly integrates epistemic context. By introducing and operationalizing metrics such as the EEP, ZREDI, ZBES, and Zone-Based Error (ZBE), the framework allows for both aggregate and localized assessments of diagnostic reliability. This layered approach exposes patterns of directional bias, misclassification asymmetry, and predictive fragility that are often obscured by summary measures like AUC, sensitivity, or specificity. As such, it supports more nuanced and clinically grounded interpretations of model behavior\u0026mdash;particularly in scenarios where decision confidence varies across the predictive spectrum.\u003c/p\u003e\u003cp\u003eAn additional strength is the framework\u0026rsquo;s comparative flexibility. It enables cross-model benchmarking by tracking variations in the EEP location\u0026mdash;revealing how different classifiers distribute predictive balance across zones of low, moderate, or high certainty. This capability facilitates architectural audits, encouraging informed threshold selection and context-sensitive calibration. Furthermore, the successful application of the framework to both simulated and real-world diagnostic datasets underscores its scalability and relevance across diverse stages of clinical deployment, from early triage to confirmatory diagnostics.\u003c/p\u003e\u003cp\u003eHowever, several limitations merit attention. The framework depends on a consistent zoning structure, typically defined by RAMI or equivalent stratification methods. These zone definitions may not generalize across populations or diseases without recalibration, limiting transferability without prior domain-specific validation. Additionally, while the approach excels in binary classification contexts, its extension to multiclass or probabilistic output models remains an area for future development. Integrating these zone-level outputs into clinical workflows also poses practical challenges, particularly under real-time or resource-constrained conditions where interpretability, speed, and interface design are critical.\u003c/p\u003e\u003cp\u003eDespite these limitations, the proposed zone-stratified evaluation framework represents a meaningful methodological advance. It reconceptualizes model assessment through the lens of epistemic credibility, offering a structured yet interpretable toolkit for uncovering hidden vulnerabilities and optimizing diagnostic decision-making. By aligning performance evaluation with zones of trust and uncertainty, this approach enhances clinical safety and fosters a more accountable and context-aware use of machine learning in healthcare.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Conclusions and Recommendations","content":"\u003cp\u003eThis study proposes a zone-stratified framework for diagnostic evaluation, centered on four novel metrics: the EEP, ZREDI, ZBES, and Zone-Based Error (ZBE). Unlike conventional global metrics, these indicators provide localized, interpretable insights into model behavior\u0026mdash;capturing directional bias, misclassification asymmetry, and epistemic instability across diagnostic confidence zones.\u003c/p\u003e\u003cp\u003eFindings from both simulated and real-world examples underscore the diagnostic value of the TZ, where predictive reliability, calibration, and error symmetry converge. In contrast, zones such as Recessive and Inflated revealed heightened epistemic tension, reflecting areas of underdiagnosis, overconfidence, or structural fragility. These distinctions emphasize the importance of zone-specific auditing to mitigate localized risks that global metrics may overlook.\u003c/p\u003e\u003cp\u003eWe recommend integrating zone-stratified metrics into model development, validation, and regulation\u0026mdash;especially in high-stakes clinical contexts. EEP can guide threshold calibration, ZREDI can detect predictive imbalance, and ZBES/ZBE can pinpoint diagnostic instability. This framework also holds promise for enhancing regulatory transparency and supporting adaptive AI systems that monitor for epistemic drift.\u003c/p\u003e\u003cp\u003eFuture research should focus on expanding the framework to multi-class and continuous-risk models, and on validating its utility in prospective clinical settings. As AI increasingly informs diagnostic decisions, zone-based evaluation offers a critical safeguard\u0026mdash;promoting trust, safety, and interpretive clarity in real-world deployment.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eDNP: Diagnostic Neutrality Point\u003c/p\u003e\n\u003cp\u003eEEP: Epistemic Equivalence Point\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eEPEP: Epistemic Pressure Equilibrium Point\u003c/p\u003e\n\u003cp\u003eETS: Epistemic Tension State\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eGDI: \u003cstrong\u003eGlobal Divergence Index\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eIZ : Inflated Zone\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eLPZ:\u003c/em\u003e \u003cem\u003eLower Plausible Zone\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNPV: Negative Predictive Value\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003ePPV:\u003c/em\u003e Positive Predictive Value\u003c/p\u003e\n\u003cp\u003ePSI:\u0026nbsp;Prediction Symmetry Index\u003c/p\u003e\n\u003cp\u003eROC: receiver operating characteristic\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eRZ: Recessive Zone\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eTZ: Trusted Zone\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eUPZ: Upper Plausible Zone\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eZBES: Zone-Based Balanced Error Screw\u003c/p\u003e\n\u003cp\u003eZBE: Zone-Balanced Error\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eZBTS: Zone-Based Tension State\u003c/p\u003e\n\u003cp\u003eZREDI: Zone-Relevant Epistemic Divergence Index\u0026nbsp;\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eEthics approval and consent to participate:\u0026nbsp;This study did not involve direct experimentation on human subjects. All real-world diagnostic datasets used were fully de-identified and publicly available, and no personal or clinical identifiers were accessed or processed. Therefore, institutional ethical approval and informed consent were not required. The study was conducted in accordance with relevant guidelines for research integrity and data privacy.\u003c/p\u003e\n\u003cp\u003eConsent for publication: Not applicable.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAvailability of data and materials: The datasets used for illustrative purposes are available from the corresponding author upon reasonable request.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eCompeting interests: The author declares no competing interests.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAuthors\u0026apos; contributions: T.F.R. as the sole author, conceived the study, developed the methodology, performed the analyses, prepared all figures and table, and wrote the manuscript.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAcknowledgment :The author used ChatGPT (OpenAI) to assist with language editing and phrasing refinement during manuscript preparation. All conceptual and analytical content was developed by the author.\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003eDeclaration of Originality, Intellectual Ownership, Authorship, and Future Rights\u003c/h3\u003e\n\u003cp\u003eThe author, Dr. Tareef Fadhil Raham, affirms that this manuscript is an original work entirely conceived, developed, and authored by the undersigned. All intellectual contributions\u0026mdash;including the design and formalization of the zone-stratified diagnostic framework, the derivation of the related metrics such as Epistemic Equivalence Point (EEP), Zone-Relevant Epistemic Divergence Index (ZREDI), Zone-Based Balanced Error Skew (ZBES), \u003cstrong\u003eZone-Based Tension State (ZBTS)\u0026nbsp;\u003c/strong\u003eand Zone-Balanced Error Index (ZBE), as well as the simulation modeling, data analysis, and interpretation\u0026mdash;were solely the product of the author\u0026rsquo;s scholarly effort.\u003c/p\u003e\n\u003cp\u003eThis work has not been published, submitted, or disseminated elsewhere and contains no material copied from external sources without proper attribution. No co-authorship, ghostwriting, or collaborative authorship applies to any portion of this research or manuscript.\u003c/p\u003e\n\u003cp\u003eThe author retains full intellectual ownership over the concepts, methodologies, and metrics introduced. These contributions are part of an ongoing program of original research and may serve as the basis for future software tools, statistical packages, or decision-support systems. Accordingly, the author reserves the right to pursue licensing, academic dissemination, or intellectual property protection (e.g., through copyright, registration, or algorithmic patents) in accordance with institutional and jurisdictional policies.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of generative AI and AI-assisted technologies in the writing process\u003c/strong\u003e\u003cbr\u003e\u0026nbsp;During the preparation of this work, the author(s) used ChatGPT (OpenAI) to improve the clarity and readability of certain sections of the manuscript. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the published article.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eJapkowicz N, Shah M. \u003cem\u003eEvaluating Learning Algorithms: A Classification Perspective\u003c/em\u003e. Cambridge University Press; 2011. pp. 113\u0026ndash;121, 127.\u003c/li\u003e\n \u003cli\u003eSaito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015 4;10(3):e0118432. doi: 10.1371/journal.pone.0118432. PMID: 25738806; PMCID: PMC4349800.\u003c/li\u003e\n \u003cli\u003eHand DJ. Evaluating diagnostic tests: The area under the ROC curve and the balance of errors. Stat Med. 2010;29(14):1502-10. doi: 10.1002/sim.3859. PMID: 20087877.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eKuhn M, Johnson K.\u003cem\u003eApplied Predictive Modeling.\u003c/em\u003e 1st ed. New York: Springer; 2013. pp. 259\u0026ndash;266, 281\u0026ndash;283.\u003c/li\u003e\n \u003cli\u003ePeirce JC, Cornell RG. Integrating stratum-specific likelihood ratios with the analysis of ROC curves. Med Decis Making. 1993 Apr-Jun;13(2):141-51. doi: 10.1177/0272989X9301300208. PMID: 8483399.\u003c/li\u003e\n \u003cli\u003eLeeflang MM, Rutjes AW, Reitsma JB, Hooft L, Bossuyt PM. Variation of a test\u0026apos;s sensitivity and specificity with disease prevalence. CMAJ. 2013 Aug 6;185(11):E537-44. doi: 10.1503/cmaj.121286. Epub 2013 Jun 24. PMID: 23798453; PMCID: PMC3735771.\u003c/li\u003e\n \u003cli\u003eAkobeng AK. Understanding diagnostic tests 2: likelihood ratios, pre- and post-test probabilities and their use in clinical practice. Acta Paediatr. 2007;96(4):487-91. doi: 10.1111/j.1651-2227.2006.00179.x. Epub 2007 Feb 14. PMID: 17306009.\u003c/li\u003e\n \u003cli\u003eGrimes DA, Schulz KF. Uses and abuses of screening tests. Lancet. 2002;359(9309):881-4. doi: 10.1016/S0140-6736(02)07948-5.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eAltman DG, Bland JM. Diagnostic tests 2: Predictive values. BMJ. 1994 Jul 9;309(6947):102. doi: 10.1136/bmj.309.6947.102. PMID: 8038641; PMCID: PMC2540558.\u003c/li\u003e\n \u003cli\u003eElkan C.\u0026nbsp;The foundations of cost-sensitive learning. \u003cem\u003eProceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI)\u003c/em\u003e. 2001:973\u0026ndash;978.\u003c/li\u003e\n \u003cli\u003e\u0026nbsp;Raham TF. From global accuracy to local credibility: zone-based ROC and DET frameworks for conservative, transparent, and clinically credible evaluation. \u003cem\u003eManuscript under editorial consideration\u003c/em\u003e.\u003c/li\u003e\n \u003cli\u003eRaham TF. From Inferential Statistics to Epistemic Credibility: A Zone-Based Framework for Conservative Estimation. Under review.\u003c/li\u003e\n \u003cli\u003eJang ES, Jeong SH, Kim JW, Choi YS, Leissner P, Brechot C. Diagnostic Performance of Alpha-Fetoprotein, Protein Induced by Vitamin K Absence, Osteopontin, Dickkopf-1 and Its Combinations for Hepatocellular Carcinoma. PLoS One. 2016 Mar 17;11(3):e0151069. doi: 10.1371/journal.pone.0151069. PMID: 26986465; PMCID: PMC4795737.\u003c/li\u003e\n \u003cli\u003e\u0026nbsp;Jang ES, Jeong SH, Kim JW, Choi YS, Leissner P, Brechot C. (2016). Data from: Diagnostic performance of alpha-fetoprotein, protein induced by vitamin K absence, osteopontin, Dickkopf-1 and its combinations for hepatocellular carcinoma [Dataset]. Dryad. https://doi.org/10.5061/dryad.3n901\u003c/li\u003e\n \u003cli\u003eChicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020 ;21(1):6. doi: 10.1186/s12864-019-6413-7. PMID: 31898477; PMCID: PMC6941312.\u003c/li\u003e\n \u003cli\u003eVickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006 Nov-Dec;26(6):565-74. doi: 10.1177/0272989X06295361. PMID: 17099194; PMCID: PMC2577036.\u003c/li\u003e\n \u003cli\u003e\u0026nbsp;European Association for the Study of the Liver. EASL Clinical Practice Guidelines: Management of hepatocellular carcinoma. J Hepatol. 2018;69(1):182-236. doi: 10.1016/j.jhep.2018.03.019. Epub 2018 Apr 5. Erratum in: J Hepatol. 2019 Apr;70(4):817. doi: 10.1016/j.jhep.2019.01.020. PMID: 29628281.\u003c/li\u003e\n \u003cli\u003eMarrero JA, Feng Z, Wang Y, Nguyen MH, Befeler AS, Roberts LR, , et al . Alpha-fetoprotein, des-gamma carboxyprothrombin, and lectin-bound alpha-fetoprotein in early hepatocellular carcinoma. Gastroenterology. 2009;137(1):110-8. doi: 10.1053/j.gastro.2009.04.005. Epub 2009 Apr 9. PMID: 19362088; PMCID: PMC2704256.\u003c/li\u003e\n \u003cli\u003eTrevisani F, D\u0026apos;Intino PE, Morselli-Labate AM, Mazzella G, Accogli E, Caraceni P, Domenicali M, De Notariis S, Roda E, Bernardi M. Serum alpha-fetoprotein for diagnosis of hepatocellular carcinoma in patients with chronic liver disease: influence of HBsAg and anti-HCV status. J Hepatol. 2001 Apr;34(4):570-5. doi: 10.1016/s0168-8278(00)00053-2. PMID: 11394657.\u003c/li\u003e\n \u003cli\u003eHabibzadeh F. Diagnostic tests performance indices: an overview. Biochem Med (Zagreb). 2025 Feb 15;35(1):010101. doi: 10.11613/BM.2025.010101. PMID: 39974192; PMCID: PMC11838712.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Epistemic Equivalence Point (EEP), Zone-Relevant Epistemic Divergence Index (ZREDI), Zone-Based Balanced Error Score (ZBES), Zone- Balanced Error (ZBE) Index, Zone-Based Tension State (ZBTS)","lastPublishedDoi":"10.21203/rs.3.rs-7539984/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7539984/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eObjective:\u003c/h2\u003e\u003cp\u003eTraditional diagnostic metrics summarize model performance globally but often obscure localized vulnerabilities. This study proposes a zone-stratified evaluation framework to assess diagnostic reliability across confidence gradients.\u003c/p\u003e\u003ch2\u003eMethods:\u003c/h2\u003e\u003cp\u003ePredicted probabilities were stratified into five diagnostic confidence zones using the Robust Adjusted Mean Interval (RAMI) approach: Recessive, Lower Plausible, Trusted, Upper Plausible, and Inflated. Within each zone, confusion matrix components were used to compute: the Zone-Relative Epistemic Divergence Index (ZREDI) for directional trust asymmetry, the Epistemic Equivalence Point (EEP) for calibration symmetry, the Zone-Based Error Skew (ZBES) for misclassification bias, and the Zone-Balanced Error (ZBE) for class-neutral error burden. A higher-order construct, the Zone-Based Tension State (ZBTS), was introduced to quantify overall epistemic instability across diagnostic strata.\u003c/p\u003e\u003ch2\u003eResults:\u003c/h2\u003e\u003cp\u003eSimulated assessments and real-world diagnostic data were used to uncover zone-specific diagnostic behaviors often masked by traditional global metrics. In both datasets, the Trusted Zone (TZ) consistently demonstrated calibration stability, with EEPs clustering near zero. In contrast, the outer zones\u0026mdash;particularly the Recessive and Inflated regions\u0026mdash;exhibited pronounced shifts in ZBES and ZREDI, indicating increased risks of underdiagnosis, overdiagnosis, or erosion of diagnostic trust. Furthermore, ZBTS emphasized the interplay between ZBES and ZREDI, highlighting zones of epistemic fragility and enabling localized performance audits for regulatory transparency and safety profiling.\u003c/p\u003e\u003ch2\u003eConclusions:\u003c/h2\u003e\u003cp\u003eZone-stratified metrics provide actionable insights into diagnostic model behavior, improving interpretability and safety. This framework advances beyond aggregate measures to support better threshold tuning, risk calibration, and clinical deployment\u0026mdash;especially where model trust must align with uncertainty.\u003c/p\u003e","manuscriptTitle":"Beyond Global Metrics: A Zone-Stratified Diagnostic Framework Based on Confusion Matrix Components","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-21 11:51:22","doi":"10.21203/rs.3.rs-7539984/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a983e7a6-59aa-4fdd-abed-fa284b86005d","owner":[],"postedDate":"November 21st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-11-21T11:51:22+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-21 11:51:22","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7539984","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7539984","identity":"rs-7539984","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00