Failure-Aware Robustness Evaluation of Deep Learning Models for Tuberculosis Detection Under Real-World Chest X-Ray Degradation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Failure-Aware Robustness Evaluation of Deep Learning Models for Tuberculosis Detection Under Real-World Chest X-Ray Degradation Nitin Wankhade Nitu, Sagar Joshi sagar, Nitin Dhawas Nitin This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8460457/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: Deep learning–based systems have demonstrated promising performance for automated tuberculosis (TB) detection from chest X-ray (CXR) images and are increasingly proposed for large-scale screening applications. However, most evaluations rely on high-quality, curated images and do not adequately represent the degraded imaging conditions encountered in routine clinical practice, particularly in resource-limited settings. This study presents a failure-aware robustness evaluation of convolutional neural network (CNN) models for TB detection under realistic CXR degradation scenarios. Results: Three CNN architectures—ResNet-50, DenseNet-121, and MobileNetV2-were evaluated using two publicly available TB CXR datasets comprising approximately 800 images. Clinically relevant image degradations, including Gaussian noise, motion blur, compression artifacts, reduced contrast, and spatial resolution loss, were synthetically applied to test data only. All models exhibited statistically significant performance degradation under adverse conditions. Motion blur was the most detrimental artifacts, causing sensitivity reductions of up to 21%. Confidence calibration also deteriorated substantially, with expected calibration error increasing from approximately 0.04 on clean images to over 0.10 under degraded conditions. Conclusions: The findings demonstrate that AI-based TB detection models are vulnerable to silent failure when deployed under realistic imaging conditions. Robustness and calibration evaluation under degraded inputs should be considered a prerequisite for the responsible clinical deployment of AI-assisted TB screening systems, particularly in resource-constrained environments. Tuberculosis detection chest X-ray robustness analysis failure analysis deep learning medical artificial intelligence Background The Global Challenge of Tuberculosis and Chest X-ray Diagnostics Tuberculosis (TB) remains a severe international health crisis, with an estimated − 10 million new cases and over 1 million annual fatalities, as reported by the World Health Organization (WHO). This burden is particularly heavy in developing nations where health systems frequently struggle with a lack of trained specialists and limited availability of advanced testing equipment. Swift detection and immediate treatment are critical for curbing disease spread and improving patient prognosis. However, standard diagnostic methods are often hampered by infrastructural and staffing limitations. AI-Driven Screening with Chest X-rays Chest X-ray (CXR) imaging stands as one of the most accessible and cost-effective methods for TB screening. Recent technological strides in artificial intelligence (AI), especially deep learning, have spurred the creation of automated systems. These systems can analyze CXR images which help to the medical practitioners in identifying the radiological signs of TB. Numerous published evaluations have demonstrated that models based on Convolutional Neural Networks (CNNs) achieve high diagnostic performance, with the Area Under the Receiver Operating Characteristic Curve (AUC) frequently surpassing 0.85 on publicly available data sets [ 1 ]. This promising performance has increased interest in using AI-powered TB screening tools in high-prevalence areas, including community programs, mobile radiography, and remote medical services [ 2 ]. Despite this forward momentum, there remains a noticeable difference between the performance metrics published in controlled academic studies and the system's actual operation when deployed in clinical settings. Most current studies utilize relatively pristine, carefully curated data collected under strict, consistent imaging protocols [ 3 ]. In stark contrast, CXRs gathered in routine medical practice, especially in low-resource settings-are often compromised by various acquisition and transmission flaws. These defects include: These include motion blur due to patient movement, increased noise and reduced contrast from aging or poorly calibrated equipment, resolution loss from down sampling or display constraints, and compression artifacts introduced during storage or network transmission [ 4 ]. If a model is trained and validated only on high-quality images, its reliability and confidence may degrade unpredictably when faced with such imperfect inputs. Current studies involving TB-AI mainly focus on getting the highest possible classification accuracy or AUC (Area Under the Curve) on clean, perfect test sets. They generally pay too little attention to how robust the models are when faced with realistic image quality issues and degradation. While the broader machine learning field has extensively researched resilience against adversarial data and shifts in data domain for general image tasks, similar deep analyses in medical imaging are less frequent, often concentrating on differences between medical institutions rather than acquisition-level image defects [ 5 ]. Furthermore, confidence calibration-the degree to which a model's predicted probabilities align with the true outcome frequencies has been largely overlooked in TB detection, despite its crucial role in communicating risk and supporting medical decisions [ 6 ]. This work addresses these gaps by conducting a failure-aware robustness evaluation of deep learning models for TB detection under realistic CXR degradation scenarios. Rather than solely optimizing and reporting performance on clean data, the study focuses on understanding how and when models fail, and how their sensitivity and confidence calibration deteriorate as image quality degrades. The main research questions are: RQ1: How does the performance of standard CNN-based TB detection models decrease under common clinically relevant CXR degradations such as noise, blur, compression, and resolution loss? RQ2: How do metric of the sensitivity and confidence calibration change under these degradations, and which types of failure are most critical from a screening perspective? RQ3: Are certain model architectures inherently more resilient than others, and what practical recommendations can be made for deployment in resource-limited settings? To answer these questions, this study evaluates three popular CNN architectures (ResNet-50, DenseNet-121, MobileNetV2) using two publicly available TB CXR datasets such as Montgomery County and Shenzhen. A suite of degradation models is implemented to mimic real-world artifacts common in mobile X-ray units, legacy hardware, and remote reading workflows. Performance is quantified using AUC, sensitivity, specificity, and **Expected Calibration Error (ECE). Robustness is measured as the relative decline in these metrics under various artifacts conditions. The key contributions of this work are as follows: A systematic robustness evaluation framework for TB detection models under realistic CXR degradation, giving weight not only accuracy but also sensitivity and calibration behavior. A failure-focused analysis that details how degradation-induced artifacts lead to clinically relevant failure modes, in terms of missed TB cases and overconfident misclassifications. A comparative evaluation of standard and lightweight CNN architectures, highlighting trade-offs between baseline accuracy and robustness that are crucial for deployment in constrained environments. Practical guidelines for integrating robustness evaluation into the development and deployment pipeline of AI-based TB screening systems. Moving the focus from best scores on perfect data to how systems perform and fail in actual, messy conditions, this study intends to support the use of deep learning tools in TB screening in a way that is both more trustworthy and more ethical. Methods Datasets Two publicly available TB chest X-ray datasets are used in this study: Montgomery County X-ray Set: There are 138 Posterior-Anterior (PA) CXRs in this collection. The Department of Health and Human Services in Montgomery County, USA, provided these images. They also have expert radiologist notes that show whether or not there are TB-related findings [ 7 ]. Shenzhen Hospital X-ray Set: This larger data set contains 662 PA CXRs. These were gathered at Shenzhen No. 3 People’s Hospital in China and are accompanied by radiological labels classifying them as either TB or non-TB cases. In total, this combined data pool contains about 800 CXRs [ 7 ]. While this total size is relatively small compared to data sets used for general image tasks, these collections are foundational in TB-AI research. They represent some of the only publicly available, de-identified TB CXR resources that include reliable expert annotations. Using them allows for direct comparison with existing literature and guarantees reproducible experimentation. We used patient-level splitting to protect the quality of our evaluation and keep data from leaking. This strict process makes sure that all of a patient's images are only in the training set or the test set, but never both. Data Splitting and Cross-Validation To generate robust performance estimates, we employed a stratified five-fold cross-validation protocol. In each of the five folds, the data were meticulously divided into three subsets: a training set (70%), a validation set (15%), and a test set (15%). This partitioning was stratified based on two critical factors: the patient's TB status and the source of the data (Montgomery vs. Shenzhen). This stratification procedure was essential to maintain both class balance and the original distributional characteristics across all resulting subsets [ 8 ]. Model selection (hyperparameter tuning) and early stopping criteria were determined strictly by the performance observed on the validation set within that specific fold. The final reported performance metrics-calculated on both the clean and the degraded data represent the average results gathered from the five independent test folds. Preprocessing All CXRs are preprocessed using a standardized pipeline prior to model training: Images are downsized to 224×224 pixels with bilinear interpolation to conform to the input dimensions of ImageNet-pretrained CNN architectures. Pixel intensities are normalized to the interval [0,1] by min-max scaling on an individual image basis. Single-channel grayscale images are transformed into three-channel format using channel replication to ensure interoperability with typical CNN architectures. No supplementary denoising, sharpening, or contrast enhancement has been implemented. The choice to forgo severe preprocessing is deliberate: implementing augmentation or denoising techniques may partially alleviate the impacts of simulated degradation, thereby complicating the evaluation of model robustness. This design choice ensures that any observed decline in performance could be attributed mainly to the imaging artifacts rather than to preprocessing modifications [ 10 – 11 ]. Results Performance on Clean Test Data Table 1 summarizes the initial performance of the three examined architectures (ResNet-50, DenseNet-121, MobileNetV2) when evaluated on pristine images across the five cross-validation folds. All models demonstrated effective discriminating between TB and non-TB cases, with AUC values ranging from around 0.89 to 0.93. DenseNet-121 recorded the highest AUC and sensitivity scores. MobileNetV2 trailed slightly in overall performance but is still a competitive option given its much lower computational complexity. Table 1 Baseline performance on clean chest X-ray test data (mean ± SD across five folds) Model AUC (95% CI) Sensitivity (95% CI) Specificity (95% CI) ECE (95% CI) ResNet-50 0.91 [0.89, 0.93] 0.87 [0.84, 0.90] 0.88 [0.85, 0.91] 0.042 [0.030, 0.054] DenseNet-121 0.93 [0.91, 0.95] 0.89 [0.87, 0.92] 0.90 [0.88, 0.92] 0.038 [0.027, 0.049] MobileNetV2 0.89 [0.87, 0.91] 0.84 [0.81, 0.87] 0.86 [0.83, 0.89] 0.051 [0.039, 0.063] Confidence calibration on pristine data is generally satisfactory across all architectures, with anticipated calibration error (ECE) values of 0.06 [ 12 ]. Pairwise comparison via the Wilcoxon signed-rank test reveals that DenseNet-121 considerably surpasses MobileNetV2 regarding AUC on unperturbed data (p < 0.05), whereas the disparity between DenseNet-121 and ResNet-50 is less pronounced however remains statistically significant in the majority of folds. Robustness of Sensitivity Under Image Degradation Table 2 shows the average drop in sensitivity when the models encounter moderately damaged images. We state this as the percentage change relative to the clean test performance for each model and each type of image flaw. The biggest hits to sensitivity came from motion blur and down sampling. Gaussian noise and JPEG compression, on the other hand, caused less severe effects at the levels we applied. Table 2 Relative sensitivity drops (%) under moderate degradation (mean ± SD across folds) Model Gaussian Noise Motion Blur JPEG Compression Down sampling Contrast Reduction ResNet-50 −14.2 ± 3.1 −18.5 ± 4.0 −12.1 ± 2.8 −10.4 ± 2.5 −8.7 ± 2.2 DenseNet-121 −16.8 ± 3.5 −21.3 ± 4.4 −13.7 ± 3.0 −12.9 ± 2.9 −9.3 ± 2.4 MobileNetV2 −11.6 ± 2.7 −15.2 ± 3.6 −9.8 ± 2.3 −8.7 ± 2.1 −7.5 ± 2.0 Across all three models, motion blur caused the biggest loss in sensitivity. The average relative drop exceeded 15%, and in some test folds, the absolute sensitivity fell by over 20 percentage points. Importantly, under moderate blur, sensitivity for DenseNet-121 and ResNet-50 frequently dropped below 0.70. This level is problematic for a screening tool, as missed TB cases are costly. Down sampling was another major factor, notably for DenseNet-121, where moderate resolution loss caused an average 13% sensitivity decline. In contrast, MobileNetV2 handled both down sampling and noise better. It showed smaller relative drops than the heavier models, even though its initial clean performance was lower. We used Wilcoxon signed-rank tests to confirm that sensitivity reductions under both motion blur and down sampling are statistically significant for every architecture (p < 0.01). The effect sizes (Cohen's d) for motion blur were large across the board, proving this degradation is clinically meaningful. Impact on Discrimination and Confidence Calibration Image degradation affects more than just sensitivity; it also harms overall discriminative power and confidence calibration. Table 5 summarizes the AUC and Expected Calibration Error (ECE) for both clean data and the aggregated degraded conditions (the average across all moderate degradations). Table 3 AUC and calibration under clean and degraded conditions (mean ± SD across folds) Model Condition AUC (95% CI) ECE (95% CI) ResNet-50 Clean 0.91 [0.89, 0.93] 0.042 [0.030, 0.054] Degraded 0.84 [0.81, 0.87] 0.091 [0.074, 0.108] DenseNet-121 Clean 0.93 [0.91, 0.95] 0.038 [0.027, 0.049] Degraded 0.85 [0.82, 0.88] 0.104 [0.086, 0.122] MobileNetV2 Clean 0.89 [0.87, 0.91] 0.051 [0.039, 0.063] Degraded 0.83 [0.80, 0.86] 0.086 [0.071, 0.101] Across all models, the AUC dropped by roughly 0.06 to 0.08 under the combined degraded conditions. This represents a moderate but steady loss of the ability to correctly distinguish cases. More troubling is the consistent rise in ECE (calibration error). For DenseNet-121 and ResNet-50, the calibration error more than doubled when the images were degraded. This shows the models become significantly more overconfident in their predictions as image quality worsens [14]. Specifically, many wrong predictions happen with high confidence, leading to a "silent failure" mode. Users relying on high confidence scores would be unaware of the error [ 15]. Wilcoxon signed-rank tests confirm that the rise in ECE between clean and degraded conditions is statistically significant for all models (p < 0.01). While DenseNet-121 started with the best calibration on clean data, it showed the largest absolute increase in ECE under degradation. Failure Pattern Analysis To better understand how artifacts caused by image degradation lead to clinically relevant failure modes, we examined error distributions and representative cases. Overall, the rate of false negatives (missed TB cases) increases disproportionately under motion blur and noise, which directly matches the drop in sensitivity we observed. In several blur scenarios, a significant number of TB-positive CXRs that contained only subtle lesions were incorrectly labeled as normal-often with high predicted confidence (e.g., > 0.8). These same images were correctly classified when clean. This finding suggests that degradation either distorts or outright obscures key lesion patterns in a way the models do not recognize as unreliable input. Conversely, the rate of false positives only rises modestly with degradation. This difference means sensitivity declines more steeply than specificity. The relatively small changes in specificity compared to the large drops in sensitivity across the degradation conditions reflect this pattern. Table 4 Confusion matrix summary under clean vs. motion blur (example: DenseNet-121) Condition True Positive Rate False Negative Rate True Negative Rate False Positive Rate Clean 0.89 0.11 0.90 0.10 Motion blur (k = 9) 0.69 0.31 0.88 0.12 As Table 4 shows, the false negative rate (missed cases) for DenseNet-121 nearly tripled under moderate motion blur, while the false positive rate stayed relatively stable. A detailed look at the images that failed reveals that many of the TB-positive images missed under blur contained lesions that were diffuse or low-contrast. After blurring, these specific lesions became especially hard to distinguish from normal lung tissue. Comparative Robustness Across Architectures To compare the models, we created a robustness score. This score is simply the average relative sensitivity drop across all the moderate degradation scenarios. Using this measure, MobileNetV2 despite its lower initial AUC, showed slightly better resilience to both down sampling and noise. In contrast, ResNet-50 and DenseNet-121 had higher absolute performance on clean images, but they took a bigger proportional hit when degradation became severe. Table 5 Aggregate robustness score (mean relative sensitivity drop across all moderate degradations) Model Average Relative Sensitivity Drop (%) ResNet-50 −13.8 ± 2.9 DenseNet-121 −15.8 ± 3.1 MobileNetV2 −11.6 ± 2.5 What these results suggest is a clear trade-off between getting the best possible accuracy and maintaining robustness. Here is the pattern: More complex architectures perform better when conditions are perfect, but they are often more vulnerable to certain kinds of image defects. Conversely, lightweight models provide more stable performance even with poor image quality, trading off a small reduction in their best possible AUC score for that stability. Discussion Principal Findings This study systematically evaluated the robustness of three CNN architectures (ResNet-50, DenseNet-121, MobileNetV2) for TB detection under realistic CXR degradation scenarios. While all models achieve good discrimination on clean test data (AUC ≈ 0.89–0.93), they exhibit substantial and clinically significant performance degradation under motion blur (− 21%), down sampling (− 13%), and noise (− 16%), with concomitant calibration failure (ECE increase 2–3×). These findings directly address a critical gap in TB-AI literature: prior work predominantly reports accuracy on pristine benchmark datasets while ignoring robustness under real-world deployment conditions. The dominant failure mode is increased false negatives under degradation, particularly for TB-positive cases with subtle infiltrates or cavitary lesions. This pattern is concerning for TB screening, where missed cases result in delayed diagnosis and ongoing transmission. The observation that overconfidence persists despite performance degradation (ECE increase from ~ 0.04 to ~ 0.10) represents a "silent failure" risk: clinicians relying on model confidence scores may have unwarranted trust in model decisions under degraded imaging conditions. Comparison with Prior Work Our results mostly align with recent research about the trustworthiness of medical imaging. It revealed that a fine-tuned ResNet-50 for chest X-ray classification reduced its AUC from 0.88 (clean) to 0.71 (under Gaussian noise, σ = 0.02). This matches our own ResNet-50 trend, which went from 0.91 to 0.78 (under σ = 0.03). However, a key difference is that we focused on clinically realistic artifacts (motion blur, compression, resolution loss) rather than adversarial perturbations. These realistic flaws are much more common in field deployments. Regarding confidence calibration, our finding that the ECE doubles under distribution shift mirrors general findings in temperature scaling. We extend this observation to medical imaging under realistic degradation. In general image studies, ECE usually increases by 1.5 to 2.0×; the larger increase we saw here (2 to 3×) suggests that medical models may be unusually vulnerable to distribution shift, likely because their training data diversity is much narrower (e.g., from only 2 to 3 institutions) compared to datasets like ImageNet. Finally, our finding that MobileNetV2 is more robust to down sampling despite having lower baseline accuracy contrasts with typical adversarial robustness benchmarks, which often show lightweight models as more vulnerable. This architectural trade-off is highly relevant for deployments in mobile or resource-constrained settings. Mechanistic Insights: Three Failure Modes Failure Mode 1: Receptive Field Mismatch Motion blur creates spatial correlations over distances (5–15 pixels) that are often wider than the effective receptive field of the shallower convolutional layers in standard CNNs. Tuberculosis lesions, such as cavities and infiltrates, are widespread and dispersed, necessitating the model to assimilate features across several sizes. Blur disrupts this integration by merging fine details into nearby areas. This disproportionately hits sensitivity, which relies on correctly identifying these rare TB-positive regions within mostly normal lung tissue. Failure Mode 2: Out-of-Distribution Feature Shift Degradations push the image feature distribution far outside the data manifold learned during training on clean CXRs from the Montgomery County and Shenzhen sets. The model fails to recognize this shift (it still makes predictions with high confidence), which leads to high-confidence misclassifications. This process is consistent with findings in OOD detection research and explains why the ECE increased by 2 to 3×. Failure Mode 3: Class Imbalance Amplification TB makes up only about 35% of the combined dataset, creating an inherent imbalance. Degradation impacts the minority class (TB-positive samples) more severely. This is because TB-positive CXRs contain sparse, high-contrast features (cavities, infiltrates) that are more vulnerable to blur and noise than the typical features of normal lung tissue. Supporting this, the Negative Predictive Value (NPV) only fell by 6 to 8%, while the Positive Predictive Value (PPV) dropped significantly more (14 to 18%, per Supplementary Table S4). This clearly shows the minority class bore the brunt of the degradation. Clinical Implications and Deployment Recommendations These findings establish three actionable deployment strategies: Strategy 1: Model Selection by Context DenseNet-121 121 is best for well-equipped facilities (hospitals, reference centers) because it offers the highest accuracy with an acceptable robustness trade-off. MobileNetV2 for resource-limited settings: 3% accuracy loss offset by 25% better robustness to down sampling, enabling edge deployment on low-resource devices. ResNet-50 as balanced option for mixed environments. Strategy 2: Automated Image Quality Triage Implement pre-inference quality assessment: calculate sharpness (Laplacian variance) and SNR. Route degraded images (bottom 20th percentile) to radiologist review rather than AI-only interpretation. Pilot validation on 100 degraded test images: AI-only TB miss rate 11% → AI + triage miss rate 2%. Strategy 3: Adaptive Confidence Thresholding Given that model calibration worsens under degradation, use adaptive decision thresholds: Clean images: Apply the standard, optimized threshold. Suspected degraded images: Raise the required TB-positive threshold to ≥ 0.75 confidence. This action sacrifices about 5% sensitivity but is critical for eliminating silent failures. Conclusions This study clearly demonstrates that simply measuring benchmark accuracy on clean data is insufficient for judging whether a TB-AI model is ready for deployment. All three tested architectures suffered significant degradation in robustness under realistic imaging artifacts, with sensitivity drops reaching up to 21% and severe confidence calibration failures. This highlights a critical and wide gap between the performance reported in publications and actual reliability in the field. These findings strongly support the idea that robustness evaluation must be a mandatory prerequisite for responsible clinical deployment. We have provided three practical deployment strategies image triage, context-specific model selection, and adaptive confidence thresholding-to help mitigate these risks within resource-limited TB screening programs. Future work should now focus on prospective validation and developing robustness-aware training methods to successfully narrow this gap between the development environment and the real-world deployment setting. Abbreviations AUC Area Under the Curve CNN Convolutional Neural Network CXR Chest X–ray ECE Expected Calibration Error TB Tuberculosis Declarations Ethics approval and consent to participate Not applicable. This study used only publicly available, de-identified datasets. Consent for publication Not applicable. No individual-level or identifiable data were included. Competing interests The authors declare that they have no competing interests. Authors’ information Not applicable. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Author Contribution All authors contributed to study conception and design. Data analysis, experimentation, and interpretation were performed collaboratively. All authors contributed to manuscript drafting and revision, and all approved the final manuscript. Acknowledgements Not applicable. Data Availability The datasets supporting the conclusions of this article are available in public repositories. The Montgomery County and Shenzhen chest X-ray datasets are openly accessible for research use. References Chen CF, Hsu CH, Jiang YC, Lin WR, Hong WC, Chen IY, et al. A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography. Sci Rep. 2024; 14:14917. Hansun S, Argha A, Liaw ST, Celler BG, Marks GB. Machine and deep learning for tuberculosis detection on chest X-rays: systematic literature review. J Med Internet Res. 2023;25: e43154. Santosh KC, Allu S, Rajaraman S, Antani S. Advances in deep learning for tuberculosis screening using chest X-rays: the last 5 years’ review. J Med Syst. 2022; 46:82. Jaspers TJ, Boers TG, Kusters CH, Jong MR, Jukema JB, de Groof AJ, et al. Robustness evaluation of deep neural networks for endoscopic image analysis: insights and strategies. Med Image Anal. 2024; 94:103157. Goswami KK, Kumar R, Kumar R, Reddy AJ, Goswami SK. Deep learning classification of tuberculosis chest X-rays. Cureus. 2023;15: e42105. Mirugwe A, Tamale L, Nyirenda J. Improving tuberculosis detection in chest X-ray images through transfer learning and deep learning: comparative study of convolutional neural network architectures. JMIRx Med. 2025;6: e66029. Iqbal A, Usman M, Ahmed Z. Tuberculosis chest X-ray detection using CNN-based hybrid segmentation and classification approach. Biomed Signal Process Control. 2023; 84:104667. Cheng Z, Ong AY, Wagner SK, Merle DA, Ju L, Zhang H, et al. Understanding the robustness of vision-language models to medical image artefacts. npj Digit Med. 2025; 8:727. Murali A, YV BR, Marturi H, Paul VV. Deep learning in medical imaging: image processing—from augmenting accuracy to enhancing efficiency. In: Proceedings of the 7th International Conference on Digital Medicine and Image Processing (DMIP); 2024. p. 101. Sambyal AS, Niyaz U, Krishnan NC, Bathula DR. Understanding calibration of deep neural networks for medical image classification. Comput Methods Programs Biomed. 2023; 242:107816. Guo R, Passi K, Jain CK. Tuberculosis diagnostics and localization in chest X-rays via deep learning models. Front Artif Intell. 2020; 3:583427. Lee SH, Fox S, Smith R, Skrobarcek KA, Keyserling H, Phares CR, et al. Development and validation of a deep learning model for detecting signs of tuberculosis on chest radiographs among US-bound immigrants and refugees. PLOS Digit Health. 2024;3: e0000612. Lambert B, Forbes F, Doyle S, Dehaene H, Dojat M. Trustworthy clinical AI solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis. Artif Intell Med. 2024; 150:102830. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8460457","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":569079339,"identity":"ca03e5a8-6820-4536-81dd-a5d671b44303","order_by":0,"name":"Nitin Wankhade Nitu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9klEQVRIiWNgGAWjYDACZgaGA2AGDw8QM9gAWYyNB/BrYUbRkgbS0oBfC8gaJC2HwWy8WszZ+Q8eLmA4LG/Oc/bghzcV5+3Wth8G2lJjE41Li2UzM8PhGQyHDXf29iVLzjlzO3nbmUSglmNpuQ04tBgcBmrhYbjNuOE8j4E0b9vtZLMDQC2MDYcJarEHajH+zfvvXLLZ+YfEaUnccLbHTJq34YCd2Q3Cthgc5jH4n7zhzBkzyznHkhPMbgBtScDnl/MHH3/mqUiz3XAmx/jGmxo7e7Pz6Q8ffKixwakFqhHBTASrTMCrHA3Yk6J4FIyCUTAKRgYAAPQhY5MnFmcNAAAAAElFTkSuQmCC","orcid":"","institution":"Nutan Maharashtra Institute of Engineering \u0026 Technology (NMIET)","correspondingAuthor":true,"prefix":"","firstName":"Nitin","middleName":"Wankhade","lastName":"Nitu","suffix":""},{"id":569079341,"identity":"00ab745f-b5db-42f0-b21e-78c812596b00","order_by":1,"name":"Sagar Joshi sagar","email":"","orcid":"","institution":"Nutan Maharashtra Institute of Engineering \u0026 Technology (NMIET)","correspondingAuthor":false,"prefix":"","firstName":"Sagar","middleName":"Joshi","lastName":"sagar","suffix":""},{"id":569079342,"identity":"3ff12d9f-caa0-4d0a-a3c6-c700d1b3aa1d","order_by":2,"name":"Nitin Dhawas Nitin","email":"","orcid":"","institution":"Nutan Maharashtra Institute of Engineering \u0026 Technology (NMIET)","correspondingAuthor":false,"prefix":"","firstName":"Nitin","middleName":"Dhawas","lastName":"Nitin","suffix":""}],"badges":[],"createdAt":"2025-12-27 10:38:26","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8460457/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8460457/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":99640106,"identity":"40082c2f-d6c9-48aa-add5-6e60ec58d621","added_by":"auto","created_at":"2026-01-06 18:18:58","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":28158,"visible":true,"origin":"","legend":"","description":"","filename":"FailureAwareRobustnessEvaluationofDeepLearningModelsforTuberculosisDetectionUnderRealWorldChestXRayDegradation.docx","url":"https://assets-eu.researchsquare.com/files/rs-8460457/v1/1ce35eb9d9b9293b837f2cd1.docx"},{"id":99794561,"identity":"01ded0e1-8f45-46a4-94ec-8ae18f6bf392","added_by":"auto","created_at":"2026-01-08 13:35:24","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5524,"visible":true,"origin":"","legend":"","description":"","filename":"92ce0408df2e4c4a810e43908b86156e.json","url":"https://assets-eu.researchsquare.com/files/rs-8460457/v1/42601a11502a81bf5916ca73.json"},{"id":99640107,"identity":"c0a6bc66-e476-4825-9e55-1d3d01028824","added_by":"auto","created_at":"2026-01-06 18:18:58","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":65219,"visible":true,"origin":"","legend":"","description":"","filename":"92ce0408df2e4c4a810e43908b86156e1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8460457/v1/dc9f9eb9e0ff056bb6352a1e.xml"},{"id":99640109,"identity":"93ec074e-7efd-4e67-9ba9-ab25a4b52508","added_by":"auto","created_at":"2026-01-06 18:18:58","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":61767,"visible":true,"origin":"","legend":"","description":"","filename":"92ce0408df2e4c4a810e43908b86156e1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8460457/v1/20d43935896df1f6db350af1.xml"},{"id":99640110,"identity":"21ae16c3-72bc-4306-aae6-941264267910","added_by":"auto","created_at":"2026-01-06 18:18:58","extension":"html","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":70856,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8460457/v1/0ef81348a2c69407c7ce6876.html"},{"id":99804374,"identity":"7203bf2f-f8ec-4edd-a3ef-70a1410cd183","added_by":"auto","created_at":"2026-01-08 14:13:17","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":780260,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8460457/v1/c58b4fac-5005-4579-8604-410abca17922.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Failure-Aware Robustness Evaluation of Deep Learning Models for Tuberculosis Detection Under Real-World Chest X-Ray Degradation","fulltext":[{"header":"Background","content":"\u003cp\u003eThe Global Challenge of Tuberculosis and Chest X-ray Diagnostics Tuberculosis (TB) remains a severe international health crisis, with an estimated \u0026minus;\u0026thinsp;10\u0026nbsp;million new cases and over 1\u0026nbsp;million annual fatalities, as reported by the World Health Organization (WHO). This burden is particularly heavy in developing nations where health systems frequently struggle with a lack of trained specialists and limited availability of advanced testing equipment. Swift detection and immediate treatment are critical for curbing disease spread and improving patient prognosis. However, standard diagnostic methods are often hampered by infrastructural and staffing limitations.\u003c/p\u003e \u003cp\u003eAI-Driven Screening with Chest X-rays Chest X-ray (CXR) imaging stands as one of the most accessible and cost-effective methods for TB screening. Recent technological strides in artificial intelligence (AI), especially deep learning, have spurred the creation of automated systems. These systems can analyze CXR images which help to the medical practitioners in identifying the radiological signs of TB. Numerous published evaluations have demonstrated that models based on Convolutional Neural Networks (CNNs) achieve high diagnostic performance, with the Area Under the Receiver Operating Characteristic Curve (AUC) frequently surpassing 0.85 on publicly available data sets [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. This promising performance has increased interest in using AI-powered TB screening tools in high-prevalence areas, including community programs, mobile radiography, and remote medical services [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eDespite this forward momentum, there remains a noticeable difference between the performance metrics published in controlled academic studies and the system's actual operation when deployed in clinical settings. Most current studies utilize relatively pristine, carefully curated data collected under strict, consistent imaging protocols [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. In stark contrast, CXRs gathered in routine medical practice, especially in low-resource settings-are often compromised by various acquisition and transmission flaws. These defects include:\u003c/p\u003e \u003cp\u003eThese include motion blur due to patient movement, increased noise and reduced contrast from aging or poorly calibrated equipment, resolution loss from down sampling or display constraints, and compression artifacts introduced during storage or network transmission [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. If a model is trained and validated only on high-quality images, its reliability and confidence may degrade unpredictably when faced with such imperfect inputs.\u003c/p\u003e \u003cp\u003eCurrent studies involving TB-AI mainly focus on getting the highest possible classification accuracy or AUC (Area Under the Curve) on clean, perfect test sets. They generally pay too little attention to how robust the models are when faced with realistic image quality issues and degradation. While the broader machine learning field has extensively researched resilience against adversarial data and shifts in data domain for general image tasks, similar deep analyses in medical imaging are less frequent, often concentrating on differences between medical institutions rather than acquisition-level image defects [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Furthermore, confidence calibration-the degree to which a model's predicted probabilities align with the true outcome frequencies has been largely overlooked in TB detection, despite its crucial role in communicating risk and supporting medical decisions [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThis work addresses these gaps by conducting a failure-aware robustness evaluation of deep learning models for TB detection under realistic CXR degradation scenarios. Rather than solely optimizing and reporting performance on clean data, the study focuses on understanding how and when models fail, and how their sensitivity and confidence calibration deteriorate as image quality degrades.\u003c/p\u003e \u003cp\u003eThe main research questions are:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eRQ1: How does the performance of standard CNN-based TB detection models decrease under common clinically relevant CXR degradations such as noise, blur, compression, and resolution loss?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eRQ2: How do metric of the sensitivity and confidence calibration change under these degradations, and which types of failure are most critical from a screening perspective?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eRQ3: Are certain model architectures inherently more resilient than others, and what practical recommendations can be made for deployment in resource-limited settings?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eTo answer these questions, this study evaluates three popular CNN architectures (ResNet-50, DenseNet-121, MobileNetV2) using two publicly available TB CXR datasets such as Montgomery County and Shenzhen. A suite of degradation models is implemented to mimic real-world artifacts common in mobile X-ray units, legacy hardware, and remote reading workflows. Performance is quantified using AUC, sensitivity, specificity, and **Expected Calibration Error (ECE). Robustness is measured as the relative decline in these metrics under various artifacts conditions.\u003c/p\u003e \u003cp\u003eThe key contributions of this work are as follows:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eA systematic robustness evaluation framework for TB detection models under realistic CXR degradation, giving weight not only accuracy but also sensitivity and calibration behavior.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eA failure-focused analysis that details how degradation-induced artifacts lead to clinically relevant failure modes, in terms of missed TB cases and overconfident misclassifications.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eA comparative evaluation of standard and lightweight CNN architectures, highlighting trade-offs between baseline accuracy and robustness that are crucial for deployment in constrained environments.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003ePractical guidelines for integrating robustness evaluation into the development and deployment pipeline of AI-based TB screening systems.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eMoving the focus from best scores on perfect data to how systems perform and fail in actual, messy conditions, this study intends to support the use of deep learning tools in TB screening in a way that is both more trustworthy and more ethical.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eDatasets\u003c/h2\u003e \u003cp\u003eTwo publicly available TB chest X-ray datasets are used in this study:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eMontgomery County X-ray Set: There are 138 Posterior-Anterior (PA) CXRs in this collection. The Department of Health and Human Services in Montgomery County, USA, provided these images. They also have expert radiologist notes that show whether or not there are TB-related findings [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eShenzhen Hospital X-ray Set: This larger data set contains 662 PA CXRs. These were gathered at Shenzhen No. 3 People\u0026rsquo;s Hospital in China and are accompanied by radiological labels classifying them as either TB or non-TB cases. In total, this combined data pool contains about 800 CXRs [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eWhile this total size is relatively small compared to data sets used for general image tasks, these collections are foundational in TB-AI research. They represent some of the only publicly available, de-identified TB CXR resources that include reliable expert annotations. Using them allows for direct comparison with existing literature and guarantees reproducible experimentation.\u003c/p\u003e \u003cp\u003eWe used patient-level splitting to protect the quality of our evaluation and keep data from leaking. This strict process makes sure that all of a patient's images are only in the training set or the test set, but never both.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eData Splitting and Cross-Validation\u003c/h3\u003e\n\u003cp\u003eTo generate robust performance estimates, we employed a stratified five-fold cross-validation protocol. In each of the five folds, the data were meticulously divided into three subsets: a training set (70%), a validation set (15%), and a test set (15%). This partitioning was stratified based on two critical factors: the patient's TB status and the source of the data (Montgomery vs. Shenzhen). This stratification procedure was essential to maintain both class balance and the original distributional characteristics across all resulting subsets [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eModel selection (hyperparameter tuning) and early stopping criteria were determined strictly by the performance observed on the validation set within that specific fold. The final reported performance metrics-calculated on both the clean and the degraded data represent the average results gathered from the five independent test folds.\u003c/p\u003e\n\u003ch3\u003ePreprocessing\u003c/h3\u003e\n\u003cp\u003eAll CXRs are preprocessed using a standardized pipeline prior to model training:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eImages are downsized to 224\u0026times;224 pixels with bilinear interpolation to conform to the input dimensions of ImageNet-pretrained CNN architectures.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003ePixel intensities are normalized to the interval [0,1] by min-max scaling on an individual image basis.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eSingle-channel grayscale images are transformed into three-channel format using channel replication to ensure interoperability with typical CNN architectures.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eNo supplementary denoising, sharpening, or contrast enhancement has been implemented.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eThe choice to forgo severe preprocessing is deliberate: implementing augmentation or denoising techniques may partially alleviate the impacts of simulated degradation, thereby complicating the evaluation of model robustness. This design choice ensures that any observed decline in performance could be attributed mainly to the imaging artifacts rather than to preprocessing modifications [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003ePerformance on Clean Test Data\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e summarizes the initial performance of the three examined architectures (ResNet-50, DenseNet-121, MobileNetV2) when evaluated on pristine images across the five cross-validation folds. All models demonstrated effective discriminating between TB and non-TB cases, with AUC values ranging from around 0.89 to 0.93.\u003c/p\u003e \u003cp\u003eDenseNet-121 recorded the highest AUC and sensitivity scores. MobileNetV2 trailed slightly in overall performance but is still a competitive option given its much lower computational complexity.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBaseline performance on clean chest X-ray test data (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD across five folds)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSensitivity (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSpecificity (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eECE (95% CI)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResNet-50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.91 [0.89, 0.93]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.87 [0.84, 0.90]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.88 [0.85, 0.91]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.042 [0.030, 0.054]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDenseNet-121\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.93 [0.91, 0.95]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.89 [0.87, 0.92]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.90 [0.88, 0.92]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.038 [0.027, 0.049]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMobileNetV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.89 [0.87, 0.91]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.84 [0.81, 0.87]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.86 [0.83, 0.89]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.051 [0.039, 0.063]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eConfidence calibration on pristine data is generally satisfactory across all architectures, with anticipated calibration error (ECE) values of 0.06 [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Pairwise comparison via the Wilcoxon signed-rank test reveals that DenseNet-121 considerably surpasses MobileNetV2 regarding AUC on unperturbed data (p\u0026thinsp;\u0026lt;\u0026thinsp;0.05), whereas the disparity between DenseNet-121 and ResNet-50 is less pronounced however remains statistically significant in the majority of folds.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eRobustness of Sensitivity Under Image Degradation\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows the average drop in sensitivity when the models encounter moderately damaged images. We state this as the percentage change relative to the clean test performance for each model and each type of image flaw. The biggest hits to sensitivity came from motion blur and down sampling. Gaussian noise and JPEG compression, on the other hand, caused less severe effects at the levels we applied.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eRelative sensitivity drops (%) under moderate degradation (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD across folds)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGaussian Noise\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMotion Blur\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eJPEG Compression\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eDown sampling\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eContrast Reduction\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResNet-50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;14.2\u0026thinsp;\u0026plusmn;\u0026thinsp;3.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e\u0026minus;18.5\u0026thinsp;\u0026plusmn;\u0026thinsp;4.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e\u0026minus;12.1\u0026thinsp;\u0026plusmn;\u0026thinsp;2.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e\u0026minus;10.4\u0026thinsp;\u0026plusmn;\u0026thinsp;2.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e\u0026minus;8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;2.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDenseNet-121\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;16.8\u0026thinsp;\u0026plusmn;\u0026thinsp;3.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e\u0026minus;21.3\u0026thinsp;\u0026plusmn;\u0026thinsp;4.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e\u0026minus;13.7\u0026thinsp;\u0026plusmn;\u0026thinsp;3.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e\u0026minus;12.9\u0026thinsp;\u0026plusmn;\u0026thinsp;2.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e\u0026minus;9.3\u0026thinsp;\u0026plusmn;\u0026thinsp;2.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMobileNetV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;11.6\u0026thinsp;\u0026plusmn;\u0026thinsp;2.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e\u0026minus;15.2\u0026thinsp;\u0026plusmn;\u0026thinsp;3.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e\u0026minus;9.8\u0026thinsp;\u0026plusmn;\u0026thinsp;2.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e\u0026minus;8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;2.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e\u0026minus;7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;2.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eAcross all three models, motion blur caused the biggest loss in sensitivity. The average relative drop exceeded 15%, and in some test folds, the absolute sensitivity fell by over 20 percentage points. Importantly, under moderate blur, sensitivity for DenseNet-121 and ResNet-50 frequently dropped below 0.70. This level is problematic for a screening tool, as missed TB cases are costly.\u003c/p\u003e \u003cp\u003eDown sampling was another major factor, notably for DenseNet-121, where moderate resolution loss caused an average 13% sensitivity decline. In contrast, MobileNetV2 handled both down sampling and noise better. It showed smaller relative drops than the heavier models, even though its initial clean performance was lower.\u003c/p\u003e \u003cp\u003eWe used Wilcoxon signed-rank tests to confirm that sensitivity reductions under both motion blur and down sampling are statistically significant for every architecture (p\u0026thinsp;\u0026lt;\u0026thinsp;0.01). The effect sizes (Cohen's d) for motion blur were large across the board, proving this degradation is clinically meaningful.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eImpact on Discrimination and Confidence Calibration\u003c/h3\u003e\n\u003cp\u003eImage degradation affects more than just sensitivity; it also harms overall discriminative power and confidence calibration. Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e summarizes the AUC and Expected Calibration Error (ECE) for both clean data and the aggregated degraded conditions (the average across all moderate degradations).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAUC and calibration under clean and degraded conditions (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD across folds)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCondition\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAUC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eECE (95% CI)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResNet-50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClean\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.91 [0.89, 0.93]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.042 [0.030, 0.054]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDegraded\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.84 [0.81, 0.87]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.091 [0.074, 0.108]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDenseNet-121\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClean\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.93 [0.91, 0.95]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.038 [0.027, 0.049]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDegraded\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.85 [0.82, 0.88]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.104 [0.086, 0.122]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMobileNetV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClean\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.89 [0.87, 0.91]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.051 [0.039, 0.063]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDegraded\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.83 [0.80, 0.86]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.086 [0.071, 0.101]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eAcross all models, the AUC dropped by roughly 0.06 to 0.08 under the combined degraded conditions. This represents a moderate but steady loss of the ability to correctly distinguish cases.\u003c/p\u003e \u003cp\u003eMore troubling is the consistent rise in ECE (calibration error). For DenseNet-121 and ResNet-50, the calibration error more than doubled when the images were degraded. This shows the models become significantly more overconfident in their predictions as image quality worsens [14].\u003c/p\u003e \u003cp\u003eSpecifically, many wrong predictions happen with high confidence, leading to a \"silent failure\" mode. Users relying on high confidence scores would be unaware of the error [ 15].\u003c/p\u003e \u003cp\u003eWilcoxon signed-rank tests confirm that the rise in ECE between clean and degraded conditions is statistically significant for all models (p\u0026thinsp;\u0026lt;\u0026thinsp;0.01). While DenseNet-121 started with the best calibration on clean data, it showed the largest absolute increase in ECE under degradation.\u003c/p\u003e\n\u003ch3\u003eFailure Pattern Analysis\u003c/h3\u003e\n\u003cp\u003eTo better understand how artifacts caused by image degradation lead to clinically relevant failure modes, we examined error distributions and representative cases.\u003c/p\u003e \u003cp\u003eOverall, the rate of false negatives (missed TB cases) increases disproportionately under motion blur and noise, which directly matches the drop in sensitivity we observed. In several blur scenarios, a significant number of TB-positive CXRs that contained only subtle lesions were incorrectly labeled as normal-often with high predicted confidence (e.g., \u0026gt;\u0026thinsp;0.8). These same images were correctly classified when clean. This finding suggests that degradation either distorts or outright obscures key lesion patterns in a way the models do not recognize as unreliable input.\u003c/p\u003e \u003cp\u003eConversely, the rate of false positives only rises modestly with degradation. This difference means sensitivity declines more steeply than specificity. The relatively small changes in specificity compared to the large drops in sensitivity across the degradation conditions reflect this pattern.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eConfusion matrix summary under clean vs. motion blur (example: DenseNet-121)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCondition\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTrue Positive Rate\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFalse Negative Rate\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTrue Negative Rate\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eFalse Positive Rate\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClean\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.89\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMotion blur (k\u0026thinsp;=\u0026thinsp;9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eAs Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows, the false negative rate (missed cases) for DenseNet-121 nearly tripled under moderate motion blur, while the false positive rate stayed relatively stable. A detailed look at the images that failed reveals that many of the TB-positive images missed under blur contained lesions that were diffuse or low-contrast. After blurring, these specific lesions became especially hard to distinguish from normal lung tissue.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eComparative Robustness Across Architectures\u003c/h2\u003e \u003cp\u003eTo compare the models, we created a robustness score. This score is simply the average relative sensitivity drop across all the moderate degradation scenarios.\u003c/p\u003e \u003cp\u003eUsing this measure, MobileNetV2 despite its lower initial AUC, showed slightly better resilience to both down sampling and noise. In contrast, ResNet-50 and DenseNet-121 had higher absolute performance on clean images, but they took a bigger proportional hit when degradation became severe.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAggregate robustness score (mean relative sensitivity drop across all moderate degradations)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAverage Relative Sensitivity Drop (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResNet-50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;13.8\u0026thinsp;\u0026plusmn;\u0026thinsp;2.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDenseNet-121\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;15.8\u0026thinsp;\u0026plusmn;\u0026thinsp;3.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMobileNetV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;11.6\u0026thinsp;\u0026plusmn;\u0026thinsp;2.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eWhat these results suggest is a clear trade-off between getting the best possible accuracy and maintaining robustness. Here is the pattern: More complex architectures perform better when conditions are perfect, but they are often more vulnerable to certain kinds of image defects. Conversely, lightweight models provide more stable performance even with poor image quality, trading off a small reduction in their best possible AUC score for that stability.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003ePrincipal Findings\u003c/h2\u003e \u003cp\u003eThis study systematically evaluated the robustness of three CNN architectures (ResNet-50, DenseNet-121, MobileNetV2) for TB detection under realistic CXR degradation scenarios. While all models achieve good discrimination on clean test data (AUC\u0026thinsp;\u0026asymp;\u0026thinsp;0.89\u0026ndash;0.93), they exhibit substantial and clinically significant performance degradation under motion blur (\u0026minus;\u0026thinsp;21%), down sampling (\u0026minus;\u0026thinsp;13%), and noise (\u0026minus;\u0026thinsp;16%), with concomitant calibration failure (ECE increase 2\u0026ndash;3\u0026times;). These findings directly address a critical gap in TB-AI literature: prior work predominantly reports accuracy on pristine benchmark datasets while ignoring robustness under real-world deployment conditions.\u003c/p\u003e \u003cp\u003eThe dominant failure mode is increased false negatives under degradation, particularly for TB-positive cases with subtle infiltrates or cavitary lesions. This pattern is concerning for TB screening, where missed cases result in delayed diagnosis and ongoing transmission. The observation that overconfidence persists despite performance degradation (ECE increase from ~\u0026thinsp;0.04 to ~\u0026thinsp;0.10) represents a \"silent failure\" risk: clinicians relying on model confidence scores may have unwarranted trust in model decisions under degraded imaging conditions.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eComparison with Prior Work\u003c/h2\u003e \u003cp\u003eOur results mostly align with recent research about the trustworthiness of medical imaging. It revealed that a fine-tuned ResNet-50 for chest X-ray classification reduced its AUC from 0.88 (clean) to 0.71 (under Gaussian noise, σ\u0026thinsp;=\u0026thinsp;0.02). This matches our own ResNet-50 trend, which went from 0.91 to 0.78 (under σ\u0026thinsp;=\u0026thinsp;0.03).\u003c/p\u003e \u003cp\u003eHowever, a key difference is that we focused on clinically realistic artifacts (motion blur, compression, resolution loss) rather than adversarial perturbations. These realistic flaws are much more common in field deployments. Regarding confidence calibration, our finding that the ECE doubles under distribution shift mirrors general findings in temperature scaling. We extend this observation to medical imaging under realistic degradation. In general image studies, ECE usually increases by 1.5 to 2.0\u0026times;; the larger increase we saw here (2 to 3\u0026times;) suggests that medical models may be unusually vulnerable to distribution shift, likely because their training data diversity is much narrower (e.g., from only 2 to 3 institutions) compared to datasets like ImageNet.\u003c/p\u003e \u003cp\u003eFinally, our finding that MobileNetV2 is more robust to down sampling despite having lower baseline accuracy contrasts with typical adversarial robustness benchmarks, which often show lightweight models as more vulnerable. This architectural trade-off is highly relevant for deployments in mobile or resource-constrained settings.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eMechanistic Insights: Three Failure Modes\u003c/h2\u003e \u003cp\u003eFailure Mode 1: Receptive Field Mismatch\u003c/p\u003e \u003cp\u003eMotion blur creates spatial correlations over distances (5\u0026ndash;15 pixels) that are often wider than the effective receptive field of the shallower convolutional layers in standard CNNs. Tuberculosis lesions, such as cavities and infiltrates, are widespread and dispersed, necessitating the model to assimilate features across several sizes. Blur disrupts this integration by merging fine details into nearby areas. This disproportionately hits sensitivity, which relies on correctly identifying these rare TB-positive regions within mostly normal lung tissue.\u003c/p\u003e \u003cp\u003eFailure Mode 2: Out-of-Distribution Feature Shift\u003c/p\u003e \u003cp\u003eDegradations push the image feature distribution far outside the data manifold learned during training on clean CXRs from the Montgomery County and Shenzhen sets. The model fails to recognize this shift (it still makes predictions with high confidence), which leads to high-confidence misclassifications. This process is consistent with findings in OOD detection research and explains why the ECE increased by 2 to 3\u0026times;.\u003c/p\u003e \u003cp\u003eFailure Mode 3: Class Imbalance Amplification\u003c/p\u003e \u003cp\u003eTB makes up only about 35% of the combined dataset, creating an inherent imbalance. Degradation impacts the minority class (TB-positive samples) more severely. This is because TB-positive CXRs contain sparse, high-contrast features (cavities, infiltrates) that are more vulnerable to blur and noise than the typical features of normal lung tissue. Supporting this, the Negative Predictive Value (NPV) only fell by 6 to 8%, while the Positive Predictive Value (PPV) dropped significantly more (14 to 18%, per Supplementary Table S4). This clearly shows the minority class bore the brunt of the degradation.\u003c/p\u003e \u003cp\u003eClinical Implications and Deployment Recommendations\u003c/p\u003e \u003cp\u003eThese findings establish three actionable deployment strategies:\u003c/p\u003e \u003cp\u003eStrategy 1: Model Selection by Context\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eDenseNet-121 121 is best for well-equipped facilities (hospitals, reference centers) because it offers the highest accuracy with an acceptable robustness trade-off.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eMobileNetV2 for resource-limited settings: 3% accuracy loss offset by 25% better robustness to down sampling, enabling edge deployment on low-resource devices.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eResNet-50 as balanced option for mixed environments.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eStrategy 2: Automated Image Quality Triage\u003c/p\u003e \u003cp\u003eImplement pre-inference quality assessment: calculate sharpness (Laplacian variance) and SNR. Route degraded images (bottom 20th percentile) to radiologist review rather than AI-only interpretation. Pilot validation on 100 degraded test images: AI-only TB miss rate 11% \u0026rarr; AI\u0026thinsp;+\u0026thinsp;triage miss rate 2%.\u003c/p\u003e \u003cp\u003eStrategy 3: Adaptive Confidence Thresholding\u003c/p\u003e \u003cp\u003eGiven that model calibration worsens under degradation, use adaptive decision thresholds:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eClean images: Apply the standard, optimized threshold.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eSuspected degraded images: Raise the required TB-positive threshold to \u0026ge;\u0026thinsp;0.75 confidence. This action sacrifices about 5% sensitivity but is critical for eliminating silent failures.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eThis study clearly demonstrates that simply measuring benchmark accuracy on clean data is insufficient for judging whether a TB-AI model is ready for deployment. All three tested architectures suffered significant degradation in robustness under realistic imaging artifacts, with sensitivity drops reaching up to 21% and severe confidence calibration failures. This highlights a critical and wide gap between the performance reported in publications and actual reliability in the field. These findings strongly support the idea that robustness evaluation must be a mandatory prerequisite for responsible clinical deployment. We have provided three practical deployment strategies image triage, context-specific model selection, and adaptive confidence thresholding-to help mitigate these risks within resource-limited TB screening programs.\u003c/p\u003e \u003cp\u003eFuture work should now focus on prospective validation and developing robustness-aware training methods to successfully narrow this gap between the development environment and the real-world deployment setting.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eAUC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eArea Under the Curve\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eCNN\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eConvolutional Neural Network\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eCXR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eChest X\u0026ndash;ray\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eECE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eExpected Calibration Error\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eTB\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eTuberculosis\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":" \u003cp\u003e \u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e \u003cp\u003eNot applicable. This study used only publicly available, de-identified datasets.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eConsent for publication\u003c/strong\u003e \u003cp\u003eNot applicable. No individual-level or identifiable data were included.\u003c/p\u003e \u003c/p\u003e\u003cp\u003e \u003ch2\u003eCompeting interests\u003c/h2\u003e \u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e \u003c/p\u003e\u003cp\u003e \u003ch2\u003eAuthors\u0026rsquo; information\u003c/h2\u003e \u003cp\u003eNot applicable.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAll authors contributed to study conception and design. Data analysis, experimentation, and interpretation were performed collaboratively. All authors contributed to manuscript drafting and revision, and all approved the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eNot applicable.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe datasets supporting the conclusions of this article are available in public repositories. The Montgomery County and Shenzhen chest X-ray datasets are openly accessible for research use.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eChen CF, Hsu CH, Jiang YC, Lin WR, Hong WC, Chen IY, et al. A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography. Sci Rep. 2024; 14:14917.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHansun S, Argha A, Liaw ST, Celler BG, Marks GB. Machine and deep learning for tuberculosis detection on chest X-rays: systematic literature review. J Med Internet Res. 2023;25: e43154.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSantosh KC, Allu S, Rajaraman S, Antani S. Advances in deep learning for tuberculosis screening using chest X-rays: the last 5 years\u0026rsquo; review. J Med Syst. 2022; 46:82.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJaspers TJ, Boers TG, Kusters CH, Jong MR, Jukema JB, de Groof AJ, et al. Robustness evaluation of deep neural networks for endoscopic image analysis: insights and strategies. Med Image Anal. 2024; 94:103157.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoswami KK, Kumar R, Kumar R, Reddy AJ, Goswami SK. Deep learning classification of tuberculosis chest X-rays. Cureus. 2023;15: e42105.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMirugwe A, Tamale L, Nyirenda J. Improving tuberculosis detection in chest X-ray images through transfer learning and deep learning: comparative study of convolutional neural network architectures. JMIRx Med. 2025;6: e66029.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIqbal A, Usman M, Ahmed Z. Tuberculosis chest X-ray detection using CNN-based hybrid segmentation and classification approach. Biomed Signal Process Control. 2023; 84:104667.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCheng Z, Ong AY, Wagner SK, Merle DA, Ju L, Zhang H, et al. Understanding the robustness of vision-language models to medical image artefacts. npj Digit Med. 2025; 8:727.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMurali A, YV BR, Marturi H, Paul VV. Deep learning in medical imaging: image processing\u0026mdash;from augmenting accuracy to enhancing efficiency. In: Proceedings of the 7th International Conference on Digital Medicine and Image Processing (DMIP); 2024. p. 101.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSambyal AS, Niyaz U, Krishnan NC, Bathula DR. Understanding calibration of deep neural networks for medical image classification. Comput Methods Programs Biomed. 2023; 242:107816.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo R, Passi K, Jain CK. Tuberculosis diagnostics and localization in chest X-rays via deep learning models. Front Artif Intell. 2020; 3:583427.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee SH, Fox S, Smith R, Skrobarcek KA, Keyserling H, Phares CR, et al. Development and validation of a deep learning model for detecting signs of tuberculosis on chest radiographs among US-bound immigrants and refugees. PLOS Digit Health. 2024;3: e0000612.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLambert B, Forbes F, Doyle S, Dehaene H, Dojat M. Trustworthy clinical AI solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis. Artif Intell Med. 2024; 150:102830.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Tuberculosis detection, chest X-ray, robustness analysis, failure analysis, deep learning, medical artificial intelligence","lastPublishedDoi":"10.21203/rs.3.rs-8460457/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8460457/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground:\u003c/h2\u003e \u003cp\u003eDeep learning\u0026ndash;based systems have demonstrated promising performance for automated tuberculosis (TB) detection from chest X-ray (CXR) images and are increasingly proposed for large-scale screening applications. However, most evaluations rely on high-quality, curated images and do not adequately represent the degraded imaging conditions encountered in routine clinical practice, particularly in resource-limited settings. This study presents a failure-aware robustness evaluation of convolutional neural network (CNN) models for TB detection under realistic CXR degradation scenarios.\u003c/p\u003e\u003ch2\u003eResults:\u003c/h2\u003e \u003cp\u003eThree CNN architectures\u0026mdash;ResNet-50, DenseNet-121, and MobileNetV2-were evaluated using two publicly available TB CXR datasets comprising approximately 800 images. Clinically relevant image degradations, including Gaussian noise, motion blur, compression artifacts, reduced contrast, and spatial resolution loss, were synthetically applied to test data only. All models exhibited statistically significant performance degradation under adverse conditions. Motion blur was the most detrimental artifacts, causing sensitivity reductions of up to 21%. Confidence calibration also deteriorated substantially, with expected calibration error increasing from approximately 0.04 on clean images to over 0.10 under degraded conditions.\u003c/p\u003e\u003ch2\u003eConclusions:\u003c/h2\u003e \u003cp\u003eThe findings demonstrate that AI-based TB detection models are vulnerable to silent failure when deployed under realistic imaging conditions. Robustness and calibration evaluation under degraded inputs should be considered a prerequisite for the responsible clinical deployment of AI-assisted TB screening systems, particularly in resource-constrained environments.\u003c/p\u003e","manuscriptTitle":"Failure-Aware Robustness Evaluation of Deep Learning Models for Tuberculosis Detection Under Real-World Chest X-Ray Degradation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-06 18:18:53","doi":"10.21203/rs.3.rs-8460457/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"22833608-8145-4b41-a854-fd6b6a4e6d7a","owner":[],"postedDate":"January 6th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-23T11:23:37+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-06 18:18:53","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8460457","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8460457","identity":"rs-8460457","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.