Multi-Dataset Generalization and Explainability in AI-Based Intracranial Hemorrhage Detection from Noncontrast CT Using EfficientNet-B4

preprint OA: closed
Full text JSON View at publisher
Full text 145,744 characters · extracted from preprint-html · click to expand
Multi-Dataset Generalization and Explainability in AI-Based Intracranial Hemorrhage Detection from Noncontrast CT Using EfficientNet-B4 | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Multi-Dataset Generalization and Explainability in AI-Based Intracranial Hemorrhage Detection from Noncontrast CT Using EfficientNet-B4 Lochan Shrestha, Huma Subhani This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9705343/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 4 You are reading this latest preprint version Abstract Intracranial hemorrhage (ICH) is a neurological emergency with a mortality rate exceeding 40%. Deep learning models trained on benchmark datasets rarely demonstrate robust generalizability to independent external cohorts. We developed and externally validated an open-source EfficientNet-B4 pipeline for multilabel ICH detection with integrated Grad-CAM explainability. The model was trained on 752,803 CT slices from the RSNA 2019 dataset via inverse frequency weighted loss across six outputs, with three-window RGB encoding as input. External validation was performed on the CQ500 (473 studies, New Delhi, India) without retraining. On the RSNA internal test set (112,921 slices), the model achieved a macro AUC-ROC of 0.9835 (95% CI 0.9821–0.9847), a sensitivity of 0.8192, a specificity of 0.9862, and an F1 of 0.7743. For CQ500, all sigmoid predictions fell below 0.5, yielding zero sensitivity — a primary negative finding attributable to domain shift. The threshold-independent macro AUC-ROC was 0.7276 (95% CI 0.6729–0.7815). Intraventricular hemorrhage was the most consistent cross-dataset generalization (AUC-ROC 0.8079, 95% CI 0.691–0.904). Grad-CAM heatmaps (n = 60, two independent radiologists) achieved a combined mean localization score of 3.81 ± 0.86/5 with substantial interrater agreement (κ = 0.653). Domain shift causes complete threshold failure on external data; threshold recalibration on a representative local validation set is a nonnegotiable prerequisite for deployment. The open-source pipeline is released to facilitate adaptation in resource-limited settings. Biological sciences/Computational biology and bioinformatics Health sciences/Medical research Health sciences/Neurology Biological sciences/Neuroscience intracranial hemorrhage deep learning EfficientNet multilabel classification Grad-CAM external validation CT imaging domain shift class imbalance Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 1. Introduction Intracranial hemorrhage (ICH) is a neurological emergency associated with mortality exceeding 40% and substantial long-term disability among survivors 1 , 2 . Rapid and accurate diagnosis is critical: delayed identification is directly associated with worse neurological outcomes 1 . Noncontrast computed tomography (NCCT) is the primary imaging modality for the initial assessment of suspected ICH, owing to its widespread availability, speed, and high sensitivity for acute blood products. Accurate interpretation demands the recognition of five radiologically and clinically distinct subtypes (epidural, intraparenchymal, intraventricular, subarachnoid, and subdural hemorrhage), each with different etiologies, anatomical distributions, and management implications. This interpretive burden is compounded in high-volume emergency settings and in low- and middle-income countries (LMICs), where specialist neuroradiology coverage may be limited. Deep learning has demonstrated strong diagnostic performance for ICH detection from NCCT. A 2025 systematic review and meta-analysis of 58 studies reported a pooled sensitivity of 0.92, specificity of 0.94, and AUC-ROC of 0.96 3 . Landmark studies have achieved AUC-ROC curves as high as 0.991 4 , and reader studies have confirmed that DL assistance improves diagnostic accuracy 1 . Multiple dataset ensemble approaches have reported an AUC-ROC of 0.953 5 , which serves as the primary benchmark for the present work. Despite this progress, five important limitations persist. First, most models are validated on a single dataset, and few open-source systems demonstrate robust generalization across heterogeneous CT acquisitions from geographically distinct populations 6 , 7 . Second, epidural hemorrhage, the rarest ICH subtype (< 0.5% of slices), receives minimal attention. Third, the severe class imbalance inherent to ICH datasets is inconsistently addressed 8 . Fourth, structured radiologist scoring of model attention maps has not been reported. Fifth, open reproducible pipelines enabling replication in LMIC contexts remain scarce. To address these gaps, we present an open EfficientNet-B4 pipeline trained on 752,803 CT slices from RSNA 2019 dataset 9 and externally validated on CQ500 10 . Our contributions are as follows: (i) cross-dataset generalization with transparent reporting of domain shifts; (ii) Grad-CAM explainability with radiologist localization scoring; (iii) class-balanced imbalance handling with verified numerical accuracy; and (iv) full public release of code and model weights. 2. Related Work 2.1 Deep Learning Architectures for ICH Detection The application of deep learning to ICH detection from NCCT has advanced rapidly since Chilamkurthy et al. demonstrated CNN-based detection across 313,000 head CT scans with an AUC of 0.91–0.97 on the CQ500 validation set 11 . Kuo et al. reported an AUC-ROC of 0.991 on 4,396 CT scans, demonstrating expert-level performance 4 . The RSNA 2019 Challenge 9 standardized evaluation on the largest publicly available annotated CT dataset. Burduja et al. finished in the top 2% of 1,345 teams, integrating Grad-CAM visualizations 12 . Postchallenge, D’Angelo et al. evaluated Dense-UNet on 502 NCCT scans with an accuracy of 91.24% 13 . Ensemble methods have consistently outperformed single-model approaches 14 . EfficientNet 15 , designed with compound scaling, has demonstrated strong medical imaging performance 16 . 2.2 Multilabel Subtype Classification and Class Imbalance Multilabel binary cross-entropy formulations better reflect the clinical reality that a single CT study may reveal multiple concurrent hemorrhage types 14 . Class imbalance poses a compounding difficulty, particularly for epidural hemorrhage. Strategies include focal loss 8 , oversampling, and weighted loss functions. Kok et al. demonstrated the effectiveness of focal loss for intraventricular hemorrhage segmentation 17 . In the present work, we adopt a class-balanced inverse frequency weighting formula and verify it numerically against logged training values. 2.3 Cross-Dataset Generalization Generalization to external cohorts is performed in only a minority of published ICH studies 3 . Salehinejad et al. reported an AUC-ROC of 95.4% for external validation, with a drop from 98.4% internally 7 . Voter et al. identified failure modes of a validated algorithm during prospective deployment 5 . Nada et al. performed external validation on 5,600 heterogeneous CTs (AUC-ROC 0.954) 6 . The CQ500 dataset 11 , collected in New Delhi, India, has become the most widely used external benchmark for models trained on North American data. 2.4 Explainability in Brain CT AI Grad-CAM 18 has become the most widely applied post hoc explanation method for CNN-based medical image classifiers. Lee et al. demonstrated that saliency-based explanations improved clinician trust in ICH detection 10 . Kim et al. applied Grad-CAM to all six ICH subtypes via ResNet, providing a proof-of-concept 19 . Altuve and Pérez conducted qualitative heatmap validation for ICH detection 20 . The present study extends this work by conducting formal radiologist scoring across all six output classes. 3. Materials and methods Reporting standards. This study follows the TRIPOD-AI and CLAIM (Checklist for Artificial Intelligence in Medical Imaging) reporting guidelines for diagnostic AI studies. A supplementary checklist is available from the authors upon request. 3.1 Datasets RSNA 2019 Intracranial Hemorrhage Detection Dataset. The primary training dataset was sourced from the RSNA 2019 Brain CT Hemorrhage Challenge 9 . After slice-level labels were extracted, the dataset comprised 752,803 CT slices across six binary labels: any_ich, epidural, intraparenchymal, intraventricular, subarachnoid, and subdural hemorrhage. The RSNA 2019 dataset provides study-level annotations; slice-level labels were generated by propagating the study-level label to all slices within a study. This introduces label noise, as hemorrhagic subtypes may be visible in only a subset of slices within a positive study (discussed in Section 5.5 ). The raw DICOM files were converted to JPEG format (quality factor 95) prior to training. CQ500 Dataset. CQ500 11 comprises 491 noncontrast head CT studies collected in New Delhi, India. Of these, 18 studies (3.7%) were excluded because of the absence of DICOM folders in the Kaggle-hosted version of the dataset; no studies were excluded on a clinical basis. The remaining 473 studies were used for external validation. ICH was present in 205 studies (43.3%) and absent in 268 (56.7%). Each study was annotated by three radiologists; the majority vote (≥ 2/3) was the ground truth. CQ500 was used exclusively for external validation. All the RSNA 2019 data were split at the study level: 70% training/15% validation/15% testing, yielding 526,962/112,920/112,921 slices, respectively. An assertion verified that no StudyInstanceUID appeared in more than one partition. StudyInstanceUID was used as the primary split key, constituting the closest available proxy for patient-level separation in this deidentified dataset. 3.3 Class distribution and imbalance handling The RSNA training set exhibits severe class imbalance (Table 1 ). Binary cross-entropy loss with per-class positive weighting (BCEWithLogitsLoss, pos_weight) was applied during training. The positive class weight was computed as: w c = N /( Nc × C ) where N is the total number of training slices, Nc is the number of positive slices for class c , and C = 6 is the number of output classes. This ensures that each class contributes equally to the aggregate loss at initialization. Weights ranged from 1.16 (any_ich) to 39.51 (epidural). The formula was confirmed against the logged training output for all six classes. Implementation: `pos_weight = total/(pos_counts * num_classes + 1e-6)` . The ε = 1e-6 term is a numerical stabilizer used to prevent division by zero for any class with zero positive samples in the training set (not applicable here but included for robustness). This approach was preferred over focal loss 8 , as it requires no additional hyperparameter tuning. Table 1 Class distribution in the RSNA 2019 training set (n = 526,962 slices). Hemorrhage subtype Positive slices Prevalence (%) pos_weight (wc) n_test pos. any_ich 75,553 14.34 1.16 16,084 epidural 2,223 0.42 39.51 459 intraparenchymal 25,258 4.80 3.48 5,279 intraventricular 18,383 3.49 4.78 3,900 subarachnoid 24,952 4.74 3.52 5,431 subdural 33,108 6.29 2.65 7,043 n_test pos. : positive slices in the held-out RSNA test set. C = 6 classes. 3.4 CT preprocessing and windowing Noncontrast CT images were preprocessed via a three-window RGB encoding strategy 10 , 21 , 22 . Three clinically standard Hounsfield unit (HU) windows were applied and stacked as separate color channels: the brain (center 40 HU, width 80 HU), subdural (center 40 HU, width 130 HU), and bone (center 400 HU, width 1800 HU) channels. For each window, raw HU pixel values were clipped to [C − W/2, C + W/2] and linearly rescaled to unsigned 8-bit integers [0,255] as pixel_out = 255 × (HU_clipped − (C − W/2))/W. The three resulting uint8 channels were concatenated to produce a 3-channel 512×512 image. RSNA DICOM files were converted to JPEG format via OpenCV (cv2.imwrite, IMWRITE_JPEG_QUALITY = 95) prior to training; this is a lossy compression step that quantifies the 8-bit windowed pixel values, and the mismatch between JPEG-compressed training data and raw DICOM external data is discussed as the primary driver of domain shift in Section 5.2 . The images were downsampled to 256×256 via area interpolation (INTER_AREA) and normalized via ImageNet channel statistics (mean = [0.485, 0.456, 0.406]; std = [0.229, 0.224, 0.225]). During quality control, a lightweight Otsu-based threshold was applied to the bone window channel to flag CT slices in which brain tissue coverage fell outside the range of 20–95%; no skull stripping was applied at the inference time. 3.5 Model Architecture EfficientNet-B4 15 , pretrained on ImageNet 23 , was loaded via the timm library 24 with the default classification head replaced by Dropout (p = 0.3) followed by Linear (1,792, 6). For training, a batch size of 128 slices per step was used. Two NVIDIA T4 GPUs were used with PyTorch DataParallel 25 , with 64 slices distributed per GPU. 3.6 Training procedure The model was trained via AdamW 26 (learning rate of 1×10⁻⁴, weight decay of 1×10⁻²) with cosine annealing scheduling (T_max = 50). Mixed-precision training 27 was enabled. Early stopping (patience = 7 epochs) was used to monitor the macro validation AUC-ROC. The best checkpoint saves the epoch index, model state, optimizer state, and validation AUC. Early termination occurred at epoch 13; the best checkpoint (epoch 6) was selected for all subsequent evaluations. 3.7 Evaluation Metrics The per-class and macroaveraged AUC-ROC curves, sensitivity, specificity, and F1 score were computed. For the RSNA internal test set, a fixed threshold of 0.5 was applied. For CQ500, domain shift caused all sigmoid predictions to fall below 0.5 (maximum predicted probability = 0.468), yielding zero sensitivity; this is the primary negative finding. For the secondary evaluation of the CQ500, per-class Youden index thresholds were derived from the RSNA 2019 validation set and applied unchanged to the CQ500, preserving methodological independence. AUC-ROC is threshold independent and constitutes the primary cross-dataset performance measure. The RSNA 2019 dataset is fully deidentified; StudyInstanceUIDs does not reliably map to unique patients. Study-level aggregation of RSNA test set predictions is therefore not feasible. Internal evaluation was performed at the slice level (n = 112,921), which is the native label granularity. Kang et al. 2023 benchmark 14 was reported at the study/exam level on a different dataset; direct numerical comparison is approximate and presented as contextual rather than equivalent. 3.8 Grad-CAM Explainability Gradient-weighted class activation mapping (Grad-CAM) 18 was implemented by registering forward and backward hooks on the final convolutional block of EfficientNet-B4. For each target class, the gradient of the predicted logit with respect to the final convolutional feature map was globally average pooled to obtain per-channel importance weights. The resulting heatmap was ReLU-activated, upsampled to 256×256 via bilinear interpolation, normalized to [0,10], and overlaid as a jet colormap on the bone window channel at 50% opacity. For localization quality assessment, 10 heatmaps per class (n = 60 total) were randomly selected from true-positive test set predictions with a predicted probability ≥ 0.5, ensuring that the selection was representative rather than biased toward the highest-confidence cases. Two board-certified radiologists independently scored all 60 heatmaps, blinded to each other’s scores and used a five-point Likert scale to model architecture details: 1 = attention entirely outside the hemorrhage region; 2 = predominantly outside; 3 = partial overlap; 4 = predominantly within; and 5 = attention exclusively within the hemorrhage region. Interrater reliability was quantified via Cohen’s weighted kappa (quadratic weights) and interpreted via the Landis & Koch scale 28 . 3.9 External validation protocol The CQ500 validation uses the epoch 6 checkpoint without retraining or fine-tuning. Slice-level sigmoid predictions were aggregated to the study level via mean pooling and then compared against majority-vote study-level labels. The AUC-ROC with 95% bootstrap confidence intervals (n = 1,000 study-level resamples) was the primary evaluation measure. 3.10 Statistical analysis The 95% CI for AUC-ROC was estimated via bootstrap resampling (n = 1,000; slice level for the RSNA internal test set; study level for the CQ500). All analyses were performed in Python (scipy v1.11, NumPy v1.24). 3.11 Use of AI-assisted writing tools During the preparation of this manuscript, the authors used Claude (claude.ai; Anthropic PBC, San Francisco, CA, USA) to assist with drafting, structural editing, and revision of the manuscript text. The AI tool was used solely to improve clarity, coherence, and language; it was not used in the design or conduct of the study, the training or evaluation of the model, the collection or analysis of data, the interpretation of results, or the clinical assessment of Grad-CAM heatmaps. All AI-assisted text was reviewed, edited, and verified by the authors, who take full responsibility for the accuracy, integrity, and scientific validity of all content in this article. Claude is not listed as an author and does not meet the ICMJE criteria for authorship. 4. Results 4.1 Model training Training converged rapidly, with the validation macro AUC-ROC reaching 0.9837 at epoch 6 (training loss = 0.0553). Training loss decreased monotonically across logged epochs from 0.2002 (epoch 1) to 0.0154 (epoch 13). The validation AUC-ROC rose steeply from 0.9726 (epoch 1) to 0.9837 (epoch 6) and then plateaued while training loss continued to decline, which was consistent with the onset of overfitting and the rationale for early stopping. Early stops at epoch 13 after seven consecutive epochs without improvement (Fig. 2 , Table 2 ). Table 2 Validation performance across training epochs. Metrics from training logs. Early stopping is best marked with ★. Epoch Train loss Val macro AUC Sensitivity Specificity 1 0.2002 0.9726 0.7606 0.9798 2 0.1420 0.9787 0.7970 0.9806 3 0.1149 0.9813 0.7970 0.9838 4 0.0910 0.9822 0.7940 0.9858 5 0.0692 0.9831 0.7912 0.9879 6 ★ 0.0553 0.9837 0.8207 0.9864 7 0.0418 0.9827 0.7784 0.9904 8 0.0333 0.9831 0.8085 0.9877 9 0.0267 0.9821 0.7649 0.9917 10 0.0238 0.9825 0.7756 0.9917 11 0.0196 0.9835 0.7825 0.9909 12 0.0172 0.9834 0.7948 0.9906 13 0.0154 0.9822 0.7612 0.9930 ★ Best checkpoint selected for all subsequent evaluations. 4.2 Internal Validation — RSNA Test Set On the RSNA held-out test set (112,921 slices, threshold = 0.5), the model achieved a macro AUC-ROC of 0.9835 (95% CI 0.9821–0.9847), a sensitivity of 0.8192, a specificity of 0.9862, and an F1 of 0.7743 (Table 3 ). The per-class AUC-ROC ranged from 0.9758 (subarachnoid) to 0.9921 (intraventricular). Epidural hemorrhage, the most severely imbalanced class (w = 39.51), had an AUC-ROC of 0.9847, with a specificity of 0.9979. The macro AUC-ROC of 0.9835 contextually exceeds the Kang et al. 2023 benchmark of 0.953 14 , although this comparison is approximate given different evaluation units. Table 3 Per-class performance on the RSNA 2019 internal test set (n = 112,921 slices, threshold = 0.50). Class AUC-ROC 95% CI Sens. Spec. F1 n_pos any_ich 0.9820 0.9811–0.9829 0.8434 0.9806 0.8605 16,084 epidural 0.9847 0.9777–0.9908 0.7821 0.9979 0.6845 459 intraparenchymal 0.9868 0.9855–0.9882 0.8511 0.9870 0.8043 5,279 intraventricular 0.9921 0.9910–0.9931 0.8769 0.9908 0.8221 3,900 subarachnoid 0.9758 0.9740–0.9775 0.7535 0.9808 0.7060 5,431 subdural 0.9793 0.9778–0.9807 0.8082 0.9804 0.7685 7,043 MACRO 0.9835 0.9821–0.9847 0.8192 0.9862 0.7743 — 95% CI: bootstrap (n = 1,000, slice level). Sens. = Sensitivity; Spec. = specificity; n_pos = positive slices in the test set. 4.3 External Validation — CQ500 Primary evaluation (threshold = 0.5) : The maximum predicted sigmoid probability across all CQ500 slices was 0.468. All outputs fell below the fixed 0.5 threshold, yielding zero sensitivity across all classes (specificity = 1.0). This constitutes a primary negative finding: the model produces no positive predictions on the CQ500 data without recalibration. Discriminative performance (AUC-ROC) The macro AUC-ROC for the CQ500 was 0.7276 (95% CI 0.6729–0.7815; Table 4 ). Intraventricular hemorrhage demonstrated the highest degree of cross-dataset generalizability (AUC-ROC 0.8079, 95% CI 0.691–0.904). The subdural (0.6762) and epidural (0.6772, 95% CI 0.496–0.858) methods showed the poorest cross-dataset performance; the wide epidural CI reflects only 12 positive studies. Secondary evaluation (RSNA-derived Youden thresholds) At the RSNA validation set, the Youden index threshold (range 0.018–0.070), the macro sensitivity was 0.347, and the macro specificity was 0.937, reflecting systematic conservative predictions under domain shift. Intraventricular hemorrhage was the most clinically useful balance (sensitivity 0.731, specificity 0.808). Table 4 Per-class performance on the CQ500 external validation dataset (n = 473 studies). Primary: AUC‒ROC. Second, sensitivity, specificity, and F1 at the RSNA validation set Youden index thresholds were assessed, and no changes were detected for the CQ500. Class AUC 95% CI Sens. Spec. F1 Thresh. n_pos any_ich 0.7298 0.6775–0.7802 0.4721 0.9348 0.6039 0.070 197 epidural 0.6772 0.4956–0.8582 0.0833 0.9610 0.0645 0.000 12 intraparenchymal 0.7098 0.6508–0.7646 0.1550 0.9884 0.2614 0.068 129 intraventricular 0.8079 0.6905–0.9044 0.7308 0.8076 0.2901 0.018 26 subarachnoid 0.7646 0.6852–0.8327 0.3333 0.9808 0.4524 0.043 57 subdural 0.6762 0.5703–0.7681 0.3061 0.9505 0.3529 0.053 49 MACRO 0.7276 0.6729–0.7815 0.3468 0.9372 0.3375 — 205 AUC-ROC: threshold-independent primary measure; 95% CI by study-level bootstrap (n = 1,000). Sens., Spec., F1: At the RSNA validation set, the Youden thresholds applied unchanged to the CQ500. Epidural threshold 0.000: Under domain shift, no RSNA validation slice prediction exceeded the Youden operating point for this class, indicating that the model’s epidural output distribution shifted below the RSNA-derived threshold; de facto, no CQ500 slice was classified as epidural using this threshold. n_pos: positive studies in CQ500. The prevalence of the CQ500 class was as follows: any_ich, 43.3%; intraparenchymal, 28.3%; subarachnoid, 12.7%; subdural, 11.2%; intraventricular, 5.9%; and epidural, 2.7%. 4.4 Grad-CAM Explainability Two board-certified radiologists independently scored Grad-CAM heatmaps for 60 randomly selected true-positive test set predictions (10 per class, predicted probability ≥ 0.5) via a five-point localization scale (Fig. 4 ). The combined mean localization score was 3.81 ± 0.86/5 (Table 5 ). Interrater reliability was κ = 0.653, indicating substantial agreement (Landis & Koch scale) 29 . Exact agreement was achieved on 47 of 60 images (78.3%), and all 60 images (100%) had within-one-point agreement, confirming scale stability. Intraparenchymal hemorrhage achieved the highest localization scores (mean of 4.40/5 for both raters), which is consistent with its large, visually distinct parenchymal signal. Subdural hemorrhage showed perfect interrater agreement (κ = 1.000), which was consistent with the anatomically unambiguous crescent-shaped convexity pattern. Any_ich showed the lowest mean score (3.30/5) and lowest per-class kappa (0.467), which was expected given its heterogeneous composite nature spanning all five subtypes. Epidural hemorrhage showed near-perfect interrater agreement (κ = 0.808), which was consistent with the anatomically constrained peripheral location. Table 5 Radiologist Grad-CAM localization scores by two independent board-certified radiologists (n = 60 heatmaps; 10 per class; randomly selected true-positive predictions; threshold ≥ 0.50). Class Rater 1 (mean ± SD) Rater 2 (mean ± SD) Combined mean Cohen’s κ Interpretation any_ich 3.20 ± 1.08 3.40 ± 1.28 3.30 0.467 Moderate epidural 3.60 ± 0.49 3.50 ± 0.67 3.55 0.808 Almost perfect intraparenchymal 4.40 ± 0.49 4.40 ± 0.49 4.40 0.583 Moderate intraventricular 3.90 ± 0.70 3.60 ± 0.80 3.75 0.531 Moderate subarachnoid 4.00 ± 1.00 4.10 ± 0.94 4.05 0.565 Moderate subdural 3.80 ± 0.40 3.80 ± 0.40 3.80 1.000 Perfect Overall 3.82 ± 0.83 3.80 ± 0.89 3.81 ± 0.86 0.653 Substantial Scores: 1 = attention entirely outside the hemorrhage; 5 = attention exclusively within the hemorrhage. Interrater reliability: Cohen’s weighted kappa (quadratic weights); Landis & Koch (1977) 28 scale. Exact agreement: 47/60 (78.3%); within-one-point agreement: 60/60 (100%). κ interpretation: <0.20 slight; 0.21–0.40 fair; 0.41–0.60 moderate; 0.61–0.80 substantial; 0.81–1.00 almost perfect. 5. Discussion 5.1 Internal performance in the context of the literature The model achieved a macro AUC-ROC of 0.9835 (95% CI 0.9821–0.9847) on the RSNA internal test set. This value is numerically greater than the Kang et al. 2023 benchmark of 0.953 14 and falls within the pooled meta-analysis estimate of 0.96 3 . As discussed in Section 4.2 , this comparison is not equivalent (slice-level versus exam-level evaluation) and should be interpreted as contextual rather than confirmatory. The high macro specificity of 0.9862 indicates a low false-positive rate, which is important for clinical screening workflows 1 , 23 . Epidural hemorrhage, the most severely imbalanced class (0.42% prevalence, w = 39.51), achieved an AUC-ROC of 0.9847, demonstrating that the class-balanced weighting formula effectively prevents minority class collapse. The sensitivity of 0.7821 for the epidural method reflects the intrinsic difficulty of detecting a class represented by only 459 positive test slices, which is consistent with the literature 3 , 5 . 5.2 Domain Shift and Cross-Dataset Generalization The most notable finding was complete threshold failure on the CQ500: all sigmoid outputs fell below 0.5, rendering the model clinically innovative without recalibration. The discriminative AUC-ROC of 0.7276 confirms the retained separation power. Domain shift is attributable to multiple concurrent factors: (i) image format differences—JPEG-compressed RSNA data versus raw DICOM CQ500 acquisitions, producing systematically lower sigmoid outputs; (ii) scanner heterogeneity—CQ500 acquired on Siemens and Philips scanners across varied slice thicknesses (1.25–5 mm) and reconstruction kernels versus the curated RSNA multi-institutional collection; (iii) differences in slice thickness altering partial volume effects, particularly for thin extra-axial collections; (iv) population-level differences in skull morphology, brain atrophy, and hemorrhage etiology; and (v) structural label noise differences—RSNA labels propagated from the study level versus CQ500’s majority-vote study-level labels, potentially contributing to conservative sigmoid outputs. These findings reinforce the conclusions of Voter et al. 5 and Salehinejad et al. 7 that benchmark performance does not predict deployment performance and highlight threshold recalibration as a mandatory predeployment step. Intraventricular hemorrhage showed the most consistent cross-dataset performance (AUC-ROC 0.8079, sensitivity 0.731 at the RSNA-derived threshold) and represented the most clinically deployable output class under domain shift. The anatomically distinctive periventricular distribution of intraventricular blood produces a high-contrast signal that is relatively robust to scanner and acquisition variation, a finding with practical implications for triage prioritization in LMIC deployment. The epidural AUC-ROC for the CQ500 (0.6772, 95% CI 0.496–0.858) is statistically unreliable given that only 12 positive studies exist and should not be used for subtype-specific comparisons. 5.3 Epidural Hemorrhage: Rare Subtype Performance Epidural hemorrhage presents a dual challenge: extreme class imbalance (0.42%, w = 39.51) and a high dependence on anatomically precise skull–brain interface signals. The class-balanced weighting formula successfully suppressed false positives (specificity of 0.9979 internally), but the sensitivity (0.7821) reflects the difficulty of correctly classifying a small number of positive slices. A dataset with substantially greater epidural incidence is needed for robust cross-domain assessment. 5.4 Grad-CAM and clinical interpretability The two-rater Grad-CAM evaluation (n = 60, κ = 0.653, substantial agreement) provides credible evidence that the model’s attention is broadly consistent with the expected anatomical distribution of each ICH subtype. The combined mean score of 3.81 ± 0.86 / 5 (ranging from 3.30–4.40 across subtypes) indicates that both radiologists consistently judged model attention predominantly within the hemorrhage region. These findings are consistent with those of Lee et al. 10 and Kim et al. 19 , who demonstrated anatomically grounded Grad-CAM attention in CNN-based ICH classifiers. Notably, the intraparenchymal subtype had the highest mean (4.40 / 5), which was consistent with its large, visually distinctive parenchymal signal; the subdural subtype achieved perfect interrater agreement (κ = 1.000), which was consistent with the unambiguous crescent-shaped convexity pattern; and any_ich scored lowest (3.30 / 5, κ = 0.467), which was expected given its heterogeneous composite nature. The evaluation is limited to predictions with probabilities ≥ 0.5; attention near the decision boundary was not assessed and remains a direction for future work. 5.5 Limitations Label noise (slice-level label propagation). The RSNA 2019 dataset provides only study-level labels. Slice-level training labels are generated by propagating the study-level label to every slice in a study, meaning that every slice in a positive study is labeled positive regardless of whether hemorrhage is visible on that slice, and every slice in a negative study is labeled negative. This introduces bidirectional label noise. Given the focal and spatially limited nature of most intracranial hemorrhages relative to standard CT slice thickness (typically 5 mm), it is clinically reasonable to expect that only a minority of slices within a positive study contain visible hemorrhage, implying that the majority of ‘positive’ training slices may not exhibit visible pathology. This noise almost certainly inflates slice-level specificity and sensitivity estimates and likely inflates the AUC-ROC relative to what a model trained on slice-confirmed labels would achieve. It may also encourage the model to learn study-level features (e.g., patient-specific anatomy, scanner artifacts) rather than slice-level pathology. The internal AUC-ROC of 0.9835 should therefore be interpreted as an upper bound estimate. A fully slice-annotated dataset, such as a DICOM-native dataset with per-slice radiologist annotation, is needed to obtain unbiased slice-level performance estimates. The model operates at the slice level without interslice volumetric context, limiting detection of small or thin hemorrhages more apparent across consecutive slices. A lightweight Otsu-based skull mask was used for quality control rather than a validated deep learning skull stripping tool (e.g., HD-BET), which may affect parenchymal signal isolation. The study-level split is a proxy for patient-level separation in this deidentified dataset. JPEG compression at a quality factor of 95 was applied to training data only; the mismatch with raw DICOM external data was a primary contributor to domain shift and threshold failure. The Grad-CAM radiologist assessment (n = 60, two raters) is limited to predictions with probabilities ≥ 0.5; model attention near the decision boundary was not assessed. The external validation was limited to a single external cohort (CQ500). 5.6 Future directions Domain adaptation techniques should be evaluated to mitigate the JPEG-to-DICOM intensity distribution mismatch. Candidate approaches include histogram equalization (matching CQ500 intensity histograms to the RSNA training distribution before inference), standard unit normalization (z score normalization applied to HU values prior to windowing), and test-time adaptation methods that adjust the model’s feature statistics to the target domain. Threshold calibration on a small institutional validation set should be formalized as a mandatory predeployment step. Extension to three-dimensional volumetric models would provide the interslice context currently unavailable. Prospective clinical studies at PAHS and other Nepali emergency departments would directly assess the pipeline’s utility in LMIC settings. The preliminary Grad-CAM radiologist assessment should be extended to a larger multirater study with formal interrater reliability measurements across the full range of predicted probabilities. Validation on additional external cohorts with independently acquired DICOM data would further characterize cross-domain generalization. 6. Conclusion EfficientNet-B4, trained on 752,803 RSNA 2019 CT slices with class-balanced weighted loss, achieves a macro AUC-ROC of 0.9835 (95% CI 0.9821–0.9847) on the internal test set. External validation of the CQ500 reveals complete threshold failure under domain shift—a critical negative finding that must be foregrounded in any deployment context. The threshold-independent AUC-ROC of 0.7276 demonstrated retained discriminative capacity, with intraventricular hemorrhage showing the most consistent cross-dataset generalization. The Grad-CAM attention maps received a radiologist localization score of 3.81/5 across all six ICH subtypes, confirming anatomically plausible model attention. These findings establish threshold recalibration on a representative local validation set as a nonnegotiable prerequisite for deployment, and the open-source pipeline is released to facilitate adaptation in resource-limited healthcare settings. Declarations Ethics statement: This study used only publicly available, fully deidentified datasets (RSNA 2019 and CQ500). No patient contact or new data collection was performed. Ethical approval was not needed. Data availability: RSNA 2019: https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection. CQ500: http://headctstudy.qure.ai. Code and model weights: https://github.com/lochanshrestha-dev/ich-detection-efficientnet- Author contributions (CRediT): Lochan Shrestha: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – Original Draft, Writing – Review & Editing, Visualization. Huma Subhani: Conceptualization, Resources, Writing – Review & Editing, Supervision, Project administration. Generative AI and AI-assisted technologies in the writing process During the preparation of this work, the authors used Claude (Anthropic) to assist with manuscript drafting, structural editing, and revision. The authors reviewed, edited, and critically evaluated all AI-assisted content and take full responsibility for the integrity, accuracy, and scientific validity of the published article. No AI tool was used in the design, conduct, or analysis of the study, in the interpretation of results, or in the clinical assessment of Grad-CAM heatmaps. Competing interests: The authors declare that they have no competing interests. Funding: No external funding was received for this study. Acknowledgments: The authors thank the RSNA and Qure.ai for making their datasets publicly available and Kaggle for providing computational resources. References Caceres, J. A. & Goldstein, J. N. Intracranial hemorrhage. Emerg. Med. Clin. North. Am. 30 , 771–794 (2012). van Asch, C. J. et al. Incidence, case fatality, and functional outcome of intracerebral hemorrhage over time, according to age, sex, and ethnic origin: a systematic review and meta-analysis. Lancet Neurol. 9 , 167–176 (2010). Karamian, A. & Seifi, A. Diagnostic Accuracy of Deep Learning for Intracranial Hemorrhage Detection in Non-Contrast Brain CT Scans: A Systematic Review and Meta-Analysis. J. Clin. Med. 14 , 2377 (2025). Kuo, W., Hӓne, C., Mukherjee, P., Malik, J. & Yuh, E. L. Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning. Proc. Natl. Acad. Sci. 116, 22737–22745 (2019). Voter, A. F., Meram, E., Garrett, J. W. & Yu, J. P. J. Diagnostic Accuracy and Failure Mode Analysis of a Deep Learning Algorithm for the Detection of Intracranial Hemorrhage. J. Am. Coll. Radiol. JACR . 18 , 1143–1152 (2021). Nada, A. et al. External validation and performance analysis of a deep learning-based model for the detection of intracranial hemorrhage. Neuroradiol. J. 38 , 312–321 (2025). Salehinejad, H. et al. A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography. Sci. Rep. 11 , 17051 (2021). Lin, T. Y., Goyal, P., Girshick, R., He, K. & Dollar, P. Focal Loss for Dense Object Detection. in 2980–2988 (2017). Flanders, A. E. et al. Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge. Radiol. Artif. Intell. 2 , e190211 (2020). Lee, H. et al. An explainable deep-learning algorithm for the detection of acute intracranial hemorrhage from small datasets. Nat. Biomed. Eng. 3 , 173–182 (2019). Chilamkurthy, S. et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 392 , 2388–2396 (2018). Burduja, M., Ionescu, R. T. & Verga, N. Accurate and Efficient Intracranial Hemorrhage Detection and Subtype Classification in 3D CT Scans with Convolutional and Long Short-Term Memory Neural Networks. Sensors 20 , 5611 (2020). D’Angelo, T. et al. Accuracy and time efficiency of a novel deep learning algorithm for Intracranial Hemorrhage detection in CT Scans. Radiol. Med. (Torino) . 129 , 1499–1506 (2024). Kang, D. W. et al. Strengthening deep-learning models for intracranial hemorrhage detection: strongly annotated computed tomography images and model ensembles. Front. Neurol. 14 , 1321964 (2023). Tan, M., Le, Q. & EfficientNet Rethinking Model Scaling for Convolutional Neural Networks. in Proceedings of the 36th International Conference on Machine Learning 6105–6114PMLR, (2019). Park, S. H. et al. Comparison between single and serial computed tomography images in classification of acute appendicitis, acute right-sided diverticulitis, and normal appendix using EfficientNet. PLOS ONE . 18 , e0281498 (2023). Kok, Y. E. et al. Semantic Segmentation of Spontaneous Intracerebral Hemorrhage, Intraventricular Hemorrhage, and Associated Edema on CT Images Using Deep Learning. Radiol. Artif. Intell. 4 , e220096 (2022). Selvaraju, R. R. et al. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. in 618–626 (2017). Kim, K. H., Koo, H. W., Lee, B. J., Yoon, S. W. & Sohn, M. J. Cerebral hemorrhage detection and localization with medical imaging for cerebrovascular disease diagnosis and treatment using explainable deep learning. J. Korean Phys. Soc. 79 , 321–327 (2021). Altuve, M. & Pérez, A. Intracerebral hemorrhage detection on computed tomography images using a residual neural network. Phys. Med. PM Int. J. Devoted Appl. Phys. Med. Biol. Off J. Ital. Assoc. Biomed. Phys. 99 , 113–119 (2022). Inkeaw, P. et al. Automatic hemorrhage segmentation on head CT scan for traumatic brain injury using 3D deep learning model. Comput. Biol. Med. 146 , 105530 (2022). Vidhya, V. et al. YOLOv5s-CAM: A Deep Learning Model for Automated Detection and Classification for Types of Intracranial Hematoma in CT Images. IEEE Access. 11 , 141309–141328 (2023). Deng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database. Wightman, R. & PyTorch Image Models https://doi.org/10.5281/zenodo.4414861 (2026). Paszke, A. et al. Curran Associates, Inc.,. PyTorch: An Imperative Style, High-Performance Deep Learning Library. in Advances in Neural Information Processing Systems vol. 32 (2019). Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization. in (2018). Micikevicius, P. et al. Mixed Precision Train. in (2018). Landis, J. R. & Koch, G. G. The Measurement of Observer Agreement for Categorical Data. Biometrics 33 , 159–174 (1977). Cohen, J. A. Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20 , 37–46 (1960). Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 17 May, 2026 Editor assigned by journal 14 May, 2026 Submission checks completed at journal 14 May, 2026 First submitted to journal 13 May, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9705343","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":641595973,"identity":"02fa5afd-bb94-45c7-8966-e62a94fcd574","order_by":0,"name":"Lochan Shrestha","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABFUlEQVRIiWNgGAWjYBACPgYGNhCdACKYGRhsGBgkQEw23FrYkLQwNjMwpJGu5TARWtibnz34uaMuz7y9/fnjgprzefyz2x8wfCg7zGAu3YBdC88xc8PeM4eLZc6cMWyecex2scSdMwaMM84dZrCccwC7FokcNgnetgOJMyRyGJt52G4nNtzIYWDmbTvMYHAjAacWyb9tdYkz5J8/bOb5dy5x/o30B8x/CWiR5m1jBtrCYNgMsm7DjQQDZkZ8WniOmUnLth1OnMGTYzibty+52PBGjsHBnnPpPAZ3sPuFHxhikm9BDmM//uAzzze7PLkb6Q8f/CizljO4jT3EMADYMSDjeSARRKwWCCBWyygYBaNgFAx3AADNT2EkT4sltwAAAABJRU5ErkJggg==","orcid":"","institution":"Patan Academy of Health Sciences","correspondingAuthor":true,"prefix":"","firstName":"Lochan","middleName":"","lastName":"Shrestha","suffix":""},{"id":641595976,"identity":"010fe1f0-e4a5-47c0-bea9-3f9e4e58e848","order_by":1,"name":"Huma Subhani","email":"","orcid":"","institution":"Patan Academy of Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Huma","middleName":"","lastName":"Subhani","suffix":""}],"badges":[],"createdAt":"2026-05-13 15:05:31","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9705343/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9705343/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":109759608,"identity":"1c2457c8-8e9d-47d4-a329-423009e14103","added_by":"auto","created_at":"2026-05-22 07:27:26","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":676399,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eIntracranial hemorrhage (ICH) classification pipeline. he system processes raw noncontrast DICOM CT slices through a 3-window RGB encoding strategy (brain, subdural, and bone windows) and standardizes them via 256 × 256 resizing and ImageNet normalization. These processed images are analyzed by an EfficientNet-B4 backbone to extract 1,792 features, which are then passed through a classification head—comprising a 0.3 dropout layer and a linear transformation—to predict six distinct hemorrhage categories via sigmoid activation. Model transparency is provided by Grad-CAM class heatmaps generated from the final convolutional block.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-9705343/v1/c76c7fbc2b61c3fd1959ea24.png"},{"id":109760188,"identity":"e06a406a-b01e-4ab4-9598-e8034c3313cc","added_by":"auto","created_at":"2026-05-22 07:28:17","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":254090,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eTraining dynamics: EfficientNet-B4 on RSNA 2019. Left y-axis (red): training loss. Right y-axis (blue): validation macro AUC-ROC curve. Dashed vertical line: best checkpoint (epoch 6, AUC = 0.9837). Early stopping at epoch 13.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-9705343/v1/d18232d0e93bc88a31c7ec3f.png"},{"id":109759464,"identity":"1cde299a-6767-4cf9-8afc-5e6db8c1d5b9","added_by":"auto","created_at":"2026-05-22 07:27:07","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":283214,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eEmpirical ROC curves. Left: RSNA 2019 internal test set (slice level, n = 112,921), macro AUC = 0.983. Right: CQ500 external validation (study-level, n = 473), macro AUC = 0.728. The stepped appearance of the CQ500 curves reflects small positive study counts per class. Diagonal dashed line: reference classifier (AUC = 0.50).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-9705343/v1/03fde84b425ec86d5b65d256.png"},{"id":109437359,"identity":"9982175d-1b47-4331-9bde-864edfce40b1","added_by":"auto","created_at":"2026-05-18 06:31:35","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":457429,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure 5. \u003c/strong\u003e\u003cem\u003eAUC-ROC comparison: RSNA internal test set (blue bars, slice level, threshold = 0.50) vs CQ500 external validation (amber bars, study level, AUC-ROC only). Error bars: 95% bootstrap CI (CQ500). Horizontal dashed line: Kang \u003c/em\u003eet al\u003cem\u003e. 2023 benchmark (AUC = 0.953).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-9705343/v1/3f55ea5e854e3a0cd0cd1520.png"},{"id":109760331,"identity":"710c5891-95e8-43d4-9c5e-b80a40671d2a","added_by":"auto","created_at":"2026-05-22 07:28:33","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":364326,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure 6. \u003c/strong\u003e\u003cem\u003ePredicted sigmoid probability distributions for the RSNA 2019 internal test set (blue, slice-level, n = 112,921) and the CQ500 external validation set (amber, study-level, n = 473) per ICH subtype. The RSNA distribution spans the full [0, 1] range, indicating a well-calibrated internal classifier. The CQ500 predictions are entirely concentrated near zero, with no class exceeding the decision threshold of 0.5 (red dashed line). The per-class maximum CQ500 values were as follows: any_ich 0.468, epidural 0.029, intraparenchymal 0.313, intraventricular 0.276, subarachnoid 0.226, and subdural 0.261. The epidural maximum of 0.029 directly explains the degenerate Youden threshold of 0.000 in Table 4. This systematic compression of CQ500 predictions is consistent with the JPEG-to-DICOM intensity distribution mismatch identified as the primary driver of domain shift (Section 5.2).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-9705343/v1/7054b1f87d142eab59ad72c7.png"},{"id":109906468,"identity":"7f0e7336-6129-4657-a39a-29f3c74e0f5e","added_by":"auto","created_at":"2026-05-25 06:40:19","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":946064,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure 4. \u003c/strong\u003e\u003cem\u003eGrad-CAM class-specific attention heatmaps (highest-confidence prediction per ICH subtype, predicted probability ≈ 1.00). Jet color map (blue = low, red = high activation) overlaid on the bone window channel at 50% opacity. Expected anatomical localization confirmed for all subtypes. Full evaluation: Sixty randomly selected true-positive predictions (10 per class) were scored by two independent radiologists; the combined mean score was 3.81 ± 0.86 / 5 (κ = 0.653). See Table 5 for per-class breakdown.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-9705343/v1/655ad17b0b72dccd69f96ddf.png"},{"id":109907964,"identity":"7d5cd53d-8e8c-4355-b72d-d219a5c01295","added_by":"auto","created_at":"2026-05-25 06:46:14","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3336589,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9705343/v1/cd38e297-88c3-4552-ae27-b2eac3e9e9f2.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Multi-Dataset Generalization and Explainability in AI-Based Intracranial Hemorrhage Detection from Noncontrast CT Using EfficientNet-B4","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eIntracranial hemorrhage (ICH) is a neurological emergency associated with mortality exceeding 40% and substantial long-term disability among survivors \u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Rapid and accurate diagnosis is critical: delayed identification is directly associated with worse neurological outcomes \u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Noncontrast computed tomography (NCCT) is the primary imaging modality for the initial assessment of suspected ICH, owing to its widespread availability, speed, and high sensitivity for acute blood products. Accurate interpretation demands the recognition of five radiologically and clinically distinct subtypes (epidural, intraparenchymal, intraventricular, subarachnoid, and subdural hemorrhage), each with different etiologies, anatomical distributions, and management implications. This interpretive burden is compounded in high-volume emergency settings and in low- and middle-income countries (LMICs), where specialist neuroradiology coverage may be limited.\u003c/p\u003e \u003cp\u003eDeep learning has demonstrated strong diagnostic performance for ICH detection from NCCT. A 2025 systematic review and meta-analysis of 58 studies reported a pooled sensitivity of 0.92, specificity of 0.94, and AUC-ROC of 0.96 \u003csup\u003e3\u003c/sup\u003e. Landmark studies have achieved AUC-ROC curves as high as 0.991 \u003csup\u003e4\u003c/sup\u003e, and reader studies have confirmed that DL assistance improves diagnostic accuracy \u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Multiple dataset ensemble approaches have reported an AUC-ROC of 0.953 \u003csup\u003e5\u003c/sup\u003e, which serves as the primary benchmark for the present work.\u003c/p\u003e \u003cp\u003eDespite this progress, five important limitations persist. First, most models are validated on a single dataset, and few open-source systems demonstrate robust generalization across heterogeneous CT acquisitions from geographically distinct populations \u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Second, epidural hemorrhage, the rarest ICH subtype (\u0026lt;\u0026thinsp;0.5% of slices), receives minimal attention. Third, the severe class imbalance inherent to ICH datasets is inconsistently addressed \u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Fourth, structured radiologist scoring of model attention maps has not been reported. Fifth, open reproducible pipelines enabling replication in LMIC contexts remain scarce.\u003c/p\u003e \u003cp\u003eTo address these gaps, we present an open EfficientNet-B4 pipeline trained on 752,803 CT slices from RSNA 2019 dataset \u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e and externally validated on CQ500 \u003csup\u003e10\u003c/sup\u003e. Our contributions are as follows: (i) cross-dataset generalization with transparent reporting of domain shifts; (ii) Grad-CAM explainability with radiologist localization scoring; (iii) class-balanced imbalance handling with verified numerical accuracy; and (iv) full public release of code and model weights.\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Deep Learning Architectures for ICH Detection\u003c/h2\u003e \u003cp\u003eThe application of deep learning to ICH detection from NCCT has advanced rapidly since Chilamkurthy et al. demonstrated CNN-based detection across 313,000 head CT scans with an AUC of 0.91\u0026ndash;0.97 on the CQ500 validation set \u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. Kuo et al. reported an AUC-ROC of 0.991 on 4,396 CT scans, demonstrating expert-level performance \u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. The RSNA 2019 Challenge \u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e standardized evaluation on the largest publicly available annotated CT dataset. Burduja et al. finished in the top 2% of 1,345 teams, integrating Grad-CAM visualizations \u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. Postchallenge, D\u0026rsquo;Angelo et al. evaluated Dense-UNet on 502 NCCT scans with an accuracy of 91.24% \u003csup\u003e13\u003c/sup\u003e. Ensemble methods have consistently outperformed single-model approaches \u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. EfficientNet \u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e, designed with compound scaling, has demonstrated strong medical imaging performance \u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Multilabel Subtype Classification and Class Imbalance\u003c/h2\u003e \u003cp\u003eMultilabel binary cross-entropy formulations better reflect the clinical reality that a single CT study may reveal multiple concurrent hemorrhage types \u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. Class imbalance poses a compounding difficulty, particularly for epidural hemorrhage. Strategies include focal loss \u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e, oversampling, and weighted loss functions. Kok et al. demonstrated the effectiveness of focal loss for intraventricular hemorrhage segmentation \u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. In the present work, we adopt a class-balanced inverse frequency weighting formula and verify it numerically against logged training values.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Cross-Dataset Generalization\u003c/h2\u003e \u003cp\u003eGeneralization to external cohorts is performed in only a minority of published ICH studies \u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. Salehinejad et al. reported an AUC-ROC of 95.4% for external validation, with a drop from 98.4% internally \u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Voter et al. identified failure modes of a validated algorithm during prospective deployment \u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Nada et al. performed external validation on 5,600 heterogeneous CTs (AUC-ROC 0.954) \u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. The CQ500 dataset \u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e, collected in New Delhi, India, has become the most widely used external benchmark for models trained on North American data.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Explainability in Brain CT AI\u003c/h2\u003e \u003cp\u003eGrad-CAM \u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e has become the most widely applied post hoc explanation method for CNN-based medical image classifiers. Lee et al. demonstrated that saliency-based explanations improved clinician trust in ICH detection \u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. Kim et al. applied Grad-CAM to all six ICH subtypes via ResNet, providing a proof-of-concept \u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. Altuve and P\u0026eacute;rez conducted qualitative heatmap validation for ICH detection \u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e. The present study extends this work by conducting formal radiologist scoring across all six output classes.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Materials and methods","content":"\u003cp\u003e \u003cb\u003eReporting standards.\u003c/b\u003e This study follows the TRIPOD-AI and CLAIM (Checklist for Artificial Intelligence in Medical Imaging) reporting guidelines for diagnostic AI studies. A supplementary checklist is available from the authors upon request.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Datasets\u003c/h2\u003e \u003cp\u003e \u003cb\u003eRSNA 2019 Intracranial Hemorrhage Detection Dataset.\u003c/b\u003e The primary training dataset was sourced from the RSNA 2019 Brain CT Hemorrhage Challenge \u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. After slice-level labels were extracted, the dataset comprised 752,803 CT slices across six binary labels: any_ich, epidural, intraparenchymal, intraventricular, subarachnoid, and subdural hemorrhage. The RSNA 2019 dataset provides study-level annotations; slice-level labels were generated by propagating the study-level label to all slices within a study. This introduces label noise, as hemorrhagic subtypes may be visible in only a subset of slices within a positive study (discussed in Section \u003cspan refid=\"Sec28\" class=\"InternalRef\"\u003e5.5\u003c/span\u003e). The raw DICOM files were converted to JPEG format (quality factor 95) prior to training.\u003c/p\u003e \u003cp\u003e \u003cb\u003eCQ500 Dataset.\u003c/b\u003e CQ500 \u003csup\u003e11\u003c/sup\u003e comprises 491 noncontrast head CT studies collected in New Delhi, India. Of these, 18 studies (3.7%) were excluded because of the absence of DICOM folders in the Kaggle-hosted version of the dataset; no studies were excluded on a clinical basis. The remaining 473 studies were used for external validation. ICH was present in 205 studies (43.3%) and absent in 268 (56.7%). Each study was annotated by three radiologists; the majority vote (\u0026ge;\u0026thinsp;2/3) was the ground truth. CQ500 was used exclusively for external validation.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAll the RSNA 2019 data were split at the study level: 70% training/15% validation/15% testing, yielding 526,962/112,920/112,921 slices, respectively. An assertion verified that no StudyInstanceUID appeared in more than one partition. StudyInstanceUID was used as the primary split key, constituting the closest available proxy for patient-level separation in this deidentified dataset.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Class distribution and imbalance handling\u003c/h2\u003e \u003cp\u003eThe RSNA training set exhibits severe class imbalance (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Binary cross-entropy loss with per-class positive weighting (BCEWithLogitsLoss, pos_weight) was applied during training. The positive class weight was computed as:\u003c/p\u003e \u003cp\u003e \u003cb\u003ew\u003c/b\u003e \u003cem\u003ec\u003c/em\u003e\u0026thinsp;=\u0026thinsp;\u003cem\u003eN\u003c/em\u003e/(\u003cem\u003eNc\u003c/em\u003e \u0026times; \u003cem\u003eC\u003c/em\u003e)\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eN\u003c/em\u003e is the total number of training slices, \u003cem\u003eNc\u003c/em\u003e is the number of positive slices for class \u003cem\u003ec\u003c/em\u003e, and \u003cem\u003eC\u003c/em\u003e\u0026thinsp;=\u0026thinsp;6 is the number of output classes. This ensures that each class contributes equally to the aggregate loss at initialization. Weights ranged from 1.16 (any_ich) to 39.51 (epidural). The formula was confirmed against the logged training output for all six classes. Implementation: \u003cspan fontcategory=\"NonProportional\" class=\"\" name=\"Emphasis\"\u003e`pos_weight\u003c/span\u003e \u0026thinsp; \u003cspan fontcategory=\"NonProportional\" class=\"\" name=\"Emphasis\"\u003e=\u003c/span\u003e \u0026thinsp; \u003cspan fontcategory=\"NonProportional\" class=\"\" name=\"Emphasis\"\u003etotal/(pos_counts * num_classes\u003c/span\u003e \u0026thinsp; \u003cspan fontcategory=\"NonProportional\" class=\"\" name=\"Emphasis\"\u003e+\u003c/span\u003e \u0026thinsp; \u003cspan fontcategory=\"NonProportional\" class=\"\" name=\"Emphasis\"\u003e1e-6)`\u003c/span\u003e. The ε\u0026thinsp;=\u0026thinsp;1e-6 term is a numerical stabilizer used to prevent division by zero for any class with zero positive samples in the training set (not applicable here but included for robustness). This approach was preferred over focal loss \u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003c/sup\u003e as it requires no additional hyperparameter tuning.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eClass distribution in the RSNA 2019 training set (n\u0026thinsp;=\u0026thinsp;526,962 slices).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHemorrhage subtype\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePositive slices\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrevalence (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003epos_weight (wc)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003en_test pos.\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eany_ich\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e75,553\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e14.34\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e16,084\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eepidural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e2,223\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.42\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e39.51\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e459\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraparenchymal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e25,258\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.48\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5,279\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraventricular\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e18,383\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3.49\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.78\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3,900\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubarachnoid\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e24,952\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.74\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.52\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5,431\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubdural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e33,108\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e6.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.65\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e7,043\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003en_test pos. : positive slices in the held-out RSNA test set. C\u0026thinsp;=\u0026thinsp;6 classes.\u003c/em\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.4 CT preprocessing and windowing\u003c/h2\u003e \u003cp\u003eNoncontrast CT images were preprocessed via a three-window RGB encoding strategy \u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Three clinically standard Hounsfield unit (HU) windows were applied and stacked as separate color channels: the brain (center 40 HU, width 80 HU), subdural (center 40 HU, width 130 HU), and bone (center 400 HU, width 1800 HU) channels. For each window, raw HU pixel values were clipped to [C\u0026thinsp;\u0026minus;\u0026thinsp;W/2, C\u0026thinsp;+\u0026thinsp;W/2] and linearly rescaled to unsigned 8-bit integers [0,255] as pixel_out\u0026thinsp;=\u0026thinsp;255 \u0026times; (HU_clipped \u0026minus; (C\u0026thinsp;\u0026minus;\u0026thinsp;W/2))/W. The three resulting uint8 channels were concatenated to produce a 3-channel 512\u0026times;512 image. RSNA DICOM files were converted to JPEG format via OpenCV (cv2.imwrite, IMWRITE_JPEG_QUALITY\u0026thinsp;=\u0026thinsp;95) prior to training; this is a lossy compression step that quantifies the 8-bit windowed pixel values, and the mismatch between JPEG-compressed training data and raw DICOM external data is discussed as the primary driver of domain shift in Section \u003cspan refid=\"Sec25\" class=\"InternalRef\"\u003e5.2\u003c/span\u003e. The images were downsampled to 256\u0026times;256 via area interpolation (INTER_AREA) and normalized via ImageNet channel statistics (mean = [0.485, 0.456, 0.406]; std = [0.229, 0.224, 0.225]). During quality control, a lightweight Otsu-based threshold was applied to the bone window channel to flag CT slices in which brain tissue coverage fell outside the range of 20\u0026ndash;95%; no skull stripping was applied at the inference time.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.5 Model Architecture\u003c/h2\u003e \u003cp\u003eEfficientNet-B4 \u003csup\u003e15\u003c/sup\u003e, pretrained on ImageNet \u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e, was loaded via the timm library \u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e with the default classification head replaced by Dropout (p\u0026thinsp;=\u0026thinsp;0.3) followed by Linear (1,792, 6). For training, a batch size of 128 slices per step was used. Two NVIDIA T4 GPUs were used with PyTorch DataParallel \u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e, with 64 slices distributed per GPU.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.6 Training procedure\u003c/h2\u003e \u003cp\u003eThe model was trained via AdamW \u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e (learning rate of 1\u0026times;10⁻⁴, weight decay of 1\u0026times;10⁻\u0026sup2;) with cosine annealing scheduling (T_max\u0026thinsp;=\u0026thinsp;50). Mixed-precision training \u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e was enabled. Early stopping (patience\u0026thinsp;=\u0026thinsp;7 epochs) was used to monitor the macro validation AUC-ROC. The best checkpoint saves the epoch index, model state, optimizer state, and validation AUC. Early termination occurred at epoch 13; the best checkpoint (epoch 6) was selected for all subsequent evaluations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e3.7 Evaluation Metrics\u003c/h2\u003e \u003cp\u003eThe per-class and macroaveraged AUC-ROC curves, sensitivity, specificity, and F1 score were computed. For the RSNA internal test set, a fixed threshold of 0.5 was applied. For CQ500, domain shift caused all sigmoid predictions to fall below 0.5 (maximum predicted probability\u0026thinsp;=\u0026thinsp;0.468), yielding zero sensitivity; this is the primary negative finding. For the secondary evaluation of the CQ500, per-class Youden index thresholds were derived from the RSNA 2019 validation set and applied unchanged to the CQ500, preserving methodological independence. AUC-ROC is threshold independent and constitutes the primary cross-dataset performance measure.\u003c/p\u003e \u003cp\u003eThe RSNA 2019 dataset is fully deidentified; StudyInstanceUIDs does not reliably map to unique patients. Study-level aggregation of RSNA test set predictions is therefore not feasible. Internal evaluation was performed at the slice level (n\u0026thinsp;=\u0026thinsp;112,921), which is the native label granularity. Kang et al. 2023 benchmark \u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e was reported at the study/exam level on a different dataset; direct numerical comparison is approximate and presented as contextual rather than equivalent.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.8 Grad-CAM Explainability\u003c/h2\u003e \u003cp\u003eGradient-weighted class activation mapping (Grad-CAM) \u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e was implemented by registering forward and backward hooks on the final convolutional block of EfficientNet-B4. For each target class, the gradient of the predicted logit with respect to the final convolutional feature map was globally average pooled to obtain per-channel importance weights. The resulting heatmap was ReLU-activated, upsampled to 256\u0026times;256 via bilinear interpolation, normalized to [0,10], and overlaid as a jet colormap on the bone window channel at 50% opacity. For localization quality assessment, 10 heatmaps per class (n\u0026thinsp;=\u0026thinsp;60 total) were randomly selected from true-positive test set predictions with a predicted probability\u0026thinsp;\u0026ge;\u0026thinsp;0.5, ensuring that the selection was representative rather than biased toward the highest-confidence cases. Two board-certified radiologists independently scored all 60 heatmaps, blinded to each other\u0026rsquo;s scores and used a five-point Likert scale to model architecture details: 1\u0026thinsp;=\u0026thinsp;attention entirely outside the hemorrhage region; 2\u0026thinsp;=\u0026thinsp;predominantly outside; 3\u0026thinsp;=\u0026thinsp;partial overlap; 4\u0026thinsp;=\u0026thinsp;predominantly within; and 5\u0026thinsp;=\u0026thinsp;attention exclusively within the hemorrhage region. Interrater reliability was quantified via Cohen\u0026rsquo;s weighted kappa (quadratic weights) and interpreted via the Landis \u0026amp; Koch scale \u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e3.9 External validation protocol\u003c/h2\u003e \u003cp\u003e \u003cb\u003eThe\u003c/b\u003e CQ500 validation uses the epoch 6 checkpoint without retraining or fine-tuning. Slice-level sigmoid predictions were aggregated to the study level via mean pooling and then compared against majority-vote study-level labels. The AUC-ROC with 95% bootstrap confidence intervals (n\u0026thinsp;=\u0026thinsp;1,000 study-level resamples) was the primary evaluation measure.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e3.10 Statistical analysis\u003c/h2\u003e \u003cp\u003e \u003cb\u003eThe\u003c/b\u003e 95% CI for AUC-ROC was estimated via bootstrap resampling (n\u0026thinsp;=\u0026thinsp;1,000; slice level for the RSNA internal test set; study level for the CQ500). All analyses were performed in Python (scipy v1.11, NumPy v1.24).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e3.11 Use of AI-assisted writing tools\u003c/h2\u003e \u003cp\u003eDuring the preparation of this manuscript, the authors used Claude (claude.ai; Anthropic PBC, San Francisco, CA, USA) to assist with drafting, structural editing, and revision of the manuscript text. The AI tool was used solely to improve clarity, coherence, and language; it was not used in the design or conduct of the study, the training or evaluation of the model, the collection or analysis of data, the interpretation of results, or the clinical assessment of Grad-CAM heatmaps. All AI-assisted text was reviewed, edited, and verified by the authors, who take full responsibility for the accuracy, integrity, and scientific validity of all content in this article. Claude is not listed as an author and does not meet the ICMJE criteria for authorship.\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Results","content":"\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Model training\u003c/h2\u003e \u003cp\u003eTraining converged rapidly, with the validation macro AUC-ROC reaching 0.9837 at epoch 6 (training loss\u0026thinsp;=\u0026thinsp;0.0553). Training loss decreased monotonically across logged epochs from 0.2002 (epoch 1) to 0.0154 (epoch 13). The validation AUC-ROC rose steeply from 0.9726 (epoch 1) to 0.9837 (epoch 6) and then plateaued while training loss continued to decline, which was consistent with the onset of overfitting and the rationale for early stopping. Early stops at epoch 13 after seven consecutive epochs without improvement (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eValidation performance across training epochs. Metrics from training logs. Early stopping is best marked with ★.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEpoch\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTrain loss\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eVal macro AUC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSensitivity\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSpecificity\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.2002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9726\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7606\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9798\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.1420\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9787\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7970\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9806\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.1149\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9813\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7970\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9838\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0910\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9822\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7940\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9858\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0692\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9831\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7912\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9879\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e6 ★\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.0553\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.9837\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.8207\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.9864\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0418\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9827\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7784\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9904\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0333\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9831\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8085\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9877\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0267\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9821\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7649\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9917\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0238\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9825\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7756\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9917\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0196\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9835\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7825\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9909\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0172\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9834\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7948\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9906\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e13\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.0154\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9822\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7612\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9930\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003e★ Best checkpoint selected for all subsequent evaluations.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Internal Validation \u0026mdash; RSNA Test Set\u003c/h2\u003e \u003cp\u003eOn the RSNA held-out test set (112,921 slices, threshold\u0026thinsp;=\u0026thinsp;0.5), the model achieved a macro AUC-ROC of 0.9835 (95% CI 0.9821\u0026ndash;0.9847), a sensitivity of 0.8192, a specificity of 0.9862, and an F1 of 0.7743 (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). The per-class AUC-ROC ranged from 0.9758 (subarachnoid) to 0.9921 (intraventricular). Epidural hemorrhage, the most severely imbalanced class (w\u0026thinsp;=\u0026thinsp;39.51), had an AUC-ROC of 0.9847, with a specificity of 0.9979. The macro AUC-ROC of 0.9835 contextually exceeds the Kang et al. 2023 benchmark of 0.953 \u003csup\u003e14\u003c/sup\u003e, although this comparison is approximate given different evaluation units.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePer-class performance on the RSNA 2019 internal test set (n\u0026thinsp;=\u0026thinsp;112,921 slices, threshold\u0026thinsp;=\u0026thinsp;0.50).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClass\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC-ROC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95% CI\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSens.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSpec.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eF1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003en_pos\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eany_ich\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9820\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9811\u0026ndash;0.9829\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8434\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9806\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.8605\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e16,084\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eepidural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9847\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9777\u0026ndash;0.9908\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7821\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9979\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.6845\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e459\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraparenchymal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9868\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9855\u0026ndash;0.9882\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8511\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9870\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.8043\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e5,279\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraventricular\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9921\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9910\u0026ndash;0.9931\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8769\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9908\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.8221\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e3,900\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubarachnoid\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9758\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9740\u0026ndash;0.9775\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7535\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9808\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.7060\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e5,431\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubdural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9793\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9778\u0026ndash;0.9807\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8082\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9804\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.7685\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e7,043\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eMACRO\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.9835\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.9821\u0026ndash;0.9847\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.8192\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.9862\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.7743\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003e95% CI: bootstrap (n\u0026thinsp;=\u0026thinsp;1,000, slice level). Sens. = Sensitivity; Spec. = specificity; n_pos\u0026thinsp;=\u0026thinsp;positive slices in the test set.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e4.3 External Validation \u0026mdash; CQ500\u003c/h2\u003e \u003cp\u003e \u003cb\u003ePrimary evaluation (threshold\u0026thinsp;=\u0026thinsp;0.5)\u003c/b\u003e: The maximum predicted sigmoid probability across all CQ500 slices was 0.468. All outputs fell below the fixed 0.5 threshold, yielding zero sensitivity across all classes (specificity\u0026thinsp;=\u0026thinsp;1.0). This constitutes a primary negative finding: the model produces no positive predictions on the CQ500 data without recalibration.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eDiscriminative performance (AUC-ROC)\u003c/strong\u003e \u003cp\u003eThe macro AUC-ROC for the CQ500 was 0.7276 (95% CI 0.6729\u0026ndash;0.7815; Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Intraventricular hemorrhage demonstrated the highest degree of cross-dataset generalizability (AUC-ROC 0.8079, 95% CI 0.691\u0026ndash;0.904). The subdural (0.6762) and epidural (0.6772, 95% CI 0.496\u0026ndash;0.858) methods showed the poorest cross-dataset performance; the wide epidural CI reflects only 12 positive studies.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eSecondary evaluation (RSNA-derived Youden thresholds)\u003c/strong\u003e \u003cp\u003eAt the RSNA validation set, the Youden index threshold (range 0.018\u0026ndash;0.070), the macro sensitivity was 0.347, and the macro specificity was 0.937, reflecting systematic conservative predictions under domain shift. Intraventricular hemorrhage was the most clinically useful balance (sensitivity 0.731, specificity 0.808).\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePer-class performance on the CQ500 external validation dataset (n\u0026thinsp;=\u0026thinsp;473 studies). Primary: AUC‒ROC. Second, sensitivity, specificity, and F1 at the RSNA validation set Youden index thresholds were assessed, and no changes were detected for the CQ500.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClass\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95% CI\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSens.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSpec.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eF1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eThresh.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003en_pos\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eany_ich\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.7298\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.6775\u0026ndash;0.7802\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4721\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9348\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.6039\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.070\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e197\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eepidural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.6772\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.4956\u0026ndash;0.8582\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0833\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9610\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.0645\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraparenchymal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.7098\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.6508\u0026ndash;0.7646\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1550\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9884\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.2614\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.068\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e129\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraventricular\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.8079\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.6905\u0026ndash;0.9044\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7308\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.8076\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.2901\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.018\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e26\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubarachnoid\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.7646\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.6852\u0026ndash;0.8327\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.3333\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9808\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.4524\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.043\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e57\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubdural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.6762\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.5703\u0026ndash;0.7681\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.3061\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9505\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.3529\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.053\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e49\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eMACRO\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.7276\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.6729\u0026ndash;0.7815\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.3468\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.9372\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.3375\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e205\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eAUC-ROC: threshold-independent primary measure; 95% CI by study-level bootstrap (n\u0026thinsp;=\u0026thinsp;1,000). Sens., Spec., F1: At the RSNA validation set, the Youden thresholds applied unchanged to the CQ500. Epidural threshold 0.000: Under domain shift, no RSNA validation slice prediction exceeded the Youden operating point for this class, indicating that the model\u0026rsquo;s epidural output distribution shifted below the RSNA-derived threshold; de facto, no CQ500 slice was classified as epidural using this threshold. n_pos: positive studies in CQ500. The prevalence of the CQ500 class was as follows: any_ich, 43.3%; intraparenchymal, 28.3%; subarachnoid, 12.7%; subdural, 11.2%; intraventricular, 5.9%; and epidural, 2.7%.\u003c/em\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Grad-CAM Explainability\u003c/h2\u003e \u003cp\u003eTwo board-certified radiologists independently scored Grad-CAM heatmaps for 60 randomly selected true-positive test set predictions (10 per class, predicted probability\u0026thinsp;\u0026ge;\u0026thinsp;0.5) via a five-point localization scale (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e4\u003c/span\u003e). The combined mean localization score was 3.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.86/5 (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). Interrater reliability was κ\u0026thinsp;=\u0026thinsp;0.653, indicating substantial agreement (Landis \u0026amp; Koch scale) \u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. Exact agreement was achieved on 47 of 60 images (78.3%), and all 60 images (100%) had within-one-point agreement, confirming scale stability. Intraparenchymal hemorrhage achieved the highest localization scores (mean of 4.40/5 for both raters), which is consistent with its large, visually distinct parenchymal signal. Subdural hemorrhage showed perfect interrater agreement (κ\u0026thinsp;=\u0026thinsp;1.000), which was consistent with the anatomically unambiguous crescent-shaped convexity pattern. Any_ich showed the lowest mean score (3.30/5) and lowest per-class kappa (0.467), which was expected given its heterogeneous composite nature spanning all five subtypes. Epidural hemorrhage showed near-perfect interrater agreement (κ\u0026thinsp;=\u0026thinsp;0.808), which was consistent with the anatomically constrained peripheral location.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eRadiologist Grad-CAM localization scores by two independent board-certified radiologists (n\u0026thinsp;=\u0026thinsp;60 heatmaps; 10 per class; randomly selected true-positive predictions; threshold\u0026thinsp;\u0026ge;\u0026thinsp;0.50).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClass\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRater 1 (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRater 2 (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCombined mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCohen\u0026rsquo;s κ\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eInterpretation\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eany_ich\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e3.20\u0026thinsp;\u0026plusmn;\u0026thinsp;1.08\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e3.40\u0026thinsp;\u0026plusmn;\u0026thinsp;1.28\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.467\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eModerate\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eepidural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e3.60\u0026thinsp;\u0026plusmn;\u0026thinsp;0.49\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e3.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.67\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.55\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.808\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAlmost perfect\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraparenchymal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e4.40\u0026thinsp;\u0026plusmn;\u0026thinsp;0.49\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e4.40\u0026thinsp;\u0026plusmn;\u0026thinsp;0.49\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.583\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eModerate\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eintraventricular\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e3.90\u0026thinsp;\u0026plusmn;\u0026thinsp;0.70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e3.60\u0026thinsp;\u0026plusmn;\u0026thinsp;0.80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.75\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.531\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eModerate\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubarachnoid\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e4.00\u0026thinsp;\u0026plusmn;\u0026thinsp;1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e4.10\u0026thinsp;\u0026plusmn;\u0026thinsp;0.94\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.05\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.565\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eModerate\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003esubdural\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e3.80\u0026thinsp;\u0026plusmn;\u0026thinsp;0.40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e3.80\u0026thinsp;\u0026plusmn;\u0026thinsp;0.40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePerfect\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eOverall\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e3.82\u0026thinsp;\u0026plusmn;\u0026thinsp;0.83\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e3.80\u0026thinsp;\u0026plusmn;\u0026thinsp;0.89\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e3.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.86\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.653\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003eSubstantial\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cem\u003eScores: 1\u0026thinsp;=\u0026thinsp;attention entirely outside the hemorrhage; 5\u0026thinsp;=\u0026thinsp;attention exclusively within the hemorrhage. Interrater reliability: Cohen\u0026rsquo;s weighted kappa (quadratic weights); Landis \u0026amp; Koch (1977)\u003c/em\u003e \u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e \u003cem\u003escale. Exact agreement: 47/60 (78.3%); within-one-point agreement: 60/60 (100%). κ interpretation: \u0026lt;0.20 slight; 0.21\u0026ndash;0.40 fair; 0.41\u0026ndash;0.60 moderate; 0.61\u0026ndash;0.80 substantial; 0.81\u0026ndash;1.00 almost perfect.\u003c/em\u003e\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5. Discussion","content":"\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003e5.1 Internal performance in the context of the literature\u003c/h2\u003e \u003cp\u003eThe model achieved a macro AUC-ROC of 0.9835 (95% CI 0.9821\u0026ndash;0.9847) on the RSNA internal test set. This value is numerically greater than the Kang et al. 2023 benchmark of 0.953 \u003csup\u003e14\u003c/sup\u003e and falls within the pooled meta-analysis estimate of 0.96 \u003csup\u003e3\u003c/sup\u003e. As discussed in Section \u003cspan refid=\"Sec20\" class=\"InternalRef\"\u003e4.2\u003c/span\u003e, this comparison is not equivalent (slice-level versus exam-level evaluation) and should be interpreted as contextual rather than confirmatory. The high macro specificity of 0.9862 indicates a low false-positive rate, which is important for clinical screening workflows \u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. Epidural hemorrhage, the most severely imbalanced class (0.42% prevalence, w\u0026thinsp;=\u0026thinsp;39.51), achieved an AUC-ROC of 0.9847, demonstrating that the class-balanced weighting formula effectively prevents minority class collapse. The sensitivity of 0.7821 for the epidural method reflects the intrinsic difficulty of detecting a class represented by only 459 positive test slices, which is consistent with the literature \u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section2\"\u003e \u003ch2\u003e5.2 Domain Shift and Cross-Dataset Generalization\u003c/h2\u003e \u003cp\u003eThe most notable finding was complete threshold failure on the CQ500: all sigmoid outputs fell below 0.5, rendering the model clinically innovative without recalibration. The discriminative AUC-ROC of 0.7276 confirms the retained separation power. Domain shift is attributable to multiple concurrent factors: (i) image format differences\u0026mdash;JPEG-compressed RSNA data versus raw DICOM CQ500 acquisitions, producing systematically lower sigmoid outputs; (ii) scanner heterogeneity\u0026mdash;CQ500 acquired on Siemens and Philips scanners across varied slice thicknesses (1.25\u0026ndash;5 mm) and reconstruction kernels versus the curated RSNA multi-institutional collection; (iii) differences in slice thickness altering partial volume effects, particularly for thin extra-axial collections; (iv) population-level differences in skull morphology, brain atrophy, and hemorrhage etiology; and (v) structural label noise differences\u0026mdash;RSNA labels propagated from the study level versus CQ500\u0026rsquo;s majority-vote study-level labels, potentially contributing to conservative sigmoid outputs. These findings reinforce the conclusions of Voter et al. \u003csup\u003e5\u003c/sup\u003e and Salehinejad et al. \u003csup\u003e7\u003c/sup\u003e that benchmark performance does not predict deployment performance and highlight threshold recalibration as a mandatory predeployment step. Intraventricular hemorrhage showed the most consistent cross-dataset performance (AUC-ROC 0.8079, sensitivity 0.731 at the RSNA-derived threshold) and represented the most clinically deployable output class under domain shift. The anatomically distinctive periventricular distribution of intraventricular blood produces a high-contrast signal that is relatively robust to scanner and acquisition variation, a finding with practical implications for triage prioritization in LMIC deployment. The epidural AUC-ROC for the CQ500 (0.6772, 95% CI 0.496\u0026ndash;0.858) is statistically unreliable given that only 12 positive studies exist and should not be used for subtype-specific comparisons.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section2\"\u003e \u003ch2\u003e5.3 Epidural Hemorrhage: Rare Subtype Performance\u003c/h2\u003e \u003cp\u003eEpidural hemorrhage presents a dual challenge: extreme class imbalance (0.42%, w\u0026thinsp;=\u0026thinsp;39.51) and a high dependence on anatomically precise skull\u0026ndash;brain interface signals. The class-balanced weighting formula successfully suppressed false positives (specificity of 0.9979 internally), but the sensitivity (0.7821) reflects the difficulty of correctly classifying a small number of positive slices. A dataset with substantially greater epidural incidence is needed for robust cross-domain assessment.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section2\"\u003e \u003ch2\u003e5.4 Grad-CAM and clinical interpretability\u003c/h2\u003e \u003cp\u003eThe two-rater Grad-CAM evaluation (n\u0026thinsp;=\u0026thinsp;60, κ\u0026thinsp;=\u0026thinsp;0.653, substantial agreement) provides credible evidence that the model\u0026rsquo;s attention is broadly consistent with the expected anatomical distribution of each ICH subtype. The combined mean score of 3.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.86 / 5 (ranging from 3.30\u0026ndash;4.40 across subtypes) indicates that both radiologists consistently judged model attention predominantly within the hemorrhage region. These findings are consistent with those of Lee et al. \u003csup\u003e10\u003c/sup\u003e and Kim et al. \u003csup\u003e19\u003c/sup\u003e, who demonstrated anatomically grounded Grad-CAM attention in CNN-based ICH classifiers. Notably, the intraparenchymal subtype had the highest mean (4.40 / 5), which was consistent with its large, visually distinctive parenchymal signal; the subdural subtype achieved perfect interrater agreement (κ\u0026thinsp;=\u0026thinsp;1.000), which was consistent with the unambiguous crescent-shaped convexity pattern; and any_ich scored lowest (3.30 / 5, κ\u0026thinsp;=\u0026thinsp;0.467), which was expected given its heterogeneous composite nature. The evaluation is limited to predictions with probabilities\u0026thinsp;\u0026ge;\u0026thinsp;0.5; attention near the decision boundary was not assessed and remains a direction for future work.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec28\" class=\"Section2\"\u003e \u003ch2\u003e5.5 Limitations\u003c/h2\u003e \u003cp\u003e \u003cb\u003eLabel noise (slice-level label propagation).\u003c/b\u003e The RSNA 2019 dataset provides only study-level labels. Slice-level training labels are generated by propagating the study-level label to every slice in a study, meaning that every slice in a positive study is labeled positive regardless of whether hemorrhage is visible on that slice, and every slice in a negative study is labeled negative. This introduces bidirectional label noise. Given the focal and spatially limited nature of most intracranial hemorrhages relative to standard CT slice thickness (typically 5 mm), it is clinically reasonable to expect that only a minority of slices within a positive study contain visible hemorrhage, implying that the majority of \u0026lsquo;positive\u0026rsquo; training slices may not exhibit visible pathology. This noise almost certainly inflates slice-level specificity and sensitivity estimates and likely inflates the AUC-ROC relative to what a model trained on slice-confirmed labels would achieve. It may also encourage the model to learn study-level features (e.g., patient-specific anatomy, scanner artifacts) rather than slice-level pathology. The internal AUC-ROC of 0.9835 should therefore be interpreted as an upper bound estimate. A fully slice-annotated dataset, such as a DICOM-native dataset with per-slice radiologist annotation, is needed to obtain unbiased slice-level performance estimates. The model operates at the slice level without interslice volumetric context, limiting detection of small or thin hemorrhages more apparent across consecutive slices. A lightweight Otsu-based skull mask was used for quality control rather than a validated deep learning skull stripping tool (e.g., HD-BET), which may affect parenchymal signal isolation. The study-level split is a proxy for patient-level separation in this deidentified dataset. JPEG compression at a quality factor of 95 was applied to training data only; the mismatch with raw DICOM external data was a primary contributor to domain shift and threshold failure. The Grad-CAM radiologist assessment (n\u0026thinsp;=\u0026thinsp;60, two raters) is limited to predictions with probabilities\u0026thinsp;\u0026ge;\u0026thinsp;0.5; model attention near the decision boundary was not assessed. The external validation was limited to a single external cohort (CQ500).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec29\" class=\"Section2\"\u003e \u003ch2\u003e5.6 Future directions\u003c/h2\u003e \u003cp\u003eDomain adaptation techniques should be evaluated to mitigate the JPEG-to-DICOM intensity distribution mismatch. Candidate approaches include histogram equalization (matching CQ500 intensity histograms to the RSNA training distribution before inference), standard unit normalization (z score normalization applied to HU values prior to windowing), and test-time adaptation methods that adjust the model\u0026rsquo;s feature statistics to the target domain. Threshold calibration on a small institutional validation set should be formalized as a mandatory predeployment step. Extension to three-dimensional volumetric models would provide the interslice context currently unavailable. Prospective clinical studies at PAHS and other Nepali emergency departments would directly assess the pipeline\u0026rsquo;s utility in LMIC settings. The preliminary Grad-CAM radiologist assessment should be extended to a larger multirater study with formal interrater reliability measurements across the full range of predicted probabilities. Validation on additional external cohorts with independently acquired DICOM data would further characterize cross-domain generalization.\u003c/p\u003e \u003c/div\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eEfficientNet-B4, trained on 752,803 RSNA 2019 CT slices with class-balanced weighted loss, achieves a macro AUC-ROC of 0.9835 (95% CI 0.9821\u0026ndash;0.9847) on the internal test set. External validation of the CQ500 reveals complete threshold failure under domain shift\u0026mdash;a critical negative finding that must be foregrounded in any deployment context. The threshold-independent AUC-ROC of 0.7276 demonstrated retained discriminative capacity, with intraventricular hemorrhage showing the most consistent cross-dataset generalization. The Grad-CAM attention maps received a radiologist localization score of 3.81/5 across all six ICH subtypes, confirming anatomically plausible model attention. These findings establish threshold recalibration on a representative local validation set as a nonnegotiable prerequisite for deployment, and the open-source pipeline is released to facilitate adaptation in resource-limited healthcare settings.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics statement:\u0026nbsp;\u003c/strong\u003eThis study used only publicly available, fully deidentified datasets (RSNA 2019 and CQ500). No patient contact or new data collection was performed. Ethical approval was not needed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability:\u0026nbsp;\u003c/strong\u003eRSNA 2019: https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection. CQ500: http://headctstudy.qure.ai. Code and model weights: https://github.com/lochanshrestha-dev/ich-detection-efficientnet-\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions (CRediT): Lochan Shrestha:\u0026nbsp;\u003c/strong\u003eConceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing \u0026ndash; Original Draft, Writing \u0026ndash; Review \u0026amp; Editing, Visualization. \u003cstrong\u003eHuma Subhani:\u0026nbsp;\u003c/strong\u003eConceptualization, Resources, Writing \u0026ndash; Review \u0026amp; Editing, Supervision, Project administration.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eGenerative AI and AI-assisted technologies in the writing process\u003c/strong\u003e During the preparation of this work, the authors used Claude (Anthropic) to assist with manuscript drafting, structural editing, and revision. The authors reviewed, edited, and critically evaluated all AI-assisted content and take full responsibility for the integrity, accuracy, and scientific validity of the published article. No AI tool was used in the design, conduct, or analysis of the study, in the interpretation of results, or in the clinical assessment of Grad-CAM heatmaps.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests:\u0026nbsp;\u003c/strong\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u0026nbsp;\u003c/strong\u003eNo external funding was received for this study.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments:\u0026nbsp;\u003c/strong\u003eThe authors thank the RSNA and Qure.ai for making their datasets publicly available and Kaggle for providing computational resources.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eCaceres, J. A. \u0026amp; Goldstein, J. N. Intracranial hemorrhage. \u003cem\u003eEmerg. Med. Clin. North. Am.\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e, 771\u0026ndash;794 (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003evan Asch, C. J. et al. Incidence, case fatality, and functional outcome of intracerebral hemorrhage over time, according to age, sex, and ethnic origin: a systematic review and meta-analysis. \u003cem\u003eLancet Neurol.\u003c/em\u003e \u003cb\u003e9\u003c/b\u003e, 167\u0026ndash;176 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaramian, A. \u0026amp; Seifi, A. Diagnostic Accuracy of Deep Learning for Intracranial Hemorrhage Detection in Non-Contrast Brain CT Scans: A Systematic Review and Meta-Analysis. \u003cem\u003eJ. Clin. Med.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e, 2377 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKuo, W., Hӓne, C., Mukherjee, P., Malik, J. \u0026amp; Yuh, E. L. Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning. \u003cem\u003eProc. Natl. Acad. Sci.\u003c/em\u003e 116, 22737\u0026ndash;22745 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVoter, A. F., Meram, E., Garrett, J. W. \u0026amp; Yu, J. P. J. Diagnostic Accuracy and Failure Mode Analysis of a Deep Learning Algorithm for the Detection of Intracranial Hemorrhage. \u003cem\u003eJ. Am. Coll. Radiol. JACR\u003c/em\u003e. \u003cb\u003e18\u003c/b\u003e, 1143\u0026ndash;1152 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNada, A. et al. External validation and performance analysis of a deep learning-based model for the detection of intracranial hemorrhage. \u003cem\u003eNeuroradiol. J.\u003c/em\u003e \u003cb\u003e38\u003c/b\u003e, 312\u0026ndash;321 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSalehinejad, H. et al. A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography. \u003cem\u003eSci. Rep.\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e, 17051 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin, T. Y., Goyal, P., Girshick, R., He, K. \u0026amp; Dollar, P. Focal Loss for Dense Object Detection. in 2980\u0026ndash;2988 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFlanders, A. E. et al. Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge. \u003cem\u003eRadiol. Artif. Intell.\u003c/em\u003e \u003cb\u003e2\u003c/b\u003e, e190211 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee, H. et al. An explainable deep-learning algorithm for the detection of acute intracranial hemorrhage from small datasets. \u003cem\u003eNat. Biomed. Eng.\u003c/em\u003e \u003cb\u003e3\u003c/b\u003e, 173\u0026ndash;182 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChilamkurthy, S. et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. \u003cem\u003eLancet\u003c/em\u003e \u003cb\u003e392\u003c/b\u003e, 2388\u0026ndash;2396 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBurduja, M., Ionescu, R. T. \u0026amp; Verga, N. Accurate and Efficient Intracranial Hemorrhage Detection and Subtype Classification in 3D CT Scans with Convolutional and Long Short-Term Memory Neural Networks. \u003cem\u003eSensors\u003c/em\u003e \u003cb\u003e20\u003c/b\u003e, 5611 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eD\u0026rsquo;Angelo, T. et al. Accuracy and time efficiency of a novel deep learning algorithm for Intracranial Hemorrhage detection in CT Scans. \u003cem\u003eRadiol. Med. (Torino)\u003c/em\u003e. \u003cb\u003e129\u003c/b\u003e, 1499\u0026ndash;1506 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKang, D. W. et al. Strengthening deep-learning models for intracranial hemorrhage detection: strongly annotated computed tomography images and model ensembles. \u003cem\u003eFront. Neurol.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e, 1321964 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTan, M., Le, Q. \u0026amp; EfficientNet Rethinking Model Scaling for Convolutional Neural Networks. in \u003cem\u003eProceedings of the 36th International Conference on Machine Learning\u003c/em\u003e 6105\u0026ndash;6114PMLR, (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePark, S. H. et al. Comparison between single and serial computed tomography images in classification of acute appendicitis, acute right-sided diverticulitis, and normal appendix using EfficientNet. \u003cem\u003ePLOS ONE\u003c/em\u003e. \u003cb\u003e18\u003c/b\u003e, e0281498 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKok, Y. E. et al. Semantic Segmentation of Spontaneous Intracerebral Hemorrhage, Intraventricular Hemorrhage, and Associated Edema on CT Images Using Deep Learning. \u003cem\u003eRadiol. Artif. Intell.\u003c/em\u003e \u003cb\u003e4\u003c/b\u003e, e220096 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSelvaraju, R. R. et al. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. in 618\u0026ndash;626 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, K. H., Koo, H. W., Lee, B. J., Yoon, S. W. \u0026amp; Sohn, M. J. Cerebral hemorrhage detection and localization with medical imaging for cerebrovascular disease diagnosis and treatment using explainable deep learning. \u003cem\u003eJ. Korean Phys. Soc.\u003c/em\u003e \u003cb\u003e79\u003c/b\u003e, 321\u0026ndash;327 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAltuve, M. \u0026amp; P\u0026eacute;rez, A. Intracerebral hemorrhage detection on computed tomography images using a residual neural network. \u003cem\u003ePhys. Med. PM Int. J. Devoted Appl. Phys. Med. Biol. Off J. Ital. Assoc. Biomed. Phys.\u003c/em\u003e \u003cb\u003e99\u003c/b\u003e, 113\u0026ndash;119 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eInkeaw, P. et al. Automatic hemorrhage segmentation on head CT scan for traumatic brain injury using 3D deep learning model. \u003cem\u003eComput. Biol. Med.\u003c/em\u003e \u003cb\u003e146\u003c/b\u003e, 105530 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVidhya, V. et al. YOLOv5s-CAM: A Deep Learning Model for Automated Detection and Classification for Types of Intracranial Hematoma in CT Images. \u003cem\u003eIEEE Access.\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e, 141309\u0026ndash;141328 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWightman, R. \u0026amp; PyTorch \u003cem\u003eImage Models\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/zenodo.4414861\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.4414861\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2026).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaszke, A. et al. Curran Associates, Inc.,. PyTorch: An Imperative Style, High-Performance Deep Learning Library. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 32 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLoshchilov, I. \u0026amp; Hutter, F. Decoupled Weight Decay Regularization. in (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMicikevicius, P. et al. \u003cem\u003eMixed Precision Train.\u003c/em\u003e in (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLandis, J. R. \u0026amp; Koch, G. G. The Measurement of Observer Agreement for Categorical Data. \u003cem\u003eBiometrics\u003c/em\u003e \u003cb\u003e33\u003c/b\u003e, 159\u0026ndash;174 (1977).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCohen, J. A. Coefficient of Agreement for Nominal Scales. \u003cem\u003eEduc. Psychol. Meas.\u003c/em\u003e \u003cb\u003e20\u003c/b\u003e, 37\u0026ndash;46 (1960).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"intracranial hemorrhage, deep learning, EfficientNet, multilabel classification, Grad-CAM, external validation, CT imaging, domain shift, class imbalance","lastPublishedDoi":"10.21203/rs.3.rs-9705343/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9705343/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIntracranial hemorrhage (ICH) is a neurological emergency with a mortality rate exceeding 40%. Deep learning models trained on benchmark datasets rarely demonstrate robust generalizability to independent external cohorts. We developed and externally validated an open-source EfficientNet-B4 pipeline for multilabel ICH detection with integrated Grad-CAM explainability. The model was trained on 752,803 CT slices from the RSNA 2019 dataset via inverse frequency weighted loss across six outputs, with three-window RGB encoding as input. External validation was performed on the CQ500 (473 studies, New Delhi, India) without retraining. On the RSNA internal test set (112,921 slices), the model achieved a macro AUC-ROC of 0.9835 (95% CI 0.9821\u0026ndash;0.9847), a sensitivity of 0.8192, a specificity of 0.9862, and an F1 of 0.7743. For CQ500, all sigmoid predictions fell below 0.5, yielding zero sensitivity \u0026mdash; a primary negative finding attributable to domain shift. The threshold-independent macro AUC-ROC was 0.7276 (95% CI 0.6729\u0026ndash;0.7815). Intraventricular hemorrhage was the most consistent cross-dataset generalization (AUC-ROC 0.8079, 95% CI 0.691\u0026ndash;0.904). Grad-CAM heatmaps (n\u0026thinsp;=\u0026thinsp;60, two independent radiologists) achieved a combined mean localization score of 3.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.86/5 with substantial interrater agreement (κ\u0026thinsp;=\u0026thinsp;0.653). Domain shift causes complete threshold failure on external data; threshold recalibration on a representative local validation set is a nonnegotiable prerequisite for deployment. The open-source pipeline is released to facilitate adaptation in resource-limited settings.\u003c/p\u003e","manuscriptTitle":"Multi-Dataset Generalization and Explainability in AI-Based Intracranial Hemorrhage Detection from Noncontrast CT Using EfficientNet-B4","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-18 06:31:27","doi":"10.21203/rs.3.rs-9705343/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-05-17T17:26:19+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-05-14T05:27:19+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-05-14T05:27:13+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-05-13T14:45:32+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"18f16e29-e729-4d2d-a2de-b9a44b4dfdda","owner":[],"postedDate":"May 18th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Revision requested","date":"2026-05-17T17:26:19+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-05-14T05:27:19+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-05-14T05:27:13+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-05-13T14:45:32+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":68243611,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":68243612,"name":"Health sciences/Medical research"},{"id":68243613,"name":"Health sciences/Neurology"},{"id":68243614,"name":"Biological sciences/Neuroscience"}],"tags":[],"updatedAt":"2026-05-19T02:23:31+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-18 06:31:27","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9705343","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9705343","identity":"rs-9705343","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00