Automated Real-time Assessment of Intracranial Hemorrhage Detection AI Using an Ensembled Monitoring Model (EMM) | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Automated Real-time Assessment of Intracranial Hemorrhage Detection AI Using an Ensembled Monitoring Model (EMM) Zhongnan Fang, Andrew Johnston, Lina Cheuy, Hye Sun Na, Magdalini Paschali, and 8 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6683104/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 16 Oct, 2025 Read the published version in npj Digital Medicine → Version 1 posted 10 You are reading this latest preprint version Abstract Artificial intelligence (AI) tools for radiology are commonly unmonitored once deployed. The lack of real-time case-by-case assessments of AI prediction confidence requires users to independently distinguish between trustworthy and unreliable AI predictions, which increases cognitive burden, reduces productivity, and potentially leads to misdiagnoses. To address these challenges, we introduce Ensembled Monitoring Model (EMM), a framework inspired by clinical consensus practices using multiple expert reviews. Designed specifically for black-box commercial AI products, EMM operates independently without requiring access to internal AI components or intermediate outputs, while still providing robust confidence measurements. Using intracranial hemorrhage detection as our test case on a large, diverse dataset of 2919 studies, we demonstrate that EMM successfully categorizes confidence in the AI-generated prediction, suggesting different actions and helping improve the overall performance of AI tools to ultimately reduce cognitive burden. Importantly, we provide key technical considerations and best practices for successfully translating EMM into clinical settings. Health sciences/Diseases/Neurological disorders/Brain injuries Health sciences/Anatomy/Nervous system/Brain Health sciences/Health care/Medical imaging/Tomography/Computed tomography Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction The landscape of healthcare has rapidly evolved in recent years, with an exponential increase in FDA-cleared artificial intelligence (AI) software as medical devices, especially in radiology 1 . Despite the number of AI applications available, the clinical adoption of radiological AI tools has been slow due to safety concerns regarding potential increases in misdiagnosis, which can erode overall trust in AI systems 2 . Such inaccurate predictions force meticulous verification of each AI result, ultimately adding to the user’s cognitive workload rather than fulfilling AI's promise to enhance clinical efficiency 3 . This mismatch between expected and actual performance of AI tools indicates a clear need for real-time monitoring to inform physicians on a case-by-case basis about how confident they can be in each prediction. Such real-time monitoring alongside physician’s image interpretation also goes hand-in-hand with the latest guidance issued by the FDA focusing on total life-cycle management of AI tools, rather than the status-quo of pre-deployment validation 4 . However, there are currently limited guidelines or best practices for real-time monitoring to communicate reduced model confidence or uncertainty in an AI model’s prediction. Current monitoring of radiological AI devices is performed retrospectively based on concordance between AI model outputs and manual labels, which require laborious radiologist-led annotation 5 . Due to the resource-intensive nature of generating these labels, the vast majority of retrospective evaluations are limited to small data subsets, providing only a partial view of real-world performance 6 . While recent advances in large language models (LLMs) have shown promise in the analysis of clinical reports 7 – 9 , including automated extraction of diagnosis labels from radiology reports 10 – 12 , this solution remains retrospective. Moreover, with report-based monitoring, regardless of the extraction technique, the “quality control” mechanism for algorithm performance remains a manual task. An automated quality monitoring solution may help decrease user cognitive burden and provide additional objective information regarding the performance of the AI model, including performance drift. In response to these limitations, various real-time monitoring techniques have been proposed, including methods that can directly predict confidence/uncertainty using the same training dataset used to develop the AI model being monitored 13 – 18 . Common classes of confidence/uncertainty estimation rely on methods such as SoftMax probability calibration 19 – 21 , Bayesian neural networks 22 – 25 , and Monte Carlo dropout 26 , 27 . Deep ensemble approaches have also emerged to evaluate prediction reliability by utilizing groups of models derived from the same parent model with varying augmentations 28 – 32 . However, these methods require access to either the training dataset, model weights, or intermediate outputs, which is not practical when monitoring commercially available models. Since all FDA-cleared radiological AI models that are deployed clinically are black-box in nature, there currently exist no techniques to monitor such models in production in real time. Thus, there remains a critical need for a real-time monitoring system to automatically characterize confidence at the point-of-care (i.e., when the radiologists review the images and the black-box AI prediction in question). To address this need, we developed the Ensembled Monitoring Model (EMM) approach, which is inspired by clinical consensus practices, where individual opinions are validated through multiple expert reviews. Our EMM framework enables prospective real-time case-by-case monitoring, without requiring ground-truth labels or access to internal AI model components, making it deployable for black-box systems. Here, we demonstrate the effectiveness of the EMM approach in characterizing confidence of intracranial hemorrhage (ICH) detection AI systems (one FDA-cleared and one open-source) operating on head computed tomography (CT) imaging. In this clinically significant application requiring high reliability, we show how EMM can monitor AI model performance in real time and inform subsequent actions in cases flagged for decreased accuracy. The complementary use of a primary AI model with EMM can improve accuracy and user trust in the AI model, while potentially reducing the cognitive burden of interpreting ambiguous cases. We further investigate and provide key considerations for translating and implementing the EMM approach across different clinical scenarios. Results Ensembled Monitoring Model (EMM) Overview Emulating how clinicians achieve group consensus through a group of experts, the EMM framework was developed to estimate consensus among a group of models. Here, we refer to the model being monitored as the “ primary model ”. In this study, the EMM comprised five sub-models with diverse architectures trained for the identical task of detecting the presence of ICH (Fig. 1 a). Each sub-model within the EMM independently processed the same input to generate its own binary prediction (e.g. the input image is positive or negative for ICH), in parallel to the primary ICH detection model. Each of the five EMM outputs were compared to the primary model’s output to quantify the agreement between each pair of predictions, from 0–100% agreement (meaning that none or all five EMM sub-models agreed with the primary output). This level of agreement can translate into confidence in the primary output. The level of EMM agreement with the primary ICH detection model and resulting degree of confidence also enables radiologists to make different decisions on a case-by-case basis. EMM’s agreement level with the primary model can stratify the primary predictions into three groups to represent cases in which the radiologist can have increased, similar, or decreased confidence after seeing EMM’s agreement with the primary model (Fig. 1 b). This stratification allows radiologists to adjust their actions accordingly for each image read. For example, the primary model’s prediction might not be used in cases with decreased confidence, and these cases should be reviewed following a radiologist’s conventional image interpretation protocol. As we show in the following sections, such optimization may potentially improve radiologist efficiency and reduce cognitive load. EMM agreement levels are associated with different features To identify the features most commonly found in images with high EMM agreement, we conducted a visual examination of all 2,919 analyzed CT studies (50.1% male, age range: 2 months-104.6 years). Two primary ICH detection models were evaluated: an FDA-cleared AI model, and an open-source AI model that placed second-place 33 in the RSNA 2019 ICH challenge 34 . The two primary AI models complemented each other for evaluation, with the FDA-cleared AI model showing lower sensitivity but higher specificity, precision, and overall accuracy than the open-source model ( sFigure 1 ). Results of the FDA-cleared primary model are shown in Fig. 2 a, and the results of the open-source model are shown in sFigure 2 . The FDA-cleared primary model and EMM demonstrated 100% agreement and correct classifications in 1,479 cases (51%, ICH positive 632, negative 847), primarily in cases with obvious hemorrhage or a clearly normal brain anatomy. EMM showed partial agreement with the FDA-cleared model in 848 cases (29%, ICH positive 151, negative 697) when the FDA-cleared model was correct. And EMM also showed partial agreement in 454 cases (16%, ICH positive 39, negative 415) when the FDA-cleared model was incorrect. Visual examination revealed that the cases with partial agreement typically presented with subtle ICH or contained imaging features that mimicked hemorrhage (e.g. hyperdensity, such as calcification or tumor). These cases of partial agreement provide an opportunity for further radiologist review. Finally, in 138 cases (4%, ICH positive 21, negative 117), EMM demonstrated 100% agreement with the FDA-cleared model, but EMM failed to detect that the FDA-cleared model’s prediction was wrong. These cases predominantly involved either extremely subtle hemorrhages or CT features that strongly mimicked hemorrhage patterns, confusing both the FDA-cleared model and EMM. We then quantitatively examined which features affected EMM agreement using Shapley analysis 35 . This analysis was performed on a data subset (N = 281) with a comprehensive set of features manually labeled by radiologists spanning multiple categories, including pathology, patient positioning, patient information, image acquisition and reconstruction parameters. In ICH-positive cases, hemorrhage volume emerged as the dominant feature for high EMM agreement, with larger volumes strongly corresponding to higher agreement (Fig. 2 b), as seen in our visual analysis. For ICH-negative cases, the predictive features for EMM agreement were more balanced. The top predictors for EMM agreement were brain volume, patient age, and image rotation. Some directional relationships between feature values and EMM agreement were also identified ( sFigure2 ). For ICH-positive cases, high hemorrhage volume and multi-compartmental hemorrhages resulted in higher EMM agreement. In ICH-negative cases, the presence of features that mimicked hemorrhages led to lower EMM agreement. EMM Enables Confidence-based Image Review Optimization Following the EMM thresholds and suggested actions outlined in Fig. 1 b, we established stratification thresholds based on expert radiologist feedback for the FDA-cleared ICH detection model based on its accuracy at different EMM agreement levels, shown in sFigure 3 . For ICH-positive primary model predictions, we set thresholds of 100% EMM agreement for increased confidence in the primary model, 60% or 80% EMM agreement for similar confidence in the primary model, and 0%, 20% or 40% EMM agreement for decreased confidence in the primary model. For ICH-negative primary model predictions, thresholds were set at 100% EMM agreement for increased confidence, 20%, 40%, 60% or 80% EMM agreement for similar confidence, and 0% EMM agreement for decreased confidence. We then evaluated the overall accuracy of the primary model together with EMM for cases classified as increased, similar, and decreased confidence (Fig. 3 a). This evaluation was also performed across three different prevalences of 30%, 15%, and 5%, which are close to the prevalences observed at our institution across emergency, in-patient, and out-patient settings. As expected, overall accuracy was highest for cases in which the primary model and EMM showed high agreement, and overall accuracy was lowest when EMM showed a lower agreement level with the primary model. This was observed for both ICH-positive and ICH-negative primary predictions and across all prevalence levels. Of the cases analyzed, most of the cases were classified as increased confidence based on EMM thresholds, followed by similar confidence, and lastly decreased confidence (Fig. 3 b). To assess the practical value of the EMM suggested actions, we analyzed the relative gains of the model compared to the cognitive load and loss of trust associated with incorrect classifications. Among the cases flagged for decreased confidence, those for which the primary model’s prediction remained correct despite low EMM agreement (and thus the decreased confidence classification) were considered false alarms. As shown in Fig. 3 c, the potential for radiologists to be alerted to a possible incorrect primary model output and correct these cases substantially improved relative accuracy compared to the percentage of false alarms across all prevalence levels (relative accuracy improvements of 4.7%, 11%, and 38% versus false-alarm rates of 0.89%, 0.45%, and 0.14% at prevalence levels of 30%, 15%, and 5%, respectively). However, this net benefit was only observed for cases with ICH-negative primary model predictions at 30% prevalence (false-alarm rate of 1.1% versus relative accuracy improvement of 3.4%). At lower prevalence levels (15% and 5%), the already-high baseline accuracy of the primary model for ICH-negative cases (i.e. accuracy = 0.93 and 0.98, respectively) meant that the burden of false alarms (1.3% and 1.4%) did not exceed the relative accuracy gains (1.4% and 0.41%). Similar results were also observed for the open-source ICH detection AI across all prevalences ( sFigure 4 ). Sub-model and Data Size Considerations when Training EMM To enable broader application and adoptability, we conducted a comprehensive analysis of how three key factors affect EMM performance: i) amount of training data used: 100% of the dataset (n = 18,370), 50% of the dataset (n = 9,185), 25% of the dataset (n = 4,592), and 5% of the dataset (n = 918), ii) number of EMM sub-models (1–5), and iii) EMM sub-model size in relation to training data volume. EMM performance was measured by its ability to detect errors made by the primary model using error detection sensitivity-PPV area under curve (ED-SPAUC) and specificity-NPV area under curve (ED-SNAUC) across prevalences, as these metrics consider the overall error detection performance regardless of the agreement level threshold applied. Training Data As illustrated in Fig. 4 a, EMM's ED-SPAUC for the FDA-cleared primary model generally decreased as the training data was reduced from 100–5% of the original dataset across all three prevalences. This suggested that EMM generally improves with increased training data volume, though the benefits begin to saturate after approximately 10,000 studies. This trend was also observed in the error detection SNAUC and the open-source primary model ( sFigure 5a ). Number of Ensemble Sub-models As shown in Fig. 4 b, EMM's ED-SPAUC increased as the number sub-models increased from 1 to 4, before generally stabilizing at 5 across all three prevalence levels. Conversely, ED-SNAUC consistently improved as the number of models increased from 1 to 5, across all prevalences. Similarly, both error detection metrics showed consistent improvement as the number of networks increased for the open-source primary AI ( sFigure 5b ). These results suggest that EMM performance generally increases with more sub-models, with 4 or 5 sub-models serving as an effective starting point for future applications. Ensemble Sub-model Size and Training Data : We next investigated how combinations of EMM sub-model size and training data volume affected EMM's performance in monitoring the primary model. We examined four scenarios: i) ensemble of small networks trained with 5% of the dataset (S-Net 5%-Data), ii) ensemble of small networks trained with 100% of the dataset (S-Net 100%-Data), iii) ensemble of large networks trained with 5% of the dataset (L-Net 5%-Data), and iv) ensemble of large networks trained with 100% of the dataset (L-Net 100%-Data). As shown in Fig. 4 c, EMM generally achieved the best ED-SPAUC and ED-SNAUC with large networks and 100% of the training data and worst performance with large network and 5% of training data, suggesting that a larger training dataset could benefit EMM performance. In a 5% prevalence setting, an ensemble of small networks and 5% of the training data achieved the highest ED-SPAUC and ED-SNAUC values. Similar findings were also observed for the open-source primary AI ( sFigure 5c ). Taken together, these results provide insights into how the EMM approach can be developed and tailored for various real-time monitoring applications. Discussion As interest in AI for healthcare rapidly grows, monitoring medical AI systems has become increasingly urgent to ensure AI’s trustworthiness, safety, and effectiveness. Recently, the FDA has proposed a lifecycle management approach for medical AI devices, requiring evaluation not only during FDA approval and pre-deployment phases, but also continuous monitoring after real-world implementation, and critical risk assessments for individual cases across AI systems 4 . However, the black-box nature of FDA-cleared commercial AI systems creates significant challenges for real-time case-specific monitoring. In this paper, we introduce EMM, a framework that monitors black-box clinical AI systems in real-time without requiring manual labels or access to the primary model’s internal components. Using an ensemble of independently trained sub-models that mirror the primary task, our framework measures confidence in AI predictions through agreement levels between the EMM sub-models and the primary model on a case-by-case basis. These agreement levels can then be used to stratify cases by confidence in the primary model’s prediction and suggest a subsequent action. This makes EMM a valuable tool that fills the critical gap in real-time, case-by-case monitoring for FDA-cleared black-box AI systems that would otherwise remain unmonitored. Our approach enables quantification of confidence through EMM agreement levels with the primary model’s predictions. By applying appropriate thresholds to the level of agreement between the EMM and primary model, radiologists can differentiate between which predictions they can be more or less confident in, therefore optimizing their attention allocation and cognitive load. For example, with EMM, radiologists could feel more confident in over half of all cases interpreted. Notably, EMM also reliably identified cases of low confidence, allowing for focused review of these cases and greatly improving overall ICH detection accuracy. In our testing, EMM only failed alongside the primary model in a small percentage of cases (4%). These cases of both EMM and primary model being incorrect represent those with small ICH volumes or ICH-mimicking features. The stratification of confidence levels based on the EMM agreement levels also enables radiologists to make tailored decisions for each case. The thresholds for defining the three accuracy groups in this study were established based on expert radiologist assessment and the primary model’s performance at different EMM agreement levels ( sFigure3 ), with separate analyses for ICH-positive and ICH-negative primary model predictions. The thresholds to indicate increased and decreased confidence were specifically designated so that the overall ICH detection accuracy would be significantly higher or lower, respectively, than that with only the primary model (baseline). However, suboptimal EMM agreement thresholds (resulting in too many cases categorized as decreased confidence) can create an unfavorable trade-off where the burden of further reviewing false alarms, and the associated loss in trust in the EMM, outweighs the relative gains in accuracy. This inefficiency particularly impacts low-prevalence settings, where radiologists may waste valuable time reviewing cases that the primary model had already classified correctly (sFigure 6 ). This illustrates that although the overall EMM framework can be applied to broad applications, the agreement levels and thresholds may need case-specific definitions depending on the prevalence level. Varying the technical parameters of EMM also revealed insight into the best practices for applying EMM to other clinical use cases. Our ablation study revealed that expectedly, larger datasets, a larger number of sub-models, and larger sub-models generally improve the EMM’s capability to detect errors in the primary AI model. We also observed that at 5% prevalence, large sub-models trained with 25% of the data or small model trained with 5% of the data achieved optimal performance. This behavior can be explained by the relationship between model complexity and data volume. Specifically, large sub-models trained on the full dataset (with 41% prevalence) likely became too calibrated/overfitted to that specific prevalence distribution, causing suboptimal performance when testing on data with significantly different prevalence (5%) 36 . Using large sub-model training with 25% of data or using small sub-model training with 5% of data may help the EMM balance the bias-variance tradeoffs by learning meaningful patterns for generalizability, while not overfitting to the training prevalence. The differences observed in optimal dataset size and sub-model size across different prevalence levels can help inform the best technical parameters to start developing an EMM for a different use case, promoting greater adoptability across diseases. Beyond using EMM to improve case-by-case primary model performance, as shown in this study, another potential application of EMM can be monitoring longitudinal changes in primary AI performance. As the EMM agreement levels are tracked over time, perturbations in the expected ranges can be identified over daily, weekly, or monthly periods. For example, any significant drifts in EMM agreement level distribution may signal changes in primary model performance due to shifts in patient demographics, image acquisition parameters, or clinical workflows. In this manner, the EMM approach can provide another dimension into the current radiology statistical process/quality control pipelines for continuous background monitoring 37 , 38 , in addition to reporting concordance. While the EMM approach demonstrates several advantages in case-by-case AI monitoring, some limitations persist. Although the EMM does not require labels to perform its monitoring task, a key constraint is the need for labeled use-case-specific datasets when training the EMM for each clinical application, which could potentially limit broader adoption across diverse clinical institutions with different computing resources. However, with the recent maturity of LLMs and self-supervised model training techniques, this limitation may be largely overcome. For example, labels can now be automatically extracted from existing radiology reports using LLMs 10 – 12 . Self-supervised training 39 , 40 also enables large foundation model training without manual annotation 41 , 42 , and only a small amount of labeled data would be required to further fine-tune the model for each use case. These recent developments enable periodic updates to EMM, allowing it to adapt to changes in patient populations, scanners, and imaging protocols, thereby maintaining consistent and robust performance over time. Another limitation is EMM's susceptibility to similar failure patterns as the primary model being monitored, such as in cases involving small, low-contrast hemorrhages or ICH-mimicking pathologies in this study. Of particular concern are instances where EMM fails simultaneously with the primary model while indicating complete consensus, as this could instill false confidence in clinicians and potentially increase misdiagnosis risk. This risk might be mitigated by training EMM in the future using synthetic datasets 43 with artificially generated difficult cases, such as those with less obvious hemorrhages, with ICH mimicking features, representing diverse patient population, or with various artifacts. As AI technology rapidly develops, many of the limitations currently facing EMM may be quickly overcome, presenting greater opportunities to not only improve EMM performance but also the resources required to implement the EMM approach itself. In conclusion, our EMM framework represents a significant advancement in black-box clinical AI monitoring, enabling case-by-case confidence estimation without requiring access to primary model parameters or intermediate outputs. By leveraging ensemble agreement levels, EMM provides actionable insights, potentially enhancing diagnostic confidence while reducing cognitive burden. As AI continues to integrate into clinical workflows, approaches like EMM that provide transparent confidence measures will be essential for maintaining trust, ensuring quality, and ultimately improving patient outcomes in resource-constrained environments. Methods EMM Dataset and Training EMM consisted of 5 independently trained 3D convolutional neural networks (CNNs), comprising two versions with different numbers of trainable parameters: a large version utilizing ResNet 44 101 and 152, and DenseNet 45 121, 169, and 201; and a small version employing ResNet 18, 34, 50, 101, and 152. These networks were initialized using 2D ImageNet 46 pre-trained weights and adapted to 3D via the Inflated 3D (I3D) 47 method, which has shown success previously 42 . EMM sub-models were trained using the open-source RSNA 2019 ICH Detection Challenge 34 , 48 dataset and was evaluated using an independent dataset collected at our institution. We trained the models on different subsets of the RSNA dataset, including 18,370 (100%), 9,185 (50%), 4,592 (25%), and 918 (5%) studies, to evaluate EMM's performance across varying training data sizes. All subsets had an ICH prevalence of about 41%. Each model was trained for 100 epochs to ensure convergence with the Adam optimizer and a learning rate of 10 − 4 . All training was conducted on a server of four NVIDIA L40 GPUs using the PyTorch Lightning framework. Based on GPU memory constraints (48 GB), we set the batch size to 4 for ResNet models and 2 for DenseNet models. EMM Evaluation Dataset We evaluated the EMM using a dataset of 2,919 CT studies (1,315 ICH-positive and 1,604 ICH-negative, 45% ICH prevalence) with a balanced gender distribution (50.1% male, 49.8% female) and a wide age range (0.16-104.58 years; median: 67.13 years; interquartile range: 49.65-80.00 years). Since AI model performance is known to vary with disease prevalence 49 , 50 , we evaluated both the primary AI and EMM performance across different prevalence levels. A recent internal evaluation at our institution covering 8,935 studies between July and November 2024 revealed ICH prevalences of 34.77% for in-patient, 9.09% for out-patient, and 6.52% for emergency units, with an overall average prevalence of 16.70%. Based on these observations, we selected three representative prevalence levels for evaluation: 30%, 15%, and 5%. EMM Training To prepare input data for the EMM, we preprocessed all non-contrast axial head CT DICOM images using the Medical Open Network for AI (MONAI) toolkit 51 . The preprocessing pipeline consisted of several standardization steps: reorienting images to the "left-posterior-superior" (LPS) coordinate system, normalizing the in-plane resolution to 0.45mm, and resizing (either cropping or padding depending on the matrix size) the in-plane matrix dimensions to 512×512 pixels using PyTorch's adaptive average pool method, while preserving the original slice resolution. During training, we employed random cropping in the slice dimension, selecting a contiguous block of 30 slices. For testing, we used a sliding window of 30 slices and averaged the ICH SoftMax probabilities across overlapping windows to generate the final prediction. ICH Detection AI models We evaluated EMM's monitoring capabilities on two distinct ICH detection AI systems: an FDA-cleared commercial product and an open-source model that secured second place in the RSNA 2019 ICH detection challenge 34 , 48 . The FDA-cleared model is a black-box system with undisclosed training data and architecture that provides binary labels for presence of ICH and identifies suspicious slices. We monitored this model using EMM trained on the complete (100%, N = 18,370) RSNA 2019 ICH Detection Challenge dataset. While this dataset's license restricts usage to academic and non-commercial purposes, we do not have access to information regarding whether the FDA-cleared ICH AI model utilized this dataset during its development. The open-source RSNA 2019 Challenge second-place model 33 employs a 2D ResNext-101 52 network for slice-level feature extraction, followed by two levels of Bidirectional LSTM networks for feature summarization and ICH detection. We selected the second-place model rather than the first-place winner because retraining the top model would require extensive time while offering only marginal performance improvement (≤ 2.3%) based on the leaderboard. Although the original open-source model was trained on the complete RSNA Challenge dataset, we retrained it using only 50% of the data and reserved the remaining 50% for EMM training. This simulates real-world deployment scenarios where the primary ICH detector and EMM are trained on different datasets. For both the FDA-cleared and the open-source ICH detection models, no additional image preprocessing was performed and the original DICOM was sent as the input. Manually Annotated Sub-dataset for Shapley Analysis To comprehensively analyze features that drive high EMM agreement, we manually annotated a smaller dataset (N = 281), including ICH segmentation, volume measurements, and identification of mimicking imaging features. This curated dataset comprised 210 ICH-positive and 71 ICH-negative subjects and their associated studies. The ICH-positive cases span 7 distinct ICH subtypes: subdural (SDH, N = 35), subarachnoid (SAH, N = 50), epidural (EDH, N = 15), intraparenchymal (IPH, N = 19), intraventricular (IVH, N = 2), diffuse axonal injury (DAI, N = 1), and multi-compartmental hemorrhages (Multi-H, N = 88). Among the 71 ICH-negative cases, 43 cases were specifically selected to include features that mimic hemorrhages (e.g., hyper-density such as calcification or tumor), while 28 were from normal subjects. A neuroradiology fellow reviewed and validated all clinical labels to ensure accurate ground truth for our analysis. Comprehensive List of Features for Shapley Analysis In Shapley analysis, we prepared a comprehensive list of features including pathology-related metrics (ICH volume and type), patient characteristics (brain volume, age, gender), positioning parameters (rotation, translation), image acquisition parameters (pixel spacing, slice thickness, kVp, X-ray tube current, CT scanner manufacturer), and image reconstruction parameters (reconstruction convolution kernel and filter type). Explaining Features Contributing to High EMM Agreement using Shapley Analysis To elucidate the features contributing to the high level of agreement between EMM sub-models and the primary AI model, we conducted Shapley analysis 35 using the Python "shap" package (v0.46.0). This analysis employed an XGBoost 53 (v2.1.1) classifier to learn the relationship between feature values and EMM agreement and to evaluate the importance of each feature leading to high EMM agreement, quantified by the probability ranges between 0 to 1. Higher Shapley values indicate features important for 100% EMM agreement. ICH Volume Estimation for Shapley Analysis To evaluate whether ICH volume influences EMM monitoring performance, we implemented a systematic protocol for ICH volume estimation. First, we employed Viola-UNet 54 , the winning model from the Instance 2022 ICH Segmentation Challenge 55 , 56 , to generate initial ICH segmentations. A radiology resident reviewed these segmentations and marked any errors directly on the images. A trained researcher then manually corrected the marked discrepancies using 3D Slicer software (version 5.6.2) to ensure accurate hemorrhage delineation. Finally, we calculated ICH volumes using the corrected ICH masks and image resolution data from the DICOM headers. Estimating Patient Brain Volume and Orientation Information for Shapley Analysis Since hemorrhage detection can be challenging in brains of different sizes or certain brain orientations, we analyzed brain volume and orientation as potential factors affecting EMM performance, alongside the previously mentioned features. Using the FMRIB Software Library 57 (FSL 6.0.7.13), we developed an automated pipeline following an established protocol 58 to extract brain masks and estimate brain volumes. We then employed FSL FLIRT (FMRIB's Linear Image Registration Tool) to perform 9-degree-of-freedom brain registration, aligning each image to the MNI 2019b non-symmetrical T1 brain template. The resulting rotation, translation, and scaling parameters were incorporated into our Shapley analysis as quantitative measures of brain orientation. Analysis of the tradeoff between false alarm rate and the relative accuracy improvement When the decreased confidence group in Fig. 1 b is further reviewed by radiologists, some cases may actually be found to be labeled correctly by the primary model; we consider these cases to be false alarms. The false-alarm rate is defined as the percentage of unnecessary reviews of correctly labeled cases. After further reviewing the cases in the decreased confidence group, we assumed that the radiologists will always correctly label the cases, improving overall accuracy. We define relative improvement in accuracy as the percentage increase in accuracy after reviewing the decreased confidence group compared to the baseline accuracy of the primary model. Statistical Analysis To assess the reliability of our model's performance metrics, we calculated 95% confidence intervals (CIs) using bootstrapping. We conducted 1,000 random draws with replacement from the set of ground-truth labels and corresponding model predictions. To create evaluation dataset at target prevalence levels (30%, 15%, and 5%) different from the original distribution (45%), we down-sampled ICH-positive and resampled ICH-negative cases. For example, to create datasets with a controlled 30% prevalence of ICH-positive cases, we performed random sampling with replacement from our original dataset. Specifically, we randomly selected 0.3 × N n ICH-positive cases and N n ICH-negative cases (where N n represents the total number of ICH-negative cases in the original dataset). After each draw, we computed key performance metrics such as sensitivity, positive predictive value (PPV), specificity, and negative predictive value (NPV). We then determined the 95% confidence intervals by identifying the 2.5th and 97.5th percentiles of these metrics across all bootstrap iterations. To test the significance of differences in metric between different groups, bootstrapping was also applied to estimate the p-value, with the null hypothesis that there is no difference between the two paired groups. Declarations Author Contribution Manuscript drafting and manuscript revision for important intellectual content, all authors; Study concepts and design: Z.F., A.S.K., D.B.L.; Data/statistical analysis: Z.F.; Data collection: D.L., A.W.C., M.I.; Data cleaning and annotation: Z.F., A.J., H.S.N., M.P., C.G., A.K., D.L., A.W.C.; Literature research: Z.F., L.Y.C., A.S.C., M.P., C.G. Acknowledgement Stanford Department of Radiology; Stanford 3D and Quantitative Imaging Laboratory (3DQ Lab). Data Availability EMM training data set is based on RSNA 2019 ICH detection challenge and it can be found at https://www.rsna.org/rsnai/ai-image-challenge/rsna-intracranial-hemorrhage-detection-challenge-2019. Internal validation data are under review and will publish through Stanford.EMM code and model weights will be available publicly on GitHub. The code repository is uploaded for review. References Joshi, G. et al. FDA-Approved Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices: An Updated Landscape. Electronics 13, 498 (2024). Challen, R. et al. Artificial intelligence, bias and clinical safety. BMJ Qual Saf 28, 231–237 (2019). Del Gaizo, A. J., Osborne, T. F., Shahoumian, T. & Sherrier, R. Deep Learning to Detect Intracranial Hemorrhage in a National Teleradiology Program and the Impact on Interpretation Time. Radiology: Artificial Intelligence 6, e240067 (2024). Health, C. for D. and R. Blog: A Lifecycle Management Approach toward Delivering Safe, Effective AI-enabled Health Care. FDA (2025). Allen, B. et al. Evaluation and Real-World Performance Monitoring of Artificial Intelligence Models in Clinical Practice: Try It, Buy It, Check It. Journal of the American College of Radiology 18, 1489–1496 (2021). Chow, J., Lee, R. & Wu, H. How Do Radiologists Currently Monitor AI in Radiology and What Challenges Do They Face? An Interview Study and Qualitative Analysis. J Digit Imaging. Inform. med. (2025) doi: 10.1007/s10278-025-01493-8 . Larson, D. B. et al. Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models. Radiology 314, e241051 (2025). Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 30, 1134–1142 (2024). Li, L. et al. A scoping review of using Large Language Models (LLMs) to investigate Electronic Health Records (EHRs). Preprint at https://doi.org/10.48550/arXiv.2405.03066 (2024). Le Guellec, B. et al. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. Radiology: Artificial Intelligence 6, e230364 (2024). Reichenpfader, D., Müller, H. & Denecke, K. Large language model-based information extraction from free-text radiology reports: a scoping review protocol. BMJ Open 13, e076865 (2023). Reichenpfader, D., Müller, H. & Denecke, K. A scoping review of large language model based approaches for information extraction from radiology reports. npj Digit. Med. 7, 222 (2024). Lambert, B., Forbes, F., Doyle, S., Dehaene, H. & Dojat, M. Trustworthy clinical AI solutions: A unified review of uncertainty quantification in Deep Learning models for medical image analysis. Artificial Intelligence in Medicine 150, 102830 (2024). Gawlikowski, J. et al. A survey of uncertainty in deep neural networks. Artif Intell Rev 56, 1513–1589 (2023). Kiyasseh, D., Cohen, A., Jiang, C. & Altieri, N. A framework for evaluating clinical artificial intelligence systems without ground-truth annotations. Nat Commun 15, 1808 (2024). Ramalho, T. & Miranda, M. Density Estimation in Representation Space to Predict Model Uncertainty. in Engineering Dependable and Secure Machine Learning Systems (eds. Shehory, O., Farchi, E. & Barash, G.) 84–96 (Springer International Publishing, Cham, 2020). doi: 10.1007/978-3-030-62144-5_7 . Raghu, M. et al. Direct Uncertainty Prediction for Medical Second Opinions. in Proceedings of the 36th International Conference on Machine Learning 5281–5290 (PMLR, 2019). Malinin, A. & Gales, M. Predictive Uncertainty Estimation via Prior Networks. in Advances in Neural Information Processing Systems vol. 31 (Curran Associates, Inc., 2018). Kull, M. et al. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019). Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. in Proceedings of the 34th International Conference on Machine Learning - Volume 70 1321–1330 (JMLR.org, Sydney, NSW, Australia, 2017). Kumar, A., Liang, P. S. & Ma, T. Verified Uncertainty Calibration. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019). Louizos, C. & Welling, M. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks. in Proceedings of the 34th International Conference on Machine Learning 2218–2227 (PMLR, 2017). Ritter, H., Botev, A. & Barber, D. A Scalable Laplace Approximation for Neural Networks. in (2018). Welling, M. & Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. in Proceedings of the 28th International Conference on International Conference on Machine Learning 681–688 (Omnipress, Madison, WI, USA, 2011). Graves, A. Practical Variational Inference for Neural Networks. in Advances in Neural Information Processing Systems vol. 24 (Curran Associates, Inc., 2011). Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. in Proceedings of The 33rd International Conference on Machine Learning 1050–1059 (PMLR, 2016). Lemay, A. et al. Improving the repeatability of deep learning models with Monte Carlo dropout. npj Digit. Med. 5, 1–11 (2022). Egele, R. et al. AutoDEUQ: Automated Deep Ensemble with Uncertainty Quantification. in 2022 26th International Conference on Pattern Recognition (ICPR) 1908–1914 (2022). doi: 10.1109/ICPR56361.2022.9956231 . Mehrtash, A., Wells, W. M., Tempany, C. M., Abolmaesumi, P. & Kapur, T. Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation. IEEE Trans Med Imaging 39, 3868–3878 (2020). Wenzel, F., Snoek, J., Tran, D. & Jenatton, R. Hyperparameter Ensembles for Robustness and Uncertainty Quantification. in Advances in Neural Information Processing Systems vol. 33 6514–6527 (Curran Associates, Inc., 2020). Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017). Kwon, Y., Won, J.-H., Kim, B. J. & Paik, M. C. Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis 142, 106816 (2020). Hanley, D. RSNA Intracranial Hemorrhage Detection Second Place Winner. (2024). Flanders, A. E. et al. Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge. Radiology: Artificial Intelligence 2, e190211 (2020). Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017). Mutasa, S., Sun, S. & Ha, R. Understanding artificial intelligence based radiology studies: What is overfitting? Clinical Imaging 65, 96–99 (2020). Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 5, 1–9 (2022). Larson, D. B. A Vision for Global CT Radiation Dose Optimization. Journal of the American College of Radiology 21, 1311–1317 (2024). Huang, S.-C. et al. Multimodal Foundation Models for Medical Imaging - A Systematic Review and Implementation Guidelines. 2024.10.23.24316003 Preprint at https://doi.org/10.1101/2024.10.23.24316003 (2024). Huang, S.-C. et al. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. npj Digit. Med. 6, 1–16 (2023). Chen, Z. et al. A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation. Preprint at https://doi.org/10.48550/arXiv.2401.12208 (2024). Blankemeier, L. et al. Merlin: A Vision Language Foundation Model for 3D Computed Tomography. Preprint at https://doi.org/10.48550/arXiv.2406.06512 (2024). Bluethgen, C. et al. A vision–language foundation model for the generation of realistic chest X-ray images. Nat. Biomed. Eng 1–13 (2024) doi: 10.1038/s41551-024-01246-y . He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. Preprint at https://doi.org/10.48550/arXiv.1512.03385 (2015). Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely Connected Convolutional Networks. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (IEEE, Honolulu, HI, 2017). doi: 10.1109/CVPR.2017.243 . Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Preprint at https://doi.org/10.48550/arXiv.1409.0575 (2015). Carreira, J. & Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4724–4733 (IEEE, Honolulu, HI, 2017). doi: 10.1109/CVPR.2017.502 . RSNA Intracranial Hemorrhage Detection Challenge (2019). https://www.rsna.org/rsnai/ai-image-challenge/rsna-intracranial-hemorrhage-detection-challenge-2019 . Godau, P. et al. Navigating prevalence shifts in image analysis algorithm deployment. Medical Image Analysis 102, 103504 (2025). Park, S. H. & Han, K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 286, 800–809 (2018). Cardoso, M. J. et al. MONAI: An open-source framework for deep learning in healthcare. Preprint at https://doi.org/10.48550/ARXIV.2211.02701 (2022). Xie, S., Girshick, R., Dollar, P., Tu, Z. & He, K. Aggregated Residual Transformations for Deep Neural Networks. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5987–5995 (IEEE, Honolulu, HI, 2017). doi: 10.1109/CVPR.2017.634 . Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, New York, NY, USA, 2016). doi: 10.1145/2939672.2939785 . Liu, Q. et al. Voxels Intersecting Along Orthogonal Levels Attention U-Net for Intracerebral Haemorrhage Segmentation in Head CT. in 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) 1–5 (2023). doi: 10.1109/ISBI53787.2023.10230843 . Li, X. et al. Hematoma Expansion Context Guided Intracranial Hemorrhage Segmentation and Uncertainty Estimation. IEEE Journal of Biomedical and Health Informatics 26, 1140–1151 (2022). Li, X. et al. The state-of-the-art 3D anisotropic intracranial hemorrhage segmentation on non-contrast head CT: The INSTANCE challenge. Preprint at https://doi.org/10.48550/arXiv.2301.03281 (2023). Smith, S. M. et al. Advances in functional and structural MR image analysis and implementation as FSL. NeuroImage 23, S208–S219 (2004). Muschelli, J. et al. Validated automatic brain extraction of head CT images. NeuroImage 114, 379–385 (2015). Additional Declarations Competing interest reported. Z.F. Stock option holder of LVIS Corp. A.J. No relevant relationships. L.Y.C. No relevant relationships. H.S.N. No relevant relationships. M.P. No relevant relationships. C.G. No relevant relationships. B.A.A. No relevant relationships. A.K. No relevant relationships. D.L. No relevant relationships. A.W.C. No relevant relationships. M.I. No relevant relationships. D.B.L. Member of the Board of Chancellors of the American College of Radiology and Board of Trustees of the American Board of Radiology, shareholder in Bunkerhill Health; receives research funding from the Gordon and Betty Moore Foundation. A.S.C. receives research support from NIH grants R01 HL167974, R01HL169345, R01 AR077604, R01 EB002524, R01 AR079431, P41 EB027060; ARPA-H grants AY2AX000045 and 1AYSAX0000024-01; and NIH contracts 75N92020C00008 and 75N92020C00021.Unrelated to this work, A.S.C. receives research support from GE Healthcare, Philips, Microsoft, Amazon, Google, NVIDIA, Stability; has provided consulting services to Patient Square Capital, Chondrometrics GmbH, and Elucid Bioimaging; is co-founder of Cognita; has equity interest in Cognita, Subtle Medical, LVIS Corp, Brain Key. Supplementary Files SupplementaryInformation.docx Cite Share Download PDF Status: Published Journal Publication published 16 Oct, 2025 Read the published version in npj Digital Medicine → Version 1 posted Editorial decision: Revision requested 13 Jul, 2025 Reviews received at journal 06 Jul, 2025 Reviews received at journal 06 Jun, 2025 Reviewers agreed at journal 01 Jun, 2025 Reviewers agreed at journal 01 Jun, 2025 Reviewers agreed at journal 01 Jun, 2025 Reviewers invited by journal 01 Jun, 2025 Editor assigned by journal 21 May, 2025 Submission checks completed at journal 21 May, 2025 First submitted to journal 16 May, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6683104","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":460118386,"identity":"929c8a24-c1b3-44ff-b7bd-47337591fc82","order_by":0,"name":"Zhongnan Fang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7ElEQVRIiWNgGAWjYDACCSjNDyZtgPgAQhC/FskGZiCZRooWgwPEapGf3Xzs4dcddnnG588f/FyQYJPPd4D54G0ePFoY5xxLN5Y9k1xsdiOZWXpGQprlzANsydb4tDBL5JhJS7YxJ267wcwgzfvjsIHBAR4zaXxa2CBa6hM39x9m/s2TANLC/w2vFh6gFsmPbYcTNzAks0lDtPCw4dUiIZGWJs3Ydjxxxo1kM2uehDQDycNsxpZz8GiRn5F8TPJnW3Vif//Bx7d5EmwM+I43P7zxBo8WEGBGdQYzAeUgwPiDCEWjYBSMglEwggEAAlpHxVVz3D0AAAAASUVORK5CYII=","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":true,"prefix":"","firstName":"Zhongnan","middleName":"","lastName":"Fang","suffix":""},{"id":460118387,"identity":"18b64324-e0f4-4aea-9d81-1da3576a4d1d","order_by":1,"name":"Andrew Johnston","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Andrew","middleName":"","lastName":"Johnston","suffix":""},{"id":460118388,"identity":"6e960eb3-0549-4cc2-97ec-0b701a70d273","order_by":2,"name":"Lina Cheuy","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Lina","middleName":"","lastName":"Cheuy","suffix":""},{"id":460118389,"identity":"d0a9f2fc-5620-4502-9cd2-25fbcad01329","order_by":3,"name":"Hye Sun Na","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Hye","middleName":"Sun","lastName":"Na","suffix":""},{"id":460118390,"identity":"865bd0d3-f363-4788-93b2-af8482021075","order_by":4,"name":"Magdalini Paschali","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Magdalini","middleName":"","lastName":"Paschali","suffix":""},{"id":460118391,"identity":"eaf42a20-11e5-4acb-ab12-4a6ae714e105","order_by":5,"name":"Camila Gonzalez","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Camila","middleName":"","lastName":"Gonzalez","suffix":""},{"id":460118392,"identity":"f344d31b-354f-4849-a151-53f0cf418e66","order_by":6,"name":"Bonnie A. Armstrong","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Bonnie","middleName":"A.","lastName":"Armstrong","suffix":""},{"id":460118395,"identity":"99beda07-44f8-4fdf-a2db-344cbd2669d0","order_by":7,"name":"Arogya Koirala","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Arogya","middleName":"","lastName":"Koirala","suffix":""},{"id":460118397,"identity":"640da42f-69c8-40c1-88e1-750e3c87249a","order_by":8,"name":"Derrick Laurel","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Derrick","middleName":"","lastName":"Laurel","suffix":""},{"id":460118398,"identity":"4aa0cc8d-df5d-49e4-8bb3-17417e289619","order_by":9,"name":"Andrew Walker Campion","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Andrew","middleName":"Walker","lastName":"Campion","suffix":""},{"id":460118399,"identity":"a9dc543a-7f61-42a3-9829-84760cd03c3d","order_by":10,"name":"Michael Iv","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Michael","middleName":"","lastName":"Iv","suffix":""},{"id":460118400,"identity":"b5313993-26b4-4b7e-9098-c3008ea16c92","order_by":11,"name":"Akshay S. Chaudhari","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Akshay","middleName":"S.","lastName":"Chaudhari","suffix":""},{"id":460118401,"identity":"40b3239f-fc64-4ff4-9ae5-76c2e8582179","order_by":12,"name":"David B. Larson","email":"","orcid":"","institution":"Department of Radiology, School of Medicine, Stanford University","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"B.","lastName":"Larson","suffix":""}],"badges":[],"createdAt":"2025-05-16 19:23:08","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6683104/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6683104/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41746-025-02007-0","type":"published","date":"2025-10-16T15:57:58+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":83439462,"identity":"0f72f167-c971-4b5d-9275-489e9099e41e","added_by":"auto","created_at":"2025-05-26 09:12:12","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":164437,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eOverview of Ensembled Monitoring Model (EMM) and example of how to stratify EMM agreement into suggested actions. a.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e Each sub-model within the EMM is trained to perform the same task as the primary ICH detection model. The independent sub-model outputs are then used to compute the level of agreement between the ICH detection model and EMM, helping quantify confidence in the reference prediction and suggesting an appropriate subsequent action. \u003c/em\u003e\u003cem\u003e\u003cstrong\u003eb.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003eAn example EMM use case is stratifying cases into categories of increased, similar, or decreased confidence in the primary model predictions after computing the level of EMM agreement. Through careful selection of the stratification thresholds based on the primary model’s performance at different EMM agreement levels, this categorization enables radiologists to make different decisions based on the confidence level derived from the EMM agreement levels.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6683104/v1/5a57dcfa36de6bbfeb10a252.png"},{"id":83439454,"identity":"7dac546f-bee3-4e2f-928a-b34db0a8c31c","added_by":"auto","created_at":"2025-05-26 09:12:08","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":288336,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eEMM agreement associated with different features. a. \u003c/strong\u003e\u003c/em\u003e\u003cem\u003eExample cases for which EMM showed different levels of agreement with the FDA-cleared primary ICH detection model. Cases with full EMM agreement typically showed clear presence or absence of ICH, while cases with partial agreement often displayed subtle ICH or features mimicking hemorrhage. \u003c/em\u003e\u003cem\u003e\u003cstrong\u003eb.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e Quantitative analysis on the importance of features affecting EMM agreement with the FDA-cleared\u003c/em\u003e\u003cem\u003e\u003cstrong\u003e \u003c/strong\u003e\u003c/em\u003e\u003cem\u003eprimary model in ICH-positive and ICH-negative cases. The normalized weight of importance for all features sums to 100%.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6683104/v1/bb58a559c83b31bac8bc4902.png"},{"id":83439455,"identity":"88c8ab4e-096c-4273-81a2-456b70b73572","added_by":"auto","created_at":"2025-05-26 09:12:09","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":144235,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eEMM stratifies cases into different accuracy groups for the FDA-cleared model and enables customized clinical decision making. a.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003eCases stratified by EMM agreement levels demonstrated increased (green), similar (yellow), or decreased (red) accuracies compared to the baseline accuracy of the primary model without EMM (gray). \u003c/em\u003e\u003cem\u003e\u003cstrong\u003eb.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e Distribution (%) of cases classified as increased (green), similar (yellow), or decreased (red) confidence based on the EMM agreement thresholds. \u003c/em\u003e\u003cem\u003e\u003cstrong\u003ec.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e For the cases in which EMM indicated decreased confidence, a more detailed radiologist review is called for. Cases flagged for decreased confidence, but for which the primary model’s prediction was correct, were defined as false alarms. \u0026nbsp;Substantial relative gains over baseline accuracy using only the primary model were observed across all prevalence levels for ICH-positive primary model predictions, outweighing the burden of false alarms. For ICH-negative primary model predictions, however, this favorable balance between relative accuracy gains and false alarm burden was only observed at 30% prevalence.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6683104/v1/828b4a0b572a526c41fb87e6.png"},{"id":83439456,"identity":"854d05eb-492c-4d93-83bc-022bbb9bd5da","added_by":"auto","created_at":"2025-05-26 09:12:09","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":103903,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eEMM performance for detecting errors made by the commercial primary model generally improved with increasing (a) training data volume, (b) number of sub-models, and (c) sub-model sizes.\u003c/strong\u003e\u003c/em\u003e \u003cem\u003eError detection sensitivity-PPV area under curve (ED-SPAUC) and specificity-NPV area under curve (ED-SNAUC) were measured across prevalences; higher values are desirable for both. S-Net represents small sub-model networks for EMM, L-Net represents large sub-model networks for EMM. Similar results for the open-source ICH model are shown in \u003c/em\u003e\u003cem\u003e\u003cstrong\u003esFigure4\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6683104/v1/63060a1ac44a956be9285b45.png"},{"id":93957024,"identity":"afbe8b77-5008-4bfc-97a2-f7a5c7a94b0b","added_by":"auto","created_at":"2025-10-20 16:12:46","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1883618,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6683104/v1/d56260b4-ae2d-4051-bdf4-73bd65982623.pdf"},{"id":83439463,"identity":"fbdaf957-51b9-4ad5-83e4-1d30a40b0cff","added_by":"auto","created_at":"2025-05-26 09:12:12","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":2063927,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-6683104/v1/9a0478ac80017f20989cd768.docx"}],"financialInterests":"Competing interest reported. Z.F. Stock option holder of LVIS Corp. A.J. No relevant relationships. L.Y.C. No relevant relationships. H.S.N. No relevant relationships. M.P. No relevant relationships. C.G. No relevant relationships. B.A.A. No relevant relationships. A.K. No relevant relationships. D.L. No relevant relationships. A.W.C. No relevant relationships. M.I. No relevant relationships. D.B.L. Member of the Board of Chancellors of the American College of Radiology and Board of Trustees of the American Board of Radiology, shareholder in Bunkerhill Health; receives research funding from the Gordon and Betty Moore Foundation. A.S.C. receives research support from NIH grants R01 HL167974, R01HL169345, R01 AR077604, R01 EB002524, R01 AR079431, P41 EB027060; ARPA-H grants AY2AX000045 and 1AYSAX0000024-01; and NIH contracts 75N92020C00008 and 75N92020C00021.Unrelated to this work, A.S.C. receives research support from GE Healthcare, Philips, Microsoft, Amazon, Google, NVIDIA, Stability; has provided consulting services to Patient Square Capital, Chondrometrics GmbH, and Elucid Bioimaging; is co-founder of Cognita; has equity interest in Cognita, Subtle Medical, LVIS Corp, Brain Key.","formattedTitle":"Automated Real-time Assessment of Intracranial Hemorrhage Detection AI Using an Ensembled Monitoring Model (EMM)","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe landscape of healthcare has rapidly evolved in recent years, with an exponential increase in FDA-cleared artificial intelligence (AI) software as medical devices, especially in radiology\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Despite the number of AI applications available, the clinical adoption of radiological AI tools has been slow due to safety concerns regarding potential increases in misdiagnosis, which can erode overall trust in AI systems\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Such inaccurate predictions force meticulous verification of each AI result, ultimately adding to the user\u0026rsquo;s cognitive workload rather than fulfilling AI's promise to enhance clinical efficiency\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. This mismatch between expected and actual performance of AI tools indicates a clear need for real-time monitoring to inform physicians on a case-by-case basis about how confident they can be in each prediction. Such real-time monitoring alongside physician\u0026rsquo;s image interpretation also goes hand-in-hand with the latest guidance issued by the FDA focusing on total life-cycle management of AI tools, rather than the status-quo of pre-deployment validation\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. However, there are currently limited guidelines or best practices for real-time monitoring to communicate reduced model confidence or uncertainty in an AI model\u0026rsquo;s prediction.\u003c/p\u003e \u003cp\u003eCurrent monitoring of radiological AI devices is performed retrospectively based on concordance between AI model outputs and manual labels, which require laborious radiologist-led annotation\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Due to the resource-intensive nature of generating these labels, the vast majority of retrospective evaluations are limited to small data subsets, providing only a partial view of real-world performance\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. While recent advances in large language models (LLMs) have shown promise in the analysis of clinical reports\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e, including automated extraction of diagnosis labels from radiology reports\u003csup\u003e\u003cspan additionalcitationids=\"CR11\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e, this solution remains retrospective. Moreover, with report-based monitoring, regardless of the extraction technique, the \u0026ldquo;quality control\u0026rdquo; mechanism for algorithm performance remains a manual task. An automated quality monitoring solution may help decrease user cognitive burden and provide additional objective information regarding the performance of the AI model, including performance drift.\u003c/p\u003e \u003cp\u003eIn response to these limitations, various real-time monitoring techniques have been proposed, including methods that can directly predict confidence/uncertainty using the same training dataset used to develop the AI model being monitored\u003csup\u003e\u003cspan additionalcitationids=\"CR14 CR15 CR16 CR17\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. Common classes of confidence/uncertainty estimation rely on methods such as SoftMax probability calibration\u003csup\u003e\u003cspan additionalcitationids=\"CR20\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e, Bayesian neural networks\u003csup\u003e\u003cspan additionalcitationids=\"CR23 CR24\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e, and Monte Carlo dropout\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e,\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. Deep ensemble approaches have also emerged to evaluate prediction reliability by utilizing groups of models derived from the same parent model with varying augmentations\u003csup\u003e\u003cspan additionalcitationids=\"CR29 CR30 CR31\" citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. However, these methods require access to either the training dataset, model weights, or intermediate outputs, which is not practical when monitoring commercially available models. Since all FDA-cleared radiological AI models that are deployed clinically are black-box in nature, there currently exist no techniques to monitor such models in production in real time.\u003c/p\u003e \u003cp\u003eThus, there remains a critical need for a real-time monitoring system to automatically characterize confidence at the point-of-care (i.e., when the radiologists review the images and the black-box AI prediction in question). To address this need, we developed the Ensembled Monitoring Model (EMM) approach, which is inspired by clinical consensus practices, where individual opinions are validated through multiple expert reviews. Our EMM framework enables prospective real-time case-by-case monitoring, without requiring ground-truth labels or access to internal AI model components, making it deployable for black-box systems. Here, we demonstrate the effectiveness of the EMM approach in characterizing confidence of intracranial hemorrhage (ICH) detection AI systems (one FDA-cleared and one open-source) operating on head computed tomography (CT) imaging. In this clinically significant application requiring high reliability, we show how EMM can monitor AI model performance in real time and inform subsequent actions in cases flagged for decreased accuracy. The complementary use of a primary AI model with EMM can improve accuracy and user trust in the AI model, while potentially reducing the cognitive burden of interpreting ambiguous cases. We further investigate and provide key considerations for translating and implementing the EMM approach across different clinical scenarios.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eEnsembled Monitoring Model (EMM) Overview\u003c/h2\u003e \u003cp\u003eEmulating how clinicians achieve group consensus through a group of experts, the EMM framework was developed to estimate consensus among a group of models. Here, we refer to the model being monitored as the \u0026ldquo;\u003cem\u003eprimary model\u003c/em\u003e\u0026rdquo;. In this study, the EMM comprised five sub-models with diverse architectures trained for the identical task of detecting the presence of ICH (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). Each sub-model within the EMM independently processed the same input to generate its own binary prediction (e.g. the input image is positive or negative for ICH), in parallel to the primary ICH detection model. Each of the five EMM outputs were compared to the primary model\u0026rsquo;s output to quantify the agreement between each pair of predictions, from 0\u0026ndash;100% agreement (meaning that none or all five EMM sub-models agreed with the primary output). This level of agreement can translate into confidence in the primary output.\u003c/p\u003e \u003cp\u003eThe level of EMM agreement with the primary ICH detection model and resulting degree of confidence also enables radiologists to make different decisions on a case-by-case basis. EMM\u0026rsquo;s agreement level with the primary model can stratify the primary predictions into three groups to represent cases in which the radiologist can have increased, similar, or decreased confidence after seeing EMM\u0026rsquo;s agreement with the primary model (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb). This stratification allows radiologists to adjust their actions accordingly for each image read. For example, the primary model\u0026rsquo;s prediction might not be used in cases with decreased confidence, and these cases should be reviewed following a radiologist\u0026rsquo;s conventional image interpretation protocol. As we show in the following sections, such optimization may potentially improve radiologist efficiency and reduce cognitive load.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eEMM agreement levels are associated with different features\u003c/h3\u003e\n\u003cp\u003eTo identify the features most commonly found in images with high EMM agreement, we conducted a visual examination of all 2,919 analyzed CT studies (50.1% male, age range: 2 months-104.6 years). Two primary ICH detection models were evaluated: an FDA-cleared AI model, and an open-source AI model that placed second-place\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e in the RSNA 2019 ICH challenge\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. The two primary AI models complemented each other for evaluation, with the FDA-cleared AI model showing lower sensitivity but higher specificity, precision, and overall accuracy than the open-source model (\u003cb\u003esFigure 1\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eResults of the FDA-cleared primary model are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea, and the results of the open-source model are shown in \u003cb\u003esFigure 2\u003c/b\u003e. The FDA-cleared primary model and EMM demonstrated 100% agreement and correct classifications in 1,479 cases (51%, ICH positive 632, negative 847), primarily in cases with obvious hemorrhage or a clearly normal brain anatomy. EMM showed partial agreement with the FDA-cleared model in 848 cases (29%, ICH positive 151, negative 697) when the FDA-cleared model was correct. And EMM also showed partial agreement in 454 cases (16%, ICH positive 39, negative 415) when the FDA-cleared model was incorrect. Visual examination revealed that the cases with partial agreement typically presented with subtle ICH or contained imaging features that mimicked hemorrhage (e.g. hyperdensity, such as calcification or tumor). These cases of partial agreement provide an opportunity for further radiologist review. Finally, in 138 cases (4%, ICH positive 21, negative 117), EMM demonstrated 100% agreement with the FDA-cleared model, but EMM failed to detect that the FDA-cleared model\u0026rsquo;s prediction was wrong. These cases predominantly involved either extremely subtle hemorrhages or CT features that strongly mimicked hemorrhage patterns, confusing both the FDA-cleared model and EMM.\u003c/p\u003e \u003cp\u003eWe then quantitatively examined which features affected EMM agreement using Shapley analysis\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e. This analysis was performed on a data subset (N\u0026thinsp;=\u0026thinsp;281) with a comprehensive set of features manually labeled by radiologists spanning multiple categories, including pathology, patient positioning, patient information, image acquisition and reconstruction parameters. In ICH-positive cases, hemorrhage volume emerged as the dominant feature for high EMM agreement, with larger volumes strongly corresponding to higher agreement (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb), as seen in our visual analysis. For ICH-negative cases, the predictive features for EMM agreement were more balanced. The top predictors for EMM agreement were brain volume, patient age, and image rotation. Some directional relationships between feature values and EMM agreement were also identified (\u003cb\u003esFigure2\u003c/b\u003e). For ICH-positive cases, high hemorrhage volume and multi-compartmental hemorrhages resulted in higher EMM agreement. In ICH-negative cases, the presence of features that mimicked hemorrhages led to lower EMM agreement.\u003c/p\u003e\n\u003ch3\u003eEMM Enables Confidence-based Image Review Optimization\u003c/h3\u003e\n\u003cp\u003e \u003c/p\u003e \u003cp\u003eFollowing the EMM thresholds and suggested actions outlined in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb, we established stratification thresholds based on expert radiologist feedback for the FDA-cleared ICH detection model based on its accuracy at different EMM agreement levels, shown in \u003cb\u003esFigure 3\u003c/b\u003e. For ICH-positive primary model predictions, we set thresholds of 100% EMM agreement for increased confidence in the primary model, 60% or 80% EMM agreement for similar confidence in the primary model, and 0%, 20% or 40% EMM agreement for decreased confidence in the primary model. For ICH-negative primary model predictions, thresholds were set at 100% EMM agreement for increased confidence, 20%, 40%, 60% or 80% EMM agreement for similar confidence, and 0% EMM agreement for decreased confidence. We then evaluated the overall accuracy of the primary model together with EMM for cases classified as increased, similar, and decreased confidence (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea). This evaluation was also performed across three different prevalences of 30%, 15%, and 5%, which are close to the prevalences observed at our institution across emergency, in-patient, and out-patient settings. As expected, overall accuracy was highest for cases in which the primary model and EMM showed high agreement, and overall accuracy was lowest when EMM showed a lower agreement level with the primary model. This was observed for both ICH-positive and ICH-negative primary predictions and across all prevalence levels. Of the cases analyzed, most of the cases were classified as increased confidence based on EMM thresholds, followed by similar confidence, and lastly decreased confidence (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb).\u003c/p\u003e \u003cp\u003eTo assess the practical value of the EMM suggested actions, we analyzed the relative gains of the model compared to the cognitive load and loss of trust associated with incorrect classifications. Among the cases flagged for decreased confidence, those for which the primary model\u0026rsquo;s prediction remained correct despite low EMM agreement (and thus the decreased confidence classification) were considered false alarms. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec, the potential for radiologists to be alerted to a possible incorrect primary model output and correct these cases substantially improved relative accuracy compared to the percentage of false alarms across all prevalence levels (relative accuracy improvements of 4.7%, 11%, and 38% versus false-alarm rates of 0.89%, 0.45%, and 0.14% at prevalence levels of 30%, 15%, and 5%, respectively). However, this net benefit was only observed for cases with ICH-negative primary model predictions at 30% prevalence (false-alarm rate of 1.1% versus relative accuracy improvement of 3.4%). At lower prevalence levels (15% and 5%), the already-high baseline accuracy of the primary model for ICH-negative cases (i.e. accuracy\u0026thinsp;=\u0026thinsp;0.93 and 0.98, respectively) meant that the burden of false alarms (1.3% and 1.4%) did not exceed the relative accuracy gains (1.4% and 0.41%). Similar results were also observed for the open-source ICH detection AI across all prevalences (\u003cb\u003esFigure 4\u003c/b\u003e).\u003c/p\u003e\n\u003ch3\u003eSub-model and Data Size Considerations when Training EMM\u003c/h3\u003e\n\u003cp\u003e \u003c/p\u003e \u003cp\u003eTo enable broader application and adoptability, we conducted a comprehensive analysis of how three key factors affect EMM performance: i) amount of training data used: 100% of the dataset (n\u0026thinsp;=\u0026thinsp;18,370), 50% of the dataset (n\u0026thinsp;=\u0026thinsp;9,185), 25% of the dataset (n\u0026thinsp;=\u0026thinsp;4,592), and 5% of the dataset (n\u0026thinsp;=\u0026thinsp;918), ii) number of EMM sub-models (1\u0026ndash;5), and iii) EMM sub-model size in relation to training data volume. EMM performance was measured by its ability to detect errors made by the primary model using error detection sensitivity-PPV area under curve (ED-SPAUC) and specificity-NPV area under curve (ED-SNAUC) across prevalences, as these metrics consider the overall error detection performance regardless of the agreement level threshold applied.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eTraining Data\u003c/strong\u003e \u003cp\u003eAs illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea, EMM's ED-SPAUC for the FDA-cleared primary model generally decreased as the training data was reduced from 100\u0026ndash;5% of the original dataset across all three prevalences. This suggested that EMM generally improves with increased training data volume, though the benefits begin to saturate after approximately 10,000 studies. This trend was also observed in the error detection SNAUC and the open-source primary model (\u003cb\u003esFigure 5a\u003c/b\u003e).\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eNumber of Ensemble Sub-models\u003c/strong\u003e \u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb, EMM's ED-SPAUC increased as the number sub-models increased from 1 to 4, before generally stabilizing at 5 across all three prevalence levels. Conversely, ED-SNAUC consistently improved as the number of models increased from 1 to 5, across all prevalences. Similarly, both error detection metrics showed consistent improvement as the number of networks increased for the open-source primary AI (\u003cb\u003esFigure 5b\u003c/b\u003e). These results suggest that EMM performance generally increases with more sub-models, with 4 or 5 sub-models serving as an effective starting point for future applications.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eEnsemble Sub-model Size and Training Data\u003c/b\u003e: We next investigated how combinations of EMM sub-model size and training data volume affected EMM's performance in monitoring the primary model. We examined four scenarios: i) ensemble of small networks trained with 5% of the dataset (S-Net 5%-Data), ii) ensemble of small networks trained with 100% of the dataset (S-Net 100%-Data), iii) ensemble of large networks trained with 5% of the dataset (L-Net 5%-Data), and iv) ensemble of large networks trained with 100% of the dataset (L-Net 100%-Data). As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec, EMM generally achieved the best ED-SPAUC and ED-SNAUC with large networks and 100% of the training data and worst performance with large network and 5% of training data, suggesting that a larger training dataset could benefit EMM performance. In a 5% prevalence setting, an ensemble of small networks and 5% of the training data achieved the highest ED-SPAUC and ED-SNAUC values. Similar findings were also observed for the open-source primary AI (\u003cb\u003esFigure 5c\u003c/b\u003e).\u003c/p\u003e \u003cp\u003eTaken together, these results provide insights into how the EMM approach can be developed and tailored for various real-time monitoring applications.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eAs interest in AI for healthcare rapidly grows, monitoring medical AI systems has become increasingly urgent to ensure AI\u0026rsquo;s trustworthiness, safety, and effectiveness. Recently, the FDA has proposed a lifecycle management approach for medical AI devices, requiring evaluation not only during FDA approval and pre-deployment phases, but also continuous monitoring after real-world implementation, and critical risk assessments for individual cases across AI systems\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. However, the black-box nature of FDA-cleared commercial AI systems creates significant challenges for real-time case-specific monitoring. In this paper, we introduce EMM, a framework that monitors black-box clinical AI systems in real-time without requiring manual labels or access to the primary model\u0026rsquo;s internal components. Using an ensemble of independently trained sub-models that mirror the primary task, our framework measures confidence in AI predictions through agreement levels between the EMM sub-models and the primary model on a case-by-case basis. These agreement levels can then be used to stratify cases by confidence in the primary model\u0026rsquo;s prediction and suggest a subsequent action. This makes EMM a valuable tool that fills the critical gap in real-time, case-by-case monitoring for FDA-cleared black-box AI systems that would otherwise remain unmonitored.\u003c/p\u003e \u003cp\u003eOur approach enables quantification of confidence through EMM agreement levels with the primary model\u0026rsquo;s predictions. By applying appropriate thresholds to the level of agreement between the EMM and primary model, radiologists can differentiate between which predictions they can be more or less confident in, therefore optimizing their attention allocation and cognitive load. For example, with EMM, radiologists could feel more confident in over half of all cases interpreted. Notably, EMM also reliably identified cases of low confidence, allowing for focused review of these cases and greatly improving overall ICH detection accuracy. In our testing, EMM only failed alongside the primary model in a small percentage of cases (4%). These cases of both EMM and primary model being incorrect represent those with small ICH volumes or ICH-mimicking features.\u003c/p\u003e \u003cp\u003eThe stratification of confidence levels based on the EMM agreement levels also enables radiologists to make tailored decisions for each case. The thresholds for defining the three accuracy groups in this study were established based on expert radiologist assessment and the primary model\u0026rsquo;s performance at different EMM agreement levels (\u003cb\u003esFigure3\u003c/b\u003e), with separate analyses for ICH-positive and ICH-negative primary model predictions. The thresholds to indicate increased and decreased confidence were specifically designated so that the overall ICH detection accuracy would be significantly higher or lower, respectively, than that with only the primary model (baseline). However, suboptimal EMM agreement thresholds (resulting in too many cases categorized as decreased confidence) can create an unfavorable trade-off where the burden of further reviewing false alarms, and the associated loss in trust in the EMM, outweighs the relative gains in accuracy. This inefficiency particularly impacts low-prevalence settings, where radiologists may waste valuable time reviewing cases that the primary model had already classified correctly \u003cb\u003e(sFigure 6\u003c/b\u003e). This illustrates that although the overall EMM framework can be applied to broad applications, the agreement levels and thresholds may need case-specific definitions depending on the prevalence level.\u003c/p\u003e \u003cp\u003eVarying the technical parameters of EMM also revealed insight into the best practices for applying EMM to other clinical use cases. Our ablation study revealed that expectedly, larger datasets, a larger number of sub-models, and larger sub-models generally improve the EMM\u0026rsquo;s capability to detect errors in the primary AI model. We also observed that at 5% prevalence, large sub-models trained with 25% of the data or small model trained with 5% of the data achieved optimal performance. This behavior can be explained by the relationship between model complexity and data volume. Specifically, large sub-models trained on the full dataset (with 41% prevalence) likely became too calibrated/overfitted to that specific prevalence distribution, causing suboptimal performance when testing on data with significantly different prevalence (5%)\u003csup\u003e36\u003c/sup\u003e. Using large sub-model training with 25% of data or using small sub-model training with 5% of data may help the EMM balance the bias-variance tradeoffs by learning meaningful patterns for generalizability, while not overfitting to the training prevalence. The differences observed in optimal dataset size and sub-model size across different prevalence levels can help inform the best technical parameters to start developing an EMM for a different use case, promoting greater adoptability across diseases.\u003c/p\u003e \u003cp\u003eBeyond using EMM to improve case-by-case primary model performance, as shown in this study, another potential application of EMM can be monitoring longitudinal changes in primary AI performance. As the EMM agreement levels are tracked over time, perturbations in the expected ranges can be identified over daily, weekly, or monthly periods. For example, any significant drifts in EMM agreement level distribution may signal changes in primary model performance due to shifts in patient demographics, image acquisition parameters, or clinical workflows. In this manner, the EMM approach can provide another dimension into the current radiology statistical process/quality control pipelines for continuous background monitoring\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e,\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e, in addition to reporting concordance.\u003c/p\u003e \u003cp\u003eWhile the EMM approach demonstrates several advantages in case-by-case AI monitoring, some limitations persist. Although the EMM does not require labels to perform its monitoring task, a key constraint is the need for labeled use-case-specific datasets when training the EMM for each clinical application, which could potentially limit broader adoption across diverse clinical institutions with different computing resources. However, with the recent maturity of LLMs and self-supervised model training techniques, this limitation may be largely overcome. For example, labels can now be automatically extracted from existing radiology reports using LLMs \u003csup\u003e\u003cspan additionalcitationids=\"CR11\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. Self-supervised training\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e,\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e also enables large foundation model training without manual annotation\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e,\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e, and only a small amount of labeled data would be required to further fine-tune the model for each use case. These recent developments enable periodic updates to EMM, allowing it to adapt to changes in patient populations, scanners, and imaging protocols, thereby maintaining consistent and robust performance over time. Another limitation is EMM's susceptibility to similar failure patterns as the primary model being monitored, such as in cases involving small, low-contrast hemorrhages or ICH-mimicking pathologies in this study. Of particular concern are instances where EMM fails simultaneously with the primary model while indicating complete consensus, as this could instill false confidence in clinicians and potentially increase misdiagnosis risk. This risk might be mitigated by training EMM in the future using synthetic datasets\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e with artificially generated difficult cases, such as those with less obvious hemorrhages, with ICH mimicking features, representing diverse patient population, or with various artifacts. As AI technology rapidly develops, many of the limitations currently facing EMM may be quickly overcome, presenting greater opportunities to not only improve EMM performance but also the resources required to implement the EMM approach itself.\u003c/p\u003e \u003cp\u003eIn conclusion, our EMM framework represents a significant advancement in black-box clinical AI monitoring, enabling case-by-case confidence estimation without requiring access to primary model parameters or intermediate outputs. By leveraging ensemble agreement levels, EMM provides actionable insights, potentially enhancing diagnostic confidence while reducing cognitive burden. As AI continues to integrate into clinical workflows, approaches like EMM that provide transparent confidence measures will be essential for maintaining trust, ensuring quality, and ultimately improving patient outcomes in resource-constrained environments.\u003c/p\u003e "},{"header":"Methods","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003cdiv id=\"Sec9\" class=\"Section3\"\u003e \u003ch2\u003eEMM Dataset and Training\u003c/h2\u003e \u003cp\u003eEMM consisted of 5 independently trained 3D convolutional neural networks (CNNs), comprising two versions with different numbers of trainable parameters: a large version utilizing ResNet\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e 101 and 152, and DenseNet\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e 121, 169, and 201; and a small version employing ResNet 18, 34, 50, 101, and 152. These networks were initialized using 2D ImageNet\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e pre-trained weights and adapted to 3D via the Inflated 3D (I3D)\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003e method, which has shown success previously\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e. EMM sub-models were trained using the open-source RSNA 2019 ICH Detection Challenge\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e,\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e dataset and was evaluated using an independent dataset collected at our institution. We trained the models on different subsets of the RSNA dataset, including 18,370 (100%), 9,185 (50%), 4,592 (25%), and 918 (5%) studies, to evaluate EMM's performance across varying training data sizes. All subsets had an ICH prevalence of about 41%. Each model was trained for 100 epochs to ensure convergence with the Adam optimizer and a learning rate of 10\u003csup\u003e\u0026minus;\u0026thinsp;4\u003c/sup\u003e. All training was conducted on a server of four NVIDIA L40 GPUs using the PyTorch Lightning framework. Based on GPU memory constraints (48 GB), we set the batch size to 4 for ResNet models and 2 for DenseNet models.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\n\u003ch3\u003eEMM Evaluation Dataset\u003c/h3\u003e\n\u003cp\u003eWe evaluated the EMM using a dataset of 2,919 CT studies (1,315 ICH-positive and 1,604 ICH-negative, 45% ICH prevalence) with a balanced gender distribution (50.1% male, 49.8% female) and a wide age range (0.16-104.58 years; median: 67.13 years; interquartile range: 49.65-80.00 years). Since AI model performance is known to vary with disease prevalence\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e,\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e\u003c/sup\u003e, we evaluated both the primary AI and EMM performance across different prevalence levels. A recent internal evaluation at our institution covering 8,935 studies between July and November 2024 revealed ICH prevalences of 34.77% for in-patient, 9.09% for out-patient, and 6.52% for emergency units, with an overall average prevalence of 16.70%. Based on these observations, we selected three representative prevalence levels for evaluation: 30%, 15%, and 5%.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eEMM Training\u003c/h2\u003e \u003cp\u003eTo prepare input data for the EMM, we preprocessed all non-contrast axial head CT DICOM images using the Medical Open Network for AI (MONAI) toolkit\u003csup\u003e\u003cem\u003e\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/em\u003e\u003c/sup\u003e. The preprocessing pipeline consisted of several standardization steps: reorienting images to the \"left-posterior-superior\" (LPS) coordinate system, normalizing the in-plane resolution to 0.45mm, and resizing (either cropping or padding depending on the matrix size) the in-plane matrix dimensions to 512\u0026times;512 pixels using PyTorch's adaptive average pool method, while preserving the original slice resolution. During training, we employed random cropping in the slice dimension, selecting a contiguous block of 30 slices. For testing, we used a sliding window of 30 slices and averaged the ICH SoftMax probabilities across overlapping windows to generate the final prediction.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eICH Detection AI models\u003c/h2\u003e \u003cp\u003eWe evaluated EMM's monitoring capabilities on two distinct ICH detection AI systems: an FDA-cleared commercial product and an open-source model that secured second place in the RSNA 2019 ICH detection challenge\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e,\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe FDA-cleared model is a black-box system with undisclosed training data and architecture that provides binary labels for presence of ICH and identifies suspicious slices. We monitored this model using EMM trained on the complete (100%, N\u0026thinsp;=\u0026thinsp;18,370) RSNA 2019 ICH Detection Challenge dataset. While this dataset's license restricts usage to academic and non-commercial purposes, we do not have access to information regarding whether the FDA-cleared ICH AI model utilized this dataset during its development.\u003c/p\u003e \u003cp\u003eThe open-source RSNA 2019 Challenge second-place model\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e employs a 2D ResNext-101\u003csup\u003e52\u003c/sup\u003e network for slice-level feature extraction, followed by two levels of Bidirectional LSTM networks for feature summarization and ICH detection. We selected the second-place model rather than the first-place winner because retraining the top model would require extensive time while offering only marginal performance improvement (\u0026le;\u0026thinsp;2.3%) based on the leaderboard. Although the original open-source model was trained on the complete RSNA Challenge dataset, we retrained it using only 50% of the data and reserved the remaining 50% for EMM training. This simulates real-world deployment scenarios where the primary ICH detector and EMM are trained on different datasets. For both the FDA-cleared and the open-source ICH detection models, no additional image preprocessing was performed and the original DICOM was sent as the input.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eManually Annotated Sub-dataset for Shapley Analysis\u003c/h2\u003e \u003cp\u003eTo comprehensively analyze features that drive high EMM agreement, we manually annotated a smaller dataset (N\u0026thinsp;=\u0026thinsp;281), including ICH segmentation, volume measurements, and identification of mimicking imaging features. This curated dataset comprised 210 ICH-positive and 71 ICH-negative subjects and their associated studies. The ICH-positive cases span 7 distinct ICH subtypes: subdural (SDH, N\u0026thinsp;=\u0026thinsp;35), subarachnoid (SAH, N\u0026thinsp;=\u0026thinsp;50), epidural (EDH, N\u0026thinsp;=\u0026thinsp;15), intraparenchymal (IPH, N\u0026thinsp;=\u0026thinsp;19), intraventricular (IVH, N\u0026thinsp;=\u0026thinsp;2), diffuse axonal injury (DAI, N\u0026thinsp;=\u0026thinsp;1), and multi-compartmental hemorrhages (Multi-H, N\u0026thinsp;=\u0026thinsp;88). Among the 71 ICH-negative cases, 43 cases were specifically selected to include features that mimic hemorrhages (e.g., hyper-density such as calcification or tumor), while 28 were from normal subjects. A neuroradiology fellow reviewed and validated all clinical labels to ensure accurate ground truth for our analysis.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eComprehensive List of Features for Shapley Analysis\u003c/h2\u003e \u003cp\u003eIn Shapley analysis, we prepared a comprehensive list of features including pathology-related metrics (ICH volume and type), patient characteristics (brain volume, age, gender), positioning parameters (rotation, translation), image acquisition parameters (pixel spacing, slice thickness, kVp, X-ray tube current, CT scanner manufacturer), and image reconstruction parameters (reconstruction convolution kernel and filter type).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eExplaining Features Contributing to High EMM Agreement using Shapley Analysis\u003c/h2\u003e \u003cp\u003eTo elucidate the features contributing to the high level of agreement between EMM sub-models and the primary AI model, we conducted Shapley analysis\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e using the Python \"shap\" package (v0.46.0). This analysis employed an XGBoost\u003csup\u003e\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e\u003c/sup\u003e (v2.1.1) classifier to learn the relationship between feature values and EMM agreement and to evaluate the importance of each feature leading to high EMM agreement, quantified by the probability ranges between 0 to 1. Higher Shapley values indicate features important for 100% EMM agreement.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eICH Volume Estimation for Shapley Analysis\u003c/h2\u003e \u003cp\u003eTo evaluate whether ICH volume influences EMM monitoring performance, we implemented a systematic protocol for ICH volume estimation. First, we employed Viola-UNet\u003csup\u003e\u003cem\u003e\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/em\u003e\u003c/sup\u003e, the winning model from the Instance 2022 ICH Segmentation Challenge\u003csup\u003e\u003cem\u003e\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e,\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e\u003c/em\u003e\u003c/sup\u003e, to generate initial ICH segmentations. A radiology resident reviewed these segmentations and marked any errors directly on the images. A trained researcher then manually corrected the marked discrepancies using 3D Slicer software (version 5.6.2) to ensure accurate hemorrhage delineation. Finally, we calculated ICH volumes using the corrected ICH masks and image resolution data from the DICOM headers.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eEstimating Patient Brain Volume and Orientation Information for Shapley Analysis\u003c/h2\u003e \u003cp\u003eSince hemorrhage detection can be challenging in brains of different sizes or certain brain orientations, we analyzed brain volume and orientation as potential factors affecting EMM performance, alongside the previously mentioned features. Using the FMRIB Software Library\u003csup\u003e\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e\u003c/sup\u003e (FSL 6.0.7.13), we developed an automated pipeline following an established protocol\u003csup\u003e\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e\u003c/sup\u003e to extract brain masks and estimate brain volumes. We then employed FSL FLIRT (FMRIB's Linear Image Registration Tool) to perform 9-degree-of-freedom brain registration, aligning each image to the MNI 2019b non-symmetrical T1 brain template. The resulting rotation, translation, and scaling parameters were incorporated into our Shapley analysis as quantitative measures of brain orientation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eAnalysis of the tradeoff between false alarm rate and the relative accuracy improvement\u003c/h2\u003e \u003cp\u003eWhen the decreased confidence group in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb is further reviewed by radiologists, some cases may actually be found to be labeled correctly by the primary model; we consider these cases to be false alarms. The false-alarm rate is defined as the percentage of unnecessary reviews of correctly labeled cases. After further reviewing the cases in the decreased confidence group, we assumed that the radiologists will always correctly label the cases, improving overall accuracy. We define relative improvement in accuracy as the percentage increase in accuracy after reviewing the decreased confidence group compared to the baseline accuracy of the primary model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analysis\u003c/h2\u003e \u003cp\u003eTo assess the reliability of our model's performance metrics, we calculated 95% confidence intervals (CIs) using bootstrapping. We conducted 1,000 random draws with replacement from the set of ground-truth labels and corresponding model predictions. To create evaluation dataset at target prevalence levels (30%, 15%, and 5%) different from the original distribution (45%), we down-sampled ICH-positive and resampled ICH-negative cases. For example, to create datasets with a controlled 30% prevalence of ICH-positive cases, we performed random sampling with replacement from our original dataset. Specifically, we randomly selected 0.3 \u0026times; N\u003csub\u003en\u003c/sub\u003e ICH-positive cases and N\u003csub\u003en\u003c/sub\u003e ICH-negative cases (where N\u003csub\u003en\u003c/sub\u003e represents the total number of ICH-negative cases in the original dataset). After each draw, we computed key performance metrics such as sensitivity, positive predictive value (PPV), specificity, and negative predictive value (NPV). We then determined the 95% confidence intervals by identifying the 2.5th and 97.5th percentiles of these metrics across all bootstrap iterations. To test the significance of differences in metric between different groups, bootstrapping was also applied to estimate the p-value, with the null hypothesis that there is no difference between the two paired groups.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\n\u003cp\u003eManuscript drafting and manuscript revision for important intellectual content, all authors; Study concepts and design: Z.F., A.S.K., D.B.L.; Data/statistical analysis: Z.F.; Data collection: D.L., A.W.C., M.I.; Data cleaning and annotation: Z.F., A.J., H.S.N., M.P., C.G., A.K., D.L., A.W.C.; Literature research: Z.F., L.Y.C., A.S.C., M.P., C.G.\u003c/p\u003e\n\u003ch2\u003eAcknowledgement\u003c/h2\u003e\n\u003cp\u003eStanford Department of Radiology; Stanford 3D and Quantitative Imaging Laboratory (3DQ Lab).\u003c/p\u003e\n\u003ch2\u003eData Availability\u003c/h2\u003e\n\u003cp\u003eEMM training data set is based on RSNA 2019 ICH detection challenge and it can be found at https://www.rsna.org/rsnai/ai-image-challenge/rsna-intracranial-hemorrhage-detection-challenge-2019. Internal validation data are under review and will publish through Stanford.EMM code and model weights will be available publicly on GitHub. The code repository is uploaded for review.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eJoshi, G. \u003cem\u003eet al.\u003c/em\u003e FDA-Approved Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices: An Updated Landscape. \u003cem\u003eElectronics\u003c/em\u003e 13, 498 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChallen, R. \u003cem\u003eet al.\u003c/em\u003e Artificial intelligence, bias and clinical safety. \u003cem\u003eBMJ Qual Saf\u003c/em\u003e 28, 231\u0026ndash;237 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDel Gaizo, A. J., Osborne, T. F., Shahoumian, T. \u0026amp; Sherrier, R. Deep Learning to Detect Intracranial Hemorrhage in a National Teleradiology Program and the Impact on Interpretation Time. \u003cem\u003eRadiology: Artificial Intelligence\u003c/em\u003e 6, e240067 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHealth, C. for D. and R. Blog: A Lifecycle Management Approach toward Delivering Safe, Effective AI-enabled Health Care. \u003cem\u003eFDA\u003c/em\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAllen, B. \u003cem\u003eet al.\u003c/em\u003e Evaluation and Real-World Performance Monitoring of Artificial Intelligence Models in Clinical Practice: Try It, Buy It, Check It. \u003cem\u003eJournal of the American College of Radiology\u003c/em\u003e 18, 1489\u0026ndash;1496 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChow, J., Lee, R. \u0026amp; Wu, H. How Do Radiologists Currently Monitor AI in Radiology and What Challenges Do They Face? An Interview Study and Qualitative Analysis. \u003cem\u003eJ Digit Imaging. Inform. med.\u003c/em\u003e (2025) doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s10278-025-01493-8\u003c/span\u003e\u003cspan address=\"10.1007/s10278-025-01493-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLarson, D. B. \u003cem\u003eet al.\u003c/em\u003e Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models. \u003cem\u003eRadiology\u003c/em\u003e 314, e241051 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Veen, D. \u003cem\u003eet al.\u003c/em\u003e Adapted large language models can outperform medical experts in clinical text summarization. \u003cem\u003eNat Med\u003c/em\u003e 30, 1134\u0026ndash;1142 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, L. \u003cem\u003eet al.\u003c/em\u003e A scoping review of using Large Language Models (LLMs) to investigate Electronic Health Records (EHRs). Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2405.03066\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2405.03066\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLe Guellec, B. \u003cem\u003eet al.\u003c/em\u003e Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. \u003cem\u003eRadiology: Artificial Intelligence\u003c/em\u003e 6, e230364 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReichenpfader, D., M\u0026uuml;ller, H. \u0026amp; Denecke, K. Large language model-based information extraction from free-text radiology reports: a scoping review protocol. \u003cem\u003eBMJ Open\u003c/em\u003e 13, e076865 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReichenpfader, D., M\u0026uuml;ller, H. \u0026amp; Denecke, K. A scoping review of large language model based approaches for information extraction from radiology reports. \u003cem\u003enpj Digit. Med.\u003c/em\u003e 7, 222 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLambert, B., Forbes, F., Doyle, S., Dehaene, H. \u0026amp; Dojat, M. Trustworthy clinical AI solutions: A unified review of uncertainty quantification in Deep Learning models for medical image analysis. \u003cem\u003eArtificial Intelligence in Medicine\u003c/em\u003e 150, 102830 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGawlikowski, J. \u003cem\u003eet al.\u003c/em\u003e A survey of uncertainty in deep neural networks. \u003cem\u003eArtif Intell Rev\u003c/em\u003e 56, 1513\u0026ndash;1589 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKiyasseh, D., Cohen, A., Jiang, C. \u0026amp; Altieri, N. A framework for evaluating clinical artificial intelligence systems without ground-truth annotations. \u003cem\u003eNat Commun\u003c/em\u003e 15, 1808 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRamalho, T. \u0026amp; Miranda, M. Density Estimation in Representation Space to Predict Model Uncertainty. in \u003cem\u003eEngineering Dependable and Secure Machine Learning Systems\u003c/em\u003e (eds. Shehory, O., Farchi, E. \u0026amp; Barash, G.) 84\u0026ndash;96 (Springer International Publishing, Cham, 2020). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/978-3-030-62144-5_7\u003c/span\u003e\u003cspan address=\"10.1007/978-3-030-62144-5_7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRaghu, M. \u003cem\u003eet al.\u003c/em\u003e Direct Uncertainty Prediction for Medical Second Opinions. in \u003cem\u003eProceedings of the 36th International Conference on Machine Learning\u003c/em\u003e 5281\u0026ndash;5290 (PMLR, 2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMalinin, A. \u0026amp; Gales, M. Predictive Uncertainty Estimation via Prior Networks. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 31 (Curran Associates, Inc., 2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKull, M. \u003cem\u003eet al.\u003c/em\u003e Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 32 (Curran Associates, Inc., 2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo, C., Pleiss, G., Sun, Y. \u0026amp; Weinberger, K. Q. On calibration of modern neural networks. in \u003cem\u003eProceedings of the 34th International Conference on Machine Learning - Volume 70\u003c/em\u003e 1321\u0026ndash;1330 (JMLR.org, Sydney, NSW, Australia, 2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar, A., Liang, P. S. \u0026amp; Ma, T. Verified Uncertainty Calibration. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 32 (Curran Associates, Inc., 2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLouizos, C. \u0026amp; Welling, M. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks. in \u003cem\u003eProceedings of the 34th International Conference on Machine Learning\u003c/em\u003e 2218\u0026ndash;2227 (PMLR, 2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRitter, H., Botev, A. \u0026amp; Barber, D. A Scalable Laplace Approximation for Neural Networks. in (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWelling, M. \u0026amp; Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. in \u003cem\u003eProceedings of the 28th International Conference on International Conference on Machine Learning\u003c/em\u003e 681\u0026ndash;688 (Omnipress, Madison, WI, USA, 2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGraves, A. Practical Variational Inference for Neural Networks. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 24 (Curran Associates, Inc., 2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGal, Y. \u0026amp; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. in \u003cem\u003eProceedings of The 33rd International Conference on Machine Learning\u003c/em\u003e 1050\u0026ndash;1059 (PMLR, 2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLemay, A. \u003cem\u003eet al.\u003c/em\u003e Improving the repeatability of deep learning models with Monte Carlo dropout. \u003cem\u003enpj Digit. Med.\u003c/em\u003e 5, 1\u0026ndash;11 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEgele, R. \u003cem\u003eet al.\u003c/em\u003e AutoDEUQ: Automated Deep Ensemble with Uncertainty Quantification. in 2022 \u003cem\u003e26th International Conference on Pattern Recognition (ICPR)\u003c/em\u003e 1908\u0026ndash;1914 (2022). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICPR56361.2022.9956231\u003c/span\u003e\u003cspan address=\"10.1109/ICPR56361.2022.9956231\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMehrtash, A., Wells, W. M., Tempany, C. M., Abolmaesumi, P. \u0026amp; Kapur, T. Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation. \u003cem\u003eIEEE Trans Med Imaging\u003c/em\u003e 39, 3868\u0026ndash;3878 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWenzel, F., Snoek, J., Tran, D. \u0026amp; Jenatton, R. Hyperparameter Ensembles for Robustness and Uncertainty Quantification. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 33 6514\u0026ndash;6527 (Curran Associates, Inc., 2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLakshminarayanan, B., Pritzel, A. \u0026amp; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 30 (Curran Associates, Inc., 2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKwon, Y., Won, J.-H., Kim, B. J. \u0026amp; Paik, M. C. Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation. \u003cem\u003eComputational Statistics \u0026amp; Data Analysis\u003c/em\u003e 142, 106816 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHanley, D. RSNA Intracranial Hemorrhage Detection Second Place Winner. (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFlanders, A. E. \u003cem\u003eet al.\u003c/em\u003e Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge. \u003cem\u003eRadiology: Artificial Intelligence\u003c/em\u003e 2, e190211 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLundberg, S. M. \u0026amp; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 30 (Curran Associates, Inc., 2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMutasa, S., Sun, S. \u0026amp; Ha, R. Understanding artificial intelligence based radiology studies: What is overfitting? \u003cem\u003eClinical Imaging\u003c/em\u003e 65, 96\u0026ndash;99 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeng, J. \u003cem\u003eet al.\u003c/em\u003e Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. \u003cem\u003enpj Digit. Med.\u003c/em\u003e 5, 1\u0026ndash;9 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLarson, D. B. A Vision for Global CT Radiation Dose Optimization. \u003cem\u003eJournal of the American College of Radiology\u003c/em\u003e 21, 1311\u0026ndash;1317 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang, S.-C. \u003cem\u003eet al.\u003c/em\u003e Multimodal Foundation Models for Medical Imaging - A Systematic Review and Implementation Guidelines. 2024.10.23.24316003 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2024.10.23.24316003\u003c/span\u003e\u003cspan address=\"10.1101/2024.10.23.24316003\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang, S.-C. \u003cem\u003eet al.\u003c/em\u003e Self-supervised learning for medical image classification: a systematic review and implementation guidelines. \u003cem\u003enpj Digit. Med.\u003c/em\u003e 6, 1\u0026ndash;16 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, Z. \u003cem\u003eet al.\u003c/em\u003e A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2401.12208\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2401.12208\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlankemeier, L. \u003cem\u003eet al.\u003c/em\u003e Merlin: A Vision Language Foundation Model for 3D Computed Tomography. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2406.06512\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2406.06512\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBluethgen, C. \u003cem\u003eet al.\u003c/em\u003e A vision\u0026ndash;language foundation model for the generation of realistic chest X-ray images. \u003cem\u003eNat. Biomed. Eng\u003c/em\u003e 1\u0026ndash;13 (2024) doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41551-024-01246-y\u003c/span\u003e\u003cspan address=\"10.1038/s41551-024-01246-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe, K., Zhang, X., Ren, S. \u0026amp; Sun, J. Deep Residual Learning for Image Recognition. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1512.03385\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1512.03385\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang, G., Liu, Z., Van Der Maaten, L. \u0026amp; Weinberger, K. Q. Densely Connected Convolutional Networks. in 2017 \u003cem\u003eIEEE Conference on Computer Vision and Pattern Recognition (CVPR)\u003c/em\u003e 2261\u0026ndash;2269 (IEEE, Honolulu, HI, 2017). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/CVPR.2017.243\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2017.243\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRussakovsky, O. \u003cem\u003eet al.\u003c/em\u003e ImageNet Large Scale Visual Recognition Challenge. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1409.0575\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1409.0575\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCarreira, J. \u0026amp; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. in 2017 \u003cem\u003eIEEE Conference on Computer Vision and Pattern Recognition (CVPR)\u003c/em\u003e 4724\u0026ndash;4733 (IEEE, Honolulu, HI, 2017). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/CVPR.2017.502\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2017.502\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRSNA Intracranial Hemorrhage Detection Challenge (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.rsna.org/rsnai/ai-image-challenge/rsna-intracranial-hemorrhage-detection-challenge-2019\u003c/span\u003e\u003cspan address=\"https://www.rsna.org/rsnai/ai-image-challenge/rsna-intracranial-hemorrhage-detection-challenge-2019\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGodau, P. \u003cem\u003eet al.\u003c/em\u003e Navigating prevalence shifts in image analysis algorithm deployment. \u003cem\u003eMedical Image Analysis\u003c/em\u003e 102, 103504 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePark, S. H. \u0026amp; Han, K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. \u003cem\u003eRadiology\u003c/em\u003e 286, 800\u0026ndash;809 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCardoso, M. J. \u003cem\u003eet al.\u003c/em\u003e MONAI: An open-source framework for deep learning in healthcare. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/ARXIV.2211.02701\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2211.02701\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXie, S., Girshick, R., Dollar, P., Tu, Z. \u0026amp; He, K. Aggregated Residual Transformations for Deep Neural Networks. in 2017 \u003cem\u003eIEEE Conference on Computer Vision and Pattern Recognition (CVPR)\u003c/em\u003e 5987\u0026ndash;5995 (IEEE, Honolulu, HI, 2017). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/CVPR.2017.634\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2017.634\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, T. \u0026amp; Guestrin, C. XGBoost: A Scalable Tree Boosting System. in \u003cem\u003eProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\u003c/em\u003e 785\u0026ndash;794 (Association for Computing Machinery, New York, NY, USA, 2016). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/2939672.2939785\u003c/span\u003e\u003cspan address=\"10.1145/2939672.2939785\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, Q. \u003cem\u003eet al.\u003c/em\u003e Voxels Intersecting Along Orthogonal Levels Attention U-Net for Intracerebral Haemorrhage Segmentation in Head CT. in 2023 \u003cem\u003eIEEE 20th International Symposium on Biomedical Imaging (ISBI)\u003c/em\u003e 1\u0026ndash;5 (2023). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ISBI53787.2023.10230843\u003c/span\u003e\u003cspan address=\"10.1109/ISBI53787.2023.10230843\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, X. \u003cem\u003eet al.\u003c/em\u003e Hematoma Expansion Context Guided Intracranial Hemorrhage Segmentation and Uncertainty Estimation. \u003cem\u003eIEEE Journal of Biomedical and Health Informatics\u003c/em\u003e 26, 1140\u0026ndash;1151 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, X. \u003cem\u003eet al.\u003c/em\u003e The state-of-the-art 3D anisotropic intracranial hemorrhage segmentation on non-contrast head CT: The INSTANCE challenge. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2301.03281\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2301.03281\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmith, S. M. \u003cem\u003eet al.\u003c/em\u003e Advances in functional and structural MR image analysis and implementation as FSL. \u003cem\u003eNeuroImage\u003c/em\u003e 23, S208\u0026ndash;S219 (2004).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMuschelli, J. \u003cem\u003eet al.\u003c/em\u003e Validated automatic brain extraction of head CT images. \u003cem\u003eNeuroImage\u003c/em\u003e 114, 379\u0026ndash;385 (2015).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6683104/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6683104/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eArtificial intelligence (AI) tools for radiology are commonly unmonitored once deployed. The lack of real-time case-by-case assessments of AI prediction confidence requires users to independently distinguish between trustworthy and unreliable AI predictions, which increases cognitive burden, reduces productivity, and potentially leads to misdiagnoses. To address these challenges, we introduce Ensembled Monitoring Model (EMM), a framework inspired by clinical consensus practices using multiple expert reviews. Designed specifically for black-box commercial AI products, EMM operates independently without requiring access to internal AI components or intermediate outputs, while still providing robust confidence measurements. Using intracranial hemorrhage detection as our test case on a large, diverse dataset of 2919 studies, we demonstrate that EMM successfully categorizes confidence in the AI-generated prediction, suggesting different actions and helping improve the overall performance of AI tools to ultimately reduce cognitive burden. Importantly, we provide key technical considerations and best practices for successfully translating EMM into clinical settings.\u003c/p\u003e","manuscriptTitle":"Automated Real-time Assessment of Intracranial Hemorrhage Detection AI Using an Ensembled Monitoring Model (EMM)","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-26 09:12:02","doi":"10.21203/rs.3.rs-6683104/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-07-13T14:21:27+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-07-06T23:00:07+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-06-06T10:45:21+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"258601587941471791521732483670428811572","date":"2025-06-01T19:23:45+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"22843163068061174222394055999875319405","date":"2025-06-01T12:09:51+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"219181900152356993427149411044949656704","date":"2025-06-01T09:39:30+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-06-01T09:34:46+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-05-22T00:02:17+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-05-21T17:54:40+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Digital Medicine","date":"2025-05-16T19:09:50+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"92d9032f-64e8-48ec-b0e4-a48c4bbaab40","owner":[],"postedDate":"May 26th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":48878234,"name":"Health sciences/Diseases/Neurological disorders/Brain injuries"},{"id":48878235,"name":"Health sciences/Anatomy/Nervous system/Brain"},{"id":48878236,"name":"Health sciences/Health care/Medical imaging/Tomography/Computed tomography"}],"tags":[],"updatedAt":"2025-10-20T16:11:27+00:00","versionOfRecord":{"articleIdentity":"rs-6683104","link":"https://doi.org/10.1038/s41746-025-02007-0","journal":{"identity":"npj-digital-medicine","isVorOnly":false,"title":"npj Digital Medicine"},"publishedOn":"2025-10-16 15:57:58","publishedOnDateReadable":"October 16th, 2025"},"versionCreatedAt":"2025-05-26 09:12:02","video":"","vorDoi":"10.1038/s41746-025-02007-0","vorDoiUrl":"https://doi.org/10.1038/s41746-025-02007-0","workflowStages":[]},"version":"v1","identity":"rs-6683104","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6683104","identity":"rs-6683104","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.