Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

doi:10.21203/rs.3.rs-6772394/v1

Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

2025 · doi:10.21203/rs.3.rs-6772394/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 143,306 characters · extracted from preprint-html · click to expand

Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset Shruti Hegde, Mabon Ninan, Jonathan R. Dillman, Shireen Hayatghaibi, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6772394/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This study compares four commercial clinical NLP tools - Amazon Comprehend Medical, Google Healthcare NLP, Azure Clinical NLP, and SparkNLP - alongside dedicated radiograph labelers CheXpert and CheXbert for pediatric chest radiograph (CXR) report labeling. Using 95,008 pediatric CXR reports from a large academic hospital, we extracted entities and assertion statuses (positive, negative, uncertain) from findings and impressions, mapped them to 13 categories (12 disease categories and a No Findings category), and compared performance using Fleiss Kappa and accuracy against a pseudo-ground truth. Entity extraction varied widely: SparkNLP extracted 49,688 unique entities, Azure 31,543, AWS 27,216, and Google 16,477. Assertion accuracy ranged from 50% (AWS) to 76% (SparkNLP), while CheXpert and CheXbert achieved 56%. Results reveal substantial performance variability, emphasizing the need for validation and careful review before deploying NLP tools for pediatric clinical report labeling. Health sciences/Medical research/Paediatric research Health sciences/Diseases/Respiratory tract diseases Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Patient Electronic Health Records (EHRs) have become central to discovery and innovation in data-driven healthcare [ 1 ]. EHRs encompass structured data such as diagnostic codes, physiological measurements, and medication records, as well as unstructured free-text clinical health records (e.g., office visit, inpatient, and surgical notes and imaging reports), which provide valuable insights to enhance care provision and inform clinical decision-making [ 2 ]. Natural Language Processing (NLP) has emerged as a crucial tool for extracting insights from these unstructured clinical reports [ 3 ], a task that is highly tedious when performed manually. Advancements in NLP have facilitated various clinical and research applications, including automated clinical coding [ 4 ], clinical decision support systems [ 5 ], predictive analytics for patient outcomes [ 6 , 7 ], and population health management [ 8 ]. Additionally, NLP enables the extraction of diagnostic information from radiology reports [ 9 ], enhances disease surveillance [ 10 ], supports large-scale clinical research [ 7 ], improves patient care through personalized treatment recommendations [ 11 ] and enables quality improvement efforts. NLP systems for processing clinical text can be categorized into rule-based and statistical learning-based approaches. Rule-based systems utilize medical ontology libraries that host expert-curated knowledge bases encompassing medical concepts, diagnostic codes, and medication categories [ 12 ]. In contrast, machine learning (ML) and hybrid (ML + rule-based) tools have been increasingly adopted for general-purpose clinical NLP applications [ 13 , 14 ]. Notable among these are recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, which have significantly advanced the handling of sequential data [ 15 ]. However, the introduction of attention mechanisms through the Transformer architecture has surpassed previous benchmarks, establishing Transformers as the industry standard in NLP [ 16 ]. Transformer-based methods have led to the development of two primary model families: the GPT (Generative Pre-trained Transformer) family [ 17 ], excelling in applications such as question answering and summarization through autoregressive generation, and the BERT (Bidirectional Encoder Representations from Transformers) family [ 18 ], optimized for tasks like named entity recognition, assertion detection, entity linking, and relationship extraction. Named Entity Recognition (NER) is the task of identifying and classifying specific pieces of information (called entities) in unstructured text into predefined categories (e.g., disease, procedure entities). Assertion detection determines the status or context of a recognized entity in clinical text (e.g., positive/negative/uncertain). Relationship Extraction is the task of identifying and classifying semantic relationships between two or more entities mentioned in unstructured text. In the clinical context, this means detecting how entities are connected, such as recognizing that a medication is prescribed to treat a specific condition. Chest radiography (CXR) is pivotal for diagnosing and monitoring respiratory and cardiovascular conditions such as pneumonia [ 19 ], tuberculosis [ 20 ], lung cancer [ 21 ], and heart failure [ 22 ]. CXR reports offer detailed diagnostic information that aids in identifying and tracking these conditions, thereby supporting informed treatment and patient management decisions. However, interpreting CXR images is inherently complex, with studies revealing significant variability among physicians and radiologists in their assessments [ 23 , 24 ]. This complexity makes CXR reports an ideal use case for evaluating NLP systems, as they encapsulate nuanced diagnostic information that challenges automated extraction and analysis. The availability of several large public CXR datasets with images, reports, and disease labels has spurred the development and assessment of advanced computer vision, NLP, and multimodal AI models tailored specifically to CXR applications [ 25 , 26 ]. Specifically, the CheXpert framework [ 27 ], which includes a comprehensive dataset of CXR images, reports, and associated disease labels, provides multiple report labeling models that are used to generate labels for other public datasets, such as the MIMICS CXR dataset [ 28 ]. Numerous research studies rely on these datasets to develop and evaluate their model performance [ 29 – 31 ]. Concurrently, general-purpose clinical NLP systems, ranging from open-source rule-based and Machine Learning models like MetaMap [ 32 ] and cTAKES [ 33 ] to recent commercial systems employing proprietary algorithms, are widely utilized for various clinical note processing tasks [ 34 ]. However, the performance of these more general systems on specific note types and tasks, such as pediatric CXR report labeling, is not well documented. This lack of standardized, independent comparison leaves users unaware of inherent errors in these systems and how such inaccuracies may propagate into downstream applications. A rigorous, standardized comparison of competing NLP systems on independent datasets is crucial to understanding their limitations, assessing their uncertainties, and ensuring more reliable integration into clinical workflows. The primary objective of this study is to compare four commercial clinical NLP systems : Amazon Comprehend Medical (AWS) [ 35 ], Google Healthcare NLP (GC) [ 36 ], Azure Clinical NLP (AZ) [ 37 ], and SparkNLP (SP) from John Snow Labs [ 38 ], for extracting clinically relevant entities and determining their assertion status from an independent database of pediatric CXR reports. The secondary objective is to evaluate the performance of benchmark CXR report labeling models, CheXpert [ 27 ] and CheXbert [ 39 ], commonly used in CXR research, against an extraction pipeline constructed using general-purpose clinical NLP systems on this independent pediatric dataset. Methods Figure 1 shows the overall methodology of this analysis. Patient consent for this retrospective study was waived by the Institutional Review Board, which approved this study. All exams were de-identified of Protected Health Information (PHI) by the Radiology Informatics team. a) Dataset: Pediatric CXR reports were extracted from the radiology information system at a large pediatric hospital for the study period spanning June 2015 to June 2020. A total of 95,008 examinations constituted the study sample for this study's evaluation. The mean age of the participants was 7.5 ± 7.5 years, with a male-to-female ratio of 1:2 (50503 males, 44286 females, 219 unknown). The CXR reports were typically in a semi-structured format, comprising the following sections: 1. Clinical History, 2. Findings, and 3. Impression. Each CXR radiology report was stored as an individual text file, linked via anonymized identifiers to ensure patient confidentiality. b) General purpose clinical NLP systems: This study compares four commercial clinical NLP systems: AWS [ 35 ], GC [ 36 ], AZ [ 37 ], and SP’s NLP models for radiology reports [ 38 ]. The clinical NLP systems for AWS, AZ, and GC are accessed via their respective cloud-based Application Programming Interfaces (APIs), all of which are compliant with the Health Insurance Portability and Accountability Act (HIPAA, USA). Deidentified radiology reports were submitted to these APIs, and the output results were returned in JSON format. Each of the three cloud-based systems performed four major tasks: entity extraction, assertion detection, entity linking, and relationship extraction. [ 40 , 41 ] In contrast, the Spark NLP (SP) radiology models were accessed through an academic license and operated on local hardware. The extraction pipeline for SP consisted of BERT-based radiology models specifically designed for entity extraction and assertion detection. Unlike the cloud-based systems, the SP models are tailored to handle the unique nuances of radiology reports. c) Entity recognition: The output data structure and the number of entity categories extracted were unique to each NLP system. A custom postprocessing pipeline was developed to handle the JSON outputs from each system. The pipeline focused exclusively on disease-related entity categories and filtered out entities in other categories for the analysis. Table 1 provides the total number of named entity categories extracted by each system and the entity categories that were filtered for analysis. The categories for analysis are defined so that they are synonymous with disease symptoms or findings. The extracted entities from each NLP system were also normalized using a lemmatization algorithm such that related terms and phrases appear consistent for calculating descriptive statistics. Table 1 Details of the commercial NLP systems used in this study, the number of named entity categories detected by each system, and the selected categories that relate to disease symptoms or findings. NLP system Named Entity Categories Categories Selected for Analysis AWS – Amazon Medical Comprehend [v1.0.0, DetectEntities] 7 ‘MEDICAL_CONDITION’ AZ – Azure Text Analytics for Health 36 ‘SYMPTOM_OR_SIGN’, ‘DIAGNOSIS’ GC – Google Healthcare NLP system 28 ‘PROBLEM’ SP – John Snow Labs Spark NLP [ner_radiology model] 13 ‘Disease_Syndrome_Disorder’, ‘Symptom’, ‘ImagingFindings’ d) Standardization of assertion status: Assertion refers to the context in which entities are mentioned in a clinical report, which is crucial for effective use of clinical notes in downstream applications. Each of the four commercial NLP systems identified assertions for extracted entities in different ways. Except for AWS, the other systems provided a categorical assertion status with unique definitions for each category. The AWS model provided only a numerical confidence score (CS) for the negation attribute associated with each detected entity. For a standardized comparative analysis, assertions for each entity were consolidated into three categories: positive, negative, or uncertain. Table 2 outlines the original assertion categories for each NLP system and how they were mapped into these three standardized assertion statuses for analysis. Table 2 Standardization criteria for each NLP system, mapping assertion outputs to three standardized assertion statuses: Positive, Negative, and Uncertain for analysis. CS is the negation confidence score for AWS. NLP System Positive Negative Uncertain AWS – Amazon Medical Comprehend [v1.0.0, DetectEntities] *CS 0.75 0.25>= *CS < = 0.75 AZ – Azure Text Analytics for Health ‘positive’ ‘negative’ ‘positivepossible’, ‘negativepossible’, ‘neutralpossible’ GC – Google Healthcare NLP system ‘likely’ ‘unlikely’ ‘somewhat_likely’, ‘somewhat_unlikely’, ‘uncertain’, ‘conditional’ SP – John Snow Labs Spark NLP [ner_radiology model] ‘present’, ‘none’, ‘past’ ‘absent', 'family’, ‘someone else’, ‘planned’ ‘hypothetical’, ‘possible’ *CS stands for confidence score e) CheXpert disease labels and models: This study utilized two open-source specialized models, CheXpert and CheXbert, to extract disease labels from the impression section of the chest radiograph (CXR) reports based on a predefined set of 13 categories relevant to chest radiography. These categories include enlarged cardiomediastinum , cardiomegaly, lung opacity, lung lesion, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, pleural other, fracture , and a No Findings category indicating exams without known disease findings. CheXpert is a rule-based model that employs heuristic rules and expert-curated mappings to assign disease labels from the textual content of CXR reports. In contrast, CheXbert is built on a BERT-based architecture that leverages deep contextual embeddings for a more nuanced interpretation of CXR reports, demonstrating improved accuracy on the CheXpert dataset. Both models produce labels accompanied by three assertion statuses positive, negative, and uncertain for each of the 13 output categories. d) Mapping clinical entities to CheXpert disease categories: Lemmatization is a process in natural language processing (NLP) that reduces a word to its base or dictionary form, also known as the lemma (e.g., diagnosed, diagnoses and diagnosing would be reduced to base word diagnose). The extracted named entities from the commercial NLP systems, even after lemmatization, exhibited multiple variations for the same clinical concepts, depending on the system. To map these varied entities to the disease labels identified by the CheXpert models, a comprehensive regular expression (RegEx) algorithm was developed. This algorithm was primarily designed using the dictionary of keywords employed by the CheXpert model and was expanded by reviewing phrases, abbreviations, and linguistic patterns specific to the pediatric dataset that correspond to each disease category. By filtering and grouping relevant entities under the appropriate disease labels, the RegEx algorithm provided a consistent and unified categorization across the dataset. Figure 2 illustrates how the algorithm detected "enlarged cardiomediastinum" , a condition not often directly mentioned in CXR reports is labeled by identifying related phrases and patterns. The complete details of the RegEx patterns used for each disease label are provided in Appendix A. This standardization was critical for enabling reliable aggregation, comparison, and subsequent analysis of disease labels detected by the commercial NLP systems against the CheXpert and CheXbert models. e) Evaluation metrics: To evaluate the performance of commercial NLP systems in identifying disease-related entities from chest X-ray (CXR) reports, we analyzed both the total number of entities extracted by each system within individual report sections and the average number of entities extracted per report. To assess the breadth of information captured, we also counted the number of unique disease entities identified by each system across the dataset. Differences in the number of entities extracted per report between systems were statistically compared using paired t-tests across all model combinations, with Bonferroni correction applied to adjust for multiple comparisons. For assertion detection, we report the distribution of extracted entities classified as positive, negative, or uncertain for each section of the CXR report, using the standardization criteria defined in Table 2 . To evaluate whether the distribution of assertion types differed significantly between NLP systems, Chi-square tests of independence were performed separately for the Findings and Impression sections. For CheXpert labels comparison, inter-model agreement for assertion classification was assessed using Fleiss’ Kappa, calculated across all six NLP systems, four commercial systems (AWS, AZ, GC, and SP) and two open-source CXR-specific models (CheXbert and CheXpert), for each disease category. The assertion category absent was assigned when an NLP system did not detect a given label. Fleiss’ Kappa values were computed under two conditions. In the first condition ( All ), all exams were included irrespective of whether the disease was detected or marked as absent by the models. In the second condition ( Excluding Absent ), exams were excluded if all six models predicted the disease as absent . This approach allows for a more accurate assessment of inter-model agreement in cases where the disease is actually present or asserted by at least one model, avoiding artificial inflation of agreement due to consistent non-detection. To estimate model-specific assertion performance, a pseudo–ground truth was established using a majority voting strategy across outputs from all six NLP systems. For each disease entity, the consensus assertion label was determined by majority vote. If no majority was reached, the entity was assigned to the uncertain category. Assertion accuracy for each model was computed by comparing its predicted assertion category to the consensus label for each disease. A match was counted only when the assertion category exactly aligned with the consensus. Mean accuracy was reported separately for each disease, for each model, and as an overall mean, providing a summary of assertion performance. Results a) Named entity recognition by commercial clinical NLP models: Table 3 Number of disease related entities extracted by each clinical NLP system for the Findings and Impressions sections of the CXR reports in the study dataset (n = 95,008). Bolded numbers signify the highest count of extracted entities in the report sections. Section AWS AZ GC SP FINDINGS 417,630 689,774 333,516 457,540 IMPRESSION 154,397 196,911 140,201 170,728 All sections 846,137 1,175,594 741,958 900,655 Table 3 shows the total number of disease-related entities extracted by each system and their distribution for the Findings and Impressions sections of the CXR report, along with the overall counts including all sections (Clinical History, Comparison and Procedure Comments). Figure 2 shows the average number of disease related entities extracted per report by each clinical NLP system along with the standard deviation, whereas Fig. 3 shows the unique number of disease related entities extracted by each system across the study dataset. All pairwise comparisons in the number of extracted entities per report were statistically significant (Bonferroni-adjusted p-value < 0.001). AZ extracted more entities overall (12.4), followed by SP (9.5), AWS (8.9), and GC (7.8), respectively. In terms of uniqueness, however, SP extracted considerably more entities (49,688) when compared to the other three systems. AZ and AWS extracted 31,543 and 27,216 unique entities, respectively, while GC had the lowest count with only 16,477 unique entities. Figure 4 illustrates the count of the top five most frequently detected diseases extracted by each of the four NLP systems from the Impression section of the reports - highlighting both detection frequency and inter-system variability. Pneumonia is the most identified disease entity across all systems, with counts ranging from approximately 28,037 (SP) to 32,392 (AZ), reflecting a moderate spread of about 4,355 entities (≈ 14.4% variation). Viral or reactive airway disease shows the smallest variation, with counts tightly clustered between 16,460 (AWS) and 16,391 (SP) - just a 0.4% variation, indicating strong agreement. Atelectasis , however, shows the largest disparity, with AZ identifying nearly 15,774 instances, while SP reports only about 10,056, a spread of 5,718 entities (≈ 44.3% variation), suggesting significant disagreement. Pleural effusion has consistent counts between 3986 (GC) and 4793 (AZ), having a moderate spread with a difference of 807 (≈ 18.4% variation). Pneumothorax is the least commonly identified in the top 5, ranging between 1,885 (GC) and 3,065 (AZ) - a 1,180 difference (≈ 47.7% variation). These statistics suggest that while systems largely agree on certain entities, like viral or reactive airway and pleural effusion , others such as atelectasis and pneumonia show greater variability, likely due to differences in model sensitivity, training data, variability in medical term usage or recognition. b) Assertion detection by commercial clinical NLP models: Table 4 Comparison of assertion detection performance across different models (AWS, AZ, GC, SP) for various sections in CXR reports. Percentages are rounded to one decimal place. Assertion System FINDINGS Percentage (Assertion count / Total entities) IMPRESSIONS Percentage (Assertion count / Total entities) Positive AWS 51.2% (213,654 / 417,630) 79.5% (122,652 / 154,397) AZ 72.3% (499,003 / 689,774) 67.5% (132,857 / 196,911) GC 36.8% (122,789 / 333,516) 61.0% (85,551 / 140,201) SP 70.5% (322,788 / 457,540) 58.2% (99,342 / 170,728) Negative AWS 48.3% (201,809 / 417,630) 19.2% (29,617 / 154,397) AZ 26.1% (180,147 / 689,774) 4.6% (9,125 / 196,911) GC 61.8% (206,052 / 333,516) 21.8% (30,498 / 140,201) SP 26.9% (123,421 / 457,540) 9.2% (15,777 / 170,728) Uncertain AWS 0.5% (2,167 / 417,630) 1.4% (2,128 / 154,397) AZ 1.5% (10,624 / 689,774) 27.9% (54,929 / 196,911) GC 1.4% (4,675 / 333,516) 17.2% (24,152 / 140,201) SP 2.5% (11,331 / 457,540) 32.6% (55,609 / 170,728) Table 4 presents the distribution (Detected assertion entities / Total entities in report section) of assertion classifications for medical entities detected by the four models (AWS, AZ, GC, and SP) for the Findings and Impressions sections of chest X-ray (CXR) reports. In the Findings section, most entities were either positive or negative across all models, with uncertain classifications ranging from 0.5% (AWS) to 2.5% (SP). SP classified 70.5% of Findings as positive , whereas GC classified only 36.8%. For the Impression section, AWS classified 79.5% of entities as positive , while SP classified only 58.2%. Negative assertions were largest for GC (21.8%) and smallest for AZ (4.6%). Uncertain classifications were minimal for AWS (1.4%) but reached 32.6% for SP. The Chi-square test of independence showed that the distribution of entity assertions differed significantly between NLP systems in both the Findings and Impression sections (p < 0.001). These variations highlight differences in assertion detection performance, which may impact clinical decision support applications. c) Comparison with open-source CXR report labelers on CheXpert labels: Figure 5 shows the Fleiss’ Kappa values for each of 12 disease categories and the No Findings category, based on the assertion statuses (positive, negative, uncertain and absent) assigned by all six NLP models. The highest agreement was observed for pleural effusion (Kappa: 0.89 in All ; 0.59 in Excluding Absent ), while the lowest was for e nlarged cardiomediastinum (Kappa: 0.25 in All ; 0.05 in Excluding Absent ). The mean Fleiss’ Kappa across all categories were 0.68 ± 0.17 (All) and 0.35 ± 0.14 ( Excluding Absent) . The individual assertion performance of the six NLP models against the consensus pseudo-ground truth calculated using majority voting for the CheXpert labels are shown in Table 5 . The results are only based on the Impression section of the CXR reports. The lowest accuracy against the consensus was observed for the Consolidation category (14 ± 3) %, where AWS had the largest accuracy of 20% and GC had the smallest accuracy of 10%. Excluding No Findings, the highest mean accuracy was observed for Pleural Effusion (72 ± 6) %, with GC having the largest value of 82% and AWS having the smallest value of 62%. When considering the aggregate across all disease categories, SP out-performed other NLP systems with an accuracy of (76 ± 4) %, while AWS trailed with an accuracy of 50%. The mean overall accuracy was (62 ± 9) %. Both CheXbert and CheXpert had an overall accuracy of (56 ± 8) %. Table 5 Assertion accuracy (%) of the six NLP systems on the impression section of the study dataset with 95,008 chest X-ray (CXR) reports, evaluated against the consensus pseudo–ground truth on the CheXpert labels. Percentages are rounded to the nearest whole number. Disease Category AWS (%) AZ (%) GC (%) SP (%) CheXbert (%) CheXpert (%) Mean (%) Atelectasis 35 71 86 57 65 65 63 ± 15 Cardiomegaly 60 56 80 63 59 56 62 ± 8 Consolidation 20 12 10 13 12 17 14 ± 3 Edema 44 69 76 82 62 61 66 ± 12 Enlarged Cardiomediastinum 32 30 51 26 19 13 28 ± 12 Fracture 20 36 33 47 29 28 32 ± 8 Lung Lesion 47 65 61 70 55 55 59 ± 7 Lung Opacity 54 58 54 65 46 46 54 ± 7 Pleural Effusion 62 71 82 76 68 70 72 ± 6 Pleural Other 47 57 47 78 66 66 60 ± 11 Pneumonia 17 38 39 76 34 35 40 ± 18 Pneumothorax 40 52 56 43 42 41 46 ± 6 No Findings 72 98 70 89 67 67 77 ± 12 Overall Mean 50 ± 0 69 ± 5 63 ± 0 76 ± 4 56 ± 8 56 ± 8 62 ± 9 d) Sample reports with discrepant assertion statuses extracted by the NLP systems for CheXpert labels: Table 6 Variability in Assertion Among Commercial NLP Systems - Divergent NLP Interpretations of Radiology Reports Impressions. Sample Disease Label Report Impression Positive Uncertain Negative 1 pneumothorax 1. Pectus bars in place. 2. No appreciable pneumothorax on the left. SP CheXpert, CheXbert, AWS, GC AZ 2 pneumonia Findings consistent with viral or reactive airways disease without focal pneumonia. AZ SP CheXpert, CheXbert, AWS, GC 3 cardiomegaly No acute cardiopulmonary abnormality with stable cardiomegaly and fracture of one of the pacemakers leads. CheXpert, CheXbert, AZ, GC AWS SP 4 consolidation Right suprahilar opacity may represent developing consolidation, superimposed on findings of viral or reactive airways disease. AWS, AZ, SP CheXpert, CheXbert, GC 5 atelectasis Viral/reactive airway disease, superimposed with right upper and left lower lobe airspace disease such as atelectasis/pneumonia. AWS, AZ, GC CheXpert, CheXbert SP 6 edema Mildly prominent pulmonary vasculature. Cardiac size appears mildly enlarged. No focal airspace disease or overt pulmonary edema is suspected. GC AZ CheXpert, CheXbert, AWS, SP 7 lung lesion No consolidation. Small oval lucency in the left upper lobe. It is not clear whether this represents superimposed shadows (mock effect) or a true finding. A short-term follow-up two-view chest x-ray is suggested to evaluate the persistence of this finding. If it does persist, it may represent a tiny bleb, bulla, or pneumatocele. The surrounding lung appears normal making a cavitary lesion less likely. CheXpert, AWS, GC CheXbert, AZ, SP 8 lung opacity Poorly defined left lower lobe opacity concerning for developing pneumonia. CheXpert, CheXbert, AWS AZ, GC, SP Table 6 describes variability in assertion among commercial NLP systems. It illustrates how clinical NLP systems interpret assertions based on linguistic features in the impression section. In case 1 for pneumothorax , the impression "No appreciable pneumothorax on the left" led to mixed interpretations, with SP classifying it as positive while most others marked it negative. For pneumonia in case 2, the phrase "without focal pneumonia" resulted in most systems classifying it as negative, with only AZ marking it positive and SP classifying it as uncertain. The cardiomegaly example (case 3) with " No acute cardiopulmonary abnormality with stable cardiomegaly" term led four systems to classify it as positive, while AWS remained uncertain, and SP classified it as negative. In case 4, the hedging phrase "may represent developing consolidation" resulted in three systems classifying consolidation as uncertain and three as negative. Similarly, in case 5 for atelectasis , the complex description of "airspace disease such as atelectasis/pneumonia" divided systems, with three positive, two uncertain, and one negative. Notably, in case 8, the phrase "opacity concerning for developing pneumonia" led to a split between positive and uncertain classifications of lung opacity . These findings demonstrate how radiological language, especially uncertainty markers, qualifiers, and alternative explanations consistently challenge commercial NLP systems, with each interpreting probabilistic medical language and hedging expressions differently. Discussion This study is one of the first to quantify and compare the performance of commercial clinical NLP systems on a large independent dataset, and the first to do so on a study sample composed of pediatric chest radiograph reports. Commercial clinical NLP systems from three major cloud providers, namely Amazon, Google, and Azure, along with a radiology specific model from a well-known vendor, John Snow Labs, were analyzed on the tasks of named entity recognition and assertion detection. A standardization algorithm was developed to map the entities extracted by these systems to disease labels defined by the CheXpert framework. The CheXpert and CheXbert models provided a set of disease categories that served as a reference for further comparison. Model outputs were evaluated using a consensus ground truth derived from the outputs of all six NLP systems using a majority voting approach. a) Variability in entity extraction Comparison of the disease related entities extracted by the four commercial NLP systems revealed considerable variability in the mean number of entities per report and the number of unique entities detected. AZ recorded the highest mean number of entities per report, while SP produced the greatest number of unique entities. In addition, the top five most frequent disease or diagnosis entities reported had variable agreement between the systems. Viral or reactive airway disease had near perfect agreement ( 40% difference in counts). These results illustrate the inconsistencies in named entity recognition and underline the importance of applying standardization techniques, such as the use of regular expressions, to achieve meaningful comparisons. b) Variability in assertion detection Large variability was also observed in assertion detection performance by the four NLP systems. Firstly, each NLP system reported assertion status in its own way, and the definitions were not consistent. Secondly, once the assertion statuses were grouped into positive, negative, and uncertain categories, significant differences in their distributions were observed between the systems for both Findings and Impression sections. Especially, for the Findings section, the NLP systems reported 0.5% (AWS) to 2.5% (SP) of the entities as uncertain , whereas for the Impression section, the number of uncertain entities varied between 1.4% (AWS) to 32.6% (SP). c) Performance on CheXpert labels When evaluating disease labels based on the CheXpert framework, inter-model agreement was substantial when absent predictions were included (mean Kappa = 0.68) but dropped to fair agreement when those cases were excluded (mean Kappa = 0.35), highlighting variability in assertion classification when disease presence was detected by at least one model. The overall mean accuracy across all six models was 62% with a standard deviation of 9%. Mean assertion accuracy for individual diseases ranged from 14% for consolidation to 77% for pleural effusion . Poor performance for consolidation may stem from the wide variety of expressions used in reports. As a descriptive imaging finding, it is mentioned in more variable and nuanced ways and can appear in multiple conditions such as pneumonia or pulmonary edema. Higher accuracy for pleural effusion possibly results from its more explicit mentions. Among the disease categories, SP outperformed the other systems with an accuracy of 76%. CheXpert and CheXbert performed similarly for most categories, with notable differences only for consolidation and enlarged cardiomediastinum . Assertion accuracy varied by category, with SP performing best in six categories, GC in five, and AWS and AZ each in one category. d) Limitations and future directions: Investigation of cases with discrepant assertion statuses revealed that linguistic nuances influence each NLP model differently. No single system demonstrated uniform superiority. Instead, performance varied by disease category and the contextual phrasing within report sections. These findings highlight the potential benefits of model ensembling, where the complementary strengths of different models can be leveraged to improve overall robustness and accuracy. This is reinforced by the variability observed across cases in Table 6 . Each system shows strengths and limitations depending on the linguistic structure, diagnostic ambiguity, and type of condition described. For instance, AZ accurately identified the absence of pneumothorax in sample 1 but failed to detect the absence of pneumonia in sample 2. AWS correctly recognized the absence of pneumonia in sample 2 but was uncertain about cardiomegaly in sample 3 due to ambiguous language, although clinically this indicates a positive finding. SP successfully detected the absence of edema in sample 6 but missed the absence of pneumothorax in sample 1. Meanwhile, CheXpert, CheXbert, and GC all correctly flagged uncertain consolidation in sample 4 but failed to identify the absence of pneumothorax in sample 1. This inconsistency highlights that NLP performance is highly context-dependent, shaped by how each system interprets medical uncertainty, negation, and diagnostic probability. This observation underscores the necessity for thorough validation of clinical NLP systems for specific use cases and institutions, as well as careful evaluation of non-standard definitions for uncertainty before these systems are deployed. One limitation of this study is that the AWS did not provide a categorical assertion status and offered only a separate negation attribute. Confidence values were used to assign uncertainty in these cases. The latest version of Amazon’s Comprehend Medical NLP system includes an attribute for low confidence that may provide a better estimation of uncertainty. In addition, the large dataset precluded manual annotation of CheXpert labels. Therefore, a pseudo ground truth derived from all six NLP systems was used. Cases without a clear majority were assigned to the uncertain category. Although the pseudo ground truth is useful for comparing systems against each other, performance metrics may differ when compared to labels assigned by radiologists. Future research should incorporate manual annotations and investigate the effect of these variabilities on downstream clinical and research applications to ensure that inherent errors do not propagate and adversely affect patient care. Additionally, this study is limited to only pediatric chest X-ray reports, without including adult cases or a broader range of imaging modalities. While this limits the direct generalizability of the findings, the insights gained are likely applicable across other modalities and populations. Conclusion Significant variability exists in named entity recognition and assertion assessment on CXR radiology reports across different NLP systems. These differences arise from variations in the clinical concept definitions inherent to each system and the absence of a uniform approach to quantifying uncertainty expressions. In addition, the diversity in how CXR reports describe radiological findings and impressions results in systems performing optimally under different conditions. Consequently, the use of automated NLP systems for labeling imaging exams and for downstream applications such as outcome predictions must be accompanied by careful task specific evaluation and oversight. Declarations Author Contribution Shruti Hegde and Elanchezhian Somasundaram wrote the main manuscript text. Shruti Hegde, Mabon Ninan and Elanchezhian Somasundaram worked on the code. All authors reviewed the manuscript. References Soumya Upadhyay, H.-f.H., A Qualitative Analysis of the Impact of Electronic Health Records (EHR) on Healthcare Quality and Safety: Clinicians’ Lived Experiences. 22, Mar 3. Campanella, P., et al., The impact of electronic health records on healthcare quality: a systematic review and meta-analysis . The European Journal of Public Health, 2016. 26(1): p. 60–64. Wang, Y., et al., Clinical information extraction applications: a literature review . Journal of biomedical informatics, 2018. 77: p. 34–49. Hang Dong, M.F., William Whiteley, Beatrice Alex, Joshua Matterson, Shaoxiong Ji, Jiaoyan Chen, Honghan Wu, Automated clinical coding: what, why, and where we are? npj Digital Medicine, 2022. Ahmed, U., et al., Natural language processing for clinical decision support systems: a review of recent advances in healthcare . J Intell Connect Emerg Technol, 2023. 8(2): p. 1–17. Emma L Barber, R.G., Christianne Persenaire, Melissa Simon Natural Language Processing with Machine Learning to Predict Outcomes after Ovarian Cancer Surgery. 2020, Oct 14. Velupillai, S., et al., Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances . Journal of biomedical informatics, 2018. 88: p. 11–19. Jerfy, A., O. Selden, and R. Balkrishnan, The Growing Impact of Natural Language Processing in Healthcare and Public Health . INQUIRY: The Journal of Health Care Organization, Provision, and Financing, 2024. 61: p. 00469580241290095. Ewoud Pons, L.M.M.B., M G Myriam Hunink, Jan A Kors Natural Language Processing in Radiology: A Systematic Review . RSNA, 2016. Mahmud Omar, D.B., Benjamin Glicksberg, Eyal Klang, Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review . American Journal of Infection Control, 2024. 52(9). Kevin B Johnson, W.Q.W., Dilhan Weeraratne, Mark E Frisse, Karl Misulis, Kyu Rhee, Juan Zhao, Jane L Snowdon, Precision Medicine, AI, and the Future of Personalized Health Care . CTS, 2020. Papadopoulos, P., et al., A systematic review of technologies and standards used in the development of rule-based clinical decision support systems . Health and Technology, 2022. 12(4): p. 713–727. Slawomir Kierner, J.K., Zofia Kierner, Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review. 2023. Aryan Arbabi, D.R.A., Sanja Fidler, Michael Brudno Identifying Clinical Terms in Medical Text Using Ontology-Guided Machine Learning . JMIR Med Inform, 2019. Benyamin Ghojogh, A.G., Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey. 2023. Ashish Vaswani, N.S., Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention Is All You Need . 2017. Radford, A., et al., Improving language understanding by generative pre-training. 2018. Jacob Devlin, M.-W.C., Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018. Makhnevich, A., et al., The clinical utility of chest radiography for identifying pneumonia: accounting for diagnostic uncertainty in radiology reports . American Journal of Roentgenology, 2019. 213(6): p. 1207–1212. Piccazzo, R., F. Paparo, and G. Garlaschi, Diagnostic accuracy of chest radiography for the diagnosis of tuberculosis (TB) and its role in the detection of latent TB infection: a systematic review . The Journal of Rheumatology Supplement, 2014. 91: p. 32–40. Kim, J. and K.H. Kim, Role of chest radiographs in early lung cancer detection . Translational lung cancer research, 2020. 9(3): p. 522. Cardinale, L., et al., Effectiveness of chest radiography, lung ultrasound and thoracic computed tomography in the diagnosis of congestive heart failure . World journal of radiology, 2014. 6(6): p. 230. Gatt, M., et al., Chest radiographs in the emergency department: is the radiologist really necessary? Postgraduate medical journal, 2003. 79(930): p. 214–217. Johnson, J. and J.A. Kline, Intraobserver and interobserver agreement of the interpretation of pediatric chest radiographs . Emergency radiology, 2010. 17: p. 285–290. Anis, S., et al., An overview of deep learning approaches in chest radiograph . IEEE Access, 2020. 8: p. 182347–182354. Tiu, E., et al., Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning . Nature biomedical engineering, 2022. 6(12): p. 1399–1406. Jeremy Irvin, P.R., Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, Andrew Y. Ng, CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison . 2019. Alistair E. W. Johnson, T.J.P., Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark & Steven Horng MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. 2019. Chambon, P., et al., Roentgen: vision-language foundation model for chest x-ray generation. arXiv preprint arXiv:2211.12737, 2022. Weber, T., et al., Post-hoc Orthogonalization for Mitigation of Protected Feature Bias in CXR Embeddings. arXiv preprint arXiv:2311.01349, 2023. McDermott, M.B., et al. Chexpert++: Approximating the chexpert labeler for speed, differentiability, and probabilistic output . in Machine Learning for Healthcare Conference . 2020. PMLR. Aronson, A.R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program . in Proceedings of the AMIA Symposium . 2001. Savova, G.K., et al., Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications . Journal of the American Medical Informatics Association, 2010. 17(5): p. 507–513. Bai, L., et al. Clinical entity extraction: comparison between MetaMap, cTAKES, CLAMP and Amazon Comprehend Medical . in 2021 32nd Irish Signals and Systems Conference (ISSC) . 2021. IEEE. AWS. https://aws.amazon.com/comprehend/medical/. GC. Google Cloud . Available from: https://cloud.google.com/healthcare-api/ . AZ. Microsoft Azure . Available from: https://azure.microsoft.com/en-us/ . JSL. John Snow Labs . Available from: https://www.johnsnowlabs.com/ . Akshay Smit, S.J., Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, Matthew P. Lungren, CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. 2020. Priyankar Bose, S.S., William C. Sleeman, Jatinder Palta, Rishabh Kapoor, Preetam Ghosh, A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts. 2021. Diego Pinheiro da Silva, W.d.R.F., Blanda Helena de Mello, Renata Vieira, Sandro José Rigo, Exploring named entity recognition and relation extraction for ontology and medical records integration. 2023. Additional Declarations No competing interests reported. Supplementary Files AppendixA.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6772394","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":470008204,"identity":"fc8df0ba-e64f-41f4-b231-295e42ebfd46","order_by":0,"name":"Shruti Hegde","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7UlEQVRIiWNgGAWjYDACCRBhwJAAJj8wgBlEaDkA1cI4g3gtEJUGzDzEaJGf3ftM+kMBQx4//+KNj21q7PJ02w8wPq74hVuLwZ3jZhJAhxVLznhWbJxzLLnY7EwCs+HZPjxaJNLYQFoSN9w4Yyad23AgcdsNBjbJxh48DpsB1bL/xhnz35bEaGG4AbOFv8eMmRGmpeEHHofdSGO2OGMgkTjjBluxZM+x5MRtZxKbDRsb8DqM8UbFH5vE/v7DGz/8qLFL3Hb88MGHDX/wOAwCgLEjkQDjMDYwMLYR1AIE/AeQeYRtGQWjYBSMgpEDAKj5V0YZ72AqAAAAAElFTkSuQmCC","orcid":"","institution":"Cincinnati Children's Hospital Medical Center","correspondingAuthor":true,"prefix":"","firstName":"Shruti","middleName":"","lastName":"Hegde","suffix":""},{"id":470008205,"identity":"6c1e0733-1dbd-4bc3-a185-deef57d4d39d","order_by":1,"name":"Mabon Ninan","email":"","orcid":"","institution":"Texas A\u0026M University","correspondingAuthor":false,"prefix":"","firstName":"Mabon","middleName":"","lastName":"Ninan","suffix":""},{"id":470008207,"identity":"2a66a23f-9a0e-4310-8ead-2aea1ab10bfc","order_by":2,"name":"Jonathan R. Dillman","email":"","orcid":"","institution":"Cincinnati Children's Hospital Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Jonathan","middleName":"R.","lastName":"Dillman","suffix":""},{"id":470008209,"identity":"be6814ab-6aa0-4d9d-8758-64ebca60a604","order_by":3,"name":"Shireen Hayatghaibi","email":"","orcid":"","institution":"Cincinnati Children's Hospital Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Shireen","middleName":"","lastName":"Hayatghaibi","suffix":""},{"id":470008210,"identity":"4ec1b10e-57ad-4cb8-a1d6-f587c6632465","order_by":4,"name":"Lynn Babcock","email":"","orcid":"","institution":"Cincinnati Children's Hospital Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Lynn","middleName":"","lastName":"Babcock","suffix":""},{"id":470008214,"identity":"8cf0a609-656d-497e-9850-32086f2637fa","order_by":5,"name":"Elanchezhian Somasundaram","email":"","orcid":"","institution":"Cincinnati Children's Hospital Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Elanchezhian","middleName":"","lastName":"Somasundaram","suffix":""}],"badges":[],"createdAt":"2025-05-29 03:53:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6772394/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6772394/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":84672691,"identity":"65ec68e1-3ea4-42a0-a458-09a4e244ae82","added_by":"auto","created_at":"2025-06-16 06:59:47","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":859308,"visible":true,"origin":"","legend":"\u003cp\u003eThis flowchart illustrates the methodology starting from pediatric CXR report extraction, followed by processing using the six NLP systems. Postprocessing steps include 1) entity standardization, 2) assertion categorization, and 3) aggregation into a consensus ground truth via majority voting. Performance of the models in entity extraction and assertion detection on all entities extracted by the commercial NLP systems and on CheXpert specific disease labels were analyzed.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/7271660ae01c4cc47fe1993d.png"},{"id":84672683,"identity":"2a248683-aa65-494c-8d9c-37c0732ed57e","added_by":"auto","created_at":"2025-06-16 06:59:46","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":200685,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eThis image illustrates the regex (regular expression) used to search for phrases and patterns related to the \"enlarged cardiomediastinum\" condition. The Chest Anatomy Structures list the base keywords that must be present, followed by Attributes that describe the anatomical structures. This combination forms a valid pattern for identifying \"enlarged cardiomediastinum\". However, if Chest Anatomy Structures are followed by certain Mediastinal Conditions, the pattern is considered invalid and is excluded from the \"enlarged cardiomediastinum\" search.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/4ca989ff903b36443798af2d.png"},{"id":84672688,"identity":"65286ff5-fe7f-4b0e-b397-1290f3b904ff","added_by":"auto","created_at":"2025-06-16 06:59:47","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":134021,"visible":true,"origin":"","legend":"\u003cp\u003eThe average number of disease related entities extracted per CXR report by each clinical NLP system. AWS – Amazon Medical Comprehend, AZ – Azure Text Analytics for Health, GC – Google Healthcare Natural Language API, SP – John Snow Labs’ Spark NLP models for radiology. The whiskers represent the standard deviation from the average value.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/ab9ef7061ff5fa80da403ce5.png"},{"id":84673268,"identity":"aa19d4a6-7313-4299-96aa-70830e30672a","added_by":"auto","created_at":"2025-06-16 07:07:46","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":130249,"visible":true,"origin":"","legend":"\u003cp\u003eUnique number of disease related entities extracted by each clinical NLP system on 95,008 pediatric CXR reports. AWS – Amazon Medical Comprehend, AZ – Azure Text Analytics for Health, GC – Google Healthcare Natural Language API, SP – John Snow Labs’ Spark NLP models for radiology.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/ba65aad57ff115eab6b7fab1.png"},{"id":84672694,"identity":"f2ddf98c-e9a7-4fe9-8a94-a8ca02f5fef7","added_by":"auto","created_at":"2025-06-16 06:59:47","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":238284,"visible":true,"origin":"","legend":"\u003cp\u003eCount of top 5 most frequently extracted disease entities by the 4 commercial NLP systems on the study dataset from the Impression section of the reports. AWS – Amazon Medical Comprehend, AZ – Azure Text Analytics for Health, GC – Google Healthcare Natural Language API, SP – John Snow Labs’ Spark NLP models for radiology.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/851f91fd39e410f70d38c329.png"},{"id":84672698,"identity":"b9833f86-5ed7-424d-bed1-5350fd7d45c9","added_by":"auto","created_at":"2025-06-16 06:59:47","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":209330,"visible":true,"origin":"","legend":"\u003cp\u003eFleiss’ Kappa values for each CheXpert label calculated across the six NLP systems under two conditions: (All) includes all exams regardless of whether the disease was detected; (Excluding Absent) excludes exams where the disease was not detected by any of the six NLP systems.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/ef04918ef1991a75a45dd30d.png"},{"id":88069286,"identity":"f6bcc101-4f6c-4b6c-8d26-2a1b1da04322","added_by":"auto","created_at":"2025-08-01 05:01:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3040104,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/5a4c1c33-0668-489d-b92e-a473ef6da5fd.pdf"},{"id":84673267,"identity":"c93e9c97-12d1-4243-84bc-3cf968950b98","added_by":"auto","created_at":"2025-06-16 07:07:46","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":32444,"visible":true,"origin":"","legend":"","description":"","filename":"AppendixA.docx","url":"https://assets-eu.researchsquare.com/files/rs-6772394/v1/52aab80bca16e33278a447ed.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset","fulltext":[{"header":"Introduction","content":"\u003cp\u003ePatient Electronic Health Records (EHRs) have become central to discovery and innovation in data-driven healthcare [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. EHRs encompass structured data such as diagnostic codes, physiological measurements, and medication records, as well as unstructured free-text clinical health records (e.g., office visit, inpatient, and surgical notes and imaging reports), which provide valuable insights to enhance care provision and inform clinical decision-making [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Natural Language Processing (NLP) has emerged as a crucial tool for extracting insights from these unstructured clinical reports [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], a task that is highly tedious when performed manually. Advancements in NLP have facilitated various clinical and research applications, including automated clinical coding [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], clinical decision support systems [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], predictive analytics for patient outcomes [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], and population health management [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Additionally, NLP enables the extraction of diagnostic information from radiology reports [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], enhances disease surveillance [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], supports large-scale clinical research [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], improves patient care through personalized treatment recommendations [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] and enables quality improvement efforts.\u003c/p\u003e \u003cp\u003eNLP systems for processing clinical text can be categorized into rule-based and statistical learning-based approaches. Rule-based systems utilize medical ontology libraries that host expert-curated knowledge bases encompassing medical concepts, diagnostic codes, and medication categories [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. In contrast, machine learning (ML) and hybrid (ML\u0026thinsp;+\u0026thinsp;rule-based) tools have been increasingly adopted for general-purpose clinical NLP applications [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Notable among these are recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, which have significantly advanced the handling of sequential data [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. However, the introduction of attention mechanisms through the Transformer architecture has surpassed previous benchmarks, establishing Transformers as the industry standard in NLP [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Transformer-based methods have led to the development of two primary model families: the GPT (Generative Pre-trained Transformer) family [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], excelling in applications such as question answering and summarization through autoregressive generation, and the BERT (Bidirectional Encoder Representations from Transformers) family [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], optimized for tasks like named entity recognition, assertion detection, entity linking, and relationship extraction. Named Entity Recognition (NER) is the task of identifying and classifying specific pieces of information (called entities) in unstructured text into predefined categories (e.g., disease, procedure entities). Assertion detection determines the status or context of a recognized entity in clinical text (e.g., positive/negative/uncertain). Relationship Extraction is the task of identifying and classifying semantic relationships between two or more entities mentioned in unstructured text. In the clinical context, this means detecting how entities are connected, such as recognizing that a medication is prescribed to treat a specific condition.\u003c/p\u003e \u003cp\u003eChest radiography (CXR) is pivotal for diagnosing and monitoring respiratory and cardiovascular conditions such as pneumonia [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], tuberculosis [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e], lung cancer [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], and heart failure [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. CXR reports offer detailed diagnostic information that aids in identifying and tracking these conditions, thereby supporting informed treatment and patient management decisions. However, interpreting CXR images is inherently complex, with studies revealing significant variability among physicians and radiologists in their assessments [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. This complexity makes CXR reports an ideal use case for evaluating NLP systems, as they encapsulate nuanced diagnostic information that challenges automated extraction and analysis.\u003c/p\u003e \u003cp\u003eThe availability of several large public CXR datasets with images, reports, and disease labels has spurred the development and assessment of advanced computer vision, NLP, and multimodal AI models tailored specifically to CXR applications [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Specifically, the CheXpert framework [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], which includes a comprehensive dataset of CXR images, reports, and associated disease labels, provides multiple report labeling models that are used to generate labels for other public datasets, such as the MIMICS CXR dataset [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. Numerous research studies rely on these datasets to develop and evaluate their model performance [\u003cspan additionalcitationids=\"CR30\" citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Concurrently, general-purpose clinical NLP systems, ranging from open-source rule-based and Machine Learning models like MetaMap [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] and cTAKES [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e] to recent commercial systems employing proprietary algorithms, are widely utilized for various clinical note processing tasks [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. However, the performance of these more general systems on specific note types and tasks, such as pediatric CXR report labeling, is not well documented. This lack of standardized, independent comparison leaves users unaware of inherent errors in these systems and how such inaccuracies may propagate into downstream applications. A rigorous, standardized comparison of competing NLP systems on independent datasets is crucial to understanding their limitations, assessing their uncertainties, and ensuring more reliable integration into clinical workflows.\u003c/p\u003e \u003cp\u003eThe primary objective of this study is to compare four commercial clinical NLP systems : Amazon Comprehend Medical (AWS) [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e], Google Healthcare NLP (GC) [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e], Azure Clinical NLP (AZ) [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e], and SparkNLP (SP) from John Snow Labs [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e], for extracting clinically relevant entities and determining their assertion status from an independent database of pediatric CXR reports. The secondary objective is to evaluate the performance of benchmark CXR report labeling models, CheXpert [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] and CheXbert [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e], commonly used in CXR research, against an extraction pipeline constructed using general-purpose clinical NLP systems on this independent pediatric dataset.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eFigure 1 shows the overall methodology of this analysis. Patient consent for this retrospective study was waived by the Institutional Review Board, which approved this study. All exams were de-identified of Protected Health Information (PHI) by the Radiology Informatics team.\u003c/p\u003e\n\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003ea) Dataset:\u003c/h2\u003e\n \u003cp\u003ePediatric CXR reports were extracted from the radiology information system at a large pediatric hospital for the study period spanning June 2015 to June 2020. A total of 95,008 examinations constituted the study sample for this study\u0026apos;s evaluation. The mean age of the participants was 7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;7.5 years, with a male-to-female ratio of 1:2 (50503 males, 44286 females, 219 unknown). The CXR reports were typically in a semi-structured format, comprising the following sections: 1. Clinical History, 2. Findings, and 3. Impression. Each CXR radiology report was stored as an individual text file, linked via anonymized identifiers to ensure patient confidentiality.\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003eb) General purpose clinical NLP systems:\u003c/h3\u003e\n\u003cp\u003eThis study compares four commercial clinical NLP systems: AWS [\u003cspan class=\"CitationRef\"\u003e35\u003c/span\u003e], GC [\u003cspan class=\"CitationRef\"\u003e36\u003c/span\u003e], AZ [\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e], and SP\u0026rsquo;s NLP models for radiology reports [\u003cspan class=\"CitationRef\"\u003e38\u003c/span\u003e]. The clinical NLP systems for AWS, AZ, and GC are accessed via their respective cloud-based Application Programming Interfaces (APIs), all of which are compliant with the Health Insurance Portability and Accountability Act (HIPAA, USA). Deidentified radiology reports were submitted to these APIs, and the output results were returned in JSON format. Each of the three cloud-based systems performed four major tasks: entity extraction, assertion detection, entity linking, and relationship extraction. [\u003cspan class=\"CitationRef\"\u003e40\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e41\u003c/span\u003e]\u003c/p\u003e\n\u003cp\u003eIn contrast, the Spark NLP (SP) radiology models were accessed through an academic license and operated on local hardware. The extraction pipeline for SP consisted of BERT-based radiology models specifically designed for entity extraction and assertion detection. Unlike the cloud-based systems, the SP models are tailored to handle the unique nuances of radiology reports.\u003c/p\u003e\n\u003ch3\u003ec) Entity recognition:\u003c/h3\u003e\n\u003cp\u003eThe output data structure and the number of entity categories extracted were unique to each NLP system. A custom postprocessing pipeline was developed to handle the JSON outputs from each system. The pipeline focused exclusively on disease-related entity categories and filtered out entities in other categories for the analysis. Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e provides the total number of named entity categories extracted by each system and the entity categories that were filtered for analysis. The categories for analysis are defined so that they are synonymous with disease symptoms or findings. The extracted entities from each NLP system were also normalized using a lemmatization algorithm such that related terms and phrases appear consistent for calculating descriptive statistics.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eDetails of the commercial NLP systems used in this study, the number of named entity categories detected by each system, and the selected categories that relate to disease symptoms or findings.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"3\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNLP system\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNamed Entity Categories\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCategories Selected for Analysis\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eAWS\u003c/strong\u003e \u0026ndash; Amazon Medical Comprehend [v1.0.0, DetectEntities]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;MEDICAL_CONDITION\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eAZ\u003c/strong\u003e \u0026ndash; Azure Text Analytics for Health\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;SYMPTOM_OR_SIGN\u0026rsquo;, \u0026lsquo;DIAGNOSIS\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eGC\u003c/strong\u003e \u0026ndash; Google Healthcare NLP system\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e28\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;PROBLEM\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eSP\u003c/strong\u003e \u0026ndash; John Snow Labs Spark NLP [ner_radiology model]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;Disease_Syndrome_Disorder\u0026rsquo;, \u0026lsquo;Symptom\u0026rsquo;, \u0026lsquo;ImagingFindings\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003ch3\u003ed) Standardization of assertion status:\u003c/h3\u003e\n\u003cp\u003eAssertion refers to the context in which entities are mentioned in a clinical report, which is crucial for effective use of clinical notes in downstream applications. Each of the four commercial NLP systems identified assertions for extracted entities in different ways. Except for AWS, the other systems provided a categorical assertion status with unique definitions for each category. The AWS model provided only a numerical confidence score (CS) for the negation attribute associated with each detected entity. For a standardized comparative analysis, assertions for each entity were consolidated into three categories: positive, negative, or uncertain. Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e outlines the original assertion categories for each NLP system and how they were mapped into these three standardized assertion statuses for analysis.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003e\u003cem\u003eStandardization criteria for each NLP system, mapping assertion outputs to three standardized assertion statuses: Positive, Negative, and Uncertain for analysis. CS is the negation confidence score for AWS.\u003c/em\u003e\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"4\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNLP System\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ePositive\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNegative\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eUncertain\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eAWS\u003c/strong\u003e \u0026ndash; Amazon Medical Comprehend [v1.0.0, DetectEntities]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e*CS\u0026thinsp;\u0026lt;\u0026thinsp;0.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e*CS\u0026thinsp;\u0026gt;\u0026thinsp;0.75\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.25\u0026gt;= *CS\u0026thinsp;\u0026lt;\u0026thinsp;=\u0026thinsp;0.75\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eAZ\u003c/strong\u003e \u0026ndash; Azure Text Analytics for Health\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;positive\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;negative\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;positivepossible\u0026rsquo;, \u0026lsquo;negativepossible\u0026rsquo;, \u0026lsquo;neutralpossible\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eGC\u003c/strong\u003e \u0026ndash; Google Healthcare NLP system\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;likely\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;unlikely\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;somewhat_likely\u0026rsquo;, \u0026lsquo;somewhat_unlikely\u0026rsquo;, \u0026lsquo;uncertain\u0026rsquo;, \u0026lsquo;conditional\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eSP\u003c/strong\u003e \u0026ndash; John Snow Labs Spark NLP [ner_radiology model]\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;present\u0026rsquo;, \u0026lsquo;none\u0026rsquo;, \u0026lsquo;past\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;absent\u0026apos;, \u0026apos;family\u0026rsquo;, \u0026lsquo;someone else\u0026rsquo;, \u0026lsquo;planned\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026lsquo;hypothetical\u0026rsquo;, \u0026lsquo;possible\u0026rsquo;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003ctfoot\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"4\"\u003e*CS stands for confidence score\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tfoot\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003ch3\u003ee) CheXpert disease labels and models:\u003c/h3\u003e\n\u003cp\u003eThis study utilized two open-source specialized models, CheXpert and CheXbert, to extract disease labels from the impression section of the chest radiograph (CXR) reports based on a predefined set of 13 categories relevant to chest radiography. These categories include \u003cem\u003eenlarged cardiomediastinum\u003c/em\u003e, \u003cem\u003ecardiomegaly, lung opacity, lung lesion, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, pleural other, fracture\u003c/em\u003e, and a \u003cem\u003eNo Findings\u003c/em\u003e category indicating exams without known disease findings. CheXpert is a rule-based model that employs heuristic rules and expert-curated mappings to assign disease labels from the textual content of CXR reports. In contrast, CheXbert is built on a BERT-based architecture that leverages deep contextual embeddings for a more nuanced interpretation of CXR reports, demonstrating improved accuracy on the CheXpert dataset. Both models produce labels accompanied by three assertion statuses positive, negative, and uncertain for each of the 13 output categories.\u003c/p\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003ed) Mapping clinical entities to CheXpert disease categories:\u003c/h2\u003e\n \u003cp\u003eLemmatization is a process in natural language processing (NLP) that reduces a word to its base or dictionary form, also known as the lemma (e.g., diagnosed, diagnoses and diagnosing would be reduced to base word diagnose). The extracted named entities from the commercial NLP systems, even after lemmatization, exhibited multiple variations for the same clinical concepts, depending on the system. To map these varied entities to the disease labels identified by the CheXpert models, a comprehensive regular expression (RegEx) algorithm was developed. This algorithm was primarily designed using the dictionary of keywords employed by the CheXpert model and was expanded by reviewing phrases, abbreviations, and linguistic patterns specific to the pediatric dataset that correspond to each disease category. By filtering and grouping relevant entities under the appropriate disease labels, the RegEx algorithm provided a consistent and unified categorization across the dataset.\u003c/p\u003e\n \u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e illustrates how the algorithm detected \u003cem\u003e\u0026quot;enlarged cardiomediastinum\u0026quot;\u003c/em\u003e, a condition not often directly mentioned in CXR reports is labeled by identifying related phrases and patterns. The complete details of the RegEx patterns used for each disease label are provided in Appendix A. This standardization was critical for enabling reliable aggregation, comparison, and subsequent analysis of disease labels detected by the commercial NLP systems against the CheXpert and CheXbert models.\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003ee) Evaluation metrics:\u003c/h3\u003e\n\u003cp\u003eTo evaluate the performance of commercial NLP systems in identifying disease-related entities from chest X-ray (CXR) reports, we analyzed both the total number of entities extracted by each system within individual report sections and the average number of entities extracted per report. To assess the breadth of information captured, we also counted the number of unique disease entities identified by each system across the dataset. Differences in the number of entities extracted per report between systems were statistically compared using paired t-tests across all model combinations, with Bonferroni correction applied to adjust for multiple comparisons.\u003c/p\u003e\n\u003cp\u003eFor assertion detection, we report the distribution of extracted entities classified as \u003cem\u003epositive, negative, or uncertain\u003c/em\u003e for each section of the CXR report, using the standardization criteria defined in Table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e. To evaluate whether the distribution of assertion types differed significantly between NLP systems, Chi-square tests of independence were performed separately for the Findings and Impression sections.\u003c/p\u003e\n\u003cp\u003eFor CheXpert labels comparison, inter-model agreement for assertion classification was assessed using Fleiss\u0026rsquo; Kappa, calculated across all six NLP systems, four commercial systems (AWS, AZ, GC, and SP) and two open-source CXR-specific models (CheXbert and CheXpert), for each disease category. The assertion category \u003cem\u003eabsent\u003c/em\u003e was assigned when an NLP system did not detect a given label. Fleiss\u0026rsquo; Kappa values were computed under two conditions. In the first condition (\u003cem\u003eAll\u003c/em\u003e), all exams were included irrespective of whether the disease was detected or marked as \u003cem\u003eabsent\u003c/em\u003e by the models. In the second condition (\u003cem\u003eExcluding Absent\u003c/em\u003e), exams were excluded if all six models predicted the disease as \u003cem\u003eabsent\u003c/em\u003e. This approach allows for a more accurate assessment of inter-model agreement in cases where the disease is actually present or asserted by at least one model, avoiding artificial inflation of agreement due to consistent non-detection.\u003c/p\u003e\n\u003cp\u003eTo estimate model-specific assertion performance, a pseudo\u0026ndash;ground truth was established using a majority voting strategy across outputs from all six NLP systems. For each disease entity, the consensus assertion label was determined by majority vote. If no majority was reached, the entity was assigned to the uncertain category. Assertion accuracy for each model was computed by comparing its predicted assertion category to the consensus label for each disease. A match was counted only when the assertion category exactly aligned with the consensus. Mean accuracy was reported separately for each disease, for each model, and as an overall mean, providing a summary of assertion performance.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003ea) Named entity recognition by commercial clinical NLP models:\u003c/h2\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eNumber of disease related entities extracted by each clinical NLP system for the Findings and Impressions sections of the CXR reports in the study dataset (n\u0026thinsp;=\u0026thinsp;95,008). Bolded numbers signify the highest count of extracted entities in the report sections.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSection\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAWS\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eGC\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eFINDINGS\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e417,630\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e689,774\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e333,516\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e457,540\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eIMPRESSION\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e154,397\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e196,911\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e140,201\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e170,728\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eAll sections\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e846,137\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e1,175,594\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e741,958\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e900,655\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e shows the total number of disease-related entities extracted by each system and their distribution for the Findings and Impressions sections of the CXR report, along with the overall counts including all sections (Clinical History, Comparison and Procedure Comments). Figure \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e shows the average number of disease related entities extracted per report by each clinical NLP system along with the standard deviation, whereas Fig. \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e shows the unique number of disease related entities extracted by each system across the study dataset. All pairwise comparisons in the number of extracted entities per report were statistically significant (Bonferroni-adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.001). AZ extracted more entities overall (12.4), followed by SP (9.5), AWS (8.9), and GC (7.8), respectively. In terms of uniqueness, however, SP extracted considerably more entities (49,688) when compared to the other three systems. AZ and AWS extracted 31,543 and 27,216 unique entities, respectively, while GC had the lowest count with only 16,477 unique entities.\u003c/p\u003e\n \u003cp\u003eFigure 4 illustrates the count of the top five most frequently detected diseases extracted by each of the four NLP systems from the Impression section of the reports - highlighting both detection frequency and inter-system variability.\u003c/p\u003e\n \u003cp\u003e\u003cem\u003ePneumonia\u003c/em\u003e is the most identified disease entity across all systems, with counts ranging from approximately 28,037 (SP) to 32,392 (AZ), reflecting a moderate spread of about 4,355 entities (\u0026asymp;\u0026thinsp;14.4% variation). \u003cem\u003eViral or reactive airway\u003c/em\u003e disease shows the smallest variation, with counts tightly clustered between 16,460 (AWS) and 16,391 (SP) - just a 0.4% variation, indicating strong agreement. \u003cem\u003eAtelectasis\u003c/em\u003e, however, shows the largest disparity, with AZ identifying nearly 15,774 instances, while SP reports only about 10,056, a spread of 5,718 entities (\u0026asymp;\u0026thinsp;44.3% variation), suggesting significant disagreement. \u003cem\u003ePleural effusion\u003c/em\u003e has consistent counts between 3986 (GC) and 4793 (AZ), having a moderate spread with a difference of 807 (\u0026asymp;\u0026thinsp;18.4% variation). \u003cem\u003ePneumothorax\u003c/em\u003e is the least commonly identified in the top 5, ranging between 1,885 (GC) and 3,065 (AZ) - a 1,180 difference (\u0026asymp;\u0026thinsp;47.7% variation). These statistics suggest that while systems largely agree on certain entities, like \u003cem\u003eviral or reactive airway and pleural effusion\u003c/em\u003e, others such as \u003cem\u003eatelectasis\u003c/em\u003e and \u003cem\u003epneumonia\u003c/em\u003e show greater variability, likely due to differences in model sensitivity, training data, variability in medical term usage or recognition.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n \u003ch2\u003eb) Assertion detection by commercial clinical NLP models:\u003c/h2\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab4\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eComparison of assertion detection performance across different models (AWS, AZ, GC, SP) for various sections in CXR reports. Percentages are rounded to one decimal place.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"4\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAssertion\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSystem\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFINDINGS\u003c/p\u003e\n \u003cp\u003ePercentage\u003c/p\u003e\n \u003cp\u003e(Assertion count / Total entities)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eIMPRESSIONS\u003c/p\u003e\n \u003cp\u003ePercentage\u003c/p\u003e\n \u003cp\u003e(Assertion count / Total entities)\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003e\u003cstrong\u003ePositive\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAWS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e51.2% (213,654 / 417,630)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e79.5% (122,652 / 154,397)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e72.3% (499,003 / 689,774)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e67.5% (132,857 / 196,911)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e36.8% (122,789 / 333,516)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e61.0% (85,551 / 140,201)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e70.5% (322,788 / 457,540)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e58.2% (99,342 / 170,728)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003e\u003cstrong\u003eNegative\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAWS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e48.3% (201,809 / 417,630)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e19.2% (29,617 / 154,397)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e26.1% (180,147 / 689,774)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e4.6% (9,125 / 196,911)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e61.8% (206,052 / 333,516)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e21.8% (30,498 / 140,201)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e26.9% (123,421 / 457,540)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e9.2% (15,777 / 170,728)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003e\u003cstrong\u003eUncertain\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAWS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.5% (2,167 / 417,630)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1.4% (2,128 / 154,397)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1.5% (10,624 / 689,774)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e27.9% (54,929 / 196,911)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1.4% (4,675 / 333,516)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e17.2% (24,152 / 140,201)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e2.5% (11,331 / 457,540)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e32.6% (55,609 / 170,728)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e presents the distribution (Detected assertion entities / Total entities in report section) of assertion classifications for medical entities detected by the four models (AWS, AZ, GC, and SP) for the Findings and Impressions sections of chest X-ray (CXR) reports. In the Findings section, most entities were either \u003cem\u003epositive\u003c/em\u003e or \u003cem\u003enegative\u003c/em\u003e across all models, with \u003cem\u003euncertain\u003c/em\u003e classifications ranging from 0.5% (AWS) to 2.5% (SP). SP classified 70.5% of Findings as \u003cem\u003epositive\u003c/em\u003e, whereas GC classified only 36.8%. For the Impression section, AWS classified 79.5% of entities as \u003cem\u003epositive\u003c/em\u003e, while SP classified only 58.2%. \u003cem\u003eNegative\u003c/em\u003e assertions were largest for GC (21.8%) and smallest for AZ (4.6%). \u003cem\u003eUncertain\u003c/em\u003e classifications were minimal for AWS (1.4%) but reached 32.6% for SP. The Chi-square test of independence showed that the distribution of entity assertions differed significantly between NLP systems in both the Findings and Impression sections (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). These variations highlight differences in assertion detection performance, which may impact clinical decision support applications.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003ec) Comparison with open-source CXR report labelers on CheXpert labels:\u003c/h2\u003e\n \u003cp\u003eFigure 5 shows the Fleiss\u0026rsquo; Kappa values for each of 12 disease categories and the \u003cem\u003eNo Findings\u003c/em\u003e category, based on the assertion statuses (positive, negative, uncertain and absent) assigned by all six NLP models. The highest agreement was observed for \u003cem\u003epleural effusion\u003c/em\u003e (Kappa: 0.89 in \u003cem\u003eAll\u003c/em\u003e; 0.59 in \u003cem\u003eExcluding Absent\u003c/em\u003e), while the lowest was for e\u003cem\u003enlarged cardiomediastinum\u003c/em\u003e (Kappa: 0.25 in \u003cem\u003eAll\u003c/em\u003e; 0.05 in \u003cem\u003eExcluding Absent\u003c/em\u003e). The mean Fleiss\u0026rsquo; Kappa across all categories were 0.68\u0026thinsp;\u0026plusmn;\u0026thinsp;0.17 (All) and 0.35\u0026thinsp;\u0026plusmn;\u0026thinsp;0.14 (\u003cem\u003eExcluding Absent)\u003c/em\u003e.\u003c/p\u003e\n \u003cp\u003eThe individual assertion performance of the six NLP models against the consensus pseudo-ground truth calculated using majority voting for the CheXpert labels are shown in Table \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e. The results are only based on the Impression section of the CXR reports. The lowest accuracy against the consensus was observed for the\u0026nbsp;\u003cem\u003eConsolidation\u003c/em\u003e category (14\u0026thinsp;\u0026plusmn;\u0026thinsp;3) %, where AWS had the largest accuracy of 20% and GC had the smallest accuracy of 10%. Excluding No Findings, the highest mean accuracy was observed for Pleural Effusion (72\u0026thinsp;\u0026plusmn;\u0026thinsp;6) %, with GC having the largest value of 82% and AWS having the smallest value of 62%. When considering the aggregate across all disease categories, SP out-performed other NLP systems with an accuracy of (76\u0026thinsp;\u0026plusmn;\u0026thinsp;4) %, while AWS trailed with an accuracy of 50%. The mean overall accuracy was (62\u0026thinsp;\u0026plusmn;\u0026thinsp;9) %. Both CheXbert and CheXpert had an overall accuracy of (56\u0026thinsp;\u0026plusmn;\u0026thinsp;8) %.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab5\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eAssertion accuracy (%) of the six NLP systems on the impression section of the study dataset with 95,008 chest X-ray (CXR) reports, evaluated against the consensus pseudo\u0026ndash;ground truth on the CheXpert labels. Percentages are rounded to the nearest whole number.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"8\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDisease Category\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAWS\u003c/p\u003e\n \u003cp\u003e(%)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003cp\u003e(%)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eGC (%)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003cp\u003e(%)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCheXbert (%)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCheXpert (%)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMean (%)\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eAtelectasis\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e86\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e57\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e63\u0026thinsp;\u0026plusmn;\u0026thinsp;15\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eCardiomegaly\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e56\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e56\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e62\u0026thinsp;\u0026plusmn;\u0026thinsp;8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eConsolidation\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e14\u0026thinsp;\u0026plusmn;\u0026thinsp;3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eEdema\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e69\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e66\u0026thinsp;\u0026plusmn;\u0026thinsp;12\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eEnlarged Cardiomediastinum\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e28\u0026thinsp;\u0026plusmn;\u0026thinsp;12\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eFracture\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e28\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e32\u0026thinsp;\u0026plusmn;\u0026thinsp;8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eLung Lesion\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e70\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e59\u0026thinsp;\u0026plusmn;\u0026thinsp;7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eLung Opacity\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e54\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e54\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e46\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e46\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e54\u0026thinsp;\u0026plusmn;\u0026thinsp;7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003ePleural Effusion\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e70\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e72\u0026thinsp;\u0026plusmn;\u0026thinsp;6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003ePleural Other\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e57\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e60\u0026thinsp;\u0026plusmn;\u0026thinsp;11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003ePneumonia\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e39\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e40\u0026thinsp;\u0026plusmn;\u0026thinsp;18\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003ePneumothorax\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e52\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e56\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e43\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e41\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e46\u0026thinsp;\u0026plusmn;\u0026thinsp;6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eNo Findings\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e70\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e89\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e77\u0026thinsp;\u0026plusmn;\u0026thinsp;12\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eOverall Mean\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e50\u0026thinsp;\u0026plusmn;\u0026thinsp;0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e69\u0026thinsp;\u0026plusmn;\u0026thinsp;5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63\u0026thinsp;\u0026plusmn;\u0026thinsp;0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e76\u0026thinsp;\u0026plusmn;\u0026thinsp;4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e56\u0026thinsp;\u0026plusmn;\u0026thinsp;8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e56\u0026thinsp;\u0026plusmn;\u0026thinsp;8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e62\u0026thinsp;\u0026plusmn;\u0026thinsp;9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\n \u003ch2\u003ed) Sample reports with discrepant assertion statuses extracted by the NLP systems for CheXpert labels:\u003c/h2\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab6\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eVariability in Assertion Among Commercial NLP Systems - Divergent NLP Interpretations of Radiology Reports Impressions.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"6\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSample\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDisease Label\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eReport Impression\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ePositive\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eUncertain\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNegative\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003epneumothorax\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1. Pectus bars in place. 2. No appreciable pneumothorax on the left.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, CheXbert, AWS, GC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003epneumonia\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFindings consistent with viral or reactive airways disease without focal pneumonia.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, CheXbert, AWS, GC\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ecardiomegaly\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNo acute cardiopulmonary abnormality with stable cardiomegaly and fracture of one of the pacemakers leads.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, CheXbert, AZ, GC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAWS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003econsolidation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRight suprahilar opacity may represent developing consolidation, superimposed on findings of viral or reactive airways disease.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAWS, AZ, SP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, CheXbert, GC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eatelectasis\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eViral/reactive airway disease, superimposed with right upper and left lower lobe airspace disease such as atelectasis/pneumonia.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAWS, AZ, GC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, CheXbert\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSP\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eedema\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMildly prominent pulmonary vasculature. Cardiac size appears mildly enlarged. No focal airspace disease or overt pulmonary edema is suspected.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAZ\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, CheXbert, AWS, SP\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003elung lesion\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNo consolidation. Small oval lucency in the left upper lobe. It is not clear whether this represents superimposed shadows (mock effect) or a true finding. A short-term follow-up two-view chest x-ray is suggested to evaluate the persistence of this finding. If it does persist, it may represent a tiny bleb, bulla, or pneumatocele. The surrounding lung appears normal making a cavitary lesion less likely.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, AWS, GC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXbert, AZ, SP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003elung opacity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePoorly defined left lower lobe opacity concerning for developing pneumonia.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCheXpert, CheXbert, AWS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAZ, GC, SP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e describes variability in assertion among commercial NLP systems. It illustrates how clinical NLP systems interpret assertions based on linguistic features in the impression section. In case 1 for \u003cem\u003epneumothorax\u003c/em\u003e, the impression \u0026quot;No appreciable pneumothorax on the left\u0026quot; led to mixed interpretations, with SP classifying it as positive while most others marked it negative. For \u003cem\u003epneumonia\u003c/em\u003e in case 2, the phrase \u0026quot;without focal pneumonia\u0026quot; resulted in most systems classifying it as negative, with only AZ marking it positive and SP classifying it as uncertain. The \u003cem\u003ecardiomegaly\u003c/em\u003e example (case 3) with \u0026quot; No acute cardiopulmonary abnormality with stable cardiomegaly\u0026quot; term led four systems to classify it as positive, while AWS remained uncertain, and SP classified it as negative. In case 4, the hedging phrase \u0026quot;may represent developing consolidation\u0026quot; resulted in three systems classifying \u003cem\u003econsolidation\u003c/em\u003e as uncertain and three as negative. Similarly, in case 5 for \u003cem\u003eatelectasis\u003c/em\u003e, the complex description of \u0026quot;airspace disease such as atelectasis/pneumonia\u0026quot; divided systems, with three positive, two uncertain, and one negative. Notably, in case 8, the phrase \u0026quot;opacity concerning for developing pneumonia\u0026quot; led to a split between positive and uncertain classifications of \u003cem\u003elung opacity\u003c/em\u003e. These findings demonstrate how radiological language, especially uncertainty markers, qualifiers, and alternative explanations consistently challenge commercial NLP systems, with each interpreting probabilistic medical language and hedging expressions differently.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study is one of the first to quantify and compare the performance of commercial clinical NLP systems on a large independent dataset, and the first to do so on a study sample composed of pediatric chest radiograph reports. Commercial clinical NLP systems from three major cloud providers, namely Amazon, Google, and Azure, along with a radiology specific model from a well-known vendor, John Snow Labs, were analyzed on the tasks of named entity recognition and assertion detection. A standardization algorithm was developed to map the entities extracted by these systems to disease labels defined by the CheXpert framework. The CheXpert and CheXbert models provided a set of disease categories that served as a reference for further comparison. Model outputs were evaluated using a consensus ground truth derived from the outputs of all six NLP systems using a majority voting approach.\u003c/p\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003ea) Variability in entity extraction\u003c/h2\u003e \u003cp\u003eComparison of the disease related entities extracted by the four commercial NLP systems revealed considerable variability in the mean number of entities per report and the number of unique entities detected. AZ recorded the highest mean number of entities per report, while SP produced the greatest number of unique entities. In addition, the top five most frequent disease or diagnosis entities reported had variable agreement between the systems. Viral or reactive airway disease had near perfect agreement (\u0026lt;\u0026thinsp;1% difference in counts) between the four systems, while pneumothorax and atelectasis had large differences (\u0026gt;\u0026thinsp;40% difference in counts). These results illustrate the inconsistencies in named entity recognition and underline the importance of applying standardization techniques, such as the use of regular expressions, to achieve meaningful comparisons.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eb) Variability in assertion detection\u003c/h2\u003e \u003cp\u003eLarge variability was also observed in assertion detection performance by the four NLP systems. Firstly, each NLP system reported assertion status in its own way, and the definitions were not consistent. Secondly, once the assertion statuses were grouped into positive, negative, and uncertain categories, significant differences in their distributions were observed between the systems for both Findings and Impression sections. Especially, for the Findings section, the NLP systems reported 0.5% (AWS) to 2.5% (SP) of the entities as \u003cem\u003euncertain\u003c/em\u003e, whereas for the Impression section, the number of uncertain entities varied between 1.4% (AWS) to 32.6% (SP).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003ec) Performance on CheXpert labels\u003c/h2\u003e \u003cp\u003eWhen evaluating disease labels based on the CheXpert framework, inter-model agreement was substantial when \u003cem\u003eabsent\u003c/em\u003e predictions were included (mean Kappa\u0026thinsp;=\u0026thinsp;0.68) but dropped to fair agreement when those cases were excluded (mean Kappa\u0026thinsp;=\u0026thinsp;0.35), highlighting variability in assertion classification when disease presence was detected by at least one model. The overall mean accuracy across all six models was 62% with a standard deviation of 9%. Mean assertion accuracy for individual diseases ranged from 14% for \u003cem\u003econsolidation\u003c/em\u003e to 77% for \u003cem\u003epleural effusion\u003c/em\u003e. Poor performance for consolidation may stem from the wide variety of expressions used in reports. As a descriptive imaging finding, it is mentioned in more variable and nuanced ways and can appear in multiple conditions such as pneumonia or pulmonary edema. Higher accuracy for pleural effusion possibly results from its more explicit mentions.\u003c/p\u003e \u003cp\u003eAmong the disease categories, SP outperformed the other systems with an accuracy of 76%. CheXpert and CheXbert performed similarly for most categories, with notable differences only for \u003cem\u003econsolidation\u003c/em\u003e and \u003cem\u003eenlarged cardiomediastinum\u003c/em\u003e. Assertion accuracy varied by category, with SP performing best in six categories, GC in five, and AWS and AZ each in one category.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003ed) Limitations and future directions:\u003c/h2\u003e \u003cp\u003eInvestigation of cases with discrepant assertion statuses revealed that linguistic nuances influence each NLP model differently. No single system demonstrated uniform superiority. Instead, performance varied by disease category and the contextual phrasing within report sections. These findings highlight the potential benefits of model ensembling, where the complementary strengths of different models can be leveraged to improve overall robustness and accuracy. This is reinforced by the variability observed across cases in Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e. Each system shows strengths and limitations depending on the linguistic structure, diagnostic ambiguity, and type of condition described. For instance, AZ accurately identified the absence of \u003cem\u003epneumothorax\u003c/em\u003e in sample 1 but failed to detect the absence of \u003cem\u003epneumonia\u003c/em\u003e in sample 2. AWS correctly recognized the absence of \u003cem\u003epneumonia\u003c/em\u003e in sample 2 but was uncertain about \u003cem\u003ecardiomegaly\u003c/em\u003e in sample 3 due to ambiguous language, although clinically this indicates a positive finding. SP successfully detected the absence of edema in sample 6 but missed the absence of \u003cem\u003epneumothorax\u003c/em\u003e in sample 1. Meanwhile, CheXpert, CheXbert, and GC all correctly flagged uncertain consolidation in sample 4 but failed to identify the absence of \u003cem\u003epneumothorax\u003c/em\u003e in sample 1. This inconsistency highlights that NLP performance is highly context-dependent, shaped by how each system interprets medical uncertainty, negation, and diagnostic probability.\u003c/p\u003e \u003cp\u003eThis observation underscores the necessity for thorough validation of clinical NLP systems for specific use cases and institutions, as well as careful evaluation of non-standard definitions for uncertainty before these systems are deployed. One limitation of this study is that the AWS did not provide a categorical assertion status and offered only a separate negation attribute. Confidence values were used to assign uncertainty in these cases. The latest version of Amazon\u0026rsquo;s Comprehend Medical NLP system includes an attribute for low confidence that may provide a better estimation of uncertainty. In addition, the large dataset precluded manual annotation of CheXpert labels. Therefore, a pseudo ground truth derived from all six NLP systems was used. Cases without a clear majority were assigned to the uncertain category. Although the pseudo ground truth is useful for comparing systems against each other, performance metrics may differ when compared to labels assigned by radiologists.\u003c/p\u003e \u003cp\u003eFuture research should incorporate manual annotations and investigate the effect of these variabilities on downstream clinical and research applications to ensure that inherent errors do not propagate and adversely affect patient care. Additionally, this study is limited to only pediatric chest X-ray reports, without including adult cases or a broader range of imaging modalities. While this limits the direct generalizability of the findings, the insights gained are likely applicable across other modalities and populations.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eSignificant variability exists in named entity recognition and assertion assessment on CXR radiology reports across different NLP systems. These differences arise from variations in the clinical concept definitions inherent to each system and the absence of a uniform approach to quantifying uncertainty expressions. In addition, the diversity in how CXR reports describe radiological findings and impressions results in systems performing optimally under different conditions. Consequently, the use of automated NLP systems for labeling imaging exams and for downstream applications such as outcome predictions must be accompanied by careful task specific evaluation and oversight.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eShruti Hegde and Elanchezhian Somasundaram wrote the main manuscript text. Shruti Hegde, Mabon Ninan and Elanchezhian Somasundaram worked on the code. All authors reviewed the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eSoumya Upadhyay, H.-f.H., \u003cem\u003eA Qualitative Analysis of the Impact of Electronic Health Records (EHR) on Healthcare Quality and Safety: Clinicians\u0026rsquo; Lived Experiences.\u003c/em\u003e 22, Mar 3.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCampanella, P., et al., \u003cem\u003eThe impact of electronic health records on healthcare quality: a systematic review and meta-analysis\u003c/em\u003e. The European Journal of Public Health, 2016. 26(1): p. 60\u0026ndash;64.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, Y., et al., \u003cem\u003eClinical information extraction applications: a literature review\u003c/em\u003e. Journal of biomedical informatics, 2018. 77: p. 34\u0026ndash;49.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHang Dong, M.F., William Whiteley, Beatrice Alex, Joshua Matterson, Shaoxiong Ji, Jiaoyan Chen, Honghan Wu, \u003cem\u003eAutomated clinical coding: what, why, and where we are?\u003c/em\u003e npj Digital Medicine, 2022.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAhmed, U., et al., \u003cem\u003eNatural language processing for clinical decision support systems: a review of recent advances in healthcare\u003c/em\u003e. J Intell Connect Emerg Technol, 2023. 8(2): p. 1\u0026ndash;17.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEmma L Barber, R.G., Christianne Persenaire, Melissa Simon \u003cem\u003eNatural Language Processing with Machine Learning to Predict Outcomes after Ovarian Cancer Surgery.\u003c/em\u003e 2020, Oct 14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVelupillai, S., et al., \u003cem\u003eUsing clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances\u003c/em\u003e. Journal of biomedical informatics, 2018. 88: p. 11\u0026ndash;19.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJerfy, A., O. Selden, and R. Balkrishnan, \u003cem\u003eThe Growing Impact of Natural Language Processing in Healthcare and Public Health\u003c/em\u003e. INQUIRY: The Journal of Health Care Organization, Provision, and Financing, 2024. 61: p. 00469580241290095.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEwoud Pons, L.M.M.B., M G Myriam Hunink, Jan A Kors \u003cem\u003eNatural Language Processing in Radiology: A Systematic Review\u003c/em\u003e. RSNA, 2016.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMahmud Omar, D.B., Benjamin Glicksberg, Eyal Klang, \u003cem\u003eUtilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review\u003c/em\u003e. American Journal of Infection Control, 2024. 52(9).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKevin B Johnson, W.Q.W., Dilhan Weeraratne, Mark E Frisse, Karl Misulis, Kyu Rhee, Juan Zhao, Jane L Snowdon, \u003cem\u003ePrecision Medicine, AI, and the Future of Personalized Health Care\u003c/em\u003e. CTS, 2020.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePapadopoulos, P., et al., \u003cem\u003eA systematic review of technologies and standards used in the development of rule-based clinical decision support systems\u003c/em\u003e. Health and Technology, 2022. 12(4): p. 713\u0026ndash;727.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSlawomir Kierner, J.K., Zofia Kierner, \u003cem\u003eTaxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review.\u003c/em\u003e 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAryan Arbabi, D.R.A., Sanja Fidler, Michael Brudno \u003cem\u003eIdentifying Clinical Terms in Medical Text Using Ontology-Guided Machine Learning\u003c/em\u003e. JMIR Med Inform, 2019.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBenyamin Ghojogh, A.G., \u003cem\u003eRecurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey.\u003c/em\u003e 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAshish Vaswani, N.S., Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, \u003cem\u003eAttention Is All You Need\u003c/em\u003e. 2017.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRadford, A., et al., \u003cem\u003eImproving language understanding by generative pre-training.\u003c/em\u003e 2018.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJacob Devlin, M.-W.C., Kenton Lee, Kristina Toutanova, \u003cem\u003eBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.\u003c/em\u003e 2018.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMakhnevich, A., et al., \u003cem\u003eThe clinical utility of chest radiography for identifying pneumonia: accounting for diagnostic uncertainty in radiology reports\u003c/em\u003e. American Journal of Roentgenology, 2019. 213(6): p. 1207\u0026ndash;1212.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePiccazzo, R., F. Paparo, and G. Garlaschi, \u003cem\u003eDiagnostic accuracy of chest radiography for the diagnosis of tuberculosis (TB) and its role in the detection of latent TB infection: a systematic review\u003c/em\u003e. The Journal of Rheumatology Supplement, 2014. 91: p. 32\u0026ndash;40.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, J. and K.H. Kim, \u003cem\u003eRole of chest radiographs in early lung cancer detection\u003c/em\u003e. Translational lung cancer research, 2020. 9(3): p. 522.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCardinale, L., et al., \u003cem\u003eEffectiveness of chest radiography, lung ultrasound and thoracic computed tomography in the diagnosis of congestive heart failure\u003c/em\u003e. World journal of radiology, 2014. 6(6): p. 230.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGatt, M., et al., \u003cem\u003eChest radiographs in the emergency department: is the radiologist really necessary?\u003c/em\u003e Postgraduate medical journal, 2003. 79(930): p. 214\u0026ndash;217.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJohnson, J. and J.A. Kline, \u003cem\u003eIntraobserver and interobserver agreement of the interpretation of pediatric chest radiographs\u003c/em\u003e. Emergency radiology, 2010. 17: p. 285\u0026ndash;290.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAnis, S., et al., \u003cem\u003eAn overview of deep learning approaches in chest radiograph\u003c/em\u003e. IEEE Access, 2020. 8: p. 182347\u0026ndash;182354.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTiu, E., et al., \u003cem\u003eExpert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning\u003c/em\u003e. Nature biomedical engineering, 2022. 6(12): p. 1399\u0026ndash;1406.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJeremy Irvin, P.R., Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, Andrew Y. Ng, \u003cem\u003eCheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison\u003c/em\u003e. 2019.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlistair E. W. Johnson, T.J.P., Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark \u0026amp; Steven Horng \u003cem\u003eMIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.\u003c/em\u003e 2019.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChambon, P., et al., \u003cem\u003eRoentgen: vision-language foundation model for chest x-ray generation.\u003c/em\u003e arXiv preprint arXiv:2211.12737, 2022.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeber, T., et al., \u003cem\u003ePost-hoc Orthogonalization for Mitigation of Protected Feature Bias in CXR Embeddings.\u003c/em\u003e arXiv preprint arXiv:2311.01349, 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcDermott, M.B., et al. \u003cem\u003eChexpert++: Approximating the chexpert labeler for speed, differentiability, and probabilistic output\u003c/em\u003e. in \u003cem\u003eMachine Learning for Healthcare Conference\u003c/em\u003e. 2020. PMLR.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAronson, A.R. \u003cem\u003eEffective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program\u003c/em\u003e. in \u003cem\u003eProceedings of the AMIA Symposium\u003c/em\u003e. 2001.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSavova, G.K., et al., \u003cem\u003eMayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications\u003c/em\u003e. Journal of the American Medical Informatics Association, 2010. 17(5): p. 507\u0026ndash;513.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBai, L., et al. \u003cem\u003eClinical entity extraction: comparison between MetaMap, cTAKES, CLAMP and Amazon Comprehend Medical\u003c/em\u003e. in \u003cem\u003e2021 32nd Irish Signals and Systems Conference (ISSC)\u003c/em\u003e. 2021. IEEE.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAWS. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://aws.amazon.com/comprehend/medical/.\u003c/span\u003e\u003cspan address=\"https://aws.amazon.com/comprehend/medical/.\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGC. \u003cem\u003eGoogle Cloud\u003c/em\u003e. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://cloud.google.com/healthcare-api/\u003c/span\u003e\u003cspan address=\"https://cloud.google.com/healthcare-api/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAZ. \u003cem\u003eMicrosoft Azure\u003c/em\u003e. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://azure.microsoft.com/en-us/\u003c/span\u003e\u003cspan address=\"https://azure.microsoft.com/en-us/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJSL. \u003cem\u003eJohn Snow Labs\u003c/em\u003e. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.johnsnowlabs.com/\u003c/span\u003e\u003cspan address=\"https://www.johnsnowlabs.com/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAkshay Smit, S.J., Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, Matthew P. Lungren, \u003cem\u003eCheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT.\u003c/em\u003e 2020.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePriyankar Bose, S.S., William C. Sleeman, Jatinder Palta, Rishabh Kapoor, Preetam Ghosh, \u003cem\u003eA Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts.\u003c/em\u003e 2021.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiego Pinheiro da Silva, W.d.R.F., Blanda Helena de Mello, Renata Vieira, Sandro Jos\u0026eacute; Rigo, \u003cem\u003eExploring named entity recognition and relation extraction for ontology and medical records integration.\u003c/em\u003e 2023.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6772394/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6772394/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis study compares four commercial clinical NLP tools - Amazon Comprehend Medical, Google Healthcare NLP, Azure Clinical NLP, and SparkNLP - alongside dedicated radiograph labelers CheXpert and CheXbert for pediatric chest radiograph (CXR) report labeling. Using 95,008 pediatric CXR reports from a large academic hospital, we extracted entities and assertion statuses (positive, negative, uncertain) from findings and impressions, mapped them to 13 categories (12 disease categories and a No Findings category), and compared performance using Fleiss Kappa and accuracy against a pseudo-ground truth. Entity extraction varied widely: SparkNLP extracted 49,688 unique entities, Azure 31,543, AWS 27,216, and Google 16,477. Assertion accuracy ranged from 50% (AWS) to 76% (SparkNLP), while CheXpert and CheXbert achieved 56%. Results reveal substantial performance variability, emphasizing the need for validation and careful review before deploying NLP tools for pediatric clinical report labeling.\u003c/p\u003e","manuscriptTitle":"Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-16 06:59:42","doi":"10.21203/rs.3.rs-6772394/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"63eb92ca-6ee8-45a1-a4a9-2d4cc71b313c","owner":[],"postedDate":"June 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":49910002,"name":"Health sciences/Medical research/Paediatric research"},{"id":49910003,"name":"Health sciences/Diseases/Respiratory tract diseases"}],"tags":[],"updatedAt":"2025-08-01T04:53:27+00:00","versionOfRecord":[],"versionCreatedAt":"2025-06-16 06:59:42","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6772394","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6772394","identity":"rs-6772394","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0