Benchmarking large language models for cardiovascular risk stratification using clinical vignettes

preprint OA: closed
Full text JSON View at publisher
Full text 236,079 characters · extracted from preprint-html · click to expand
Benchmarking large language models for cardiovascular risk stratification using clinical vignettes | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Benchmarking large language models for cardiovascular risk stratification using clinical vignettes José Ferreira Santos, Regina Brito Duarte, Inês Mota, Rita Carvalheira Santos, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8307079/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Large language models (LLMs) show promise for cardiovascular risk stratification, though their performance compared with clinical guidelines requires validation. We benchmarked eleven contemporary LLMs using 30 bilingual (Portuguese/English) outpatient vignettes comparing their classifications against expert-adjudicated European Society of Cardiology guidelines using SCORE2. Models achieved near-perfect extraction of traditional risk factors (micro-F1 0.97–0.99) but only moderate agreement for three-class ESC risk categories (best weighted kappa 0.69, 95% CI 0.44–0.84). Ten out of eleven showed systematic underestimation of risk. LLMs struggled with SCORE2 numeric computation, with mean absolute error exceeding 5 percentage points in all but one. Most models correctly identified guideline exceptions requiring alternative assessment, beyond SCORE2, in more than 95% of cases. No significant performance differences between languages were found. While LLMs excel at structured data extraction and eligibility screening, their inconsistent risk stratification and poor numeric accuracy preclude autonomous clinical use, warranting further refinement. Health sciences/Cardiology Health sciences/Diseases Health sciences/Medical research Clinical decision support Diagnostic accuracy Artificial intelligence Large language models Multilingual evaluation Cardiovascular prevention Risk stratification SCORE2 Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction Cardiovascular disease (CVD) remains the leading cause of death globally 1 . Because most atherosclerotic events are preventable through targeted management of modifiable risk factors, accurate and systematic risk assessment is a cornerstone of prevention 2 , 3 . The 2021 European Society of Cardiology (ESC) Guidelines recommend Systematic Coronary Risk Estimation 2 (SCORE2) and SCORE2-OP to estimate 10-year risk of first-onset cardiovascular events and to guide treatment thresholds and shared decision-making 3 . Despite the availability of validated tools, its implementation in routine care is inconsistent. Surveys indicate that fewer than half of clinicians regularly use formal calculators, with many relying on unaided clinical judgement, an approach linked to systematic underestimation of risk, particularly among higher-risk patients 4 – 6 . Even when calculators are applied, treatment gaps persist and a substantial proportion of high-risk individuals do not receive guideline-recommended treatments, contributing to avoidable morbidity and mortality 4 , 7 . A key contributor to this implementation gap is the structure and usability of electronic health records (EHR) 8 – 10 . Variables required for risk estimation are often buried in free-text notes or dispersed across poorly integrated sections, creating burden and cognitive load at the point of care. Conventional natural language processing has shown that relevant information can be recovered from unstructured text, pointing to a practical route for workflow simplification 11 . Large language models (LLMs) extend these capabilities with instruction-following and generative functions that support record summarization, clinical-note interpretation, and interactive assistance 12 – 14 . However, evidence for safe, accurate clinical decision support remains mixed, with LLMs encoding broad clinical knowledge and performing strongly on benchmarks, but with variable and inconsistent outcomes on diagnostic reasoning 15 – 17 . Within cardiovascular prevention specifically, applications are promising but remain insufficiently validated 18 . The convergence of persistent gaps in CVD prevention, namely suboptimal risk stratification and limited EHR utility, with rapid advances in LLMs creates a timely opportunity to evaluate this system's role as an aid to guideline-based risk assessment. Accordingly, we benchmarked contemporary LLMs, using simulated outpatient vignettes to assess (i) extraction of cardiovascular risk factors, including SCORE2 input variables from routine-style clinical text and (ii) generation of ESC-aligned risk categories, each compared against expert-adjudicated reference standards. Our main objective was to delineate foundational performance and limitations as a prerequisite for integrating LLMs into clinical decision support systems aimed at optimizing CVD prevention. 2. Results Clinical Vignettes Thirty simulated outpatient clinical vignettes were evaluated (Supplementary Appendix S1), generating 60 assessments per model (30 Portuguese and 30 English). The vignettes represented a middle-aged population (mean age 54.6 ± 8.8 years, 50% male) with prevalent cardiovascular risk factors including hypertension (60%), dyslipidemia (70%), and active smoking (37%) (Table 1 ). More than half (53%) included additional risk modifiers, one (23%), two (20%), or three or more (10%) modifiers. Twenty vignettes (66.7%) met standard SCORE2 eligibility criteria, while 10 (33.3%) were required to apply ESC guidelines exceptions for risk stratification, including atherosclerotic cardiovascular disease (ASCVD; n = 4), diabetes mellitus (n = 3), chronic kidney disease (CKD; n = 2), and familial hypercholesterolemia (FH; n = 1). Following expert adjudication, the gold-standard risk distribution comprised 5 (16.7%) low-to-moderate risk, 13 (43.3%) high risk, and 12 (40.0%) very-high risk vignettes, representing a higher-risk profile than initially expected (50%, 30%, and 20%, respectively) (Supplementary Figures S1 and S2). Table 1 Baseline clinical profile of simulated outpatient vignettes Demographics Age, years* 54.6 ± 8.8 [36–67] Male sex 15 (50.0%) Cardiovascular Risk Profile Blood Pressure, mmHg Systolic blood pressure 135.3 ± 17.5 [107–189] Diastolic blood pressure 79.9 ± 13.3 [58–107] Lipids, mg/dL Total cholesterol 195.7 ± 45.3 [129–336] HDL cholesterol 47.9 ± 9.1 [32–67] Non-HDL cholesterol 147.8 ± 46.2 [79–219] LDL cholesterol 119.1 ± 46.5 [49–258] Triglycerides 137.8 ± 50.5 [62–256] Risk factors Hypertension 18 (60.0%) Dyslipidemia # 21 (70.0%) Smoking status # Current 11 (36.7%) Former 7 (23.3%) Never 10 (33.3%) Predefined risk modifiers (any occurrence) Cancer 2 (6.7%) Chronic inflammatory disease 1 (3.3%) Chronic obstructive pulmonary disease 1 (3.3%) Coronary calcium score = 0 1 (3.3%) Elevated coronary calcium score 4 (13.3%) Family history of premature ASCVD 1 (3.3%) hs-CRP 1 (3.3%) Increased arterial stiffness 1 (3.3%) Lp(a) elevated 1 (3.3%) Obesity 2 (6.7%) Obstructive sleep apnoea 3 (10.0%) Pre-diabetes 11 (36.7%) Number of risk modifiers per vignette None 14 (46.7%) One 7 (23.3%) Two 6 (20.0%) Three or more 3 (10.0%) SCORE2 applicability Eligible (no exception) 20 (66.7%) SCORE2 exceptions for risk stratification 10 (33.3%) Prior ASCVD 4 (13.3%) Diabetes 3 (10.0%) Chronic kidney disease 2 (6.7%) Familial hypercholesterolemia 1 (3.3%) ESC risk category Low-to-Moderate 5 (16.7%) High 13 (43.3%) Very High 12 (40.0%) Legend. Data presentation : Values are reported as mean ± standard deviation [range] for continuous variables and n (%) for categorical variables. Percentages are calculated based on N = 30 vignettes unless otherwise noted. Definitions : “SCORE2 applicability” distinguishes vignettes eligible for standard calculation from those requiring guideline-based exceptions. *Age includes one 36-year-old patient with familial hypercholesterolemia; all other vignettes represent patients aged ≥ 40 years. # Denotes variables with missing data in the source vignettes (unknown dyslipidemia: n = 2; unknown smoking status: n = 2). Abbreviations : ASCVD, atherosclerotic cardiovascular disease; HDL, high-density lipoprotein; LDL, low-density lipoprotein; Lp(a), lipoprotein(a); hs-CRP, high-sensitivity C-reactive protein. Cardiovascular Risk Factor Extraction Performance All eleven evaluated models showed excellent performance in extracting traditional cardiovascular risk factors, with micro-F1 scores ranging from 0.97 to 0.99 across the 60 vignettes evaluated per model (Table 2 and Supplementary Table S1 ). Claude Opus 4.1, Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-5 Nano achieved the highest overall micro-F1 scores of 0.99 (95% CI: 0.98–0.99). Micro-precision ranged from 0.97 to 0.99, and micro-recall from 0.95 to 0.99 across models, indicating both high sensitivity and specificity in identifying cardiovascular risk factors (Supplementary Figure S3). The mean Jaccard similarity coefficient was ≥ 0.93 across models (minimum 0.93), reflecting substantial overlap between model-extracted and reference factor sets. There was no substantial difference between micro and macro metrics for cardiovascular risk factors extraction (Supplementary Figure S4). Table 2 Extraction performance by model: cardiovascular risk factors, SCORE2 core risk factors, and risk modifiers Model Cardiovascular Risk Factors SCORE2 Core Risk Modifiers Micro-F1 Micro-P Micro-R Macro-F1 Jaccard Micro-F1 Micro-F1 GPT-4o 0.98 (0.98–0.99) 0.98 (0.96–0.99) 0.99 (0.98–0.99) 0.98 (0.95–1.00) 0.97 (0.95–0.98) 0.99 (0.98–1.00) 0.74 (0.61–0.84) GPT-5 0.98 (0.97–0.99) 0.99 (0.98–1.00) 0.97 (0.96–0.98) 0.98 (0.95–0.99) 0.97 (0.95–0.98) 0.98 (0.97–0.99) 0.77 (0.66–0.87) GPT-4.1 0.98 (0.97–0.99) 0.98 (0.97–0.99) 0.97 (0.96–0.98) 0.97 (0.94–0.99) 0.96 (0.94–0.97) 0.99 (0.98–0.99) 0.80 (0.68–0.90) Claude Sonnet 4.5 0.99 (0.98–0.99) 0.98 (0.97–0.99) 0.99 (0.98–1.00) 0.98 (0.96–1.00) 0.97 (0.96–0.98) 0.99 (0.99–1.00) 0.58 (0.45–0.68) Gemini 2.5 Pro 0.99 (0.98–0.99) 0.98 (0.97–0.99) 0.99 (0.98–1.00) 0.99 (0.96–1.00) 0.98 (0.96–0.99) 0.99 (0.98–0.99) 0.81 (0.70–0.90) Claude Opus 4.1 0.99 (0.98–0.99) 0.99 (0.97–0.99) 0.99 (0.98–0.99) 0.99 (0.96–1.00) 0.98 (0.96–0.99) 0.99 (0.98–0.99) 0.64 (0.47–0.77) DeepSeek V3 0.98 (0.97–0.98) 0.97 (0.95–0.98) 0.99 (0.98–0.99) 0.97 (0.94–1.00) 0.96 (0.94–0.97) 0.99 (0.98–0.99) 0.67 (0.52–0.78) Gemini 2.0 Flash 0.97 (0.95–0.97) 0.98 (0.96–0.99) 0.95 (0.93–0.97) 0.96 (0.93–0.99) 0.93 (0.92–0.95) 0.99 (0.98–0.99) 0.67 (0.54–0.77) Grok-3 0.98 (0.97–0.99) 0.98 (0.97–0.99) 0.98 (0.97–0.99) 0.98 (0.96–1.00) 0.97 (0.95–0.98) 0.98 (0.97–0.99) 0.60 (0.43–0.73) Llama 3.3 70B Instruct 0.98 (0.97–0.99) 0.98 (0.97–0.99) 0.98 (0.97–0.99) 0.97 (0.94–1.00) 0.96 (0.95–0.98) 0.98 (0.97–0.99) 0.67 (0.52–0.78) GPT-5 Nano 0.99 (0.98–0.99) 0.98 (0.97–0.99) 0.99 (0.97–0.99) 0.98 (0.96–1.00) 0.97 (0.96–0.98) 0.99 (0.98–1.00) 0.82 (0.70–0.90) Legend. Metrics : Data represent micro-averaged F1-scores, precision, and recall, macro-averaged F1-scores, and Jaccard similarity coefficients, presented with 95% confidence intervals in parentheses. Sample size : Each model evaluation includes N = 60 predictions (pooled 30 Portuguese and 30 English). Definitions : "Cardiovascular Risk Factors" comprises the 12-item traditional panel (6 SCORE2 core inputs + 6 additional factors); "Risk Modifiers" comprises the 12 predefined guideline modifiers; Models are ordered by quadratic-weighted kappa (κw) from the primary three-class risk analysis. Abbreviations : CI, confidence interval; SCORE2, Systematic Coronary Risk Estimation 2. For SCORE2 core risk factors specifically, extraction accuracy was uniformly high across all models. Micro-F1 scores ranged from 0.98 to 0.99, with 8 models achieving scores of 0.99 (Table 2 and Supplementary Table S2). Individual factor analysis revealed near-perfect extraction for age, sex, and lipid parameters, while smoking status showed slightly more variability (F1: 0.94, 95% CI: 0.91–0.96). Among the additional traditional factors beyond SCORE2 requirements, dyslipidemia diagnosis proved most defiant (F1: 0.88, 95% CI: 0.86–0.90), while continuous variables such as triglycerides and diastolic blood pressure were extracted with near-perfect accuracy (Fig. 1 and Supplementary Table S3). Risk modifier extraction presented greater challenges, with substantially lower and more variable performance than traditional risk factors (Supplementary Table S4). Micro-F1 scores ranged from 0.58 to 0.82, with GPT-5 Nano, Gemini 2.5 Pro, and GPT-5, achieving the highest scores, above 0.80. Claude Sonnet 4.5 demonstrated the lowest performance at 0.58 (95% CI: 0.45–0.68). Calcium Score, cancer, elevated high-sensitivity C-reactive protein, chronic obstructive pulmonary disease, and obstructive sleep apnoea were most accurately identified across models (F1 > 0.80), while family history of premature ASCVD (F1: 0.32, 95% CI: 0.21–0.42), elevated lipoprotein(a) (F1: 0.43, 95% CI: 0.26–0.60) and obesity (F1: 0.45, 95% 0.36–0.53) proved most difficult to extract consistently. The substantial difference in precision (range: 0.19–0.98) versus recall (range: 0.50-1.00) for risk modifiers suggested that models tended toward higher sensitivity at the expense of specificity when identifying these less common cardiovascular risk determinants (Fig. 1 and Supplementary Table S5). ESC risk classification Eight models completed the three-class cardiovascular risk classification task for all 60 vignettes (30 Portuguese and 30 English), while three models failed to return risk classifications for some vignettes. GPT-5 did not calculate a risk category due to unknown smoking status in four vignettes (two in Portuguese and two in English); these cases were assumed to have no smoking history by the Cardiovascular Risk Adjudication Committee. Llama 3.3 70B Instruct and GPT-5 Nano did not provide risk classifications for one vignette (English) and five vignettes (4 Portuguese, 1 English), respectively, but did not provide a reason for these omissions. LLMs were able to classify patients into ESC cardiovascular risk categories, though performance varied substantially across models (Table 3 and Supplementary Figure S5). Using quadratic-weighted Cohen's kappa (κw) as the primary agreement metric, GPT-4o achieved the highest concordance with gold-standard classifications (κw = 0.69, 95% CI: 0.44–0.84), followed by GPT-5 (κw = 0.68, 95% CI: 0.48–0.83) and GPT-4.1 (κw = 0.65, 95% CI: 0.44–0.80). These top-performing models demonstrated moderate to substantial agreement with expert adjudication. In contrast, models at the lower performance tier showed only fair to moderate agreement, with GPT-5 nano exhibiting the lowest concordance (κw = 0.40, 95% CI: 0.15–0.63). Overall classification accuracy ranged from 67% (GPT-4o) to 42% (GPT-5 nano), and the relationship between agreement strength and raw accuracy was not linear, with some models demonstrating higher accuracy than their agreement metric would suggest, indicating different error patterns across models. Table 3 Three-class ESC risk classification: model performance summary (ordered by κw) Model N / Unknown (%) Agreement κw (95% CI) Accuracy (95% CI) Errors Minor (%) Major (%) Over (%) Under (%) GPT-4o 60 / 0% 0.69 (0.44–0.84) 0.67 (0.52–0.82) 33.3 0.0 8.3 25.0 GPT-5 56 / 6.7% 0.68 (0.48–0.83) 0.58 (0.42–0.75) 37.5 0.0 3.6 33.9 GPT-4.1 60 / 0% 0.65 (0.44–0.80) 0.62 (0.45–0.77) 38.3 0.0 3.3 35.0 Claude Sonnet 4.5 60 / 0% 0.61 (0.25–0.80) 0.63 (0.45–0.78) 35.0 1.7 26.7 10.0 Gemini 2.5 Pro 60 / 0% 0.58 (0.28–0.77) 0.57 (0.42–0.73) 43.3 0.0 20.0 23.3 Claude Opus 4.1 60 / 0% 0.57 (0.28–0.79) 0.60 (0.43–0.77) 36.7 3.3 10.0 30.0 DeepSeek V3 60 / 0% 0.56 (0.25–0.73) 0.52 (0.35–0.67) 48.3 0.0 18.3 30.0 Gemini 2.0 Flash 60 / 0% 0.52 (0.29–0.70) 0.55 (0.40–0.70) 41.7 3.3 5.0 40.0 Grok-3 60 / 0% 0.45 (0.16–0.72) 0.53 (0.37–0.70) 40.0 6.7 10.0 36.7 Llama 3.3 70B Instruct 59 / 1.7% 0.45 (0.18–0.68) 0.58 (0.42–0.73) 32.2 8.5 3.4 37.3 GPT-5 Nano 55 / 8.3% 0.40 (0.15–0.63) 0.42 (0.25–0.58) 41.8 12.7 7.3 47.3 Legend. Outcomes : "Agreement (κw)" and "Errors" are calculated using valid ordinal predictions only; "Accuracy" is calculated against the total N, penalizing unclassifiable responses. Definitions : "N" denotes the total evaluated predictions (30 Portuguese + 30 English); "Unknown (%)" represents the proportion of vignettes where the model failed to return a classification or output "unknown."; "Minor Error" indicates misclassification by one adjacent category; "Major Error" indicates misclassification by two categories (Low-to-Moderate ↔ Very High); "Over/Under" refers to the direction of minor errors; Models are ordered by quadratic-weighted kappa (κw) from the primary three-class risk analysis. Abbreviations : CI, confidence interval; ESC, European Society of Cardiology; κw, quadratic-weighted kappa. Analysis of misclassification patterns revealed important safety considerations for clinical deployment in the current setting (Figs. 2 and 3 , Supplementary Figure S6). Major two-category errors, representing the most clinically significant misclassifications where patients were shifted between low-to-moderate and very high-risk categories, were rare but present in six models, with GPT-5 nano showing the highest rate at 13%. Five models (GPT-4o, GPT-5, GPT-4.1, Gemini 2.5 Pro, and DeepSeek V3) achieved zero major error rates. Ten out of eleven models tended to underestimate risk, with only Claude Sonnet 4.5 overestimating more frequently than underestimating risk. Sensitivity for identifying high and very high-risk patients, a critical metric for ensuring high-risk individuals receive appropriate intensive interventions, varied markedly across models (Supplementary Table S6). Given that 83% of the evaluation cohort comprised high and very high-risk patients according to gold-standard assessment, the models' ability to correctly identify these high-risk individuals, sensitivity, was of paramount clinical importance. Claude Sonnet 4.5 achieved perfect sensitivity (100%, 95% CI: 92.9–100%), though with reduced specificity (80%, 95% CI: 49.0–94.3%). Seven models achieved perfect specificity (100%, 95% CI: 72.2–100%), with variable sensitivity (ranging from 52% to 92%). The top performing model in primary risk stratification, GPT-4o, had a sensitivity of 92% and specificity of 100% for detecting high and very high-risk categories. SCORE2 Numeric Agreement Among vignettes where SCORE2 risk calculation was applicable, numeric agreement between model-predicted 10-year cardiovascular risk and the gold standard varied substantially across models (Table 4 , Supplementary Table S7 and Figure S7). Only three models (GPT-4.o, GPT-4.1 and Grok-3) computed absolute risk in all eligible vignettes, with the remaining evaluating absolute risk in only 18 to 36 of cases, reflecting variable success in generating numeric outputs. Table 4 Numeric agreement with SCORE2 (applicable cases only) Model N MAE (95% CI) Bias (95% CI) CCC (95% CI) GPT-4o 40 7.93 (2.04–16.28) 4.98 (-1.35–13.80) -0.13 (-0.18–0.63) GPT-5 33 16.26 (2.23–33.85) 13.14 (-1.57–31.29) -0.08 (-0.12–0.52) GPT-4.1 40 9.04 (2.20–19.15) 5.24 (-2.23–15.74) -0.12 (-0.17–0.64) Claude Sonnet 4.5 18 20.14 (0.88–44.86) 18.28 (-1.40–43.57) -0.08 (-0.13–0.91) Gemini 2.5 Pro 27 7.78 (1.10–17.28) 5.55 (-1.40–15.59) -0.08 (-0.12–0.83) Claude Opus 4.1 29 8.43 (1.09–22.07) 5.52 (-2.23–19.74) -0.08 (-0.11–0.86) DeepSeek V3 36 7.81 (2.18–15.59) 3.30 (-2.70–11.53) -0.09 (-0.11–0.49) Gemini 2.0 Flash 38 13.28 (5.82–22.14) 7.31 (-1.10–17.24) -0.08 (-0.11–-0.05) Grok-3 40 6.81 (1.73–13.90) 3.46 (-2.04–11.10) -0.09 (-0.11–0.56) Llama 3.3 70B Instruct 31 8.68 (2.61–17.06) 4.36 (-2.49–13.95) -0.14 (-0.26–0.53) GPT-5 Nano 31 3.75 (1.92–6.78) -0.47 (-2.65–3.13) -0.05 (-0.15–0.63) Legend. Sample determination : Analysis is restricted to the subset of vignettes ("N") where both the Gold Standard was applicable, and the model successfully generated a numeric risk value. Metrics : Values represent percentage points of 10-year cardiovascular risk; "Bias" is calculated as Mean(Model) − Mean(Gold Standard), where positive values indicate overestimation; "CCC" denotes Lin’s Concordance Correlation Coefficient (− 1 to 1); Models are ordered by κw from the primary risk analysis. Abbreviations : CI, confidence interval; MAE, mean absolute error; SCORE2, Systematic Coronary Risk Estimation 2. Mean absolute error (MAE) ranged from 3.8 percentage points for GPT-5 nano (95% CI: 1.92–6.78; n = 31) to 20.1 for Claude Sonnet 4.5 (95% CI: 0.88–44.86; n = 18). All models except GPT-5 nano demonstrated MAE exceeding 5 percentage points, a clinically significant threshold that could lead to misclassification across risk categories. In contrast to the categorical risk classification results, Bland-Altman analyses revealed systematic overestimation in 10 of 11 models, with wide limits of agreement. Only GPT-5 nano showed approximately unbiased predictions (Supplementary Figure S7). Lin's concordance correlation coefficients were uniformly poor (range: -0.14 to -0.05), with all confidence intervals including negative values, indicating weak absolute agreement with reference SCORE2 values. SCORE2 Applicability Models demonstrated variable performance in identifying patients where SCORE2 should not be applied due to established CVD, diabetes, CKD, or FH (Table 5 , Fig. 4 , Supplementary Table S8). GPT-4.1 and Grok-3 achieved perfect classification (F1 1.00, accuracy 1.00), with no false positives or false negatives. Most models maintained missed-override rates below our prespecified 5% safety threshold, ensuring that high-risk patients requiring alternative management were rarely misclassified as SCORE2-eligible. The notable exception was Gemini 2.0 Flash with a 15.0% missed-override rate (95% CI 0.0–38.5%), though the wide confidence interval reflects considerable uncertainty. Over-blocking rates, indicating inappropriate withholding of SCORE2 in eligible patients, showed substantial variation from 0.0% (GPT-4o, GPT-4.1, Grok-3, GPT-5 nano) to 55.0% (95% CI 34.1–75.0) for Claude Sonnet 4.5, revealing a critical trade-off between safety and clinical usability. Table 5 SCORE2 Applicability: override usage, rationale validity, and decision performance by model Model Do Not Use (n/N; %) Reason Provided % Valid Reason % F1 (override, 95% CI) Accuracy (95% CI) GPT-4o 19/60; 31.7 95.0 100.0 0.97 (0.92–1.00) 0.98 (0.95–1.00) GPT-5 22/60; 36.7 100.0 100.0 0.95 (0.83–1.00) 0.97 (0.90–1.00) GPT-4.1 20/60; 33.3 100.0 100.0 1.00 (1.00–1.00) 1.00 (1.00–1.00) Claude Sonnet 4.5 42/60; 70.0 100.0 85.7 0.65 (0.42–0.81) 0.63 (0.47–0.80) Gemini 2.5 Pro 33/60; 55.0 100.0 87.9 0.76 (0.53–0.90) 0.78 (0.63–0.92) Claude Opus 4.1 31/60; 51.7 100.0 96.8 0.78 (0.57–0.92) 0.82 (0.68–0.93) DeepSeek V3 23/60; 38.3 95.0 82.6 0.88 (0.72–0.98) 0.92 (0.82–0.98) Gemini 2.0 Flash 19/60; 31.7 85.0 94.7 0.87 (0.68–0.98) 0.92 (0.83–0.98) Grok-3 20/60; 33.3 100.0 100.0 1.00 (1.00–1.00) 1.00 (1.00–1.00) Llama 3.3 70B Instruct 26/60; 43.3 95.0 88.5 0.83 (0.62–0.95) 0.87 (0.75–0.97) GPT-5 Nano 19/60; 31.7 95.0 100.0 0.97 (0.91–1.00) 0.98 (0.95–1.00) Legend. Definitions : "Do not use" represents the proportion of vignettes flagged by the model as SCORE2-ineligible; "Reason Provided" indicates the percentage of override decisions accompanied by a text rationale; "Valid" indicates the percentage of those rationales matching the Gold Standard. Metrics : F1 and Accuracy refer to the binary classification performance for the override decision (Positive Class = "Do not use SCORE2"). Models are ordered by κw from the primary risk analysis. Abbreviations : CI, confidence interval; SCORE2, Systematic Coronary Risk Estimation 2. Rationale transparency varied considerably across models (Supplementary Table S9). While most systems provided explanations when blocking SCORE2 (≥ 95.0% in nine models), the validity of these reasons differed markedly. Five models (GPT-4o, GPT-5, GPT-4.1, Grok-3, GPT-5 nano) achieved 100% validity, correctly identifying guideline-specified contraindications. In contrast, four models showed concerning rates of invalid rationales: DeepSeek V3 (17.4%), Claude Sonnet 4.5 (14.3%), Gemini 2.5 Pro (12.1%), and Llama 3.3 70B Instruct (11.5%). These invalid explanations predominantly cited non-qualifying cardiovascular conditions or non-cardiovascular factors, potentially leading to inappropriate clinical decisions (detailed breakdown in Supplementary Table S9). Extraction accuracy for specific override conditions correlated strongly with overall applicability performance (Supplementary Table S8). Top-performing models demonstrated near-perfect multilabel extraction: GPT-4.1 and Grok-3 achieved Micro-F1 1.00, while GPT-4o, DeepSeek V3, and GPT-5 nano reached 0.97. Models with weaker override decisions showed correspondingly poor condition extraction, with Claude Sonnet 4.5 achieving the lowest Micro-F1 of 0.71 (95% CI 0.48–0.88). Among high-performing models, the distribution of identified conditions aligned closely with the vignette case-mix, suggesting robust recognition across diverse clinical presentations rather than systematic bias toward specific conditions (Supplementary Table S9). Language Performance Model performance demonstrated robust consistency across Portuguese and English vignettes (Table 6 , Supplementary Figure S8). For risk factor extraction, micro-F1 scores exceeded 0.95 in both languages, with language differences ranging from − 0.01 to 0.02. No model exhibited clinically meaningful language effects (all FDR-adjusted p > 0.05). Table 6 Language performance across Portuguese vs English Model PT Micro‑F1 EN Micro‑F1 ΔF1 [95% CI] PT κw EN κw Δκw [95% CI] PT Acc EN Acc ΔAcc [95% CI] GPT-4o 0.99 0.98 -0.01 (-0.01–0.00) 0.69 0.68 -0.01 (-0.12–0.09) 1.00 0.97 -0.03 (-0.10–0.00) GPT-5 0.97 0.98 0.01 (0.00–0.02) 0.68 0.69 0.02 (-0.10–0.13) 0.97 0.97 + 0.00 (0.00–0.00) GPT-4.1 0.98 0.98 -0.00 (-0.01–0.01) 0.60 0.69 0.09 (-0.02–0.23) 1.00 1.00 + 0.00 (0.00–0.00) Claude Sonnet 4.5 0.98 0.99 0.00 (-0.00–0.01) 0.59 0.62 0.04 (-0.17–0.27) 0.63 0.63 + 0.00 (-0.10–0.10) Gemini 2.5 Pro 0.98 0.98 -0.00 (-0.01–0.01) 0.64 0.53 -0.10 (-0.24–0.00) 0.77 0.80 + 0.03 (-0.07–0.13) Claude Opus 4.1 0.98 0.98 -0.00 (-0.01–0.00) 0.63 0.52 -0.11 (-0.28–0.07) 0.80 0.83 + 0.03 (-0.07–0.13) DeepSeek V3 0.98 0.98 0.00 (-0.00–0.01) 0.60 0.53 -0.07 (-0.21–0.06) 0.93 0.90 -0.03 (-0.13–0.07) Gemini 2.0 Flash 0.96 0.98 0.02 (0.01–0.03) 0.61 0.45 -0.16 (-0.38–0.07) 0.90 0.93 + 0.03 (-0.07–0.17) Grok-3 0.98 0.98 -0.00 (-0.01–0.01) 0.44 0.46 0.02 (-0.07–0.15) 1.00 1.00 + 0.00 (0.00–0.00) Llama 3.3 70B Instruct 0.98 0.97 -0.01 (-0.02–0.00) 0.34 0.56 0.22 (-0.04–0.49) 0.83 0.90 + 0.07 (-0.07–0.20) GPT-5 Nano 0.99 0.98 -0.01 (-0.02-0.00) 0.33 0.46 0.13 (-0.08–0.36) 0.97 1.00 + 0.03 (0.00–0.10) Legend. Comparison : Paired performance analysis between Portuguese (PT) and English (EN) vignettes across three domains: traditional risk factor extraction (Micro-F1), ESC risk classification (κw), and SCORE2 applicability (Accuracy). Metrics : Δ represents the difference (EN − PT); values are absolute differences for F1 and κw, and percentage points for Accuracy; 95% confidence intervals are shown in parentheses; No statistically significant differences were found after FDR correction. Models are ordered by κw. Abbreviations: Acc, accuracy; CI, confidence interval; κw, quadratic-weighted kappa. Risk classification accuracy revealed greater language-dependent variability. Llama 3.3-70B Instruct exhibited the largest language effect (Δκw = 0.22, 95% CI -0.04 to 0.49), performing better in English, while Gemini-2.0-flash showed the opposite pattern (Δκw = -0.16, 95% CI -0.38 to 0.07), performing better in Portuguese. Despite individual variations, no model demonstrated statistically significant language effects (all FDR-adjusted p > 0.05). SCORE2 applicability assessment showed consistent accuracy across languages for all models, without significant differences (all FDR-adjusted p > 0.05). Language differences were minor and ranged from 0 to 7 percent points, with Llama 3.3-70B Instruct showing the largest difference. Medical Raters Performance Eight clinicians independently classified the 30 Portuguese vignettes using the ESC 3-class system. Individual agreement with the Gold Standard, measured by quadratic-weighted Cohen's kappa, demonstrated substantial heterogeneity: ranging from slight agreement (evaluator #6: κw = 0.15, 95% CI: -0.10 to 0.40) to almost perfect (evaluator #4: κw = 0.93, 95% CI: 0.85 to 1.00), with the majority achieving moderate agreement comparable to LLM performance (Fig. 5 , Supplementary Table S10). Inter-rater reliability among clinicians, assessed using Gwet's AC2 with quadratic weights, was moderate (AC2 = 0.44, 95% CI: 0.27 to 0.55), indicating meaningful but imperfect consensus independent of the Gold Standard. When pooled using majority-vote ensemble methodology, the clinicians achieved substantial agreement (κw = 0.76, 95% CI: 0.58 to 0.89), with 76.67% accuracy (23/30 vignettes). All misclassifications were limited to adjacent risk categories, with no critical two-level errors observed between low/moderate and very-high risk groups (Supplementary Figure S9). This pooled clinician benchmark exceeded the highest single-model performance observed in the primary analysis (GPT-4o: κw = 0.69), establishing a practical performance ceiling for LLM evaluation on these clinical vignettes. 3. Discussion In this comprehensive evaluation of eleven contemporary LLMs for cardiovascular risk stratification, we demonstrated that models excel at extracting traditional cardiovascular risk factors from clinical summaries but show moderate and variable performance in translating these factors into guideline-concordant risk classifications. The GPT family dominated three-class ESC risk categorization, with GPT-4o achieving the highest agreement with expert adjudication, followed closely by GPT-5 and GPT-4.1, though notably GPT-5 Nano showed the weakest performance among all models. However, ten of eleven models systematically underestimated cardiovascular risk, representing a critical safety concern for clinical deployment. LLMs also struggled with numeric SCORE2 calculation, producing clinically unacceptable mean absolute errors exceeding 5 percentage points in all but one model, revealing their inability to reliably compute risk stratification formulas. Conversely, most models demonstrated robust capability for identifying patients with conditions that require alternative risk assessment beyond SCORE2, such as established ASCVD, CKD, diabetes, or FH, missing fewer than 5% of these cases, with some models missing 0%. To our knowledge this is the first benchmark study evaluating LLMs for cardiovascular risk prevention and our findings establish that while current models possess strong capabilities for analyzing clinical information and extracting relevant data, substantial refinement in clinical reasoning and risk quantification is required before deployment in cardiovascular prevention workflows. The striking performance gap between risk factor extraction and ESC risk classification was expected, as the former represents a well-defined labeling task while the latter requires multi-step reasoning under ambiguous clinical definitions. The near-perfect accuracy in extracting traditional cardiovascular risk factors aligns with prior natural language processing achievements and is unsurprising given their objective, well-documented nature, because age, blood pressure, and cholesterol values are unambiguous data points in clinical text 19 – 22 . However, even within traditional risk factors, we observed a gradient of difficulty: factors requiring interpretive judgment showed incrementally lower performance, with smoking status, hypertension diagnosis, and dyslipidemia diagnosis revealing the challenges of extracting concepts that extend beyond simple numeric values. This pattern amplified dramatically with risk modifiers, likely because factors such as 'family history of premature ASCVD' or 'chronic inflammatory disease' lack discrete definitions and require clinical interpretation beyond text matching 22 , 23 . Most critically, translating extracted data into ESC risk categories demands higher-order medical reasoning, integrating multiple variables, applying guideline exceptions, and weighing modifiers, a complex cognitive synthesis where we observed substantial model performance variability. This aligns with mixed results reported for LLMs in diagnostic reasoning and clinical decision-making, where models consistently excel at information retrieval but struggle with multi-step clinical inference requiring integration of competing factors 16 , 24 , 25 . The predominant tendency toward risk underestimation in ten of eleven models is particularly concerning, paralleling extensive literature documenting that physicians relying on unaided clinical judgment also tend to underestimate cardiovascular risk, especially in high-risk patients 26 , 27 . This convergent bias suggests that LLMs have learned to replicate the underestimation patterns present in training data 5 , 28 , 29 . The failure of LLMs to accurately compute numeric SCORE2 values, despite successfully extracting all required variables, reveals fundamental limitations in mathematical reasoning, although we acknowledge that our zero-shot prompting strategy may have contributed to the variable completion rates observed. Only three models computed absolute risk in all eligible vignettes and the responses systematically overestimated numeric risk, paradoxically contrasting with their conservative categorical classifications, suggesting different failure modes for arithmetic computation versus clinical judgment 30 . This disconnect reflects transformer architectures' well-documented reliance on pattern recognition rather than true calculation 30 . Notably, GPT-5 Nano achieved the lowest numeric error but poorest categorical performance, highlighting a possible trade-off between computational accuracy and clinical reasoning that has been described with larger models 31 . Consequently, these results strongly support a hybrid, safety-oriented workflow for clinical deployment: using LLMs exclusively for information extraction, where performance was near-perfect for all traditional SCORE2 variables (micro-F1 ≥ 0.97), followed by a deterministic implementation of the SCORE2 algorithm to compute absolute risk. This architecture would eliminate the primary source of numerical error, specifically the LLMs' unreliable arithmetic reasoning, while retaining their greatest strengths in unstructured data processing 30 , 32 . In contrast to LLM struggling with risk categorization, most models successfully identified patients requiring guideline exceptions to SCORE2, with missed-override rates below 5% for high-risk conditions (ASCVD, diabetes, CKD, FH), and GPT-4.1 and Grok-3 achieving perfect performance. However, five models over-blocked SCORE2 in more than 10% of eligible cases, reaching 55% in Claude Sonnet 4.5, compromising clinical usability. While GPT models and Grok-3 consistently provided valid explanations for their decisions, six other models generated hallucinated medical justifications that could mislead users, incorrectly citing conditions like valvular disease, atrial fibrillation, heart failure, chronic inflammatory diseases, or simply the presence of risk factors as contraindications. Models also handled missing data inconsistently: three models failed to provide risk categories in select vignettes, with GPT-5 transparently refusing classification when smoking status was absent, while Llama 3.3 70B and GPT-5 Nano failed silently without explanation, suggesting system failure rather than clinical judgment. This combination of over-blocking, false rationales, and variable responses to incomplete data reveals concerning reliability gaps despite acceptable performance metrics 18 , 23 , 33 , 34 . There are concerns that LLMs might have a worse performance in low-resource languages 35 , 36 . Despite Portuguese being substantially less represented in training corpora, we observed no statistically significant language effects, with risk factor extraction maintaining very high micro-F1 scores and top-performing models showing minimal risk classification differences 36 . This bilingual consistency is an encouraging finding from our study, particularly relevant for non-English speaking healthcare systems, where locally developed artificial intelligence tools remain scarce 37 , 38 . Also, the linguistic robustness indicates that fundamental challenges in cardiovascular risk assessment transcend language barriers and reflect architectural limitations rather than language-specific training gaps. To contextualize LLM performance, our exploratory analysis of eight practicing physicians revealed substantial heterogeneity in cardiovascular risk classification, with individual agreement to the gold standard ranging from poor to near-perfect. The wide inter-rater variability underscores that cardiovascular risk assessment remains challenging even for experienced clinicians, reflecting the complexity of integrating multiple risk factors into categorical decisions 3 , 5 , 26 . Taken together, these findings position current LLM performance in a meaningful intermediate range. While top models like GPT-4o, achieved agreement comparable to mid-performing clinicians, they did not reach the accuracy of the best human evaluator. When pooled using majority voting, the clinician consensus outperformed every individual model, reinforcing that collective clinical judgment provides a robustness that current LLMs cannot yet replicate - though notably, the ensemble also outperformed the average individual clinician. However, a critical safety gap persists: physicians avoided two-level misclassifications entirely, whereas several LLMs exhibited these critical errors that could lead to patient under-treatment 41 , 42 . It is important to note that our study occurred under controlled conditions; in real-world practice, where clinicians are often overworked and fatigued, we hypothesize that LLMs could offer even greater utility than found here by acting as a vigilant "second opinion" against cognitive overload 39 , 40 . Ultimately, these results suggest LLMs are not yet autonomous decision-makers but could serve as powerful augmentation tools, particularly for standardizing assessment among clinicians performing below the expert median 41 , 43 . Most physicians achieved moderate agreement, and notably, only one outperformed GPT-4o, with the remaining seven showing lower weighted kappa scores. This finding that individual physician performance overlapped substantially with LLM ranges suggests these models could already augment clinicians performing below median, though they cannot yet replace expert consensus 39 , 40 . When pooled using majority voting, physician consensus achieved substantial agreement, exceeding the best-performing LLM by a marginal margin. Critically, physicians avoided two-level misclassifications entirely, while several LLMs exhibited these critical errors that could lead to patients under-treatment 41 , 42 . The wide inter-rater variability underscores that cardiovascular risk assessment remains challenging even for experienced clinicians, reflecting the complexity of integrating multiple risk factors into categorical decisions 3 , 5 , 26 . These results position current LLMs as potential clinical support tools rather than autonomous decision-makers, particularly valuable for standardizing risk assessment in settings where specialized cardiovascular expertise is limited 41 , 43 . Model hierarchies emerging from our benchmark have direct implications for deployment. The GPT family (GPT-4o, GPT-5, GPT-4.1) consistently led overall performance, combining near-perfect extraction of traditional risk factors with top-tier identification of risk modifiers, perfect specificity for high-risk override conditions, absence of two-level misclassifications, and weighted agreement for ESC risk stratification of κw > 0.65 (best: GPT-4o κw = 0.69). GPT-5 and GPT-4.1 also achieved perfect sensitivity for high-risk identification, and GPT-5 displayed appropriately conservative behavior in the presence of missing data, notably, the model refused to assign an ESC risk category in vignettes where smoking status was unknown, a behavior that, while reducing completion rates, arguably reflects higher algorithmic fidelity and safety than forced guessing. In contrast, GPT-5 Nano was the weakest performer, and Gemini Flash underperformed relative to Gemini Pro, reinforcing that model capacity materially affects clinical reasoning 31 . The Claude family exhibited pronounced over-blocking of SCORE2 eligibility, with Sonnet 4.5 flagging 55% of otherwise eligible vignettes, limiting clinical usability. Another concerning finding was the generation of erroneous rationales when blocking SCORE2, with DeepSeek V3 (17.4%), Claude Sonnet 4.5 (14.3%), and Gemini 2.5 Pro (12.1%) incorrectly citing non-qualifying conditions and risking clinician misdirection. While extraction-only tasks showed little separation across models, integrated risk assessment requiring multi-step clinical reasoning clearly discriminated moderate-to-good performers from others. When analyzing open-source (DeepSeek V3 and Llama 3.3 70B Instruct) versus closed-source models, we found that for risk factor extraction all models presented high micro-F1 scores. Regarding risk modifiers, closed-source models presented better performance. Despite this difference in risk modifiers, which might be influenced by the prompting strategy, these findings also demonstrated that open-source models can be of use for cardiovascular risk assessment if a two-step approach is considered (step 1 extraction and step 2 calculation using the formula, not the LLM). However, considering the proposed strategy with a single prompt to extract risk factors and calculate risk, these results support prioritizing the GPT family for prospective clinical evaluation and align with contemporary deployment guidance that emphasizes high sensitivity for safety-critical screening and rigorous verification of explanation faithfulness and hallucination control 44 . Our study used simulated clinical vignettes rather than real clinical notes, which may limit external validity and underrepresent documentation artifacts, like abbreviations, contradictions, and missing-data patterns, found in electronic health records. The vignette case mix was skewed toward high and very high-risk, which could inflate agreement metrics and reduce generalizability to lower-prevalence settings. Sample size of 30 vignettes, while adequate for initial benchmarking, provides limited statistical power for subgroup analyses, and the resulting wide confidence intervals constrain the precision with which models can be comparatively ranked. The adjudicated reference standard, although expert-based, introduces inherent subjectivity in defining the ground truth for risk classification. Furthermore, this standard represents an idealized benchmark derived from experts explicitly focused on rigorous calculation, which likely exceeds the implicit, often heuristic-based risk assessment typical of routine daily practice; consequently, our evaluation subjects the models to a stricter performance threshold than that often found in real-world clinical environments. Evaluating a single guideline framework (ESC/SCORE2) restricts generalizability to other calculators and guidelines (e.g., ASCVD, QRISK3). The cross-sectional, single-time-point design may not reflect rapidly evolving models. Each model was run once per language; although English/Portuguese replicates were consistent, robustness to random seeds, temperature settings, and prompt variations remains untested. Finally, all analyses used zero-shot prompting; few-shot prompting, chain-of-thought reasoning, or fine-tuning might yield different results. In this comprehensive benchmark of eleven contemporary LLMs for cardiovascular risk stratification, models achieved near-perfect extraction of traditional risk factors yet demonstrated only moderate accuracy in ESC risk categorization and unreliable SCORE2 calculations, precluding autonomous clinical use. Most models correctly identified guideline exceptions requiring alternative assessment and maintained robust performance across Portuguese and English, supporting near-term applications in structured documentation and eligibility screening under clinical supervision. This foundational study establishes the current capabilities and limitations of LLMs in preventive cardiology, providing critical evidence for implementation strategies while underscoring the necessity for real-world validation, enhanced mathematical reasoning, and safeguards against systematic bias before broader clinical adoption 4. Methods Study design We conducted a prespecified, simulation-based evaluation of LLMs for cardiovascular risk stratification in accordance with the 2021 ESC prevention guidelines 3 . The study used systematically developed clinical vignettes to benchmark model capabilities against a reference standard for comparison. No real patient data was used. The study protocol was previously registered on the Open Science Framework ( https://doi.org/10.17605/OSF.IO/J2ZK9 ). Clinical vignette development and validation Thirty clinical vignettes were systematically developed by a senior cardiologist to emulate outpatient clinical notes of patients undergoing cardiovascular risk stratification. Each vignette (100–200 words) was written in free-text format and included demographics, medical history, current medications, physical examination findings, laboratory results, and relevant diagnostic investigations, structured to reflect authentic clinical documentation. All vignettes incorporated the core variables required for SCORE2 risk calculation (age, sex, smoking status, systolic blood pressure, total cholesterol, and HDL cholesterol) together with additional cardiovascular risk modifiers. The set was designed to ensure balanced representation across sex, age strata (40–49, 50–59, and 60–69 years), and ESC risk categories (low/moderate, high, very high). Ten vignettes represented conditions in which SCORE2 is not applicable, including ASCVD (n = 4), diabetes mellitus (n = 3), CKD (n = 2), and FH (n = 1). Each vignette underwent independent evaluation by a panel of three cardiologists using a structured four-domain rubric assessing clinical relevance, completeness, realism, and clarity. Domains were rated on a four-point Likert scale, and domain-level validity required unanimous ratings of 3–4 (Item-Level Content Validity Index = 1.00). Vignettes not meeting this criterion were revised iteratively until all domains achieved full agreement. All vignettes were produced and validated in Portuguese and subsequently translated into English by the original author. The English versions were reviewed by a native English-speaking cardiologist, and a back-translation into Portuguese was performed to confirm conceptual equivalence. Any discrepancies were resolved by consensus between the author and the reviewer. Models, deployment, and prompting We evaluated a combination of proprietary and open-source LLMs selected by convenience and local availability, ensuring representation of the most widely used and high-performing contemporary systems. The final set comprised eleven models, listed alphabetically: Claude Opus 4.1, Claude Sonnet 4.5, DeepSeek V3, Gemini 2.0 Flash, Gemini 2.5 Pro, GPT-4.1, GPT-4o, GPT-5, GPT-5 Nano, Grok-3, and Llama 3.3 70B Instruct (see Supplementary Table S11 for model specifications). All models were accessed through the Azure platform or their respective cloud-based application programming interfaces, using default temperature settings and inference-only mode without any fine-tuning on the study dataset (Supplementary Table S11). Each vignette was evaluated in a new, independent session to prevent memory effects or cross-vignette information leakage, ensuring analytical independence across cases. A standardized prompt template was iteratively developed to ensure consistent task interpretation across models (see Supplementary Appendix S2 for standardized prompt template). The prompt explicitly instructed each model to: (1) extract cardiovascular risk factors in a structured format; (2) determine SCORE2 applicability; (3) calculate the 10-year cardiovascular risk when appropriate; (4) classify the patient into ESC risk categories (Low-to-Moderate, High, or Very High); (5) provide a concise clinical explanation for the assigned category; and (6) generate a JSON file to facilitate structured data extraction. Models were specifically directed to use the official SCORE2 calculator for moderate-risk countries or the corresponding risk tables published in the 2021 ESC Guidelines 3 . The prompt structure was identical in Portuguese and English, with only the vignette text and language-specific formatting adapted. Zero-shot prompting was employed to assess each model’s intrinsic reasoning capabilities. Every model completed all 60 assessments (30 Portuguese and 30 English vignettes). Reference Standard A three-member Cardiovascular Risk Adjudication Committee, composed of senior cardiologists with recognized expertise in cardiovascular prevention and not involved in vignette development or model evaluation, independently extracted all relevant cardiovascular risk factors and modifiers from each vignette. Using the 2021 ESC guidelines and assuming the moderate-risk European SCORE2 calibration, the committee calculated the 10-year cardiovascular risk and assigned the corresponding ESC risk category 3 . For conditions in which SCORE2 was not applicable, the committee applied guideline-based categorical classification overriding SCORE2 values. Discrepancies were discussed and resolved by consensus, and the final adjudicated outputs constituted the reference (Gold Standard) against which LLMs were compared. Outcomes The study outcomes were categorized into primary and secondary endpoints: Primary Outcomes Risk-factor extraction accuracy: Assessed for twelve traditional cardiovascular risk factors, including the six SCORE2 core variables (age, sex, smoking status, systolic blood pressure, total cholesterol, and HDL cholesterol) and six additional factors (diastolic blood pressure, LDL cholesterol, non-HDL cholesterol, triglycerides, hypertension diagnosis, and dyslipidemia diagnosis). Model performance was quantified using micro- and macro-averaged precision, recall, and F1-scores, and agreement was summarized using the per-vignette Jaccard similarity coefficient. Three-class cardiovascular risk classification: Agreement between model-predicted and reference ESC categories (Low-to-Moderate, High, Very High) was measured using quadratic-weighted Cohen’s κ (κw) as the primary metric. Supplementary measures included overall accuracy and the rate of major error (defined as two-class misclassification). Secondary Outcomes Risk-modifier extraction: Evaluated for twelve predefined factors (elevated coronary calcium score, calcium score equal to zero, pre-diabetes, obesity, family history of premature ASCVD, elevated lipoprotein(a), increased arterial stiffness, elevated high-sensitivity C-reactive protein, chronic inflammatory disease, obstructive sleep apnoea, chronic obstructive pulmonary disease, and cancer) using the same extraction metrics. Numeric SCORE2 agreement: For cases in which SCORE2 was applicable, numeric agreement between model-predicted and reference 10-year cardiovascular risk values (%) was quantified using mean absolute error (MAE), root mean square error (RMSE), Bland–Altman bias and limits of agreement, and Lin’s concordance correlation coefficient (CCC). SCORE2 applicability: Assessed as a binary decision on whether the SCORE2 algorithm should be applied (identifying exceptions). Key metrics included the missed-override rate (safety), over-blocking rate (usability), F1-score for the positive (“Do not use SCORE2”) class, overall accuracy, and correctness of the stated override reason. Language robustness: Evaluated by paired comparison of Portuguese and English outputs for both extraction and risk classification endpoints Human Benchmark Analysis To contextualize model performance, an exploratory analysis was conducted to establish a human benchmark for cardiovascular risk classification. Eight physicians - three family medicine specialists, three internal medicine specialists, and two cardiologists, each with more than three years of clinical experience, independently classified the thirty Portuguese vignettes according to ESC risk categories (Low-to-Moderate, High, Very High). All raters were blinded to model outputs and to each other’s assessments. Agreement with the Gold Standard was quantified using quadratic-weighted Cohen’s κ (κw), and inter-rater consistency was measured using Gwet’s AC2 coefficient with quadratic weights. An ensemble classification was generated through majority voting with median tie-breaking to represent the collective physician consensus. Statistical Analysis All analyses were conducted in R (version 4.4.1) using RStudio (v2025.09.1 + 401; Posit Software, Boston, MA, USA). The Portuguese Gold Standard served as the reference dataset for both language versions. Each model generated 60 predictions (30 Portuguese and 30 English), paired by vignette ID. For continuous variables, a tolerance of ± 0.2 units was applied to account for minor transcription or rounding variations. Predefined rules governed the handling of incomplete outputs: for risk factor extraction, absent model detections of present factors were penalized as false negatives, whereas missing Gold Standard values resulted in exclusion. For ESC risk classification, unclassifiable responses were counted as incorrect for overall accuracy but excluded from agreement metrics to ensure ordinal validity. Numeric comparisons were restricted to vignettes with valid calculations from both the model and reference. Confidence intervals (95%) were calculated using non-parametric bootstrap resampling (2,000 vignette-level replicates) for paired metrics, including κw, accuracy, and safety-related error rates. Wilson score intervals were used for single-proportion estimates (e.g., sensitivity and specificity). Language comparisons employed paired t-tests or Wilcoxon signed-rank tests for continuous metrics, and McNemar’s test for categorical outcomes, with false discovery rate (FDR) correction for multiple comparisons using the Benjamini–Hochberg method. Predefined interpretation thresholds were as follows: F1-score ≥ 0.90 (excellent), 0.80–0.89 (good), 0.70–0.79 (fair), and 0.80 (excellent), 0.61–0.80 (substantial), 0.41–0.60 (moderate); and missed-override rate 10% (concerning). Ethics This study used synthetic clinical vignettes only and contained no real-patient data or identifiable personal information. Vignettes were generated de novo and reviewed to ensure non-identifiability. For the clinician-benchmarking component, eight practicing clinicians voluntarily rated synthetic vignettes; no identifiable personal data or human biological material were collected. All clinicians provided informed consent prior to participation. Under the policy of the Luz Saúde Research and Ethics Committee (Lisbon, Portugal), the activity was determined to be exempt/not human participants research, and formal ethics review was waived (decision issued on 4 June 2025) The project was coordinated by Hospital da Luz Learning Health in collaboration with the participating institutions. Declarations Acknowledgements. We thank Duarte Espregueira Mendes, Rita Marinheiro, and Victor Gil for clinical vignette validation; Inês Rosa, João Pereira, José Nuno Raposo, Hugo Viegas, Rita Gomes, Sérgio Madeira, Maria Madalena Rodrigues, and Vanessa Carvalho for clinician benchmarking; and Maria José Loureiro for English translation and validation. Data Availability. The complete set of synthetic clinical vignettes, adjudicated labels, and model outputs that support the findings of this study are available on the Open Science Framework (OSF) at https://doi.org/10.17605/OSF.IO/J2ZK9 Code availability. All R code to reproduce data processing, model evaluation, and figures is available at the same OSF project (https://doi.org/10.17605/OSF.IO/J2ZK9). Versioned releases will be archived at this DOI. Author contributions. All conceived the study. JFS, RD, IM, JMM, JC, NAS, FL and BN designed the methodology. RCS designed the clinical vignettes. RBD, IM, JMM, FL and NAS implemented the LLMs evaluations. JFS, RLL, and HD provided expert adjudication. JFS and RBD performed the statistical analyses. JFS drafted the manuscript. All authors interpreted the results, revised the manuscript critically for important intellectual content, and approved the final version. Competing interests. The authors declare no competing interests. Funding. This research received no specific grant from any funding agency, commercial or not-for-profit sectors. Ethics declaration. Ethical approval and consent procedures are described in Methods (Ethics). Briefly, the dataset comprised synthetic vignettes; the clinician-benchmarking task was determined exempt/not human participants research by Luz Saúde Research and Ethics Committee (Lisbon, Portugal), with a waiver of review (issued on 4 June 2025). All clinicians provided informed consent. References Mendis, S., Graham, I. & Narula, J. Addressing the Global Burden of Cardiovascular Diseases; Need for Scalable and Sustainable Frameworks. Glob. Heart 17, (2022). Yusuf, S. et al. Modifiable risk factors, cardiovascular disease, and mortality in 155 722 individuals from 21 high-income, middle-income, and low-income countries (PURE): a prospective cohort study. Lancet Lond. Engl. 395, 795–808 (2020). Visseren, F. L. J. et al. 2021 ESC Guidelines on cardiovascular disease prevention in clinical practice. Eur. Heart J. 42, 3227–3337 (2021). Law, T. K. et al. Primary prevention of cardiovascular disease: global cardiovascular risk assessment and management in clinical practice. Eur. Heart J. - Qual. Care Clin. Outcomes 1, 31–36 (2015). Sposito, A. C. et al. Physicians’ attitudes and adherence to use of risk scores for primary prevention of cardiovascular disease: cross-sectional survey in three world regions. Curr. Med. Res. Opin. 25, 1171–1178 (2009). Liew, S. M. et al. Can doctors and patients correctly estimate cardiovascular risk? A cross-sectional study in primary care. BMJ Open 8, e017711 (2018). Graham, I. M., Stewart, M., Hertog, M. G. L., & Cardiovascular Round Table Task Force. Factors impeding the implementation of cardiovascular prevention guidelines: findings from a survey conducted by the European Society of Cardiology. Eur. J. Cardiovasc. Prev. Rehabil. Off. J. Eur. Soc. Cardiol. Work. Groups Epidemiol. Prev. Card. Rehabil. Exerc. Physiol. 13, 839–845 (2006). Persell, S. D., Dunne, A. P., Lloyd-Jones, D. M. & Baker, D. W. Electronic health record-based cardiac risk assessment and identification of unmet preventive needs. Med. Care 47, 418–424 (2009). Sedlakova, J. et al. Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digit. Health 2, e0000347 (2023). Asgari, E. et al. Impact of Electronic Health Record Use on Cognitive Load and Burnout Among Clinicians: Narrative Review. JMIR Med. Inform. 12, e55499 (2024). Houssein, E. H., Mohamed, R. E. & Ali, A. A. Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques. Sci. Rep. 13, 7173 (2023). Boonstra, M. J., Weissenbacher, D., Moore, J. H., Gonzalez-Hernandez, G. & Asselbergs, F. W. Artificial intelligence: revolutionizing cardiology with large language models. Eur. Heart J. 45, 332–345 (2024). Quer, G. & Topol, E. J. The potential for large language models to transform cardiovascular medicine. Lancet Digit. Health S2589-7500(24)00151–1 (2024) doi: 10.1016/S2589-7500(24)00151-1 . Nolin-Lapalme, A. et al. Maximising Large Language Model Utility in Cardiovascular Care: A Practical Guide. Can. J. Cardiol. https://doi.org/10.1016/j.cjca.2024.05.024 (2024) doi:10.1016/j.cjca.2024.05.024. Eriksen, A. V., Möller, S. & Ryg, J. Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI 1, AIp2300031 (2024). Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 7, e2440969 (2024). Skalidis, I. et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur. Heart J. - Digit. Health 4, 279–281 (2023). Ferreira Santos, J., Ladeiras-Lopes, R., Leite, F. & Dores, H. Applications of large language models in cardiovascular disease: a systematic review. Eur. Heart J. Digit. Health 6, 540–553 (2025). Abdellaoui, C., Redjdal, A. & Seroussi, B. Generative-AI-Based Approaches for Information Extraction from Clinical Notes: A Scoping Review. Stud. Health Technol. Inform. 328, 193–197 (2025). Houssein, E. H., Mohamed, R. E. & Ali, A. A. Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques. Sci. Rep. 13, 7173 (2023). Zhang, Z., Qiu, Y., Yang, X. & Zhang, M. Enhanced character-level deep convolutional neural networks for cardiovascular disease prediction. BMC Med. Inform. Decis. Mak. 20, 123 (2020). Ntinopoulos, V. et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ Health Care Inform. 32, e101139 (2025). Shah, S. V. Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records. JAMA Netw. Open 7, e2425953 (2024). Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024). Gaber, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 8, 263 (2025). Liew, S. M. et al. Can doctors and patients correctly estimate cardiovascular risk? A cross-sectional study in primary care. BMJ Open 8, e017711 (2018). Webster, R. & Heeley, E. Perceptions of risk: understanding cardiovascular disease. Risk Manag. Healthc. Policy 3, 49–60 (2010). Mihan, A., Pandey, A. & Van Spall, H. G. Mitigating the risk of artificial intelligence bias in cardiovascular care. Lancet Digit. Health 6, e749–e754 (2024). Mihan, A., Pandey, A. & Van Spall, H. G. C. Artificial intelligence bias in the prediction and detection of cardiovascular disease. Npj Cardiovasc. Health 1, 31 (2024). Khandekar, N. et al. MedCalc-Bench: Evaluating Large Language Models for Medical Calculations. Preprint at https://doi.org/10.48550/arXiv.2406.12036 (2024). Small Language Models (SLMs) Can Still Pack a Punch: A survey. https://arxiv.org/html/2501.05465v1 . Roeschl, T. et al. Development of an LLM Pipeline Surpassing Physicians in Cardiovascular Risk Score Calculation. 2025.11.11.25340002 Preprint at https://doi.org/10.1101/2025.11.11.25340002 (2025). Kim, Y. et al. Medical Hallucinations in Foundation Models and Their Impact on Healthcare. Preprint at https://doi.org/10.48550/arXiv.2503.05777 (2025). Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digit. Med. 8, 274 (2025). Qiu, P. et al. Towards building multilingual language model for medicine. Nat. Commun. 15, 8384 (2024). Nunes, M., Boné, J., Ferreira, J. C., Chaves, P. & Elvas, L. B. MediAlbertina: An European Portuguese medical language model. Comput. Biol. Med. 182, 109233 (2024). Chen, H. et al. Large language models and global health equity: a roadmap for equitable adoption in LMICs. Lancet Reg. Health – West. Pac. 63, (2025). Garcia, G. L. et al. A Step Forward for Medical LLMs in Brazilian Portuguese: Establishing a Benchmark and a Strong Baseline. in 2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS) 214–219 (2025). doi: 10.1109/CBMS65348.2025.00052 . Everett, S. S. et al. From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis. MedRxiv Prepr. Serv. Health Sci. 2025.06.07.25329176 (2025) doi: 10.1101/2025.06.07.25329176 . Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 7, e2440969 (2024). Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024). Shan, G. et al. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med. Inform. 13, e64963 (2025). Griot, M., Hemptinne, C., Vanderdonckt, J. & Yuksel, D. Large Language Models lack essential metacognition for reliable medical reasoning. Nat. Commun. 16, 1–10 (2025). Vasey, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 28, 924–933 (2022). Additional Declarations No competing interests reported. Supplementary Files CLARITY1npjSupplements.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8307079","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":559927758,"identity":"d4ffa864-08fd-44a3-8299-8d74876896d7","order_by":0,"name":"José Ferreira Santos","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA3UlEQVRIiWNgGAWjYHCCBAYGGyjzA/Fa0oAUGwMD4wywADMxuqBamHmI0SLfwPDwc0FCnby5fPPhz7ZtdnkM0v0H8GoxOMCQLD0j4bDhzja2NOnctuRiBpnD+G0xkH+QIM374wDjhmM8Zsy5bQcSGySSCTos+TdPQp09UIvxZ0titDAcYEiT5klgTgRqMZBmJEYL0C9p1kC/JG84lpYm2XMuuZhN5rABAYfxJN8GhpjthsOHD3/4UWaXxy/d+ICAy4CuQuYmsEkQ0MDAwH4AVQsDYS2jYBSMglEwwgAAVt5CZ856w58AAAAASUVORK5CYII=","orcid":"","institution":"Católica Medical School","correspondingAuthor":true,"prefix":"","firstName":"José","middleName":"Ferreira","lastName":"Santos","suffix":""},{"id":559927759,"identity":"30e84f53-125b-4daa-8ccb-8038e8bf599d","order_by":1,"name":"Regina Brito Duarte","email":"","orcid":"","institution":"Universidade de Lisboa","correspondingAuthor":false,"prefix":"","firstName":"Regina","middleName":"Brito","lastName":"Duarte","suffix":""},{"id":559927760,"identity":"8e7b0844-5186-4106-ab17-40838699cdc9","order_by":2,"name":"Inês Mota","email":"","orcid":"","institution":"Hospital da Luz Learning Health","correspondingAuthor":false,"prefix":"","firstName":"Inês","middleName":"","lastName":"Mota","suffix":""},{"id":559927761,"identity":"ee55281b-0df2-4b8a-8626-63548381da74","order_by":3,"name":"Rita Carvalheira Santos","email":"","orcid":"","institution":"Hospital da Luz Setúbal","correspondingAuthor":false,"prefix":"","firstName":"Rita","middleName":"Carvalheira","lastName":"Santos","suffix":""},{"id":559927762,"identity":"2a6274ae-c767-4c7f-b6f6-163273769b09","order_by":4,"name":"José Maria Moreira","email":"","orcid":"","institution":"Hospital da Luz Learning Health","correspondingAuthor":false,"prefix":"","firstName":"José","middleName":"Maria","lastName":"Moreira","suffix":""},{"id":559927763,"identity":"054f05a0-2f50-44f1-8444-71eb42efc3f2","order_by":5,"name":"Joana Campos","email":"","orcid":"","institution":"Hospital da Luz Setúbal","correspondingAuthor":false,"prefix":"","firstName":"Joana","middleName":"","lastName":"Campos","suffix":""},{"id":559927764,"identity":"5fff7f3f-c4ff-4fef-9232-7d811dc487c5","order_by":6,"name":"Nuno André Silva","email":"","orcid":"","institution":"Hospital da Luz Learning Health","correspondingAuthor":false,"prefix":"","firstName":"Nuno","middleName":"André","lastName":"Silva","suffix":""},{"id":559927765,"identity":"19f89243-5a71-43f7-855d-cedd062c472e","order_by":7,"name":"Bernardo Neves","email":"","orcid":"","institution":"Católica Medical School","correspondingAuthor":false,"prefix":"","firstName":"Bernardo","middleName":"","lastName":"Neves","suffix":""},{"id":559927766,"identity":"1e7b41b1-6c9e-4c41-a1ed-e0d35794302d","order_by":8,"name":"Ricardo Ladeiras-Lopes","email":"","orcid":"","institution":"University of Porto","correspondingAuthor":false,"prefix":"","firstName":"Ricardo","middleName":"","lastName":"Ladeiras-Lopes","suffix":""},{"id":559927767,"identity":"7f18be81-f59f-444f-8147-79c5fe347f3c","order_by":9,"name":"Francisca Leite","email":"","orcid":"","institution":"Católica Medical School","correspondingAuthor":false,"prefix":"","firstName":"Francisca","middleName":"","lastName":"Leite","suffix":""},{"id":559927768,"identity":"796cb68e-7700-4024-96d7-9da31395ab61","order_by":10,"name":"Helder Dores","email":"","orcid":"","institution":"Hospital da Luz Lisboa","correspondingAuthor":false,"prefix":"","firstName":"Helder","middleName":"","lastName":"Dores","suffix":""}],"badges":[],"createdAt":"2025-12-08 11:23:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8307079/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8307079/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":99316795,"identity":"c8031b57-29fc-4e1d-8d0d-ae0e8a871b1b","added_by":"auto","created_at":"2025-12-31 16:29:13","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":936330,"visible":true,"origin":"","legend":"","description":"","filename":"CLARITY1npjManuscritpFile.docx","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/3c7fa76f9fd8c0f1db1fb9ec.docx"},{"id":99194063,"identity":"a1d0c0c4-b284-4b82-9569-531e8817242e","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":11423,"visible":true,"origin":"","legend":"","description":"","filename":"5256cb7228f743f189fc4fe3120fcff6.json","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/97c6df5fb45713407db3d466.json"},{"id":99194067,"identity":"a2af0f21-0978-4ea5-8030-c0131c23b172","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1758580,"visible":true,"origin":"","legend":"","description":"","filename":"CLARITY1npjSupplements.docx","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/956acfa91c03574c9be4ae58.docx"},{"id":99317437,"identity":"4c170333-1e17-484b-a388-a3b807ab9877","added_by":"auto","created_at":"2025-12-31 16:30:13","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":184226,"visible":true,"origin":"","legend":"","description":"","filename":"5256cb7228f743f189fc4fe3120fcff61enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/efa23ac2338fd5991a04df46.xml"},{"id":99194081,"identity":"4a94f382-f16d-401a-959f-7ab6435a147d","added_by":"auto","created_at":"2025-12-30 01:29:29","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":259981,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/e3cd5e491a25df13d5ab1157.png"},{"id":99317776,"identity":"3f0f21a2-ea81-4150-ae9e-c4a4008cc482","added_by":"auto","created_at":"2025-12-31 16:30:42","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":179740,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/4568f103b12faed3341f6ed8.png"},{"id":99317451,"identity":"147081ae-ea68-4c5d-994e-dbce10cf2f50","added_by":"auto","created_at":"2025-12-31 16:30:14","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":190571,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/4318ec6fc1cddcaccf3030d0.png"},{"id":99316381,"identity":"22f61e0a-a308-427d-8db9-c28d6dd60698","added_by":"auto","created_at":"2025-12-31 16:28:21","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":189580,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/bdd01d85f19b89cf5578f539.png"},{"id":99194080,"identity":"e0027771-3b53-47bf-8da7-9d37001432a7","added_by":"auto","created_at":"2025-12-30 01:29:29","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":113281,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/1f00171894cf3b6310446245.png"},{"id":99316449,"identity":"563d74cc-c4fb-4ab2-947f-d8269d86fb82","added_by":"auto","created_at":"2025-12-31 16:28:29","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":67707,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/7478f37baacb9105cc443573.png"},{"id":99194070,"identity":"81045a74-98be-4e73-b0b4-0e9f204efbdc","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":32218,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/0e21d4c352d02af942cbbc23.png"},{"id":99194084,"identity":"e5a8000f-ab6c-48fd-ba02-f7f7d01e04ad","added_by":"auto","created_at":"2025-12-30 01:29:29","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":37350,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/82e77928b523929e91bef872.png"},{"id":99194073,"identity":"5e73569e-5584-4791-ae6f-7aa35a8806e0","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":75283,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/c1c89baae0e1bfa48b1f741e.png"},{"id":99194079,"identity":"11c2efa3-2274-4e08-8031-c53aa096d15b","added_by":"auto","created_at":"2025-12-30 01:29:29","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":28060,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/d528e97fb03fac2e37863e4f.png"},{"id":99194075,"identity":"73e1e7c2-442a-4537-8916-34b6b40dcb71","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"xml","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":180475,"visible":true,"origin":"","legend":"","description":"","filename":"5256cb7228f743f189fc4fe3120fcff61structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/dd8f68aebca86bdf00fa948f.xml"},{"id":99194077,"identity":"dd56ca06-b039-4b40-9b38-a168e467c2d2","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"html","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":198504,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/8ac3005239edb95f22d01bb3.html"},{"id":99194062,"identity":"76a934cc-e7c0-4f6f-900e-b137b36d274d","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":275239,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eExtraction accuracy for traditional cardiovascular risk factors and risk modifiers by model\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLegend. Extraction performance heatmaps.\u003c/strong\u003e (A) Per-factor F1-scores for the 12 traditional cardiovascular risk factors. (B) Per-factor F1-scores for the 12 predefined risk modifiers; the prevalence of each modifier in the reference dataset (n) is provided in parentheses. \u003cstrong\u003eData:\u003c/strong\u003e Values represent pooled F1-scores (harmonic mean of precision and recall) across Portuguese and English vignettes (N=60 per model); Models are ordered from left to right by decreasing κw performance.\u003cstrong\u003eAbbreviations:\u003c/strong\u003e BP, Blood Pressure; COPD, chronic obstructive pulmonary disease; Lp(a), lipoprotein(a); hs-CRP, high-sensitivity C-reactive protein; ASCVD, atherosclerotic cardiovascular disease.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/4173c03cb939de9999d20005.png"},{"id":99316532,"identity":"e79c446c-9562-4bd6-96ba-1cce4521064b","added_by":"auto","created_at":"2025-12-31 16:28:34","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":229177,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eModel-level accuracy and error direction in three-class ESC risk classification\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLegend. Distribution of risk classification errors.\u003c/strong\u003e Stacked bars display the proportion of predictions classified as Correct (green), Minor Overestimation (1-level, orange), Minor Underestimation (1-level, blue), or Major Error (2-level, red). \u003cstrong\u003eData:\u003c/strong\u003e Percentages are calculated relative to valid predictions; unclassifiable (\"unknown\") outputs are excluded from the visualization. Models are ordered by quadratic-weighted kappa (κw) values, shown in parentheses. Results are pooled across languages. \u003cstrong\u003eAbbreviations:\u003c/strong\u003eESC, European Society of Cardiology\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/ea5ba5b99744b1944519416b.png"},{"id":99194065,"identity":"887b5ead-5378-46de-856c-75cd4c32ae77","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":56036,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConfusion matrices for GPT-family models (three-class ESC risk)\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLegend:\u003c/strong\u003e \u003cstrong\u003eConfusion matrices for the GPT family.\u003c/strong\u003e Comparisons of Model predictions (columns) versus Gold Standard categories (rows). \u003cstrong\u003eData:\u003c/strong\u003e Cell values represent counts pooled from Portuguese and English vignettes; Color intensity is proportional to the count magnitude; Sample sizes vary due to the exclusion of unclassifiable responses: GPT-4o (N=60), GPT-4.1 (N=60), GPT-5 (N=56), GPT-5 Nano (N=55). \u003cstrong\u003eAbbreviations:\u003c/strong\u003e ESC, European Society of Cardiology.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/9b8b1b16876da83e3f06c42d.png"},{"id":99194069,"identity":"6e904b55-9527-4ba3-bbc6-216c606248d4","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":88559,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSCORE2 applicability: override safety (missed overrides) and usability (over-blocking) by model\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLegend.\u003c/strong\u003e \u003cstrong\u003eSafety and usability in SCORE2 applicability.\u003c/strong\u003e Left Panel: Missed Overrides (Safety), defined as the false negative rate among vignettes requiring guideline exceptions (lower is better). Right Panel: Over-blocking (Usability), defined as the false positive rate among vignettes eligible for SCORE2 (lower is better). \u003cstrong\u003eData:\u003c/strong\u003e Black squares represent point estimates (%); horizontal lines represent 95% confidence intervals; Vertical dashed lines indicate the 5% (excellent safety) and 10% (concerning) thresholds. Models are ordered by κw. \u003cstrong\u003eAbbreviations: \u003c/strong\u003eSCORE2, Systematic Coronary Risk Estimation 2.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/23dd38c5138a17f90b1549ae.png"},{"id":99194076,"identity":"90348051-abfd-4b84-9590-4e457cf6372a","added_by":"auto","created_at":"2025-12-30 01:29:28","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":48024,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAgreement between clinicians and the gold standard (κw) for ESC three-class risk\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLegend. Clinician benchmark performance.\u003c/strong\u003e Agreement between human evaluators and the Gold Standard for 3-class ESC risk categorization. \u003cstrong\u003eData:\u003c/strong\u003e Blue dots represent point estimates for individual physicians (N=30 Portuguese vignettes); horizontal bars represent 95% bootstrap confidence intervals. The purple triangle represents the majority-vote ensemble consensus. \u003cstrong\u003eAbbreviations:\u003c/strong\u003e κw, quadratic-weighted Cohen’s kappa; ESC, European Society of Cardiology.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/56c0d03d1904454a56741c42.png"},{"id":99788408,"identity":"3f87265d-18ca-41b2-a6ca-fc34bd784f20","added_by":"auto","created_at":"2026-01-08 12:46:37","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2619101,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/480af567-1d83-4434-be2e-aa072c6e2239.pdf"},{"id":99317601,"identity":"684809cd-224c-43c3-9e77-ec006e4ec672","added_by":"auto","created_at":"2025-12-31 16:30:28","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":1758580,"visible":true,"origin":"","legend":"","description":"","filename":"CLARITY1npjSupplements.docx","url":"https://assets-eu.researchsquare.com/files/rs-8307079/v1/aec69fc9992f8bedf3d9fa16.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Benchmarking large language models for cardiovascular risk stratification using clinical vignettes","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eCardiovascular disease (CVD) remains the leading cause of death globally\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Because most atherosclerotic events are preventable through targeted management of modifiable risk factors, accurate and systematic risk assessment is a cornerstone of prevention\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e,\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. The 2021 European Society of Cardiology (ESC) Guidelines recommend Systematic Coronary Risk Estimation 2 (SCORE2) and SCORE2-OP to estimate 10-year risk of first-onset cardiovascular events and to guide treatment thresholds and shared decision-making\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eDespite the availability of validated tools, its implementation in routine care is inconsistent. Surveys indicate that fewer than half of clinicians regularly use formal calculators, with many relying on unaided clinical judgement, an approach linked to systematic underestimation of risk, particularly among higher-risk patients\u003csup\u003e\u003cspan additionalcitationids=\"CR5\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. Even when calculators are applied, treatment gaps persist and a substantial proportion of high-risk individuals do not receive guideline-recommended treatments, contributing to avoidable morbidity and mortality\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eA key contributor to this implementation gap is the structure and usability of electronic health records (EHR)\u003csup\u003e\u003cspan additionalcitationids=\"CR9\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. Variables required for risk estimation are often buried in free-text notes or dispersed across poorly integrated sections, creating burden and cognitive load at the point of care. Conventional natural language processing has shown that relevant information can be recovered from unstructured text, pointing to a practical route for workflow simplification\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eLarge language models (LLMs) extend these capabilities with instruction-following and generative functions that support record summarization, clinical-note interpretation, and interactive assistance\u003csup\u003e\u003cspan additionalcitationids=\"CR13\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. However, evidence for safe, accurate clinical decision support remains mixed, with LLMs encoding broad clinical knowledge and performing strongly on benchmarks, but with variable and inconsistent outcomes on diagnostic reasoning\u003csup\u003e\u003cspan additionalcitationids=\"CR16\" citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. Within cardiovascular prevention specifically, applications are promising but remain insufficiently validated\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e The convergence of persistent gaps in CVD prevention, namely suboptimal risk stratification and limited EHR utility, with rapid advances in LLMs creates a timely opportunity to evaluate this system's role as an aid to guideline-based risk assessment. Accordingly, we benchmarked contemporary LLMs, using simulated outpatient vignettes to assess (i) extraction of cardiovascular risk factors, including SCORE2 input variables from routine-style clinical text and (ii) generation of ESC-aligned risk categories, each compared against expert-adjudicated reference standards. Our main objective was to delineate foundational performance and limitations as a prerequisite for integrating LLMs into clinical decision support systems aimed at optimizing CVD prevention.\u003c/p\u003e"},{"header":"2. Results","content":"\u003cp\u003e \u003cb\u003eClinical Vignettes\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThirty simulated outpatient clinical vignettes were evaluated (Supplementary Appendix S1), generating 60 assessments per model (30 Portuguese and 30 English). The vignettes represented a middle-aged population (mean age 54.6\u0026thinsp;\u0026plusmn;\u0026thinsp;8.8 years, 50% male) with prevalent cardiovascular risk factors including hypertension (60%), dyslipidemia (70%), and active smoking (37%) (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). More than half (53%) included additional risk modifiers, one (23%), two (20%), or three or more (10%) modifiers. Twenty vignettes (66.7%) met standard SCORE2 eligibility criteria, while 10 (33.3%) were required to apply ESC guidelines exceptions for risk stratification, including atherosclerotic cardiovascular disease (ASCVD; n\u0026thinsp;=\u0026thinsp;4), diabetes mellitus (n\u0026thinsp;=\u0026thinsp;3), chronic kidney disease (CKD; n\u0026thinsp;=\u0026thinsp;2), and familial hypercholesterolemia (FH; n\u0026thinsp;=\u0026thinsp;1). Following expert adjudication, the gold-standard risk distribution comprised 5 (16.7%) low-to-moderate risk, 13 (43.3%) high risk, and 12 (40.0%) very-high risk vignettes, representing a higher-risk profile than initially expected (50%, 30%, and 20%, respectively) (Supplementary Figures \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e and S2).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBaseline clinical profile of simulated outpatient vignettes\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eDemographics\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge, years*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e54.6\u0026thinsp;\u0026plusmn;\u0026thinsp;8.8 [36\u0026ndash;67]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMale sex\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e15 (50.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCardiovascular Risk Profile\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBlood Pressure, mmHg\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSystolic blood pressure\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e135.3\u0026thinsp;\u0026plusmn;\u0026thinsp;17.5 [107\u0026ndash;189]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDiastolic blood pressure\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e79.9\u0026thinsp;\u0026plusmn;\u0026thinsp;13.3 [58\u0026ndash;107]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eLipids, mg/dL\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTotal cholesterol\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e195.7\u0026thinsp;\u0026plusmn;\u0026thinsp;45.3 [129\u0026ndash;336]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHDL cholesterol\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e47.9\u0026thinsp;\u0026plusmn;\u0026thinsp;9.1 [32\u0026ndash;67]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNon-HDL cholesterol\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e147.8\u0026thinsp;\u0026plusmn;\u0026thinsp;46.2 [79\u0026ndash;219]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLDL cholesterol\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e119.1\u0026thinsp;\u0026plusmn;\u0026thinsp;46.5 [49\u0026ndash;258]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTriglycerides\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e137.8\u0026thinsp;\u0026plusmn;\u0026thinsp;50.5 [62\u0026ndash;256]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eRisk factors\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHypertension\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e18 (60.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDyslipidemia\u003csup\u003e#\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e21 (70.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eSmoking status\u003csup\u003e#\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCurrent\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e11 (36.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFormer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7 (23.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNever\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10 (33.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003ePredefined risk modifiers (any occurrence)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCancer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2 (6.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChronic inflammatory disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChronic obstructive pulmonary disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCoronary calcium score\u0026thinsp;=\u0026thinsp;0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eElevated coronary calcium score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e4 (13.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFamily history of premature ASCVD\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ehs-CRP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eIncreased arterial stiffness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLp(a) elevated\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eObesity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2 (6.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eObstructive sleep apnoea\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3 (10.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePre-diabetes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e11 (36.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eNumber of risk modifiers per vignette\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e14 (46.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOne\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7 (23.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTwo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6 (20.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eThree or more\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3 (10.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eSCORE2 applicability\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEligible (no exception)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e20 (66.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSCORE2 exceptions for risk stratification\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10 (33.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrior ASCVD\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e4 (13.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDiabetes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3 (10.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChronic kidney disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2 (6.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFamilial hypercholesterolemia\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (3.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eESC risk category\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLow-to-Moderate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5 (16.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHigh\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e13 (43.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVery High\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e12 (40.0%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\u003cb\u003eLegend. Data presentation\u003c/b\u003e: Values are reported as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation [range] for continuous variables and n (%) for categorical variables. Percentages are calculated based on N\u0026thinsp;=\u0026thinsp;30 vignettes unless otherwise noted. \u003cb\u003eDefinitions\u003c/b\u003e: \u0026ldquo;SCORE2 applicability\u0026rdquo; distinguishes vignettes eligible for standard calculation from those requiring guideline-based exceptions. *Age includes one 36-year-old patient with familial hypercholesterolemia; all other vignettes represent patients aged\u0026thinsp;\u0026ge;\u0026thinsp;40 years. \u003csup\u003e#\u003c/sup\u003eDenotes variables with missing data in the source vignettes (unknown dyslipidemia: n\u0026thinsp;=\u0026thinsp;2; unknown smoking status: n\u0026thinsp;=\u0026thinsp;2). \u003cb\u003eAbbreviations\u003c/b\u003e: ASCVD, atherosclerotic cardiovascular disease; HDL, high-density lipoprotein; LDL, low-density lipoprotein; Lp(a), lipoprotein(a); hs-CRP, high-sensitivity C-reactive protein.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eCardiovascular Risk Factor Extraction Performance\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAll eleven evaluated models showed excellent performance in extracting traditional cardiovascular risk factors, with micro-F1 scores ranging from 0.97 to 0.99 across the 60 vignettes evaluated per model (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Supplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). Claude Opus 4.1, Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-5 Nano achieved the highest overall micro-F1 scores of 0.99 (95% CI: 0.98\u0026ndash;0.99). Micro-precision ranged from 0.97 to 0.99, and micro-recall from 0.95 to 0.99 across models, indicating both high sensitivity and specificity in identifying cardiovascular risk factors (Supplementary Figure S3). The mean Jaccard similarity coefficient was \u0026ge;\u0026thinsp;0.93 across models (minimum 0.93), reflecting substantial overlap between model-extracted and reference factor sets. There was no substantial difference between micro and macro metrics for cardiovascular risk factors extraction (Supplementary Figure S4).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eExtraction performance by model: cardiovascular risk factors, SCORE2 core risk factors, and risk modifiers\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"5\" nameend=\"c6\" namest=\"c2\"\u003e \u003cp\u003eCardiovascular Risk Factors\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c8\" namest=\"c7\"\u003e \u003cp\u003eSCORE2 Core\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eRisk Modifiers\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMicro-F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMicro-P\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMicro-R\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMacro-F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003eJaccard\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eMicro-F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eMicro-F1\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.96\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.98 (0.95\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.97 (0.95\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.74 (0.61\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.97 (0.96\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.98 (0.95\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.97 (0.95\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.77 (0.66\u0026ndash;0.87)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.97 (0.96\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.97 (0.94\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.96 (0.94\u0026ndash;0.97)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.80 (0.68\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Sonnet 4.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.98 (0.96\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.97 (0.96\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.99\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.58 (0.45\u0026ndash;0.68)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.5 Pro\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.99 (0.96\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.98 (0.96\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.81 (0.70\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Opus 4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.99 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.99 (0.96\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.98 (0.96\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.64 (0.47\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek V3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.97 (0.95\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.97 (0.94\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.96 (0.94\u0026ndash;0.97)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.67 (0.52\u0026ndash;0.78)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.0 Flash\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.97 (0.95\u0026ndash;0.97)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.96\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.95 (0.93\u0026ndash;0.97)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.96 (0.93\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.93 (0.92\u0026ndash;0.95)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.67 (0.54\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGrok-3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.98 (0.96\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.97 (0.95\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.60 (0.43\u0026ndash;0.73)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLlama 3.3 70B Instruct\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.97 (0.94\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.96 (0.95\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.67 (0.52\u0026ndash;0.78)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5 Nano\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.99 (0.97\u0026ndash;0.99)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.98 (0.96\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e \u003cp\u003e0.97 (0.96\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.99 (0.98\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.82 (0.70\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"9\"\u003e\u003cb\u003eLegend. Metrics\u003c/b\u003e: Data represent micro-averaged F1-scores, precision, and recall, macro-averaged F1-scores, and Jaccard similarity coefficients, presented with 95% confidence intervals in parentheses. \u003cb\u003eSample size\u003c/b\u003e: Each model evaluation includes N\u0026thinsp;=\u0026thinsp;60 predictions (pooled 30 Portuguese and 30 English). \u003cb\u003eDefinitions\u003c/b\u003e: \"Cardiovascular Risk Factors\" comprises the 12-item traditional panel (6 SCORE2 core inputs\u0026thinsp;+\u0026thinsp;6 additional factors); \"Risk Modifiers\" comprises the 12 predefined guideline modifiers; Models are ordered by quadratic-weighted kappa (κw) from the primary three-class risk analysis. \u003cb\u003eAbbreviations\u003c/b\u003e: CI, confidence interval; SCORE2, Systematic Coronary Risk Estimation 2.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eFor SCORE2 core risk factors specifically, extraction accuracy was uniformly high across all models. Micro-F1 scores ranged from 0.98 to 0.99, with 8 models achieving scores of 0.99 (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Supplementary Table S2). Individual factor analysis revealed near-perfect extraction for age, sex, and lipid parameters, while smoking status showed slightly more variability (F1: 0.94, 95% CI: 0.91\u0026ndash;0.96). Among the additional traditional factors beyond SCORE2 requirements, dyslipidemia diagnosis proved most defiant (F1: 0.88, 95% CI: 0.86\u0026ndash;0.90), while continuous variables such as triglycerides and diastolic blood pressure were extracted with near-perfect accuracy (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Supplementary Table S3).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eRisk modifier extraction presented greater challenges, with substantially lower and more variable performance than traditional risk factors (Supplementary Table S4). Micro-F1 scores ranged from 0.58 to 0.82, with GPT-5 Nano, Gemini 2.5 Pro, and GPT-5, achieving the highest scores, above 0.80. Claude Sonnet 4.5 demonstrated the lowest performance at 0.58 (95% CI: 0.45\u0026ndash;0.68). Calcium Score, cancer, elevated high-sensitivity C-reactive protein, chronic obstructive pulmonary disease, and obstructive sleep apnoea were most accurately identified across models (F1\u0026thinsp;\u0026gt;\u0026thinsp;0.80), while family history of premature ASCVD (F1: 0.32, 95% CI: 0.21\u0026ndash;0.42), elevated lipoprotein(a) (F1: 0.43, 95% CI: 0.26\u0026ndash;0.60) and obesity (F1: 0.45, 95% 0.36\u0026ndash;0.53) proved most difficult to extract consistently. The substantial difference in precision (range: 0.19\u0026ndash;0.98) versus recall (range: 0.50-1.00) for risk modifiers suggested that models tended toward higher sensitivity at the expense of specificity when identifying these less common cardiovascular risk determinants (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Supplementary Table S5).\u003c/p\u003e \u003cp\u003e \u003cb\u003eESC risk classification\u003c/b\u003e \u003c/p\u003e \u003cp\u003eEight models completed the three-class cardiovascular risk classification task for all 60 vignettes (30 Portuguese and 30 English), while three models failed to return risk classifications for some vignettes. GPT-5 did not calculate a risk category due to unknown smoking status in four vignettes (two in Portuguese and two in English); these cases were assumed to have no smoking history by the Cardiovascular Risk Adjudication Committee. Llama 3.3 70B Instruct and GPT-5 Nano did not provide risk classifications for one vignette (English) and five vignettes (4 Portuguese, 1 English), respectively, but did not provide a reason for these omissions.\u003c/p\u003e \u003cp\u003eLLMs were able to classify patients into ESC cardiovascular risk categories, though performance varied substantially across models (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and Supplementary Figure S5). Using quadratic-weighted Cohen's kappa (κw) as the primary agreement metric, GPT-4o achieved the highest concordance with gold-standard classifications (κw\u0026thinsp;=\u0026thinsp;0.69, 95% CI: 0.44\u0026ndash;0.84), followed by GPT-5 (κw\u0026thinsp;=\u0026thinsp;0.68, 95% CI: 0.48\u0026ndash;0.83) and GPT-4.1 (κw\u0026thinsp;=\u0026thinsp;0.65, 95% CI: 0.44\u0026ndash;0.80). These top-performing models demonstrated moderate to substantial agreement with expert adjudication. In contrast, models at the lower performance tier showed only fair to moderate agreement, with GPT-5 nano exhibiting the lowest concordance (κw\u0026thinsp;=\u0026thinsp;0.40, 95% CI: 0.15\u0026ndash;0.63). Overall classification accuracy ranged from 67% (GPT-4o) to 42% (GPT-5 nano), and the relationship between agreement strength and raw accuracy was not linear, with some models demonstrating higher accuracy than their agreement metric would suggest, indicating different error patterns across models.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eThree-class ESC risk classification: model performance summary (ordered by κw)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eN / Unknown (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eAgreement κw (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eAccuracy (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c8\" namest=\"c5\"\u003e \u003cp\u003eErrors\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003eMinor (%)\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003eMajor (%)\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003eOver (%)\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003eUnder (%)\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.69 (0.44\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.67 (0.52\u0026ndash;0.82)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e33.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e8.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e25.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e56 / 6.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.68 (0.48\u0026ndash;0.83)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.58 (0.42\u0026ndash;0.75)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e37.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e3.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e33.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.65 (0.44\u0026ndash;0.80)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.62 (0.45\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e38.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e3.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e35.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Sonnet 4.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.61 (0.25\u0026ndash;0.80)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.63 (0.45\u0026ndash;0.78)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e35.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e26.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e10.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.5 Pro\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.58 (0.28\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.57 (0.42\u0026ndash;0.73)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e43.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e20.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e23.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Opus 4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.57 (0.28\u0026ndash;0.79)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.60 (0.43\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e36.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e10.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e30.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek V3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.56 (0.25\u0026ndash;0.73)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.52 (0.35\u0026ndash;0.67)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e48.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e18.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e30.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.0 Flash\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.52 (0.29\u0026ndash;0.70)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.55 (0.40\u0026ndash;0.70)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e41.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e5.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e40.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGrok-3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60 / 0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.45 (0.16\u0026ndash;0.72)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.53 (0.37\u0026ndash;0.70)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e40.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e6.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e10.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e36.7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLlama 3.3 70B Instruct\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e59 / 1.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.45 (0.18\u0026ndash;0.68)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.58 (0.42\u0026ndash;0.73)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e32.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e8.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e3.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e37.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5 Nano\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e55 / 8.3%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.40 (0.15\u0026ndash;0.63)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.42 (0.25\u0026ndash;0.58)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e41.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e12.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e7.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e47.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"8\"\u003e\u003cb\u003eLegend. Outcomes\u003c/b\u003e: \"Agreement (κw)\" and \"Errors\" are calculated using valid ordinal predictions only; \"Accuracy\" is calculated against the total N, penalizing unclassifiable responses. \u003cb\u003eDefinitions\u003c/b\u003e: \"N\" denotes the total evaluated predictions (30 Portuguese\u0026thinsp;+\u0026thinsp;30 English); \"Unknown (%)\" represents the proportion of vignettes where the model failed to return a classification or output \"unknown.\"; \"Minor Error\" indicates misclassification by one adjacent category; \"Major Error\" indicates misclassification by two categories (Low-to-Moderate \u0026harr; Very High); \"Over/Under\" refers to the direction of minor errors; Models are ordered by quadratic-weighted kappa (κw) from the primary three-class risk analysis. \u003cb\u003eAbbreviations\u003c/b\u003e: CI, confidence interval; ESC, European Society of Cardiology; κw, quadratic-weighted kappa.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eAnalysis of misclassification patterns revealed important safety considerations for clinical deployment in the current setting (Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, Supplementary Figure S6). Major two-category errors, representing the most clinically significant misclassifications where patients were shifted between low-to-moderate and very high-risk categories, were rare but present in six models, with GPT-5 nano showing the highest rate at 13%. Five models (GPT-4o, GPT-5, GPT-4.1, Gemini 2.5 Pro, and DeepSeek V3) achieved zero major error rates. Ten out of eleven models tended to underestimate risk, with only Claude Sonnet 4.5 overestimating more frequently than underestimating risk.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eSensitivity for identifying high and very high-risk patients, a critical metric for ensuring high-risk individuals receive appropriate intensive interventions, varied markedly across models (Supplementary Table S6). Given that 83% of the evaluation cohort comprised high and very high-risk patients according to gold-standard assessment, the models' ability to correctly identify these high-risk individuals, sensitivity, was of paramount clinical importance. Claude Sonnet 4.5 achieved perfect sensitivity (100%, 95% CI: 92.9\u0026ndash;100%), though with reduced specificity (80%, 95% CI: 49.0\u0026ndash;94.3%). Seven models achieved perfect specificity (100%, 95% CI: 72.2\u0026ndash;100%), with variable sensitivity (ranging from 52% to 92%). The top performing model in primary risk stratification, GPT-4o, had a sensitivity of 92% and specificity of 100% for detecting high and very high-risk categories.\u003c/p\u003e \u003cp\u003e \u003cb\u003eSCORE2 Numeric Agreement\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAmong vignettes where SCORE2 risk calculation was applicable, numeric agreement between model-predicted 10-year cardiovascular risk and the gold standard varied substantially across models (Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, Supplementary Table S7 and Figure S7). Only three models (GPT-4.o, GPT-4.1 and Grok-3) computed absolute risk in all eligible vignettes, with the remaining evaluating absolute risk in only 18 to 36 of cases, reflecting variable success in generating numeric outputs.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eNumeric agreement with SCORE2 (applicable cases only)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMAE (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBias (95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCCC (95% CI)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e7.93 (2.04\u0026ndash;16.28)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e4.98 (-1.35\u0026ndash;13.80)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.13 (-0.18\u0026ndash;0.63)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e16.26 (2.23\u0026ndash;33.85)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e13.14 (-1.57\u0026ndash;31.29)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.08 (-0.12\u0026ndash;0.52)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e9.04 (2.20\u0026ndash;19.15)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e5.24 (-2.23\u0026ndash;15.74)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.12 (-0.17\u0026ndash;0.64)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Sonnet 4.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e20.14 (0.88\u0026ndash;44.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e18.28 (-1.40\u0026ndash;43.57)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.08 (-0.13\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.5 Pro\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e27\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e7.78 (1.10\u0026ndash;17.28)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e5.55 (-1.40\u0026ndash;15.59)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.08 (-0.12\u0026ndash;0.83)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Opus 4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e8.43 (1.09\u0026ndash;22.07)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e5.52 (-2.23\u0026ndash;19.74)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.08 (-0.11\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek V3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e7.81 (2.18\u0026ndash;15.59)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e3.30 (-2.70\u0026ndash;11.53)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.09 (-0.11\u0026ndash;0.49)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.0 Flash\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e38\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e13.28 (5.82\u0026ndash;22.14)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e7.31 (-1.10\u0026ndash;17.24)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.08 (-0.11\u0026ndash;-0.05)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGrok-3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.81 (1.73\u0026ndash;13.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e3.46 (-2.04\u0026ndash;11.10)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.09 (-0.11\u0026ndash;0.56)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLlama 3.3 70B Instruct\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e8.68 (2.61\u0026ndash;17.06)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e4.36 (-2.49\u0026ndash;13.95)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.14 (-0.26\u0026ndash;0.53)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5 Nano\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3.75 (1.92\u0026ndash;6.78)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.47 (-2.65\u0026ndash;3.13)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.05 (-0.15\u0026ndash;0.63)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"5\"\u003e\u003cb\u003eLegend. Sample determination\u003c/b\u003e: Analysis is restricted to the subset of vignettes (\"N\") where both the Gold Standard was applicable, and the model successfully generated a numeric risk value. \u003cb\u003eMetrics\u003c/b\u003e: Values represent percentage points of 10-year cardiovascular risk; \"Bias\" is calculated as Mean(Model)\u0026thinsp;\u0026minus;\u0026thinsp;Mean(Gold Standard), where positive values indicate overestimation; \"CCC\" denotes Lin\u0026rsquo;s Concordance Correlation Coefficient (\u0026minus;\u0026thinsp;1 to 1); Models are ordered by κw from the primary risk analysis. \u003cb\u003eAbbreviations\u003c/b\u003e: CI, confidence interval; MAE, mean absolute error; SCORE2, Systematic Coronary Risk Estimation 2.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eMean absolute error (MAE) ranged from 3.8 percentage points for GPT-5 nano (95% CI: 1.92\u0026ndash;6.78; n\u0026thinsp;=\u0026thinsp;31) to 20.1 for Claude Sonnet 4.5 (95% CI: 0.88\u0026ndash;44.86; n\u0026thinsp;=\u0026thinsp;18). All models except GPT-5 nano demonstrated MAE exceeding 5 percentage points, a clinically significant threshold that could lead to misclassification across risk categories. In contrast to the categorical risk classification results, Bland-Altman analyses revealed systematic overestimation in 10 of 11 models, with wide limits of agreement. Only GPT-5 nano showed approximately unbiased predictions (Supplementary Figure S7).\u003c/p\u003e \u003cp\u003eLin's concordance correlation coefficients were uniformly poor (range: -0.14 to -0.05), with all confidence intervals including negative values, indicating weak absolute agreement with reference SCORE2 values.\u003c/p\u003e \u003cp\u003e \u003cb\u003eSCORE2 Applicability\u003c/b\u003e \u003c/p\u003e \u003cp\u003eModels demonstrated variable performance in identifying patients where SCORE2 should not be applied due to established CVD, diabetes, CKD, or FH (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, Supplementary Table S8). GPT-4.1 and Grok-3 achieved perfect classification (F1 1.00, accuracy 1.00), with no false positives or false negatives. Most models maintained missed-override rates below our prespecified 5% safety threshold, ensuring that high-risk patients requiring alternative management were rarely misclassified as SCORE2-eligible. The notable exception was Gemini 2.0 Flash with a 15.0% missed-override rate (95% CI 0.0\u0026ndash;38.5%), though the wide confidence interval reflects considerable uncertainty. Over-blocking rates, indicating inappropriate withholding of SCORE2 in eligible patients, showed substantial variation from 0.0% (GPT-4o, GPT-4.1, Grok-3, GPT-5 nano) to 55.0% (95% CI 34.1\u0026ndash;75.0) for Claude Sonnet 4.5, revealing a critical trade-off between safety and clinical usability.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSCORE2 Applicability: override usage, rationale validity, and decision performance by model\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDo Not Use\u003c/p\u003e \u003cp\u003e(n/N; %)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eReason Provided %\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eValid\u003c/p\u003e \u003cp\u003eReason %\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eF1\u003c/p\u003e \u003cp\u003e(override, 95% CI)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003cp\u003e(95% CI)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e19/60; 31.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.97 (0.92\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.98 (0.95\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e22/60; 36.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.95 (0.83\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.97 (0.90\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e20/60; 33.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.00 (1.00\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.00 (1.00\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Sonnet 4.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e42/60; 70.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e85.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.65 (0.42\u0026ndash;0.81)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.63 (0.47\u0026ndash;0.80)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.5 Pro\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e33/60; 55.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e87.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.76 (0.53\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.78 (0.63\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude Opus 4.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e31/60; 51.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e96.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.78 (0.57\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.82 (0.68\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek V3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e23/60; 38.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e82.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.88 (0.72\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.92 (0.82\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 2.0 Flash\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e19/60; 31.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e85.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e94.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.87 (0.68\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.92 (0.83\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGrok-3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e20/60; 33.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.00 (1.00\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.00 (1.00\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLlama 3.3 70B Instruct\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e26/60; 43.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e88.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.83 (0.62\u0026ndash;0.95)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.87 (0.75\u0026ndash;0.97)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5 Nano\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e19/60; 31.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e95.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.97 (0.91\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.98 (0.95\u0026ndash;1.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003e\u003cb\u003eLegend. Definitions\u003c/b\u003e: \"Do not use\" represents the proportion of vignettes flagged by the model as SCORE2-ineligible; \"Reason Provided\" indicates the percentage of override decisions accompanied by a text rationale; \"Valid\" indicates the percentage of those rationales matching the Gold Standard. \u003cb\u003eMetrics\u003c/b\u003e: F1 and Accuracy refer to the binary classification performance for the override decision (Positive Class = \"Do not use SCORE2\"). Models are ordered by κw from the primary risk analysis. \u003cb\u003eAbbreviations\u003c/b\u003e: CI, confidence interval; SCORE2, Systematic Coronary Risk Estimation 2.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eRationale transparency varied considerably across models (Supplementary Table S9). While most systems provided explanations when blocking SCORE2 (\u0026ge;\u0026thinsp;95.0% in nine models), the validity of these reasons differed markedly. Five models (GPT-4o, GPT-5, GPT-4.1, Grok-3, GPT-5 nano) achieved 100% validity, correctly identifying guideline-specified contraindications. In contrast, four models showed concerning rates of invalid rationales: DeepSeek V3 (17.4%), Claude Sonnet 4.5 (14.3%), Gemini 2.5 Pro (12.1%), and Llama 3.3 70B Instruct (11.5%). These invalid explanations predominantly cited non-qualifying cardiovascular conditions or non-cardiovascular factors, potentially leading to inappropriate clinical decisions (detailed breakdown in Supplementary Table S9).\u003c/p\u003e \u003cp\u003eExtraction accuracy for specific override conditions correlated strongly with overall applicability performance (Supplementary Table S8). Top-performing models demonstrated near-perfect multilabel extraction: GPT-4.1 and Grok-3 achieved Micro-F1 1.00, while GPT-4o, DeepSeek V3, and GPT-5 nano reached 0.97. Models with weaker override decisions showed correspondingly poor condition extraction, with Claude Sonnet 4.5 achieving the lowest Micro-F1 of 0.71 (95% CI 0.48\u0026ndash;0.88). Among high-performing models, the distribution of identified conditions aligned closely with the vignette case-mix, suggesting robust recognition across diverse clinical presentations rather than systematic bias toward specific conditions (Supplementary Table S9).\u003c/p\u003e \u003cp\u003e \u003cb\u003eLanguage Performance\u003c/b\u003e \u003c/p\u003e \u003cp\u003eModel performance demonstrated robust consistency across Portuguese and English vignettes (Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e, Supplementary Figure S8). For risk factor extraction, micro-F1 scores exceeded 0.95 in both languages, with language differences ranging from \u0026minus;\u0026thinsp;0.01 to 0.02. No model exhibited clinically meaningful language effects (all FDR-adjusted p\u0026thinsp;\u0026gt;\u0026thinsp;0.05).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eLanguage performance across Portuguese vs English\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"10\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePT Micro‑F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEN Micro‑F1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eΔF1 [95% CI]\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePT κw\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eEN κw\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eΔκw [95% CI]\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003ePT Acc\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eEN Acc\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003eΔAcc [95% CI]\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGPT-4o\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.99\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.01 (-0.01\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.68\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-0.01 (-0.12\u0026ndash;0.09)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e-0.03 (-0.10\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGPT-5\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.01 (0.00\u0026ndash;0.02)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.68\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.02 (-0.10\u0026ndash;0.13)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.00 (0.00\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGPT-4.1\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.00 (-0.01\u0026ndash;0.01)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.09 (-0.02\u0026ndash;0.23)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.00 (0.00\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eClaude Sonnet 4.5\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.99\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.00 (-0.00\u0026ndash;0.01)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.59\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.62\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.04 (-0.17\u0026ndash;0.27)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.63\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.63\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.00 (-0.10\u0026ndash;0.10)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGemini 2.5 Pro\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.00 (-0.01\u0026ndash;0.01)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.64\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-0.10 (-0.24\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.77\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.03 (-0.07\u0026ndash;0.13)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eClaude Opus 4.1\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.00 (-0.01\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.63\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.52\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-0.11 (-0.28\u0026ndash;0.07)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.83\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.03 (-0.07\u0026ndash;0.13)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDeepSeek V3\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.00 (-0.00\u0026ndash;0.01)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-0.07 (-0.21\u0026ndash;0.06)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e-0.03 (-0.13\u0026ndash;0.07)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGemini 2.0 Flash\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.02 (0.01\u0026ndash;0.03)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.61\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.45\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-0.16 (-0.38\u0026ndash;0.07)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.03 (-0.07\u0026ndash;0.17)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGrok-3\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.00 (-0.01\u0026ndash;0.01)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.44\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.46\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.02 (-0.07\u0026ndash;0.15)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.00 (0.00\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLlama 3.3 70B Instruct\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.01 (-0.02\u0026ndash;0.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.34\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.56\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.22 (-0.04\u0026ndash;0.49)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.83\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.07 (-0.07\u0026ndash;0.20)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGPT-5 Nano\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.99\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.01 (-0.02-0.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.46\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.13 (-0.08\u0026ndash;0.36)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e+\u0026thinsp;0.03 (0.00\u0026ndash;0.10)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"10\"\u003e\u003cb\u003eLegend. Comparison\u003c/b\u003e: Paired performance analysis between Portuguese (PT) and English (EN) vignettes across three domains: traditional risk factor extraction (Micro-F1), ESC risk classification (κw), and SCORE2 applicability (Accuracy). \u003cb\u003eMetrics\u003c/b\u003e: Δ represents the difference (EN\u0026thinsp;\u0026minus;\u0026thinsp;PT); values are absolute differences for F1 and κw, and percentage points for Accuracy; 95% confidence intervals are shown in parentheses; No statistically significant differences were found after FDR correction. Models are ordered by κw. Abbreviations: Acc, accuracy; CI, confidence interval; κw, quadratic-weighted kappa.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eRisk classification accuracy revealed greater language-dependent variability. Llama 3.3-70B Instruct exhibited the largest language effect (Δκw\u0026thinsp;=\u0026thinsp;0.22, 95% CI -0.04 to 0.49), performing better in English, while Gemini-2.0-flash showed the opposite pattern (Δκw = -0.16, 95% CI -0.38 to 0.07), performing better in Portuguese. Despite individual variations, no model demonstrated statistically significant language effects (all FDR-adjusted p\u0026thinsp;\u0026gt;\u0026thinsp;0.05).\u003c/p\u003e \u003cp\u003eSCORE2 applicability assessment showed consistent accuracy across languages for all models, without significant differences (all FDR-adjusted p\u0026thinsp;\u0026gt;\u0026thinsp;0.05). Language differences were minor and ranged from 0 to 7 percent points, with Llama 3.3-70B Instruct showing the largest difference.\u003c/p\u003e \u003cp\u003e \u003cb\u003eMedical Raters Performance\u003c/b\u003e \u003c/p\u003e \u003cp\u003eEight clinicians independently classified the 30 Portuguese vignettes using the ESC 3-class system. Individual agreement with the Gold Standard, measured by quadratic-weighted Cohen's kappa, demonstrated substantial heterogeneity: ranging from slight agreement (evaluator #6: κw\u0026thinsp;=\u0026thinsp;0.15, 95% CI: -0.10 to 0.40) to almost perfect (evaluator #4: κw\u0026thinsp;=\u0026thinsp;0.93, 95% CI: 0.85 to 1.00), with the majority achieving moderate agreement comparable to LLM performance (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, Supplementary Table S10). Inter-rater reliability among clinicians, assessed using Gwet's AC2 with quadratic weights, was moderate (AC2\u0026thinsp;=\u0026thinsp;0.44, 95% CI: 0.27 to 0.55), indicating meaningful but imperfect consensus independent of the Gold Standard.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWhen pooled using majority-vote ensemble methodology, the clinicians achieved substantial agreement (κw\u0026thinsp;=\u0026thinsp;0.76, 95% CI: 0.58 to 0.89), with 76.67% accuracy (23/30 vignettes). All misclassifications were limited to adjacent risk categories, with no critical two-level errors observed between low/moderate and very-high risk groups (Supplementary Figure S9). This pooled clinician benchmark exceeded the highest single-model performance observed in the primary analysis (GPT-4o: κw\u0026thinsp;=\u0026thinsp;0.69), establishing a practical performance ceiling for LLM evaluation on these clinical vignettes.\u003c/p\u003e"},{"header":"3. Discussion","content":"\u003cp\u003e In this comprehensive evaluation of eleven contemporary LLMs for cardiovascular risk stratification, we demonstrated that models excel at extracting traditional cardiovascular risk factors from clinical summaries but show moderate and variable performance in translating these factors into guideline-concordant risk classifications. The GPT family dominated three-class ESC risk categorization, with GPT-4o achieving the highest agreement with expert adjudication, followed closely by GPT-5 and GPT-4.1, though notably GPT-5 Nano showed the weakest performance among all models. However, ten of eleven models systematically underestimated cardiovascular risk, representing a critical safety concern for clinical deployment. LLMs also struggled with numeric SCORE2 calculation, producing clinically unacceptable mean absolute errors exceeding 5 percentage points in all but one model, revealing their inability to reliably compute risk stratification formulas. Conversely, most models demonstrated robust capability for identifying patients with conditions that require alternative risk assessment beyond SCORE2, such as established ASCVD, CKD, diabetes, or FH, missing fewer than 5% of these cases, with some models missing 0%. To our knowledge this is the first benchmark study evaluating LLMs for cardiovascular risk prevention and our findings establish that while current models possess strong capabilities for analyzing clinical information and extracting relevant data, substantial refinement in clinical reasoning and risk quantification is required before deployment in cardiovascular prevention workflows.\u003c/p\u003e \u003cp\u003eThe striking performance gap between risk factor extraction and ESC risk classification was expected, as the former represents a well-defined labeling task while the latter requires multi-step reasoning under ambiguous clinical definitions. The near-perfect accuracy in extracting traditional cardiovascular risk factors aligns with prior natural language processing achievements and is unsurprising given their objective, well-documented nature, because age, blood pressure, and cholesterol values are unambiguous data points in clinical text \u003csup\u003e\u003cspan additionalcitationids=\"CR20 CR21\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. However, even within traditional risk factors, we observed a gradient of difficulty: factors requiring interpretive judgment showed incrementally lower performance, with smoking status, hypertension diagnosis, and dyslipidemia diagnosis revealing the challenges of extracting concepts that extend beyond simple numeric values. This pattern amplified dramatically with risk modifiers, likely because factors such as 'family history of premature ASCVD' or 'chronic inflammatory disease' lack discrete definitions and require clinical interpretation beyond text matching\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. Most critically, translating extracted data into ESC risk categories demands higher-order medical reasoning, integrating multiple variables, applying guideline exceptions, and weighing modifiers, a complex cognitive synthesis where we observed substantial model performance variability. This aligns with mixed results reported for LLMs in diagnostic reasoning and clinical decision-making, where models consistently excel at information retrieval but struggle with multi-step clinical inference requiring integration of competing factors\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e,\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. The predominant tendency toward risk underestimation in ten of eleven models is particularly concerning, paralleling extensive literature documenting that physicians relying on unaided clinical judgment also tend to underestimate cardiovascular risk, especially in high-risk patients\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e,\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. This convergent bias suggests that LLMs have learned to replicate the underestimation patterns present in training data\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e,\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe failure of LLMs to accurately compute numeric SCORE2 values, despite successfully extracting all required variables, reveals fundamental limitations in mathematical reasoning, although we acknowledge that our zero-shot prompting strategy may have contributed to the variable completion rates observed. Only three models computed absolute risk in all eligible vignettes and the responses systematically overestimated numeric risk, paradoxically contrasting with their conservative categorical classifications, suggesting different failure modes for arithmetic computation versus clinical judgment\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e. This disconnect reflects transformer architectures' well-documented reliance on pattern recognition rather than true calculation\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e. Notably, GPT-5 Nano achieved the lowest numeric error but poorest categorical performance, highlighting a possible trade-off between computational accuracy and clinical reasoning that has been described with larger models\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. Consequently, these results strongly support a hybrid, safety-oriented workflow for clinical deployment: using LLMs exclusively for information extraction, where performance was near-perfect for all traditional SCORE2 variables (micro-F1\u0026thinsp;\u0026ge;\u0026thinsp;0.97), followed by a deterministic implementation of the SCORE2 algorithm to compute absolute risk. This architecture would eliminate the primary source of numerical error, specifically the LLMs' unreliable arithmetic reasoning, while retaining their greatest strengths in unstructured data processing\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e,\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e In contrast to LLM struggling with risk categorization, most models successfully identified patients requiring guideline exceptions to SCORE2, with missed-override rates below 5% for high-risk conditions (ASCVD, diabetes, CKD, FH), and GPT-4.1 and Grok-3 achieving perfect performance. However, five models over-blocked SCORE2 in more than 10% of eligible cases, reaching 55% in Claude Sonnet 4.5, compromising clinical usability. While GPT models and Grok-3 consistently provided valid explanations for their decisions, six other models generated hallucinated medical justifications that could mislead users, incorrectly citing conditions like valvular disease, atrial fibrillation, heart failure, chronic inflammatory diseases, or simply the presence of risk factors as contraindications. Models also handled missing data inconsistently: three models failed to provide risk categories in select vignettes, with GPT-5 transparently refusing classification when smoking status was absent, while Llama 3.3 70B and GPT-5 Nano failed silently without explanation, suggesting system failure rather than clinical judgment. This combination of over-blocking, false rationales, and variable responses to incomplete data reveals concerning reliability gaps despite acceptable performance metrics\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e,\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e,\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThere are concerns that LLMs might have a worse performance in low-resource languages\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e,\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. Despite Portuguese being substantially less represented in training corpora, we observed no statistically significant language effects, with risk factor extraction maintaining very high micro-F1 scores and top-performing models showing minimal risk classification differences\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. This bilingual consistency is an encouraging finding from our study, particularly relevant for non-English speaking healthcare systems, where locally developed artificial intelligence tools remain scarce\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e,\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e. Also, the linguistic robustness indicates that fundamental challenges in cardiovascular risk assessment transcend language barriers and reflect architectural limitations rather than language-specific training gaps.\u003c/p\u003e \u003cp\u003eTo contextualize LLM performance, our exploratory analysis of eight practicing physicians revealed substantial heterogeneity in cardiovascular risk classification, with individual agreement to the gold standard ranging from poor to near-perfect. The wide inter-rater variability underscores that cardiovascular risk assessment remains challenging even for experienced clinicians, reflecting the complexity of integrating multiple risk factors into categorical decisions\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e. Taken together, these findings position current LLM performance in a meaningful intermediate range. While top models like GPT-4o, achieved agreement comparable to mid-performing clinicians, they did not reach the accuracy of the best human evaluator. When pooled using majority voting, the clinician consensus outperformed every individual model, reinforcing that collective clinical judgment provides a robustness that current LLMs cannot yet replicate - though notably, the ensemble also outperformed the average individual clinician. However, a critical safety gap persists: physicians avoided two-level misclassifications entirely, whereas several LLMs exhibited these critical errors that could lead to patient under-treatment\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e,\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e. It is important to note that our study occurred under controlled conditions; in real-world practice, where clinicians are often overworked and fatigued, we hypothesize that LLMs could offer even greater utility than found here by acting as a vigilant \"second opinion\" against cognitive overload\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e,\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. Ultimately, these results suggest LLMs are not yet autonomous decision-makers but could serve as powerful augmentation tools, particularly for standardizing assessment among clinicians performing below the expert median\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e,\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eMost physicians achieved moderate agreement, and notably, only one outperformed GPT-4o, with the remaining seven showing lower weighted kappa scores. This finding that individual physician performance overlapped substantially with LLM ranges suggests these models could already augment clinicians performing below median, though they cannot yet replace expert consensus\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e,\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. When pooled using majority voting, physician consensus achieved substantial agreement, exceeding the best-performing LLM by a marginal margin. Critically, physicians avoided two-level misclassifications entirely, while several LLMs exhibited these critical errors that could lead to patients under-treatment\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e,\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e. The wide inter-rater variability underscores that cardiovascular risk assessment remains challenging even for experienced clinicians, reflecting the complexity of integrating multiple risk factors into categorical decisions\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e. These results position current LLMs as potential clinical support tools rather than autonomous decision-makers, particularly valuable for standardizing risk assessment in settings where specialized cardiovascular expertise is limited\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e,\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eModel hierarchies emerging from our benchmark have direct implications for deployment. The GPT family (GPT-4o, GPT-5, GPT-4.1) consistently led overall performance, combining near-perfect extraction of traditional risk factors with top-tier identification of risk modifiers, perfect specificity for high-risk override conditions, absence of two-level misclassifications, and weighted agreement for ESC risk stratification of κw\u0026thinsp;\u0026gt;\u0026thinsp;0.65 (best: GPT-4o κw\u0026thinsp;=\u0026thinsp;0.69). GPT-5 and GPT-4.1 also achieved perfect sensitivity for high-risk identification, and GPT-5 displayed appropriately conservative behavior in the presence of missing data, notably, the model refused to assign an ESC risk category in vignettes where smoking status was unknown, a behavior that, while reducing completion rates, arguably reflects higher algorithmic fidelity and safety than forced guessing. In contrast, GPT-5 Nano was the weakest performer, and Gemini Flash underperformed relative to Gemini Pro, reinforcing that model capacity materially affects clinical reasoning\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. The Claude family exhibited pronounced over-blocking of SCORE2 eligibility, with Sonnet 4.5 flagging 55% of otherwise eligible vignettes, limiting clinical usability. Another concerning finding was the generation of erroneous rationales when blocking SCORE2, with DeepSeek V3 (17.4%), Claude Sonnet 4.5 (14.3%), and Gemini 2.5 Pro (12.1%) incorrectly citing non-qualifying conditions and risking clinician misdirection. While extraction-only tasks showed little separation across models, integrated risk assessment requiring multi-step clinical reasoning clearly discriminated moderate-to-good performers from others. When analyzing open-source (DeepSeek V3 and Llama 3.3 70B Instruct) versus closed-source models, we found that for risk factor extraction all models presented high micro-F1 scores. Regarding risk modifiers, closed-source models presented better performance. Despite this difference in risk modifiers, which might be influenced by the prompting strategy, these findings also demonstrated that open-source models can be of use for cardiovascular risk assessment if a two-step approach is considered (step 1 extraction and step 2 calculation using the formula, not the LLM). However, considering the proposed strategy with a single prompt to extract risk factors and calculate risk, these results support prioritizing the GPT family for prospective clinical evaluation and align with contemporary deployment guidance that emphasizes high sensitivity for safety-critical screening and rigorous verification of explanation faithfulness and hallucination control\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eOur study used simulated clinical vignettes rather than real clinical notes, which may limit external validity and underrepresent documentation artifacts, like abbreviations, contradictions, and missing-data patterns, found in electronic health records. The vignette case mix was skewed toward high and very high-risk, which could inflate agreement metrics and reduce generalizability to lower-prevalence settings. Sample size of 30 vignettes, while adequate for initial benchmarking, provides limited statistical power for subgroup analyses, and the resulting wide confidence intervals constrain the precision with which models can be comparatively ranked. The adjudicated reference standard, although expert-based, introduces inherent subjectivity in defining the ground truth for risk classification. Furthermore, this standard represents an idealized benchmark derived from experts explicitly focused on rigorous calculation, which likely exceeds the implicit, often heuristic-based risk assessment typical of routine daily practice; consequently, our evaluation subjects the models to a stricter performance threshold than that often found in real-world clinical environments. Evaluating a single guideline framework (ESC/SCORE2) restricts generalizability to other calculators and guidelines (e.g., ASCVD, QRISK3). The cross-sectional, single-time-point design may not reflect rapidly evolving models. Each model was run once per language; although English/Portuguese replicates were consistent, robustness to random seeds, temperature settings, and prompt variations remains untested. Finally, all analyses used zero-shot prompting; few-shot prompting, chain-of-thought reasoning, or fine-tuning might yield different results.\u003c/p\u003e \u003cp\u003eIn this comprehensive benchmark of eleven contemporary LLMs for cardiovascular risk stratification, models achieved near-perfect extraction of traditional risk factors yet demonstrated only moderate accuracy in ESC risk categorization and unreliable SCORE2 calculations, precluding autonomous clinical use. Most models correctly identified guideline exceptions requiring alternative assessment and maintained robust performance across Portuguese and English, supporting near-term applications in structured documentation and eligibility screening under clinical supervision. This foundational study establishes the current capabilities and limitations of LLMs in preventive cardiology, providing critical evidence for implementation strategies while underscoring the necessity for real-world validation, enhanced mathematical reasoning, and safeguards against systematic bias before broader clinical adoption\u003c/p\u003e"},{"header":"4. Methods","content":"\u003cp\u003e \u003cb\u003eStudy design\u003c/b\u003e \u003c/p\u003e \u003cp\u003eWe conducted a prespecified, simulation-based evaluation of LLMs for cardiovascular risk stratification in accordance with the 2021 ESC prevention guidelines\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. The study used systematically developed clinical vignettes to benchmark model capabilities against a reference standard for comparison. No real patient data was used. The study protocol was previously registered on the Open Science Framework (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.17605/OSF.IO/J2ZK9\u003c/span\u003e\u003cspan address=\"10.17605/OSF.IO/J2ZK9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cb\u003eClinical vignette development and validation\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThirty clinical vignettes were systematically developed by a senior cardiologist to emulate outpatient clinical notes of patients undergoing cardiovascular risk stratification. Each vignette (100\u0026ndash;200 words) was written in free-text format and included demographics, medical history, current medications, physical examination findings, laboratory results, and relevant diagnostic investigations, structured to reflect authentic clinical documentation. All vignettes incorporated the core variables required for SCORE2 risk calculation (age, sex, smoking status, systolic blood pressure, total cholesterol, and HDL cholesterol) together with additional cardiovascular risk modifiers.\u003c/p\u003e \u003cp\u003eThe set was designed to ensure balanced representation across sex, age strata (40\u0026ndash;49, 50\u0026ndash;59, and 60\u0026ndash;69 years), and ESC risk categories (low/moderate, high, very high). Ten vignettes represented conditions in which SCORE2 is not applicable, including ASCVD (n\u0026thinsp;=\u0026thinsp;4), diabetes mellitus (n\u0026thinsp;=\u0026thinsp;3), CKD (n\u0026thinsp;=\u0026thinsp;2), and FH (n\u0026thinsp;=\u0026thinsp;1).\u003c/p\u003e \u003cp\u003eEach vignette underwent independent evaluation by a panel of three cardiologists using a structured four-domain rubric assessing clinical relevance, completeness, realism, and clarity. Domains were rated on a four-point Likert scale, and domain-level validity required unanimous ratings of 3\u0026ndash;4 (Item-Level Content Validity Index\u0026thinsp;=\u0026thinsp;1.00). Vignettes not meeting this criterion were revised iteratively until all domains achieved full agreement.\u003c/p\u003e \u003cp\u003eAll vignettes were produced and validated in Portuguese and subsequently translated into English by the original author. The English versions were reviewed by a native English-speaking cardiologist, and a back-translation into Portuguese was performed to confirm conceptual equivalence. Any discrepancies were resolved by consensus between the author and the reviewer.\u003c/p\u003e \u003cp\u003e \u003cb\u003eModels, deployment, and prompting\u003c/b\u003e \u003c/p\u003e \u003cp\u003e We evaluated a combination of proprietary and open-source LLMs selected by convenience and local availability, ensuring representation of the most widely used and high-performing contemporary systems. The final set comprised eleven models, listed alphabetically: Claude Opus 4.1, Claude Sonnet 4.5, DeepSeek V3, Gemini 2.0 Flash, Gemini 2.5 Pro, GPT-4.1, GPT-4o, GPT-5, GPT-5 Nano, Grok-3, and Llama 3.3 70B Instruct (see Supplementary Table S11 for model specifications).\u003c/p\u003e \u003cp\u003eAll models were accessed through the Azure platform or their respective cloud-based application programming interfaces, using default temperature settings and inference-only mode without any fine-tuning on the study dataset (Supplementary Table S11). Each vignette was evaluated in a new, independent session to prevent memory effects or cross-vignette information leakage, ensuring analytical independence across cases.\u003c/p\u003e \u003cp\u003eA standardized prompt template was iteratively developed to ensure consistent task interpretation across models (see Supplementary Appendix S2 for standardized prompt template). The prompt explicitly instructed each model to: (1) extract cardiovascular risk factors in a structured format; (2) determine SCORE2 applicability; (3) calculate the 10-year cardiovascular risk when appropriate; (4) classify the patient into ESC risk categories (Low-to-Moderate, High, or Very High); (5) provide a concise clinical explanation for the assigned category; and (6) generate a JSON file to facilitate structured data extraction. Models were specifically directed to use the official SCORE2 calculator for moderate-risk countries or the corresponding risk tables published in the 2021 ESC Guidelines\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe prompt structure was identical in Portuguese and English, with only the vignette text and language-specific formatting adapted. Zero-shot prompting was employed to assess each model\u0026rsquo;s intrinsic reasoning capabilities. Every model completed all 60 assessments (30 Portuguese and 30 English vignettes).\u003c/p\u003e \u003cp\u003e \u003cb\u003eReference Standard\u003c/b\u003e \u003c/p\u003e \u003cp\u003eA three-member Cardiovascular Risk Adjudication Committee, composed of senior cardiologists with recognized expertise in cardiovascular prevention and not involved in vignette development or model evaluation, independently extracted all relevant cardiovascular risk factors and modifiers from each vignette. Using the 2021 ESC guidelines and assuming the moderate-risk European SCORE2 calibration, the committee calculated the 10-year cardiovascular risk and assigned the corresponding ESC risk category\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. For conditions in which SCORE2 was not applicable, the committee applied guideline-based categorical classification overriding SCORE2 values. Discrepancies were discussed and resolved by consensus, and the final adjudicated outputs constituted the reference (Gold Standard) against which LLMs were compared.\u003c/p\u003e \u003cp\u003e \u003cb\u003eOutcomes\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe study outcomes were categorized into primary and secondary endpoints:\u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003ePrimary Outcomes\u003c/span\u003e \u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eRisk-factor extraction accuracy: Assessed for twelve traditional cardiovascular risk factors, including the six SCORE2 core variables (age, sex, smoking status, systolic blood pressure, total cholesterol, and HDL cholesterol) and six additional factors (diastolic blood pressure, LDL cholesterol, non-HDL cholesterol, triglycerides, hypertension diagnosis, and dyslipidemia diagnosis). Model performance was quantified using micro- and macro-averaged precision, recall, and F1-scores, and agreement was summarized using the per-vignette Jaccard similarity coefficient.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eThree-class cardiovascular risk classification: Agreement between model-predicted and reference ESC categories (Low-to-Moderate, High, Very High) was measured using quadratic-weighted Cohen\u0026rsquo;s κ (κw) as the primary metric. Supplementary measures included overall accuracy and the rate of major error (defined as two-class misclassification).\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eSecondary Outcomes\u003c/span\u003e \u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eRisk-modifier extraction: Evaluated for twelve predefined factors (elevated coronary calcium score, calcium score equal to zero, pre-diabetes, obesity, family history of premature ASCVD, elevated lipoprotein(a), increased arterial stiffness, elevated high-sensitivity C-reactive protein, chronic inflammatory disease, obstructive sleep apnoea, chronic obstructive pulmonary disease, and cancer) using the same extraction metrics.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eNumeric SCORE2 agreement: For cases in which SCORE2 was applicable, numeric agreement between model-predicted and reference 10-year cardiovascular risk values (%) was quantified using mean absolute error (MAE), root mean square error (RMSE), Bland\u0026ndash;Altman bias and limits of agreement, and Lin\u0026rsquo;s concordance correlation coefficient (CCC).\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eSCORE2 applicability: Assessed as a binary decision on whether the SCORE2 algorithm should be applied (identifying exceptions). Key metrics included the missed-override rate (safety), over-blocking rate (usability), F1-score for the positive (\u0026ldquo;Do not use SCORE2\u0026rdquo;) class, overall accuracy, and correctness of the stated override reason.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eLanguage robustness: Evaluated by paired comparison of Portuguese and English outputs for both extraction and risk classification endpoints\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eHuman Benchmark Analysis\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTo contextualize model performance, an exploratory analysis was conducted to establish a human benchmark for cardiovascular risk classification. Eight physicians - three family medicine specialists, three internal medicine specialists, and two cardiologists, each with more than three years of clinical experience, independently classified the thirty Portuguese vignettes according to ESC risk categories (Low-to-Moderate, High, Very High). All raters were blinded to model outputs and to each other\u0026rsquo;s assessments. Agreement with the Gold Standard was quantified using quadratic-weighted Cohen\u0026rsquo;s κ (κw), and inter-rater consistency was measured using Gwet\u0026rsquo;s AC2 coefficient with quadratic weights. An ensemble classification was generated through majority voting with median tie-breaking to represent the collective physician consensus.\u003c/p\u003e \u003cp\u003e \u003cb\u003eStatistical Analysis\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAll analyses were conducted in R (version 4.4.1) using RStudio (v2025.09.1\u0026thinsp;+\u0026thinsp;401; Posit Software, Boston, MA, USA). The Portuguese Gold Standard served as the reference dataset for both language versions. Each model generated 60 predictions (30 Portuguese and 30 English), paired by vignette ID. For continuous variables, a tolerance of \u0026plusmn;\u0026thinsp;0.2 units was applied to account for minor transcription or rounding variations. Predefined rules governed the handling of incomplete outputs: for risk factor extraction, absent model detections of present factors were penalized as false negatives, whereas missing Gold Standard values resulted in exclusion. For ESC risk classification, unclassifiable responses were counted as incorrect for overall accuracy but excluded from agreement metrics to ensure ordinal validity. Numeric comparisons were restricted to vignettes with valid calculations from both the model and reference.\u003c/p\u003e \u003cp\u003eConfidence intervals (95%) were calculated using non-parametric bootstrap resampling (2,000 vignette-level replicates) for paired metrics, including κw, accuracy, and safety-related error rates. Wilson score intervals were used for single-proportion estimates (e.g., sensitivity and specificity). Language comparisons employed paired t-tests or Wilcoxon signed-rank tests for continuous metrics, and McNemar\u0026rsquo;s test for categorical outcomes, with false discovery rate (FDR) correction for multiple comparisons using the Benjamini\u0026ndash;Hochberg method.\u003c/p\u003e \u003cp\u003ePredefined interpretation thresholds were as follows: F1-score\u0026thinsp;\u0026ge;\u0026thinsp;0.90 (excellent), 0.80\u0026ndash;0.89 (good), 0.70\u0026ndash;0.79 (fair), and \u0026lt;\u0026thinsp;0.70 (poor); κw\u0026thinsp;\u0026gt;\u0026thinsp;0.80 (excellent), 0.61\u0026ndash;0.80 (substantial), 0.41\u0026ndash;0.60 (moderate); and missed-override rate\u0026thinsp;\u0026lt;\u0026thinsp;5% (safety threshold), 5\u0026ndash;10% (moderate), \u0026gt;\u0026thinsp;10% (concerning).\u003c/p\u003e \u003cp\u003e \u003cb\u003eEthics\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThis study used synthetic clinical vignettes only and contained no real-patient data or identifiable personal information. Vignettes were generated de novo and reviewed to ensure non-identifiability.\u003c/p\u003e \u003cp\u003eFor the clinician-benchmarking component, eight practicing clinicians voluntarily rated synthetic vignettes; no identifiable personal data or human biological material were collected. All clinicians provided informed consent prior to participation.\u003c/p\u003e \u003cp\u003e Under the policy of the Luz Sa\u0026uacute;de Research and Ethics Committee (Lisbon, Portugal), the activity was determined to be exempt/not human participants research, and formal ethics review was waived (decision issued on 4 June 2025)\u003c/p\u003e \u003cp\u003e The project was coordinated by Hospital da Luz Learning Health in collaboration with the participating institutions.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements.\u0026nbsp;\u003c/strong\u003eWe thank Duarte Espregueira Mendes, Rita Marinheiro, and Victor Gil for clinical vignette validation; Inês Rosa, João Pereira, José Nuno Raposo, Hugo Viegas, Rita Gomes, Sérgio Madeira, Maria Madalena Rodrigues, and Vanessa Carvalho for clinician benchmarking; and Maria José Loureiro for English translation and validation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability.\u0026nbsp;\u003c/strong\u003eThe complete set of synthetic clinical vignettes, adjudicated labels, and model outputs that support the findings of this study are available on the Open Science Framework (OSF) at https://doi.org/10.17605/OSF.IO/J2ZK9\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability.\u0026nbsp;\u003c/strong\u003eAll R code to reproduce data processing, model evaluation, and figures is available at the same OSF project (https://doi.org/10.17605/OSF.IO/J2ZK9). Versioned releases will be archived at this DOI.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions.\u0026nbsp;\u003c/strong\u003eAll conceived the study. JFS, RD, IM, JMM, JC, NAS, FL and BN designed the methodology. RCS designed the clinical vignettes. RBD, IM, JMM, FL and NAS implemented the LLMs evaluations. JFS, RLL, and HD provided expert adjudication. JFS and RBD performed the statistical analyses. JFS drafted the manuscript. All authors interpreted the results, revised the manuscript critically for important intellectual content, and approved the final version.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests.\u0026nbsp;\u003c/strong\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding.\u003c/strong\u003e This research received no specific grant from any funding agency, commercial or not-for-profit sectors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics declaration.\u003c/strong\u003e Ethical approval and consent procedures are described in Methods (Ethics). Briefly, the dataset comprised synthetic vignettes; the clinician-benchmarking task was determined exempt/not human participants research by Luz Saúde Research and Ethics Committee (Lisbon, Portugal), with a waiver of review (issued on 4 June 2025). All clinicians provided informed consent.\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMendis, S., Graham, I. \u0026amp; Narula, J. Addressing the Global Burden of Cardiovascular Diseases; Need for Scalable and Sustainable Frameworks. \u003cem\u003eGlob. Heart\u003c/em\u003e 17, (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYusuf, S. \u003cem\u003eet al.\u003c/em\u003e Modifiable risk factors, cardiovascular disease, and mortality in 155 722 individuals from 21 high-income, middle-income, and low-income countries (PURE): a prospective cohort study. \u003cem\u003eLancet Lond. Engl.\u003c/em\u003e 395, 795\u0026ndash;808 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVisseren, F. L. J. \u003cem\u003eet al.\u003c/em\u003e 2021 ESC Guidelines on cardiovascular disease prevention in clinical practice. \u003cem\u003eEur. Heart J.\u003c/em\u003e 42, 3227\u0026ndash;3337 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaw, T. K. \u003cem\u003eet al.\u003c/em\u003e Primary prevention of cardiovascular disease: global cardiovascular risk assessment and management in clinical practice. \u003cem\u003eEur. Heart J. - Qual. Care Clin. Outcomes\u003c/em\u003e 1, 31\u0026ndash;36 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSposito, A. C. \u003cem\u003eet al.\u003c/em\u003e Physicians\u0026rsquo; attitudes and adherence to use of risk scores for primary prevention of cardiovascular disease: cross-sectional survey in three world regions. \u003cem\u003eCurr. Med. Res. Opin.\u003c/em\u003e 25, 1171\u0026ndash;1178 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiew, S. M. \u003cem\u003eet al.\u003c/em\u003e Can doctors and patients correctly estimate cardiovascular risk? A cross-sectional study in primary care. \u003cem\u003eBMJ Open\u003c/em\u003e 8, e017711 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGraham, I. M., Stewart, M., Hertog, M. G. L., \u0026amp; Cardiovascular Round Table Task Force. Factors impeding the implementation of cardiovascular prevention guidelines: findings from a survey conducted by the European Society of Cardiology. \u003cem\u003eEur. J. Cardiovasc. Prev. Rehabil. Off. J. Eur. Soc. Cardiol. Work. Groups Epidemiol. Prev. Card. Rehabil. Exerc. Physiol.\u003c/em\u003e 13, 839\u0026ndash;845 (2006).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePersell, S. D., Dunne, A. P., Lloyd-Jones, D. M. \u0026amp; Baker, D. W. Electronic health record-based cardiac risk assessment and identification of unmet preventive needs. \u003cem\u003eMed. Care\u003c/em\u003e 47, 418\u0026ndash;424 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSedlakova, J. \u003cem\u003eet al.\u003c/em\u003e Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. \u003cem\u003ePLOS Digit. Health\u003c/em\u003e 2, e0000347 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAsgari, E. \u003cem\u003eet al.\u003c/em\u003e Impact of Electronic Health Record Use on Cognitive Load and Burnout Among Clinicians: Narrative Review. \u003cem\u003eJMIR Med. Inform.\u003c/em\u003e 12, e55499 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoussein, E. H., Mohamed, R. E. \u0026amp; Ali, A. A. Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques. \u003cem\u003eSci. Rep.\u003c/em\u003e 13, 7173 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBoonstra, M. J., Weissenbacher, D., Moore, J. H., Gonzalez-Hernandez, G. \u0026amp; Asselbergs, F. W. Artificial intelligence: revolutionizing cardiology with large language models. \u003cem\u003eEur. Heart J.\u003c/em\u003e 45, 332\u0026ndash;345 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQuer, G. \u0026amp; Topol, E. J. The potential for large language models to transform cardiovascular medicine. \u003cem\u003eLancet Digit. Health\u003c/em\u003e S2589-7500(24)00151\u0026ndash;1 (2024) doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/S2589-7500(24)00151-1\u003c/span\u003e\u003cspan address=\"10.1016/S2589-7500(24)00151-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNolin-Lapalme, A. \u003cem\u003eet al.\u003c/em\u003e Maximising Large Language Model Utility in Cardiovascular Care: A Practical Guide. \u003cem\u003eCan. J. Cardiol.\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cjca.2024.05.024\u003c/span\u003e\u003cspan address=\"10.1016/j.cjca.2024.05.024\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024) doi:10.1016/j.cjca.2024.05.024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEriksen, A. V., M\u0026ouml;ller, S. \u0026amp; Ryg, J. Use of GPT-4 to Diagnose Complex Clinical Cases. \u003cem\u003eNEJM AI\u003c/em\u003e 1, AIp2300031 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoh, E. \u003cem\u003eet al.\u003c/em\u003e Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 7, e2440969 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSkalidis, I. \u003cem\u003eet al.\u003c/em\u003e ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? \u003cem\u003eEur. Heart J. - Digit. Health\u003c/em\u003e 4, 279\u0026ndash;281 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFerreira Santos, J., Ladeiras-Lopes, R., Leite, F. \u0026amp; Dores, H. Applications of large language models in cardiovascular disease: a systematic review. \u003cem\u003eEur. Heart J. Digit. Health\u003c/em\u003e 6, 540\u0026ndash;553 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbdellaoui, C., Redjdal, A. \u0026amp; Seroussi, B. Generative-AI-Based Approaches for Information Extraction from Clinical Notes: A Scoping Review. \u003cem\u003eStud. Health Technol. Inform.\u003c/em\u003e 328, 193\u0026ndash;197 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoussein, E. H., Mohamed, R. E. \u0026amp; Ali, A. A. Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques. \u003cem\u003eSci. Rep.\u003c/em\u003e 13, 7173 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, Z., Qiu, Y., Yang, X. \u0026amp; Zhang, M. Enhanced character-level deep convolutional neural networks for cardiovascular disease prediction. \u003cem\u003eBMC Med. Inform. Decis. Mak.\u003c/em\u003e 20, 123 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNtinopoulos, V. \u003cem\u003eet al.\u003c/em\u003e Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. \u003cem\u003eBMJ Health Care Inform.\u003c/em\u003e 32, e101139 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShah, S. V. Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 7, e2425953 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHager, P. \u003cem\u003eet al.\u003c/em\u003e Evaluation and mitigation of the limitations of large language models in clinical decision-making. \u003cem\u003eNat. Med.\u003c/em\u003e 30, 2613\u0026ndash;2622 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGaber, F. \u003cem\u003eet al.\u003c/em\u003e Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. \u003cem\u003eNPJ Digit. Med.\u003c/em\u003e 8, 263 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiew, S. M. \u003cem\u003eet al.\u003c/em\u003e Can doctors and patients correctly estimate cardiovascular risk? A cross-sectional study in primary care. \u003cem\u003eBMJ Open\u003c/em\u003e 8, e017711 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWebster, R. \u0026amp; Heeley, E. Perceptions of risk: understanding cardiovascular disease. \u003cem\u003eRisk Manag. Healthc. Policy\u003c/em\u003e 3, 49\u0026ndash;60 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMihan, A., Pandey, A. \u0026amp; Van Spall, H. G. Mitigating the risk of artificial intelligence bias in cardiovascular care. \u003cem\u003eLancet Digit. Health\u003c/em\u003e 6, e749\u0026ndash;e754 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMihan, A., Pandey, A. \u0026amp; Van Spall, H. G. C. Artificial intelligence bias in the prediction and detection of cardiovascular disease. \u003cem\u003eNpj Cardiovasc. Health\u003c/em\u003e 1, 31 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKhandekar, N. \u003cem\u003eet al.\u003c/em\u003e MedCalc-Bench: Evaluating Large Language Models for Medical Calculations. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2406.12036\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2406.12036\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmall Language Models (SLMs) Can Still Pack a Punch: A survey. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/html/2501.05465v1\u003c/span\u003e\u003cspan address=\"https://arxiv.org/html/2501.05465v1\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoeschl, T. \u003cem\u003eet al.\u003c/em\u003e Development of an LLM Pipeline Surpassing Physicians in Cardiovascular Risk Score Calculation. 2025.11.11.25340002 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2025.11.11.25340002\u003c/span\u003e\u003cspan address=\"10.1101/2025.11.11.25340002\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, Y. \u003cem\u003eet al.\u003c/em\u003e Medical Hallucinations in Foundation Models and Their Impact on Healthcare. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2503.05777\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2503.05777\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAsgari, E. \u003cem\u003eet al.\u003c/em\u003e A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. \u003cem\u003eNpj Digit. Med.\u003c/em\u003e 8, 274 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQiu, P. \u003cem\u003eet al.\u003c/em\u003e Towards building multilingual language model for medicine. \u003cem\u003eNat. Commun.\u003c/em\u003e 15, 8384 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNunes, M., Bon\u0026eacute;, J., Ferreira, J. C., Chaves, P. \u0026amp; Elvas, L. B. MediAlbertina: An European Portuguese medical language model. \u003cem\u003eComput. Biol. Med.\u003c/em\u003e 182, 109233 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, H. \u003cem\u003eet al.\u003c/em\u003e Large language models and global health equity: a roadmap for equitable adoption in LMICs. \u003cem\u003eLancet Reg. Health \u0026ndash; West. Pac.\u003c/em\u003e 63, (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGarcia, G. L. \u003cem\u003eet al.\u003c/em\u003e A Step Forward for Medical LLMs in Brazilian Portuguese: Establishing a Benchmark and a Strong Baseline. in 2025 \u003cem\u003eIEEE 38th International Symposium on Computer-Based Medical Systems (CBMS)\u003c/em\u003e 214\u0026ndash;219 (2025). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/CBMS65348.2025.00052\u003c/span\u003e\u003cspan address=\"10.1109/CBMS65348.2025.00052\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEverett, S. S. \u003cem\u003eet al.\u003c/em\u003e From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis. \u003cem\u003eMedRxiv Prepr. Serv. Health Sci.\u003c/em\u003e 2025.06.07.25329176 (2025) doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1101/2025.06.07.25329176\u003c/span\u003e\u003cspan address=\"10.1101/2025.06.07.25329176\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoh, E. \u003cem\u003eet al.\u003c/em\u003e Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 7, e2440969 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHager, P. \u003cem\u003eet al.\u003c/em\u003e Evaluation and mitigation of the limitations of large language models in clinical decision-making. \u003cem\u003eNat. Med.\u003c/em\u003e 30, 2613\u0026ndash;2622 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShan, G. \u003cem\u003eet al.\u003c/em\u003e Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. \u003cem\u003eJMIR Med. Inform.\u003c/em\u003e 13, e64963 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGriot, M., Hemptinne, C., Vanderdonckt, J. \u0026amp; Yuksel, D. Large Language Models lack essential metacognition for reliable medical reasoning. \u003cem\u003eNat. Commun.\u003c/em\u003e 16, 1\u0026ndash;10 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVasey, B. \u003cem\u003eet al.\u003c/em\u003e Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. \u003cem\u003eNat. Med.\u003c/em\u003e 28, 924\u0026ndash;933 (2022).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Clinical decision support, Diagnostic accuracy, Artificial intelligence, Large language models, Multilingual evaluation, Cardiovascular prevention, Risk stratification, SCORE2","lastPublishedDoi":"10.21203/rs.3.rs-8307079/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8307079/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e Large language models (LLMs) show promise for cardiovascular risk stratification, though their performance compared with clinical guidelines requires validation. We benchmarked eleven contemporary LLMs using 30 bilingual (Portuguese/English) outpatient vignettes comparing their classifications against expert-adjudicated European Society of Cardiology guidelines using SCORE2. Models achieved near-perfect extraction of traditional risk factors (micro-F1 0.97\u0026ndash;0.99) but only moderate agreement for three-class ESC risk categories (best weighted kappa 0.69, 95% CI 0.44\u0026ndash;0.84). Ten out of eleven showed systematic underestimation of risk. LLMs struggled with SCORE2 numeric computation, with mean absolute error exceeding 5 percentage points in all but one. Most models correctly identified guideline exceptions requiring alternative assessment, beyond SCORE2, in more than 95% of cases. No significant performance differences between languages were found. While LLMs excel at structured data extraction and eligibility screening, their inconsistent risk stratification and poor numeric accuracy preclude autonomous clinical use, warranting further refinement.\u003c/p\u003e","manuscriptTitle":"Benchmarking large language models for cardiovascular risk stratification using clinical vignettes","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-30 01:29:23","doi":"10.21203/rs.3.rs-8307079/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"37cad1dd-2ff7-458a-b019-883d67e43bcd","owner":[],"postedDate":"December 30th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":59610709,"name":"Health sciences/Cardiology"},{"id":59610710,"name":"Health sciences/Diseases"},{"id":59610711,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2025-12-31T16:30:29+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-30 01:29:23","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8307079","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8307079","identity":"rs-8307079","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00