Clinical utility of EHR-based predictive models for identifying high-risk individuals in early cancer detection

doi:10.21203/rs.3.rs-7426088/v1

Clinical utility of EHR-based predictive models for identifying high-risk individuals in early cancer detection

2025 · doi:10.21203/rs.3.rs-7426088/v1

preprint OA: closed

Full text JSON View at publisher

Full text 102,667 characters · extracted from preprint-html · click to expand

Clinical utility of EHR-based predictive models for identifying high-risk individuals in early cancer detection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Clinical utility of EHR-based predictive models for identifying high-risk individuals in early cancer detection Jiheum Park, Chao Pang, Tristan Lee, Jacob Berkowitz, Alexander Wei, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7426088/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Electronic health records (EHRs) offer a promising, scalable approach for identifying individuals at high risk for targeted cancer screening, but the absence of clinical benchmarks has limited their adoption. We evaluated the clinical utility of EHR-based predictive models for 12-month cancer risk across eight major cancers—breast, lung, colorectal, prostate, ovarian, liver, pancreatic, and stomach—using longitudinal data from over 865,000 participants in the All of Us Research Program, which uniquely integrates EHR, genomic, and survey data. Compared to traditional risk factors (e.g., age, family history, genetic variants), EHR-based models significantly improved identification of high-risk groups. The models achieved a 3- to 6-fold increase in risk enrichment for breast, colorectal, pancreatic, and stomach cancers relative to traditional risk factors alone. For liver cancer, the model achieved the highest absolute lift (27.6-fold compared to the general population), although the relative improvement over known risk factors was more modest (1.68-fold). These findings establish practical benchmarks for EHR-based cancer risk prediction and provide insights for integrating such models into clinical workflows to enable more precise and scalable early detection strategies. Biological sciences/Cancer/Cancer prevention Health sciences/Medical research/Translational research Figures Figure 1 Figure 2 Figure 3 Introduction Early detection of cancer, before it progresses to advanced stages, can substantially improve survival and reduce cancer-related mortality. 1 However, effective screening guidelines currently exist for only a few cancer types 2 , such as colorectal cancer (starting at age 45) 3 and lung cancer (based on smoking history) 4 . Many cancers with high case-fatality rates, such as pancreatic, liver, ovarian, and stomach cancers, lack evidence-based screening strategies. These cancers are often diagnosed at advanced stages, due to their insidious onset, low symptom specificity, and absence of effective early detection strategies. Notably, these are also among the leading causes of cancer-related mortality, underscoring the urgent need for scalable risk-tailored approaches to identify individuals at high risk who may benefit from early diagnostic interventions. A tool that improves early identification for these cancers could have a direct impact on patient outcomes and survival. Electronic health records (EHRs) offer a promising, non-invasive, and cost-effective data source for identifying high-risk individuals across diverse cancer types. 5 EHRs capture longitudinal patient trajectories that may reveal prediagnostic symptom clusters or healthcare utilization patterns, spanning diagnoses, medications, procedures, and more, enabling personalized risk profiling. 6 Recent advances in artificial intelligence (AI), particularly the emergence of large language models capable of synthesizing large-scale clinical data, have further amplified the potential of EHR-based predictive models. 7 – 9 Many studies have demonstrated that advanced machine learning algorithms such as gradient-boosted trees, deep neural networks, and transformer-based architectures outperform traditional models in predicting cancer risk from EHR data. 10 However, prior work has primarily focused on evaluating such methods for different cancer types in terms of predictive performance with limited assessment of clinical utility. 11 , 12 In particular, there has been less attention on how these models perform in stratifying high-risk populations for targeted screening or how they compare to traditional risk factors such as family history, genetic carrier status, or relevant comorbidities. To bridge this gap, we conducted a multi‑cancer assessment of risk enrichment (i.e., the concentration of incident cancers among high-risk individuals, also referred to as lift values) using EHR‑based predictive models. We leveraged data from the All of Us Research Program, a national research cohort of more than 860,000 participants that uniquely integrates longitudinal EHR data, genomic information, and patient-reported survey responses. This resource enabled us to directly compare EHR-based model predictions against well-established cancer risk factors, including age, family history, and known genetic mutations (e.g., BRCA, Lynch syndrome). We evaluated model performance across eight common cancers, examining both predictive accuracy and clinical utility in stratifying high-risk groups for early detection. This large-scale, multi-cancer evaluation provides a framework for assessing the real-world impact of EHR-based risk prediction on targeted screening and earlier diagnosis. Results Study workflow and cohort characteristics The overall study workflow, including cohort identification, predictive modeling, and clinical utility evaluation against established risk factors, is summarized in Fig. 1 . We used structured EHR data in the Observational Medical Outcomes Partnership (OMOP) common data model. 13 , 14 To classify malignancy-related condition names, we developed a prompt-based approach using a generative pretrained transformer (GPT) model, 15 mapping them to 52 predefined cancer categories. The prompt was iteratively refined through clinical review, achieving 94.4% accuracy. Using the All of Us Research Program database 16 , we identified 62,597 individuals with confirmed cancer diagnoses and 173,386 controls with no malignancy-related diagnoses. Cancer cases from All of Us database were classified into 49 distinct cancer types. Predictive performance of EHR-based models across cancer types We developed cancer-specific predictive models using XGBoost 17 , leveraging medical conditions documented at least 12 months prior to diagnosis. Each model incorporated approximately 26,000 features derived from EHR. Predictive performance, measured by the area under the receiver operating characteristic curve (AUROC), is shown in Fig. 1 for cancer types with at least 200 diagnosed individuals. To ensure comparability, we evaluated each model against the same control group described in the cohort definition. We used five-fold stratified cross-validation, with test sets representing hypothetical general populations for evaluating model performance against established risk factors. Training sets in each fold were used to build the EHR models. Performance varied across cancer types, with the highest AUROC for prostate cancer (0.90; 95% CI: 0.90–0.91) and the lowest for stomach cancer (0.61; 95% CI: 0.59–0.64). Clinical utility of EHR-based models in comparison with traditional risk factors To evaluate the clinical utility of the EHR-based model for stratified cancer screening, we primarily used risk enrichment, defined as the ratio of cancer incidence in the top k% of predicted risk scores to the overall population incidence. This measure—also referred to as lift 18 (Lift = Prevalence in high-risk group / Prevalence in population)—quantifies the model’s ability to concentrate true cases within high-risk strata, supporting practical decisions about screening thresholds and resource allocation. We focused our analyses on eight cancer types: breast, prostate, colorectal, lung, ovarian, liver, pancreatic, and stomach cancer. For each, we assessed whether the EHR-based model could identify high-risk cohorts with greater cancer prevalence than those defined by established clinical risk factors (e.g., age, genetic predisposition, or comorbid conditions such as new-onset diabetes; Table 1 ). We evaluated the clinical utility in two settings: (1) when the EHR model was used independently (Fig. 2 A), and (2) when it was combined with current clinical guidelines or known risk factors (Fig. 2 B). For example, the Cancer of the Pancreas Screening Study (CAPS) trial for early pancreatic cancer detection defines high-risk individuals based on the presence of inherited mutations (e.g., ATM , BRCA1/2 , PALB2 , STK11 , PRSS1/2 , CTRC , and Lynch syndrome genes) or a family history of pancreatic cancer (two or more close relatives on the same side of the family). In addition to these established criteria, we included age and new-onset diabetes (NOD) as risk factors for pancreatic cancer in our analysis. Depending on the specific risk factor, the proportion of individuals identified as high-risk—referred to as coverage —ranged from 0.8–20% (Table 1 ). The associated lift values ranged from 0.77 to 4.10, with the highest lift observed for a family history of pancreatic cancer, and the lowest (below 1.0) for a family history of any cancer, suggesting performance worse than screening the general population. In contrast, the EHR model’s coverage and lift depend on the risk threshold used to define high-risk status, which can be tuned to target specific intervention levels. At matched coverage levels, the EHR model consistently outperformed most individual risk factors. For instance, in pancreatic cancer, the EHR-based model identified high-risk individuals with more than three times the precision of genetic testing alone (lift 14.2 vs 3.87; 95% CI: 8.70–19.7 vs 1.11–6.63), highlighting its potential to detect high-risk cases currently missed by CAPS criteria. When combined with carrier status and tuned to target 3–5% coverage, the EHR model achieved significantly higher lift values (5.36–9.83) than carrier status alone (3.87), supporting its potential role in risk-stratified early detection. Beyond pancreatic cancer, we evaluated the model’s clinical utility for breast, prostate, colorectal, lung, ovarian, liver, and stomach cancers (Table 1 ). In most cases, the EHR model significantly improved lift, both as a standalone tool and in combination with established risk factors. Importantly, the EHR model may identify early-onset cancers not captured by age-based screening programs, a growing concern particularly in the colorectal and gastric cancers where the early-onset incidence is rising. 19 Among the eight cancer types evaluated, liver cancer showed the highest lift achieved by the EHR model, while stomach cancer showed the lowest (Fig. 3 ). Notably, lift was not necessarily correlated with AUROC-based predictive performance; for example, although prostate cancer had the highest AUROC, the greatest lift was observed for liver cancer. Table 1 Clinical utility evaluation across various cancer types. Lift by EHR model is evaluated at matched coverages of risk factors and shows its value as independent tool for identifying high-risk individuals while lift by RF + EHR model demonstrates its value as complementary tool with risk factor (RF). * indicates statistical significance of p value less than 0.05. Risk Factor (RF) Coverage Lift by RF Lift by EHR model Max lift by RF + EHR model Breast BRCA 1% 4.87 [4.00-5.75] 14.8[13.5–16.1]* 12.0 [11.3–12.7]* Age 40–74 64% 1.32 [1.30–1.33] 1.45 [1.43–1.47]* 1.36 [1.34–1.38]* FH_cancer 20% 1.26 [1.17–1.34] 3.30 [3.17–3.42]* 2.25 [2.16–2.34]* FH_breast 8% 2.58 [2.41–2.74] 5.49 [5.06–5.91]* 4.08 [3.82–4.34]* Prostate Age 45 65% 1.51 [1.52–1.53] 1.53 [1.52–1.55] 1.52 [1.52–1.53] FH_cancer 20% 1.25 [1.19–1.30] 4.07 [3.97–4.17]* 2.71 [2.64–2.78]* FH_prostate 4% 4.42 [4.15–4.69] 11.2 [10.4–11.9]* 7.57 [7.31–7.84]* Colorectal Age 45 64% 1.38 [1.33–1.44] 1.38 [1.36–1.39] 1.41 [1.36–1.45] FH_cancer 20% 0.97 [0.88–1.06] 3.10 [2.96–3.24]* 2.19 [2.10–2.28]* FH_colorectal 5% 2.68 [2.43–2.93] 7.55 [7.24–7.88]* 5.71 [5.25–6.16]* Lung Smoking history 39% 1.79 [1.72, 1.86] 2.06 [1.93–2.19]* 1.89 [1.83–1.95]* Smoking history & Age 50–80 23% 2.59 [2.43, 2.74] 2.95 [2.81–3.09]* 2.74 [2.63–2.86]* FH_cancer 20% 1.04 [0.94–1.13] 3.24 [3.09–3.40]* 2.20 [2.11–2.29]* FH_lung 5% 2.79 [2.27–3.31] 7.04 [6.35–7.72]* 4.94 [4.51–5.37]* Ovarian BRCA 0.5% 7.24 [2.61–11.9] 20.6 [16.1–25.1]* 15.3 [11.4–19.2]* FH_cancer 20% 1.25 [1.06–1.43] 3.09 [2.65–3.53]* 2.28 [2.18–2.37]* FH_ovarian 2% 7.37 [6.53–8.20] 10.9 [9.98–11.7]* 10.4 [9.41–11.4]* Liver Hepatitis B/C or Cirrhosis 2% 16.9 [15.2–18.6] 27.6 [25.9–29.3]* 19.3 [18.7–19.8]* FH_cancer 20% 0.70 [0.55–0.84] 4.19 [3.60–4.78]* 2.93 [2.68–3.18]* Pancreas carrier 1.0% 3.87 [1.11–6.63] 14.2 [8.70–19.7]* 9.83 [5.92–13.7]* FH_cancer 20% 0.77 [0.66–0.88] 2.53 [2.36–2.71]* 1.63 [1.54–1.73]* FH_pancreas 2% 4.10 [1.74–7.46] 8.54 [5.31–11.8]* 7.86 [4.94–10.8]* NOD 13% 0.78 [0.55–1.07] 2.53 [2.36–2.71]* 2.08 [1.94–2.23]* NOD60 6% 1.18 [0.74–1.63] 4.06 [3.86–4.25]* 3.13 [2.75–3.51]* Stomach Helicobacter pylori 2% 1.41 [0.00-4.22] 4.71 [0.29–9.13] 3.58 [1.96–5.20]* FH_cancer 20% 0.82 [0.33–1.32] 1.82 [1.54–2.10]* 1.39 [1.12–1.65]* FH_stomach 1% 6.56 [3.51–9.61] 6.58 [0.00-14.2] 6.95 [3.23–10.7] Features contributing to model performance Across cancer types, top-ranked features varied and often included known risk factors or related clinical conditions (Figure S1). For example, cirrhosis and hepatitis were among the most influential features for liver cancer, while pancreatic cysts and other pancreatic disorders were highly ranked for pancreatic cancer. In the breast cancer model, the most influential features included carcinoma in situ of the breast and acquired absence of breast , both of which appeared in a subset of individuals prior to their index dates. Specifically, carcinoma in situ was observed in 135 cases and 181 controls, and acquired absence of breast in 62 cases and 128 controls—together representing 4.7% of the breast cancer case-control cohort. Although these features were not included in the case definition for invasive breast cancer and were present in both cases and controls, they were consistently ranked among the top contributors across cross-validation folds. Discussion Early diagnosis of cancer remains a major clinical challenge, particularly for cancers lacking effective screening guidelines. Existing diagnostic modalities are often limited by availability, cost, or invasiveness, making widespread implementation difficult. 20 Identifying high-risk individuals who would benefit most from targeted screening could significantly improve early detection and reduce cancer-related mortality. In this study, we evaluated the potential of an EHR-based predictive model to identify high-risk individuals across multiple cancer types, comparing its performance to known risk factors and current clinical guidelines. Our results show that the model stratifies high-risk populations, either as a standalone tool or in combination with existing approaches. The model with such capability will enable identification of high-lift subgroups, who would be triaged to more intensive screening, even in the absence of traditional risk factors. These findings establish an important benchmark for translating future predictive modeling efforts into population-level cancer prevention and early detection strategies, highlighting the potential utility of EHR-based models in the era of AI-driven precision medicine. Our study used lift as a primary metric alongside AUROC to evaluate clinical utility. While AUROC is a widely used measure of model discrimination, it does not necessarily reflect a model’s ability to identify actionable high-risk groups for targeted screening. In our analysis, lift and AUROC were not always aligned—for example, prostate cancer had the highest AUROC, yet liver cancer showed the greatest lift. This illustrates that higher discrimination does not always translate into better risk enrichment across different cancers, as lift depends not only on model performance but also on disease prevalence and the distribution of predicted risk. However, when comparing alternative models for the same cancer, improvements in AUROC may be more likely to yield gains in lift. These findings highlight the importance of incorporating risk enrichment metrics such as lift alongside traditional accuracy measures when evaluating predictive models for stratified prevention strategies. However, a key challenge in applying advanced predictive models is interpretability. 21 , 22 While machine learning algorithms such as XGBoost used in this study are well known for capturing complex, nonlinear interactions among features and often provide significantly improved performance compared to linear models such as logistic regression (Figure S2), their complexity can make model outputs difficult to interpret and trust. Our SHAP analysis illustrated this complexity by highlighting how multiple interacting features contributed to model predictions. For example, in liver cancer, where the EHR model achieved the greatest lift among all cancer types evaluated (Fig. 3 ), the top five features were primarily related to hepatitis and cirrhosis, which were also used as established risk factors in our comparisons. However, the superior performance of the EHR model suggests it improves risk stratification not only by capturing these known associations but also by integrating additional, non-obvious features identified through SHAP analysis. In the case of breast cancer, carcinoma in situ of the breast and acquired absence of breast were consistently among the top-ranked predictive features. These features were present in both cases and controls (e.g., 135 cases and 181 controls had carcinoma in situ ; 62 cases and 128 controls had acquired absence of breast ), collectively accounting for 4.7% of the breast cancer case-control cohort. Given their substantial presence in controls as well as cases, their predictive contribution likely arises not from their presence alone, but from their interaction with other clinical variables captured by the model. These findings demonstrate how EHR-based models can leverage a broad spectrum of structured clinical data to identify meaningful predictors of cancer risk, many of which may be overlooked by traditional rule-based or linear approaches. While model interpretability has traditionally relied on feature attribution methods 23 such as SHAP, LIME, Integrated Gradients, and DeepLIFT, recent advances in AI, particularly in generative models capable of human-like reasoning, open new possibilities for more intuitive and interactive forms of explainability. 24 , 25 These emerging approaches may enable richer insights into model behavior by aligning explanations more closely with how clinicians reason through decisions. As predictive models are increasingly applied in medicine, developing advanced explainable AI techniques that enhance human understanding will be critical for fostering trust, supporting clinical adoption, and ensuring responsible deployment. Despite the promise of EHR-based predictive models, there are inherent limitations to using EHR data for cancer prediction or for any clinical modeling task. 26 , 27 These include issues of data completeness, misclassification, and potential false positives due to inaccuracies in documentation or coding. While manual chart review is often necessary to ensure phenotypic fidelity, emerging HIPAA-compliant large language model-based tools, such as those used in this study, may soon enable scalable, semi-automated validation pipelines to support clinical data curation at scale. In this study, we used prompt-based GPT classification to categorize cancer-related concepts, a strategy that could be extended to broader EHR validation efforts. It is important to emphasize that this study focused on baseline EHR model performance using only structured condition occurrence data within the OMOP framework. There remains tremendous opportunity to enhance model accuracy and clinical relevance by incorporating additional structured data, such as laboratory results, demographics, procedures, and observations as well as emerging multi-omics data, including genomics, proteomics, methylation, and exposomics to account for environmental exposures. As these data sources become more available and integrable, future EHR-based models can achieve greater predictive power and biological insight. While our study demonstrates the potential of EHR-based models to improve early cancer diagnosis, any predictive tool used to guide screening decisions must be evaluated in the context of overdiagnosis and overtreatment. 1 Identifying individuals at elevated risk does not necessarily translate to clinical benefit, particularly for cancers with indolent or slow-progressing courses. Future work should incorporate downstream clinical outcomes to assess whether EHR-based risk stratification leads to net benefit in terms of survival, quality of life, and healthcare resource utilization. In conclusion, EHR-based predictive models offer a non-invasive, scalable approach to identifying individuals at elevated cancer risk, with the potential for meaningful clinical impact in early detection. However, this promise must be balanced with comprehensive evaluation of downstream effects to ensure a net benefit in real-world applications. With continued improvements in data quality, model interpretability, and integration of diverse data types, these tools can complement existing screening guidelines and inform personalized surveillance strategies. Future work should explore how high-risk individuals identified by such models can be enrolled in risk-adapted screening protocols, including decisions around eligibility, screening frequency, and modality. These efforts represent a critical step toward realizing precision prevention at a population scale. Methods All of Us Research Program database The All of Us Research Program 16 provides a nationally representative dataset designed to reflect the diversity of the U.S. population. Any individual aged 18 or older is eligible to participate. As of August 2025, more than 865,000 participants have enrolled in the program, with over 595,000 having completed the full consent process, which includes agreeing to share EHR, completing surveys, providing physical measurements, and donating biospecimens to the All of Us biobank. For this study, we included participants with available whole genome sequencing (WGS), survey responses, and linked EHR data. Cancer type classification EHR data from the All of Us Research Program follow the OMOP Common Data Model, where clinical concepts are encoded using standardized concept IDs with defined hierarchical relationships (i.e., ancestors and descendants). To identify potential cancer cases, we extracted all descendant concept names under the OMOP concept ID 443392, which represents “malignant neoplastic disease.” We then developed a GPT-based classification approach to map these medical concept names to predefined cancer types (e.g., colorectal cancer, liver cancer, skin cancer). Classification was performed using the OpenAI API with the gpt-4o model. Starting from an initial prompt, we iteratively refined both the prompt and the cancer type categories based on clinician review of the model’s outputs, as illustrated in Fig. 1 A. The final prompt and full codebase are publicly available at https://github.com/jp4147/aou_EHRmodel . Case and control group We identified individuals with any of the concept IDs related to malignant neoplastic disease (i.e., descendants of concept ID 443392) recorded in their medical history. The earliest occurrence of any of these concept IDs was used as the individual’s first cancer diagnosis date. Individuals without any of these cancer-related concept IDs were considered potential controls. We then restricted both cases and controls to participants with available whole genome sequencing data. Cancer cases were further filtered to exclude individuals whose first diagnosis was labeled as a secondary malignancy or classified as “not cancer” by the GPT-based prompt. For cases, the prediction index date was defined as 12 months prior to the first cancer diagnosis. For controls, the index date was set to 24 months before the last recorded medical condition, to reduce the likelihood of including individuals with undiagnosed cancer. To ensure adequate longitudinal medical history, we included only individuals (both cases and controls) with at least five documented medical conditions prior to their respective index dates. Risk factor identification We evaluated a set of established risk factors to compare against the EHR model, including: age, family history of cancer, smoking history, genetic carrier status, chronic hepatitis B/C, cirrhosis, new-onset diabetes, and Helicobacter pylori infection. Age was calculated relative to each individual’s index date using their recorded date of birth. For conditions such as chronic hepatitis B/C, cirrhosis, new-onset diabetes, and H. pylori infection, we queried the All of Us Controlled Tier Dataset (v8) using curated sets of high-level OMOP concept IDs associated with each condition. We retrieved all descendant concepts via the cb_criteria and cb_search_all_events tables and flagged individuals with any matching concept recorded in their medical history. Individuals were classified as having each risk factor if they had a first recorded diagnosis prior to their index date. For new-onset diabetes, individuals were classified as positive if they had a first diagnosis of type 2 diabetes before the index date and no record of prior diabetes-related medications. Family history of cancer and smoking history were derived from participant survey data. Genetic carriers were identified using the ClinVar database. We restricted the analysis to variants classified as pathogenic or likely pathogenic , submitted by multiple submitters with no conflicts, or reviewed by an expert panel. A total of 4,738 individuals were identified as carriers of pathogenic variants in genes associated with cancer predisposition syndromes, including ATM , BRCA1 , BRCA2 , PALB2 , STK11 (Peutz-Jeghers), CDKN2A (FAMMM), and Lynch syndrome genes ( EPCAM , MLH1 , MSH2 , MSH6 , PMS2 ), as well as hereditary pancreatitis genes ( PRSS1/2 , CTRC ). Model training We conducted five-fold stratified cross-validation separately for each cancer type. While the number of cases varied by cancer type, we used a consistent control group across all models. In each fold, the data were split into a training set (80%) and a test set (20%), maintaining the case-control ratio. For each individual, medical condition data were converted into a binary sparse matrix, where each feature represented the presence or absence of a specific condition prior to the index date. Using these features, we trained gradient-boosted decision tree models with XGBoost for each fold. Evaluation setup While the training sets were used to develop the EHR models, the test sets served as a proxy for a hypothetical general population to evaluate the clinical utility of each model. We compared the effectiveness of the EHR model and known risk factors in identifying high-risk individuals. To quantify clinical utility, we used lift, a metric that measures the enrichment of true cases within the high-risk group relative to the general population. A higher lift indicates that the high-risk group contains a greater concentration of true cancer cases, suggesting greater potential effectiveness if this group were targeted for confirmatory screening, such as biopsy or imaging. Feature importance extraction To identify features contributing to model predictions, we computed SHAP (SHapley Additive exPlanations) values for each individual in the test set across all five cross-validation folds. We hypothesized that features whose SHAP values show inconsistent directionality across patients may reflect unreliable contributions to risk prediction. Accordingly, for each fold, we aggregated SHAP values by summing the signed contributions of each feature across individuals. Features were then ranked within each fold based on their aggregate contributions, and final rankings were obtained by averaging ranks across folds to identify consistently important predictors. Statistical analysis We reported 95% confidence intervals to represent variability in performance metrics across the five cross-validation folds. To assess statistical significance between groups (e.g., comparing model performance or lift values), we used the Mann–Whitney U test. P-values less than 0.05 were considered statistically significant. References Crosby D et al (2022) Early detection of cancer. Science 375:eaay9040 Smith RA et al (2016) Cancer screening in the United States, 2016: A review of current American Cancer Society guidelines and current issues in cancer screening. CA Cancer J Clin 66:96–114 Patel SG et al (2022) Updates on Age to Start and Stop Colorectal Cancer Screening: Recommendations From the U.S. Multi-Society Task Force on Colorectal Cancer. Am J Gastroenterol 117:57–69 Parekh A et al (2022) The 50-Year Journey of Lung Cancer Screening: A Narrative Review. Cureus 14:e29381 Jung AW et al (2024) Multi-cancer risk stratification based on national health data: a retrospective modelling and validation study. Lancet Digit Health 6:e396–e406 Huguet N et al (2020) Using Electronic Health Records in Longitudinal Studies: Estimating Patient Attrition. Med Care 58(Suppl 1):S46–S52 Ye J, Woods D, Jordan N, Starren J (2024) The role of artificial intelligence for the application of integrating electronic health records and patient-generated data in clinical decision support. AMIA Jt Summits Transl Sci Proc 2024:459–467 Rose C, Chen JH (2024) Learning from the EHR to implement AI in healthcare. NPJ Digit Med 7:330 Yang X et al (2022) A large language model for electronic health records. NPJ Digit Med 5:194 Zhang B, Shi H, Wang H (2023) Machine Learning and AI in Cancer Prognosis, Prediction, and Treatment Selection: A Critical Approach. J Multidiscip Healthc 16:1779–1791 Goldstein BA, Navar AM, Pencina MJ, Ioannidis JP (2017) Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inf Assoc 24:198–208 Mishra AK, Chong B, Arunachalam SP, Oberg AL, Majumder S (2024) Machine Learning Models for Pancreatic Cancer Risk Prediction Using Electronic Health Record Data-A Systematic Review and Assessment. Am J Gastroenterol 119:1466–1482 Klann JG, Joss MA, Embree K, Murphy SN (2019) Data model harmonization for the All Of Us Research Program: Transforming i2b2 data into the OMOP common data model. PLoS ONE 14:e0212463 Wang L et al (2025) A scoping review of OMOP CDM adoption for cancer research using real world data. NPJ Digit Med 8:189 Elnashar A, White J, Schmidt DC (2025) Enhancing structured data generation with GPT-4o evaluating prompt efficiency across prompt styles. Front Artif Intell 8:1558938 Ramirez AH et al (2022) The All of Us Research Program: data quality, utility, and diversity. Patterns 3 Chen T, Guestrin C, Xgboost (2016) A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 785–794 Lau K, Hart GR, Deng J (2025) Predicting time-to-first cancer diagnosis across multiple cancer types. Sci Rep 15:24790 Ben-Aharon I et al (2023) Early-onset cancer in the gastrointestinal tract is on the rise—evidence and implications. Cancer Discov 13:538–551 Ballard DH et al (2021) The Role of Imaging in Health Screening: Overview, Rationale of Screening, and Screening Economics. Acad Radiol 28:540–547 Stiglic G et al (2020) Interpretability of machine learning-based prediction models in healthcare. Wires Data Min Knowl 10 Linardatos P, Papastefanopoulos V, Kotsiantis S (2020) Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy (Basel) 23 Shobeiri S (2024) Enhancing Transparency in Healthcare Machine Learning Models Using Shap and Deeplift a Methodological Approach. Iraqi J Inform Communication Technol 7:56–72 Krause S, Stolzenburg F (2024) From data to commonsense reasoning: the use of large language models for explainable AI. arXiv preprint arXiv:2407.03778 Pal NR (2020) In Search of Trustworthy and Transparent Intelligent Systems With Human-Like Cognitive and Reasoning Capabilities. Front Robot Ai 7 Carrington JM, Effken JA (2011) Strengths and limitations of the electronic health record for documenting clinical events. Comput Inf Nurs 29:360–367 Kim MK, Rouphael C, McMichael J, Welch N, Dasarathy S (2024) Challenges in and Opportunities for Electronic Health Record-Based Data Analysis and Interpretation. Gut Liver 18:201–208 Additional Declarations There is NO Competing Interest. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7426088","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":504210845,"identity":"d9ed1824-d5da-4dd4-8471-5fe4f61c7b8c","order_by":0,"name":"Jiheum Park","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABAklEQVRIiWNgGAWjYBACxgaGhAOS/2xgfAsGBgmgGAMDMz4tDw9YsKXB+BKEtQA1PT5QwXYYWQuYgVsL8+zmhAM3eM7LG1w7/IDhR4VE4trZzY0fGCqsExtw2THnWMLBGRK3DTfcTjNg7DkjkbjtzsFmCYYz6bi1zMhJOCxhcDvB4HYOAzNjG1DLjcQGCca2w3i05H84/CfhHIqW5h+M//BpSUg4IHHgAIqWNmCg4dEy5wAwXhqSDWcC/XIQ6BdjoF/aLBKOpRvj0mI4uyH5g2SDnTzf7eSHD35U2Mhuu93++MaHGmtZnFpmIHEOwFkJOJSDgLwEHslRMApGwSgYBWAAAM5tZvBPUzn1AAAAAElFTkSuQmCC","orcid":"","institution":"The Trustees of Columbia University in the City of New York","correspondingAuthor":true,"prefix":"","firstName":"Jiheum","middleName":"","lastName":"Park","suffix":""},{"id":504210846,"identity":"6a2a340d-f3be-4f47-a695-e78f63e53d38","order_by":1,"name":"Chao Pang","email":"","orcid":"","institution":"The Trustees of Columbia University in the City of New York","correspondingAuthor":false,"prefix":"","firstName":"Chao","middleName":"","lastName":"Pang","suffix":""},{"id":504210847,"identity":"165cb2ca-1539-439b-a3eb-153e61b047c6","order_by":2,"name":"Tristan Lee","email":"","orcid":"","institution":"The Trustees of Columbia University in the City of New York","correspondingAuthor":false,"prefix":"","firstName":"Tristan","middleName":"","lastName":"Lee","suffix":""},{"id":504210848,"identity":"938390be-a6ef-410a-9fea-76fc6640fe5d","order_by":3,"name":"Jacob Berkowitz","email":"","orcid":"","institution":"Cedars-Sinai Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Jacob","middleName":"","lastName":"Berkowitz","suffix":""},{"id":504210849,"identity":"d1efab68-a9b2-46f6-bf3a-56ae65902e6c","order_by":4,"name":"Alexander Wei","email":"","orcid":"https://orcid.org/0000-0003-3290-9959","institution":"Columbia University Irving Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Alexander","middleName":"","lastName":"Wei","suffix":""},{"id":504210850,"identity":"6cff6af9-874d-4fcf-a021-f29573cf2074","order_by":5,"name":"Chin Hur","email":"","orcid":"","institution":"Columbia University Irving Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Chin","middleName":"","lastName":"Hur","suffix":""},{"id":504210851,"identity":"8b0dfd21-e9e3-425f-aeac-b221bc49d50a","order_by":6,"name":"Nicholas Tatonetti","email":"","orcid":"https://orcid.org/0000-0002-2700-2597","institution":"Cedars-Sinai Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Nicholas","middleName":"","lastName":"Tatonetti","suffix":""}],"badges":[],"createdAt":"2025-08-21 12:20:39","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7426088/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7426088/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":90544414,"identity":"8b9283af-f6cb-46b7-b8d4-4d6321bc4f0f","added_by":"auto","created_at":"2025-09-04 00:18:32","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":189316,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStudy workflow for evaluating clinical utility of EHR model compared to established risk factors.\u003c/strong\u003e Among 49 cancer types identified in the All of Us data, AUROC distributions are shown for cancer types with at least 200 cases. Cancer types in red (e.g., pancreas, colorectal, breast) were selected for clinical utility comparison. For example, comparison of EHR-based model and traditional risk factor (e.g., carrier status) for pancreatic cancer, measured using lift at equal population coverage (~0.8%).\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7426088/v1/20bfd06f97455c8e6aee7bc9.png"},{"id":90544415,"identity":"9a7e844b-0d3d-4c12-8276-fab92785ed46","added_by":"auto","created_at":"2025-09-04 00:18:32","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":88315,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eClinical utility evaluation of the EHR model for pancreatic cancer.\u003c/strong\u003e (A) Clinical utility of the EHR model as an independent tool: lift achieved by the EHR model compared to traditional risk factors at matched coverage levels. (B) Clinical utility of the EHR model as a complementary tool: lift when combining the EHR model with a risk factor to identify high-risk individuals, evaluated across a range of targeted coverage thresholds. The added value of the EHR model varies depending on the coverage used to define high-risk groups. Thresholds yielding statistically significant lift (EHR + risk factor vs. risk factor alone) are highlighted in red.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7426088/v1/c472a6889fb529529799d73e.png"},{"id":90544413,"identity":"861a81fa-f375-4002-ab49-ffe850ea3839","added_by":"auto","created_at":"2025-09-04 00:18:32","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":94657,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDifferent trends between benefits from EHR model and prediction performance measured by AUROC\u003c/strong\u003e. While EHR model for prostate cancer showed the highest AUROC, EHR model for liver cancer showed the highest lift values\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7426088/v1/44303c2c65870c7e6ed281a8.png"},{"id":100372742,"identity":"5d9b16da-1d21-4c97-a8f3-3484e6fc0e5a","added_by":"auto","created_at":"2026-01-16 08:13:05","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1193158,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7426088/v1/31882dde-fb26-4453-be22-278f33e552d1.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Clinical utility of EHR-based predictive models for identifying high-risk individuals in early cancer detection","fulltext":[{"header":"Introduction","content":"\u003cp\u003eEarly detection of cancer, before it progresses to advanced stages, can substantially improve survival and reduce cancer-related mortality.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e However, effective screening guidelines currently exist for only a few cancer types\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e, such as colorectal cancer (starting at age 45)\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e and lung cancer (based on smoking history)\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. Many cancers with high case-fatality rates, such as pancreatic, liver, ovarian, and stomach cancers, lack evidence-based screening strategies. These cancers are often diagnosed at advanced stages, due to their insidious onset, low symptom specificity, and absence of effective early detection strategies. Notably, these are also among the leading causes of cancer-related mortality, underscoring the urgent need for scalable risk-tailored approaches to identify individuals at high risk who may benefit from early diagnostic interventions. A tool that improves early identification for these cancers could have a direct impact on patient outcomes and survival.\u003c/p\u003e\u003cp\u003eElectronic health records (EHRs) offer a promising, non-invasive, and cost-effective data source for identifying high-risk individuals across diverse cancer types.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e EHRs capture longitudinal patient trajectories that may reveal prediagnostic symptom clusters or healthcare utilization patterns, spanning diagnoses, medications, procedures, and more, enabling personalized risk profiling.\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e Recent advances in artificial intelligence (AI), particularly the emergence of large language models capable of synthesizing large-scale clinical data, have further amplified the potential of EHR-based predictive models.\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e Many studies have demonstrated that advanced machine learning algorithms such as gradient-boosted trees, deep neural networks, and transformer-based architectures outperform traditional models in predicting cancer risk from EHR data.\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003cp\u003eHowever, prior work has primarily focused on evaluating such methods for different cancer types in terms of predictive performance with limited assessment of clinical utility.\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e In particular, there has been less attention on how these models perform in stratifying high-risk populations for targeted screening or how they compare to traditional risk factors such as family history, genetic carrier status, or relevant comorbidities.\u003c/p\u003e\u003cp\u003eTo bridge this gap, we conducted a multi‑cancer assessment of risk enrichment (i.e., the concentration of incident cancers among high-risk individuals, also referred to as lift values) using EHR‑based predictive models. We leveraged data from the All of Us Research Program, a national research cohort of more than 860,000 participants that uniquely integrates longitudinal EHR data, genomic information, and patient-reported survey responses. This resource enabled us to directly compare EHR-based model predictions against well-established cancer risk factors, including age, family history, and known genetic mutations (e.g., BRCA, Lynch syndrome). We evaluated model performance across eight common cancers, examining both predictive accuracy and clinical utility in stratifying high-risk groups for early detection. This large-scale, multi-cancer evaluation provides a framework for assessing the real-world impact of EHR-based risk prediction on targeted screening and earlier diagnosis.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eStudy workflow and cohort characteristics\u003c/h2\u003e\u003cp\u003eThe overall study workflow, including cohort identification, predictive modeling, and clinical utility evaluation against established risk factors, is summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. We used structured EHR data in the Observational Medical Outcomes Partnership (OMOP) common data model.\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e To classify malignancy-related condition names, we developed a prompt-based approach using a generative pretrained transformer (GPT) model,\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e mapping them to 52 predefined cancer categories. The prompt was iteratively refined through clinical review, achieving 94.4% accuracy. Using the All of Us Research Program database\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e, we identified 62,597 individuals with confirmed cancer diagnoses and 173,386 controls with no malignancy-related diagnoses. Cancer cases from All of Us database were classified into 49 distinct cancer types.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003ePredictive performance of EHR-based models across cancer types\u003c/h3\u003e\n\u003cp\u003eWe developed cancer-specific predictive models using XGBoost\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e, leveraging medical conditions documented at least 12 months prior to diagnosis. Each model incorporated approximately 26,000 features derived from EHR. Predictive performance, measured by the area under the receiver operating characteristic curve (AUROC), is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e for cancer types with at least 200 diagnosed individuals. To ensure comparability, we evaluated each model against the same control group described in the cohort definition. We used five-fold stratified cross-validation, with test sets representing hypothetical general populations for evaluating model performance against established risk factors. Training sets in each fold were used to build the EHR models. Performance varied across cancer types, with the highest AUROC for prostate cancer (0.90; 95% CI: 0.90\u0026ndash;0.91) and the lowest for stomach cancer (0.61; 95% CI: 0.59\u0026ndash;0.64).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\n\u003ch3\u003eClinical utility of EHR-based models in comparison with traditional risk factors\u003c/h3\u003e\n\u003cp\u003eTo evaluate the clinical utility of the EHR-based model for stratified cancer screening, we primarily used risk enrichment, defined as the ratio of cancer incidence in the top k% of predicted risk scores to the overall population incidence. This measure\u0026mdash;also referred to as lift\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e (Lift\u0026thinsp;=\u0026thinsp;Prevalence in high-risk group / Prevalence in population)\u0026mdash;quantifies the model\u0026rsquo;s ability to concentrate true cases within high-risk strata, supporting practical decisions about screening thresholds and resource allocation.\u003c/p\u003e\u003cp\u003eWe focused our analyses on eight cancer types: breast, prostate, colorectal, lung, ovarian, liver, pancreatic, and stomach cancer. For each, we assessed whether the EHR-based model could identify high-risk cohorts with greater cancer prevalence than those defined by established clinical risk factors (e.g., age, genetic predisposition, or comorbid conditions such as new-onset diabetes; Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eWe evaluated the clinical utility in two settings: (1) when the EHR model was used independently (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA), and (2) when it was combined with current clinical guidelines or known risk factors (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB). For example, the Cancer of the Pancreas Screening Study (CAPS) trial for early pancreatic cancer detection defines high-risk individuals based on the presence of inherited mutations (e.g., \u003cem\u003eATM\u003c/em\u003e, \u003cem\u003eBRCA1/2\u003c/em\u003e, \u003cem\u003ePALB2\u003c/em\u003e, \u003cem\u003eSTK11\u003c/em\u003e, \u003cem\u003ePRSS1/2\u003c/em\u003e, \u003cem\u003eCTRC\u003c/em\u003e, and Lynch syndrome genes) or a family history of pancreatic cancer (two or more close relatives on the same side of the family). In addition to these established criteria, we included age and new-onset diabetes (NOD) as risk factors for pancreatic cancer in our analysis. Depending on the specific risk factor, the proportion of individuals identified as high-risk\u0026mdash;referred to as \u003cem\u003ecoverage\u003c/em\u003e\u0026mdash;ranged from 0.8\u0026ndash;20% (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The associated lift values ranged from 0.77 to 4.10, with the highest lift observed for a family history of pancreatic cancer, and the lowest (below 1.0) for a family history of any cancer, suggesting performance worse than screening the general population. In contrast, the EHR model\u0026rsquo;s coverage and lift depend on the risk threshold used to define high-risk status, which can be tuned to target specific intervention levels. At matched coverage levels, the EHR model consistently outperformed most individual risk factors. For instance, in pancreatic cancer, the EHR-based model identified high-risk individuals with more than three times the precision of genetic testing alone (lift 14.2 vs 3.87; 95% CI: 8.70\u0026ndash;19.7 vs 1.11\u0026ndash;6.63), highlighting its potential to detect high-risk cases currently missed by CAPS criteria. When combined with carrier status and tuned to target 3\u0026ndash;5% coverage, the EHR model achieved significantly higher lift values (5.36\u0026ndash;9.83) than carrier status alone (3.87), supporting its potential role in risk-stratified early detection.\u003c/p\u003e\u003cp\u003eBeyond pancreatic cancer, we evaluated the model\u0026rsquo;s clinical utility for breast, prostate, colorectal, lung, ovarian, liver, and stomach cancers (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). In most cases, the EHR model significantly improved lift, both as a standalone tool and in combination with established risk factors. Importantly, the EHR model may identify early-onset cancers not captured by age-based screening programs, a growing concern particularly in the colorectal and gastric cancers where the early-onset incidence is rising.\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e Among the eight cancer types evaluated, liver cancer showed the highest lift achieved by the EHR model, while stomach cancer showed the lowest (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Notably, lift was not necessarily correlated with AUROC-based predictive performance; for example, although prostate cancer had the highest AUROC, the greatest lift was observed for liver cancer.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003e\u003cb\u003eClinical utility evaluation across various cancer types.\u003c/b\u003e Lift by EHR model is evaluated at matched coverages of risk factors and shows its value as independent tool for identifying high-risk individuals while lift by RF\u0026thinsp;+\u0026thinsp;EHR model demonstrates its value as complementary tool with risk factor (RF). * indicates statistical significance of p value less than 0.05.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRisk Factor (RF)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCoverage\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eLift by RF\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eLift by EHR model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eMax lift by RF\u0026thinsp;+\u0026thinsp;EHR model\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eBreast\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBRCA\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4.87 [4.00-5.75]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e14.8[13.5\u0026ndash;16.1]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e12.0 [11.3\u0026ndash;12.7]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAge 40\u0026ndash;74\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e64%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.32 [1.30\u0026ndash;1.33]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.45 [1.43\u0026ndash;1.47]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1.36 [1.34\u0026ndash;1.38]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.26 [1.17\u0026ndash;1.34]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e3.30 [3.17\u0026ndash;3.42]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.25 [2.16\u0026ndash;2.34]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_breast\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e8%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2.58 [2.41\u0026ndash;2.74]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e5.49 [5.06\u0026ndash;5.91]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e4.08 [3.82\u0026ndash;4.34]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eProstate\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAge 45\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e65%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.51 [1.52\u0026ndash;1.53]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.53 [1.52\u0026ndash;1.55]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1.52 [1.52\u0026ndash;1.53]\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.25 [1.19\u0026ndash;1.30]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e4.07 [3.97\u0026ndash;4.17]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.71 [2.64\u0026ndash;2.78]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_prostate\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4.42 [4.15\u0026ndash;4.69]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e11.2 [10.4\u0026ndash;11.9]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e7.57 [7.31\u0026ndash;7.84]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eColorectal\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAge 45\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e64%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.38 [1.33\u0026ndash;1.44]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.38 [1.36\u0026ndash;1.39]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1.41 [1.36\u0026ndash;1.45]\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.97 [0.88\u0026ndash;1.06]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e3.10 [2.96\u0026ndash;3.24]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.19 [2.10\u0026ndash;2.28]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_colorectal\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e5%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2.68 [2.43\u0026ndash;2.93]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e7.55 [7.24\u0026ndash;7.88]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e5.71 [5.25\u0026ndash;6.16]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eLung\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSmoking history\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e39%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.79 [1.72, 1.86]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e2.06 [1.93\u0026ndash;2.19]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1.89 [1.83\u0026ndash;1.95]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSmoking history \u0026amp;\u003c/p\u003e\u003cp\u003eAge 50\u0026ndash;80\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e23%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2.59 [2.43, 2.74]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e2.95 [2.81\u0026ndash;3.09]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.74 [2.63\u0026ndash;2.86]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.04 [0.94\u0026ndash;1.13]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e3.24 [3.09\u0026ndash;3.40]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.20 [2.11\u0026ndash;2.29]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_lung\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e5%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2.79 [2.27\u0026ndash;3.31]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e7.04 [6.35\u0026ndash;7.72]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e4.94 [4.51\u0026ndash;5.37]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOvarian\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBRCA\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.5%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e7.24 [2.61\u0026ndash;11.9]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e20.6 [16.1\u0026ndash;25.1]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e15.3 [11.4\u0026ndash;19.2]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.25 [1.06\u0026ndash;1.43]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e3.09 [2.65\u0026ndash;3.53]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.28 [2.18\u0026ndash;2.37]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_ovarian\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e7.37 [6.53\u0026ndash;8.20]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e10.9 [9.98\u0026ndash;11.7]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e10.4 [9.41\u0026ndash;11.4]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eLiver\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eHepatitis B/C or Cirrhosis\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e16.9 [15.2\u0026ndash;18.6]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e27.6 [25.9\u0026ndash;29.3]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e19.3 [18.7\u0026ndash;19.8]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.70 [0.55\u0026ndash;0.84]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e4.19 [3.60\u0026ndash;4.78]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.93 [2.68\u0026ndash;3.18]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"4\" rowspan=\"5\"\u003e\u003cp\u003e\u003cb\u003ePancreas\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ecarrier\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1.0%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e3.87 [1.11\u0026ndash;6.63]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e14.2 [8.70\u0026ndash;19.7]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e9.83 [5.92\u0026ndash;13.7]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.77 [0.66\u0026ndash;0.88]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e2.53 [2.36\u0026ndash;2.71]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1.63 [1.54\u0026ndash;1.73]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_pancreas\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4.10 [1.74\u0026ndash;7.46]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e8.54 [5.31\u0026ndash;11.8]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e7.86 [4.94\u0026ndash;10.8]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNOD\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e13%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.78 [0.55\u0026ndash;1.07]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e2.53 [2.36\u0026ndash;2.71]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.08 [1.94\u0026ndash;2.23]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNOD60\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e6%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.18 [0.74\u0026ndash;1.63]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e4.06 [3.86\u0026ndash;4.25]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e3.13 [2.75\u0026ndash;3.51]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eStomach\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eHelicobacter pylori\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.41 [0.00-4.22]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e4.71 [0.29\u0026ndash;9.13]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e3.58 [1.96\u0026ndash;5.20]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_cancer\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e20%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.82 [0.33\u0026ndash;1.32]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.82 [1.54\u0026ndash;2.10]*\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1.39 [1.12\u0026ndash;1.65]*\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFH_stomach\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6.56 [3.51\u0026ndash;9.61]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e6.58 [0.00-14.2]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e6.95 [3.23\u0026ndash;10.7]\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\n\u003ch3\u003eFeatures contributing to model performance\u003c/h3\u003e\n\u003cp\u003eAcross cancer types, top-ranked features varied and often included known risk factors or related clinical conditions (Figure S1). For example, cirrhosis and hepatitis were among the most influential features for liver cancer, while pancreatic cysts and other pancreatic disorders were highly ranked for pancreatic cancer.\u003c/p\u003e\u003cp\u003eIn the breast cancer model, the most influential features included \u003cem\u003ecarcinoma in situ of the breast\u003c/em\u003e and \u003cem\u003eacquired absence of breast\u003c/em\u003e, both of which appeared in a subset of individuals prior to their index dates. Specifically, \u003cem\u003ecarcinoma in situ\u003c/em\u003e was observed in 135 cases and 181 controls, and \u003cem\u003eacquired absence of breast\u003c/em\u003e in 62 cases and 128 controls\u0026mdash;together representing 4.7% of the breast cancer case-control cohort. Although these features were not included in the case definition for invasive breast cancer and were present in both cases and controls, they were consistently ranked among the top contributors across cross-validation folds.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003e Early diagnosis of cancer remains a major clinical challenge, particularly for cancers lacking effective screening guidelines. Existing diagnostic modalities are often limited by availability, cost, or invasiveness, making widespread implementation difficult.\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e Identifying high-risk individuals who would benefit most from targeted screening could significantly improve early detection and reduce cancer-related mortality. In this study, we evaluated the potential of an EHR-based predictive model to identify high-risk individuals across multiple cancer types, comparing its performance to known risk factors and current clinical guidelines. Our results show that the model stratifies high-risk populations, either as a standalone tool or in combination with existing approaches. The model with such capability will enable identification of high-lift subgroups, who would be triaged to more intensive screening, even in the absence of traditional risk factors. These findings establish an important benchmark for translating future predictive modeling efforts into population-level cancer prevention and early detection strategies, highlighting the potential utility of EHR-based models in the era of AI-driven precision medicine.\u003c/p\u003e\u003cp\u003eOur study used lift as a primary metric alongside AUROC to evaluate clinical utility. While AUROC is a widely used measure of model discrimination, it does not necessarily reflect a model’s ability to identify actionable high-risk groups for targeted screening. In our analysis, lift and AUROC were not always aligned—for example, prostate cancer had the highest AUROC, yet liver cancer showed the greatest lift. This illustrates that higher discrimination does not always translate into better risk enrichment across different cancers, as lift depends not only on model performance but also on disease prevalence and the distribution of predicted risk. However, when comparing alternative models for the same cancer, improvements in AUROC may be more likely to yield gains in lift. These findings highlight the importance of incorporating risk enrichment metrics such as lift alongside traditional accuracy measures when evaluating predictive models for stratified prevention strategies.\u003c/p\u003e\u003cp\u003eHowever, a key challenge in applying advanced predictive models is interpretability.\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e While machine learning algorithms such as XGBoost used in this study are well known for capturing complex, nonlinear interactions among features and often provide significantly improved performance compared to linear models such as logistic regression (Figure S2), their complexity can make model outputs difficult to interpret and trust. Our SHAP analysis illustrated this complexity by highlighting how multiple interacting features contributed to model predictions. For example, in liver cancer, where the EHR model achieved the greatest lift among all cancer types evaluated (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e), the top five features were primarily related to hepatitis and cirrhosis, which were also used as established risk factors in our comparisons. However, the superior performance of the EHR model suggests it improves risk stratification not only by capturing these known associations but also by integrating additional, non-obvious features identified through SHAP analysis. In the case of breast cancer, \u003cem\u003ecarcinoma in situ of the breast\u003c/em\u003e and \u003cem\u003eacquired absence of breast\u003c/em\u003e were consistently among the top-ranked predictive features. These features were present in both cases and controls (e.g., 135 cases and 181 controls had \u003cem\u003ecarcinoma in situ\u003c/em\u003e; 62 cases and 128 controls had \u003cem\u003eacquired absence of breast\u003c/em\u003e), collectively accounting for 4.7% of the breast cancer case-control cohort. Given their substantial presence in controls as well as cases, their predictive contribution likely arises not from their presence alone, but from their interaction with other clinical variables captured by the model. These findings demonstrate how EHR-based models can leverage a broad spectrum of structured clinical data to identify meaningful predictors of cancer risk, many of which may be overlooked by traditional rule-based or linear approaches.\u003c/p\u003e\u003cp\u003eWhile model interpretability has traditionally relied on feature attribution methods\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e such as SHAP, LIME, Integrated Gradients, and DeepLIFT, recent advances in AI, particularly in generative models capable of human-like reasoning, open new possibilities for more intuitive and interactive forms of explainability.\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e,\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e These emerging approaches may enable richer insights into model behavior by aligning explanations more closely with how clinicians reason through decisions. As predictive models are increasingly applied in medicine, developing advanced explainable AI techniques that enhance human understanding will be critical for fostering trust, supporting clinical adoption, and ensuring responsible deployment.\u003c/p\u003e\u003cp\u003eDespite the promise of EHR-based predictive models, there are inherent limitations to using EHR data for cancer prediction or for any clinical modeling task.\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e,\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e These include issues of data completeness, misclassification, and potential false positives due to inaccuracies in documentation or coding. While manual chart review is often necessary to ensure phenotypic fidelity, emerging HIPAA-compliant large language model-based tools, such as those used in this study, may soon enable scalable, semi-automated validation pipelines to support clinical data curation at scale. In this study, we used prompt-based GPT classification to categorize cancer-related concepts, a strategy that could be extended to broader EHR validation efforts.\u003c/p\u003e\u003cp\u003eIt is important to emphasize that this study focused on baseline EHR model performance using only structured condition occurrence data within the OMOP framework. There remains tremendous opportunity to enhance model accuracy and clinical relevance by incorporating additional structured data, such as laboratory results, demographics, procedures, and observations as well as emerging multi-omics data, including genomics, proteomics, methylation, and exposomics to account for environmental exposures. As these data sources become more available and integrable, future EHR-based models can achieve greater predictive power and biological insight.\u003c/p\u003e\u003cp\u003eWhile our study demonstrates the potential of EHR-based models to improve early cancer diagnosis, any predictive tool used to guide screening decisions must be evaluated in the context of overdiagnosis and overtreatment.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e Identifying individuals at elevated risk does not necessarily translate to clinical benefit, particularly for cancers with indolent or slow-progressing courses. Future work should incorporate downstream clinical outcomes to assess whether EHR-based risk stratification leads to net benefit in terms of survival, quality of life, and healthcare resource utilization.\u003c/p\u003e\u003cp\u003eIn conclusion, EHR-based predictive models offer a non-invasive, scalable approach to identifying individuals at elevated cancer risk, with the potential for meaningful clinical impact in early detection. However, this promise must be balanced with comprehensive evaluation of downstream effects to ensure a net benefit in real-world applications. With continued improvements in data quality, model interpretability, and integration of diverse data types, these tools can complement existing screening guidelines and inform personalized surveillance strategies. Future work should explore how high-risk individuals identified by such models can be enrolled in risk-adapted screening protocols, including decisions around eligibility, screening frequency, and modality. These efforts represent a critical step toward realizing precision prevention at a population scale.\u003c/p\u003e"},{"header":"Methods","content":"\u003ch2\u003eAll of Us Research Program database\u003c/h2\u003e\u003cp\u003eThe All of Us Research Program\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e provides a nationally representative dataset designed to reflect the diversity of the U.S. population. Any individual aged 18 or older is eligible to participate. As of August 2025, more than 865,000 participants have enrolled in the program, with over 595,000 having completed the full consent process, which includes agreeing to share EHR, completing surveys, providing physical measurements, and donating biospecimens to the All of Us biobank.\u003c/p\u003e\u003cp\u003eFor this study, we included participants with available whole genome sequencing (WGS), survey responses, and linked EHR data.\u003c/p\u003e\n\u003ch3\u003eCancer type classification\u003c/h3\u003e\n\u003cp\u003eEHR data from the All of Us Research Program follow the OMOP Common Data Model, where clinical concepts are encoded using standardized concept IDs with defined hierarchical relationships (i.e., ancestors and descendants). To identify potential cancer cases, we extracted all descendant concept names under the OMOP concept ID 443392, which represents \u0026ldquo;malignant neoplastic disease.\u0026rdquo;\u003c/p\u003e\u003cp\u003eWe then developed a GPT-based classification approach to map these medical concept names to predefined cancer types (e.g., colorectal cancer, liver cancer, skin cancer). Classification was performed using the OpenAI API with the gpt-4o model. Starting from an initial prompt, we iteratively refined both the prompt and the cancer type categories based on clinician review of the model\u0026rsquo;s outputs, as illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA. The final prompt and full codebase are publicly available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/jp4147/aou_EHRmodel\u003c/span\u003e\u003cspan address=\"https://github.com/jp4147/aou_EHRmodel\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eCase and control group\u003c/h2\u003e\u003cp\u003eWe identified individuals with any of the concept IDs related to malignant neoplastic disease (i.e., descendants of concept ID 443392) recorded in their medical history. The earliest occurrence of any of these concept IDs was used as the individual\u0026rsquo;s first cancer diagnosis date. Individuals without any of these cancer-related concept IDs were considered potential controls.\u003c/p\u003e\u003cp\u003eWe then restricted both cases and controls to participants with available whole genome sequencing data. Cancer cases were further filtered to exclude individuals whose first diagnosis was labeled as a secondary malignancy or classified as \u0026ldquo;not cancer\u0026rdquo; by the GPT-based prompt.\u003c/p\u003e\u003cp\u003eFor cases, the prediction index date was defined as 12 months prior to the first cancer diagnosis. For controls, the index date was set to 24 months before the last recorded medical condition, to reduce the likelihood of including individuals with undiagnosed cancer. To ensure adequate longitudinal medical history, we included only individuals (both cases and controls) with at least five documented medical conditions prior to their respective index dates.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eRisk factor identification\u003c/h2\u003e\u003cp\u003eWe evaluated a set of established risk factors to compare against the EHR model, including: age, family history of cancer, smoking history, genetic carrier status, chronic hepatitis B/C, cirrhosis, new-onset diabetes, and \u003cem\u003eHelicobacter pylori\u003c/em\u003e infection.\u003c/p\u003e\u003cp\u003eAge was calculated relative to each individual\u0026rsquo;s index date using their recorded date of birth. For conditions such as chronic hepatitis B/C, cirrhosis, new-onset diabetes, and \u003cem\u003eH. pylori\u003c/em\u003e infection, we queried the All of Us Controlled Tier Dataset (v8) using curated sets of high-level OMOP concept IDs associated with each condition. We retrieved all descendant concepts via the cb_criteria and cb_search_all_events tables and flagged individuals with any matching concept recorded in their medical history. Individuals were classified as having each risk factor if they had a first recorded diagnosis prior to their index date.\u003c/p\u003e\u003cp\u003eFor new-onset diabetes, individuals were classified as positive if they had a first diagnosis of type 2 diabetes before the index date and no record of prior diabetes-related medications. Family history of cancer and smoking history were derived from participant survey data.\u003c/p\u003e\u003cp\u003eGenetic carriers were identified using the ClinVar database. We restricted the analysis to variants classified as \u003cem\u003epathogenic\u003c/em\u003e or \u003cem\u003elikely pathogenic\u003c/em\u003e, submitted by multiple submitters with no conflicts, or reviewed by an expert panel. A total of 4,738 individuals were identified as carriers of pathogenic variants in genes associated with cancer predisposition syndromes, including \u003cem\u003eATM\u003c/em\u003e, \u003cem\u003eBRCA1\u003c/em\u003e, \u003cem\u003eBRCA2\u003c/em\u003e, \u003cem\u003ePALB2\u003c/em\u003e, \u003cem\u003eSTK11\u003c/em\u003e (Peutz-Jeghers), \u003cem\u003eCDKN2A\u003c/em\u003e (FAMMM), and Lynch syndrome genes (\u003cem\u003eEPCAM\u003c/em\u003e, \u003cem\u003eMLH1\u003c/em\u003e, \u003cem\u003eMSH2\u003c/em\u003e, \u003cem\u003eMSH6\u003c/em\u003e, \u003cem\u003ePMS2\u003c/em\u003e), as well as hereditary pancreatitis genes (\u003cem\u003ePRSS1/2\u003c/em\u003e, \u003cem\u003eCTRC\u003c/em\u003e).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003eModel training\u003c/h2\u003e\u003cp\u003eWe conducted five-fold stratified cross-validation separately for each cancer type. While the number of cases varied by cancer type, we used a consistent control group across all models. In each fold, the data were split into a training set (80%) and a test set (20%), maintaining the case-control ratio.\u003c/p\u003e\u003cp\u003eFor each individual, medical condition data were converted into a binary sparse matrix, where each feature represented the presence or absence of a specific condition prior to the index date. Using these features, we trained gradient-boosted decision tree models with XGBoost for each fold.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003eEvaluation setup\u003c/h2\u003e\u003cp\u003eWhile the training sets were used to develop the EHR models, the test sets served as a proxy for a hypothetical general population to evaluate the clinical utility of each model. We compared the effectiveness of the EHR model and known risk factors in identifying high-risk individuals.\u003c/p\u003e\u003cp\u003eTo quantify clinical utility, we used lift, a metric that measures the enrichment of true cases within the high-risk group relative to the general population. A higher lift indicates that the high-risk group contains a greater concentration of true cancer cases, suggesting greater potential effectiveness if this group were targeted for confirmatory screening, such as biopsy or imaging.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003eFeature importance extraction\u003c/h2\u003e\u003cp\u003eTo identify features contributing to model predictions, we computed SHAP (SHapley Additive exPlanations) values for each individual in the test set across all five cross-validation folds. We hypothesized that features whose SHAP values show inconsistent directionality across patients may reflect unreliable contributions to risk prediction. Accordingly, for each fold, we aggregated SHAP values by summing the signed contributions of each feature across individuals. Features were then ranked within each fold based on their aggregate contributions, and final rankings were obtained by averaging ranks across folds to identify consistently important predictors.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003eStatistical analysis\u003c/h2\u003e\u003cp\u003eWe reported 95% confidence intervals to represent variability in performance metrics across the five cross-validation folds. To assess statistical significance between groups (e.g., comparing model performance or lift values), we used the Mann\u0026ndash;Whitney U test. P-values less than 0.05 were considered statistically significant.\u003c/p\u003e\u003c/div\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eCrosby D et al (2022) Early detection of cancer. Science 375:eaay9040\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSmith RA et al (2016) Cancer screening in the United States, 2016: A review of current American Cancer Society guidelines and current issues in cancer screening. CA Cancer J Clin 66:96\u0026ndash;114\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePatel SG et al (2022) Updates on Age to Start and Stop Colorectal Cancer Screening: Recommendations From the U.S. Multi-Society Task Force on Colorectal Cancer. Am J Gastroenterol 117:57\u0026ndash;69\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eParekh A et al (2022) The 50-Year Journey of Lung Cancer Screening: A Narrative Review. Cureus 14:e29381\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJung AW et al (2024) Multi-cancer risk stratification based on national health data: a retrospective modelling and validation study. Lancet Digit Health 6:e396\u0026ndash;e406\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHuguet N et al (2020) Using Electronic Health Records in Longitudinal Studies: Estimating Patient Attrition. Med Care 58(Suppl 1):S46\u0026ndash;S52\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYe J, Woods D, Jordan N, Starren J (2024) The role of artificial intelligence for the application of integrating electronic health records and patient-generated data in clinical decision support. AMIA Jt Summits Transl Sci Proc 2024:459\u0026ndash;467\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRose C, Chen JH (2024) Learning from the EHR to implement AI in healthcare. NPJ Digit Med 7:330\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYang X et al (2022) A large language model for electronic health records. NPJ Digit Med 5:194\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang B, Shi H, Wang H (2023) Machine Learning and AI in Cancer Prognosis, Prediction, and Treatment Selection: A Critical Approach. J Multidiscip Healthc 16:1779\u0026ndash;1791\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoldstein BA, Navar AM, Pencina MJ, Ioannidis JP (2017) Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inf Assoc 24:198\u0026ndash;208\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMishra AK, Chong B, Arunachalam SP, Oberg AL, Majumder S (2024) Machine Learning Models for Pancreatic Cancer Risk Prediction Using Electronic Health Record Data-A Systematic Review and Assessment. Am J Gastroenterol 119:1466\u0026ndash;1482\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKlann JG, Joss MA, Embree K, Murphy SN (2019) Data model harmonization for the All Of Us Research Program: Transforming i2b2 data into the OMOP common data model. PLoS ONE 14:e0212463\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang L et al (2025) A scoping review of OMOP CDM adoption for cancer research using real world data. NPJ Digit Med 8:189\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eElnashar A, White J, Schmidt DC (2025) Enhancing structured data generation with GPT-4o evaluating prompt efficiency across prompt styles. Front Artif Intell 8:1558938\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRamirez AH et al (2022) The All of Us Research Program: data quality, utility, and diversity. Patterns 3\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen T, Guestrin C, Xgboost (2016) A scalable tree boosting system. in \u003cem\u003eProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining\u003c/em\u003e 785\u0026ndash;794\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLau K, Hart GR, Deng J (2025) Predicting time-to-first cancer diagnosis across multiple cancer types. Sci Rep 15:24790\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBen-Aharon I et al (2023) Early-onset cancer in the gastrointestinal tract is on the rise\u0026mdash;evidence and implications. Cancer Discov 13:538\u0026ndash;551\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBallard DH et al (2021) The Role of Imaging in Health Screening: Overview, Rationale of Screening, and Screening Economics. Acad Radiol 28:540\u0026ndash;547\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eStiglic G et al (2020) Interpretability of machine learning-based prediction models in healthcare. Wires Data Min Knowl 10\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLinardatos P, Papastefanopoulos V, Kotsiantis S (2020) Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy (Basel) 23\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShobeiri S (2024) Enhancing Transparency in Healthcare Machine Learning Models Using Shap and Deeplift a Methodological Approach. Iraqi J Inform Communication Technol 7:56\u0026ndash;72\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKrause S, Stolzenburg F (2024) From data to commonsense reasoning: the use of large language models for explainable AI. \u003cem\u003earXiv preprint arXiv:2407.03778\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePal NR (2020) In Search of Trustworthy and Transparent Intelligent Systems With Human-Like Cognitive and Reasoning Capabilities. Front Robot Ai 7\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCarrington JM, Effken JA (2011) Strengths and limitations of the electronic health record for documenting clinical events. Comput Inf Nurs 29:360\u0026ndash;367\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKim MK, Rouphael C, McMichael J, Welch N, Dasarathy S (2024) Challenges in and Opportunities for Electronic Health Record-Based Data Analysis and Interpretation. Gut Liver 18:201\u0026ndash;208\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-7426088/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7426088/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eElectronic health records (EHRs) offer a promising, scalable approach for identifying individuals at high risk for targeted cancer screening, but the absence of clinical benchmarks has limited their adoption. We evaluated the clinical utility of EHR-based predictive models for 12-month cancer risk across eight major cancers\u0026mdash;breast, lung, colorectal, prostate, ovarian, liver, pancreatic, and stomach\u0026mdash;using longitudinal data from over 865,000 participants in the All of Us Research Program, which uniquely integrates EHR, genomic, and survey data. Compared to traditional risk factors (e.g., age, family history, genetic variants), EHR-based models significantly improved identification of high-risk groups. The models achieved a 3- to 6-fold increase in risk enrichment for breast, colorectal, pancreatic, and stomach cancers relative to traditional risk factors alone. For liver cancer, the model achieved the highest absolute lift (27.6-fold compared to the general population), although the relative improvement over known risk factors was more modest (1.68-fold). These findings establish practical benchmarks for EHR-based cancer risk prediction and provide insights for integrating such models into clinical workflows to enable more precise and scalable early detection strategies.\u003c/p\u003e","manuscriptTitle":"Clinical utility of EHR-based predictive models for identifying high-risk individuals in early cancer detection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-04 00:18:27","doi":"10.21203/rs.3.rs-7426088/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"282c85ec-bcea-497d-a729-36979c9e6c8c","owner":[],"postedDate":"September 4th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":53572027,"name":"Biological sciences/Cancer/Cancer prevention"},{"id":53572028,"name":"Health sciences/Medical research/Translational research"}],"tags":[],"updatedAt":"2026-01-14T19:30:27+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-04 00:18:27","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7426088","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7426088","identity":"rs-7426088","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00