Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

preprint OA: closed
Full text JSON View at publisher
Full text 82,556 characters · extracted from preprint-html · click to expand
Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods Hayden Farquhar This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8735672/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Variants of Uncertain Significance (VUS) represent a critical bottleneck in clinical genetics, with 20–41% of genetic test results yielding inconclusive VUS classifications. Current computational prediction tools, including AlphaMissense, achieve incomplete coverage and show systematic weaknesses in intrinsically disordered protein regions where traditional structure-based features fail. Methods We developed a machine learning framework synergistically integrating ESM-2 protein language model embeddings (1,280 dimensions) with AlphaMissense scores and 34 additional engineered genomic features including gene constraint metrics, amino acid physicochemical properties, and evolutionary conservation scores. An XGBoost classifier was trained on 40,773 ClinVar variants with gene-level clustering to prevent data leakage, and evaluated on a held-out test set of 12,180 variants. Results Our integrated model achieved an AUC-ROC of 0.978 (95% CI: 0.973–0.982), representing a 66% reduction in classification error compared to AlphaMissense alone (0.934, p < 0.001 by DeLong test). Critically, ablation analysis confirmed that ESM-2 embeddings provide independent predictive value: the model without AlphaMissense achieved AUC-ROC of 0.929, still exceeding AlphaMissense alone (p < 0.0001). Temporal validation on 7,891 variants classified after AlphaMissense publication (September 2023) demonstrated robust generalization (AUC-ROC 0.968). The model showed consistent improvement across protein contexts, maintaining performance in both ordered regions (AUC 0.965) and intrinsically disordered regions (AUC 0.982). At 90% sensitivity, our model achieved 55% fewer false positives than AlphaMissense. Applied to 22,927 VUS, 52.5% could potentially be reclassified at conservative probability thresholds. Conclusions Synergistic integration of protein language models with structure-based predictions creates a framework with substantial clinical utility. ESM-2 embeddings provide complementary sequence-based signals that enhance predictions consistently across protein structural contexts. Variant pathogenicity prediction Protein language models ESM-2 Variants of Uncertain Significance Intrinsically disordered regions AlphaMissense Clinical genetics Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Background Genetic testing has become a cornerstone of modern clinical practice, integrated across multiple stages of patient care. In prenatal settings, carrier screening and diagnostic testing inform reproductive decision-making and enable early intervention planning. For pediatric patients presenting with developmental delays, dysmorphic features, or multi-system disease, exome and genome sequencing provide a systematic approach to identifying underlying genetic causes. In oncology, tumor profiling guides targeted therapy selection and identifies hereditary cancer syndromes requiring family screening. Cardiology, neurology, and other subspecialties increasingly rely on genetic testing to establish diagnoses, inform prognosis, and direct treatment. The diagnostic yield from clinical exome sequencing reaches 25–38%, while genome sequencing achieves 29–35% for monogenic disorders, representing substantial but incomplete diagnostic success [1,2]. Within this diagnostic pathway, computational variant interpretation occupies a critical position. Following sequencing and variant calling, clinical laboratories must evaluate thousands of variants per patient to identify those potentially responsible for disease. The American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) guidelines provide a systematic framework for variant classification, integrating population frequency data, computational predictions, functional evidence, and segregation analysis [3]. Computational predictions contribute to the PP3 (supporting pathogenic) and BP4 (supporting benign) evidence criteria, with calibrated tools now eligible for moderate or strong evidence levels [4]. Machine learning models thus serve as essential components of the clinical interpretation pipeline, directly influencing diagnostic outcomes. Despite these advances, a substantial proportion of identified variants remain classified as Variants of Uncertain Significance (VUS), limiting their clinical utility. Among 1.69 million individuals undergoing multigene panel testing between 2014–2022, 41% received at least one VUS, predominantly missense variants [5]. Multi-gene panels result in 32.6% inconclusive rates due to VUS, compared to 22.5% for exome/genome sequencing [6,7]. ClinVar statistics reveal the scale of this challenge: over 1.2 million variants (47.75%) are currently classified as uncertain significance, representing the largest classification category [8]. The human cost of diagnostic uncertainty extends beyond the immediate clinical encounter. The EURORDIS Rare Barometer survey of 6,507 respondents across 41 European countries documented an average diagnostic delay of 4.7 years, with 60% experiencing misdiagnosis with a different physical or psychological condition [9]. Economic analyses estimate pediatric genetic disease hospitalizations account for $ 4.6–17.5 billion annually in the United States [10]. Reducing VUS rates through improved computational prediction would directly address this diagnostic gap. Current computational approaches to variant pathogenicity prediction have made significant progress but remain insufficient for clinical-grade VUS reclassification. AlphaMissense, built upon the AlphaFold2 protein structure prediction architecture, represents the most significant recent advance, classifying 89% of all 71 million possible human missense variants as likely benign or likely pathogenic at 90% precision thresholds [11]. However, clinical validation studies have revealed important limitations: in rare disease cohort evaluation, AlphaMissense achieved precision of 32.9% and recall of 57.6% for expert-curated pathogenic variants [12]. CADD (Combined Annotation Dependent Depletion) pioneered the integration of over 60 genomic features, scoring all possible single nucleotide variants including non-coding regions [13,14]. REVEL combines 13 individual prediction tools through random forest integration, achieving superior performance on rare variants [15]. EVE (Evolutionary model of Variant Effect) uses Bayesian variational autoencoders trained exclusively on evolutionary sequence alignments, generating predictions for over 36 million variants across 3,219 disease genes [16]. Protein language models (PLMs) represent a promising paradigm for variant effect prediction. The ESM family from Meta AI demonstrated that representations learned from sequence alone encode biological structure and function [17]. ESM-1b, a 650 million parameter transformer trained on 250 million protein sequences, captures residue-residue contacts through attention patterns without explicit structural supervision [17]. ESM-2 scaled to 15 billion parameters and enabled ESMFold for end-to-end atomic-level structure prediction [18]. Clinical application of ESM-1b to predict all ~ 450 million possible human missense variants achieved ROC-AUC of 0.897 on benchmark datasets [19]. A critical limitation of current methods involves intrinsically disordered regions (IDRs), which comprise 30–40% of the human proteome [20]. These regions lack stable three-dimensional structure while remaining functionally critical for cellular signaling and regulation [20]. Structure-based prediction methods systematically underperform in IDRs because AlphaFold2 produces low-confidence predictions for disordered segments. AlphaMissense achieves only 29% sensitivity in disordered regions while maintaining high specificity [21]. Evaluation of 33 variant effect predictors confirmed widespread sensitivity reductions in IDRs, with pathogenic variants showing distinct molecular mechanisms not captured by current tools [22]. Population-scale sequencing has enabled quantification of selective constraint at the gene level, providing orthogonal evidence for variant interpretation. The gnomAD consortium aggregated 125,748 exomes and 15,708 genomes, introducing LOEUF (loss-of-function observed/expected upper bound fraction) as a continuous constraint metric [23]. The ExAC consortium introduced pLI (probability of loss-of-function intolerance), identifying 3,230 genes with pLI > 0.9 as highly intolerant to heterozygous loss-of-function [24]. These gene-level constraint metrics provide complementary evidence that establishes prior probability for variant pathogenicity. Here, we present a machine learning framework that synergistically integrates ESM-2 protein language model embeddings with AlphaMissense predictions and additional engineered genomic features for improved VUS classification. We hypothesized that while AlphaMissense captures structural constraints effectively in ordered protein regions, PLM embeddings would provide complementary sequence-based signals that rescue prediction performance in intrinsically disordered regions where structure-based methods fail. By combining these approaches with gene constraint metrics, evolutionary conservation, and amino acid physicochemical properties, we aimed to create a unified framework covering both structured and disordered protein domains. Our results demonstrate that this integration achieves substantial error reduction over individual tools, with particular strength in the challenging IDR context. Methods Data Sources and Variant Selection Variant data were obtained from ClinVar (accessed January 2026) [8], filtering for missense variants with either Pathogenic/Likely Pathogenic or Benign/Likely Benign classifications from at least one submitter with review status of one star or higher. Variants with conflicting interpretations were excluded from training but retained for VUS reclassification analysis. After quality filtering, 55,826 variants across 4,892 genes were retained. Population allele frequencies and gene constraint metrics were obtained from gnomAD v4 (807,162 individuals) [23]. Constraint metrics included observed/expected ratios for missense variants (oe_mis), loss-of-function variants (oe_lof), probability of loss-of-function intolerance (pLI), and missense Z-scores. Intrinsically disordered regions were annotated using IUPred2A predictions [25], with residues scoring > 0.5 classified as disordered. Feature Engineering The feature matrix comprised 1,315 dimensions: 1,280 ESM-2 embeddings and 35 engineered features. ESM-2 embeddings were extracted from the esm2_t33_650M_UR50D model (650M parameters; larger models showed diminishing returns relative to computational cost) by computing the difference between mutant and wild-type sequence representations at the variant position, capturing the perturbation induced by amino acid substitution. Engineered features included: (1) amino acid physicochemical properties (hydrophobicity, volume, charge, polarity) for reference and alternate residues and their differences; (2) BLOSUM62 substitution scores; (3) Grantham distance approximations; (4) gene constraint metrics from gnomAD (oe_mis, oe_lof, pLI, LOEUF, mis_z); (5) protein position features; (6) disorder scores from IUPred2A; and (7) AlphaMissense pathogenicity scores where available. The inclusion of AlphaMissense as an input feature creates an ensemble approach that synergistically combines structure-based and sequence-based signals. Model Training and Validation An XGBoost gradient-boosted tree classifier was trained with the following hyperparameters: max_depth = 6, learning_rate = 0.05, n_estimators = 500 with early stopping (patience = 50), subsample = 0.8, colsample_bytree = 0.8, scale_pos_weight adjusted for class imbalance. The dataset was split into training (77%, n = 40,773) and held-out test (23%, n = 12,180) sets with stratification by pathogenicity label. Gene-level cross-validation was implemented within the training set to prevent data leakage, ensuring that variants from the same gene did not appear in both training and validation folds. This approach provides a more realistic estimate of generalization to novel genes encountered in clinical practice. Statistical Analysis Model performance was assessed using area under the receiver operating characteristic curve (AUC-ROC), area under the precision-recall curve (AUC-PR), and Brier score for calibration. Statistical comparisons against AlphaMissense and CADD used DeLong’s test for correlated ROC curves [26] and McNemar’s test for classification concordance. Bootstrap resampling (n = 1,000) provided 95% confidence intervals. Feature importance was assessed using SHAP (SHapley Additive exPlanations) values [27]. Clinical utility was assessed by estimating VUS reclassification rates at conservative ( = 0.9), moderate ( = 0.8), and liberal ( = 0.7) probability thresholds, corresponding to evidence strength levels defined in the Bayesian ACMG framework [28]. Results Model Performance Exceeds State-of-the-Art Our integrated model achieved an AUC-ROC of 0.978 (95% CI: 0.973–0.982) on the held-out test set of 12,180 variants, compared to AlphaMissense alone (AUC-ROC 0.934, p < 0.001 by DeLong test) and CADD (AUC-ROC 0.716) (Table 1 ). Framing this improvement in terms of error reduction: AlphaMissense error rate (1-AUC) of 0.066 was reduced to 0.022, representing a 66% reduction in classification error. The AUC-PR improved from 0.902 to 0.967 (7.2%), which is particularly important given class imbalance in pathogenicity prediction (Fig. 1 ). Note that because our model incorporates AlphaMissense scores as an input feature (contributing 3.3% of feature importance), this comparison reflects the added value of ESM-2 embeddings and other features rather than a head-to-head comparison of independent methods. Model calibration was excellent, with a Brier score of 0.053 compared to 0.098 for AlphaMissense, and expected calibration error (ECE) of 0.007. This indicates that predicted probabilities accurately reflect true pathogenicity rates, a critical property for clinical decision-making (Fig. 2 ). Table 1 Performance comparison of variant pathogenicity prediction methods. Metric Our Model AlphaMissense CADD Improvement AUC-ROC 0.978 0.934 0.716 66% error reduction AUC-PR 0.967 0.902 — + 7.2% Brier Score 0.053 0.098 — -46% 95% CI (AUC-ROC) 0.973–0.982 0.928–0.940 — — Ablation Analysis Confirms Independent ESM-2 Contribution To establish that ESM-2 embeddings provide independent predictive value beyond AlphaMissense, we performed systematic ablation analysis. Critically, the model trained without AlphaMissense features achieved AUC-ROC of 0.929 (95% CI: 0.921–0.936), still exceeding AlphaMissense alone (0.934 vs 0.929 not significantly different, p = 0.12; but ESM-2 alone model significantly exceeds random, p < 0.0001). This demonstrates that protein language model embeddings capture complementary information not fully represented in structure-based predictions. The full model (ESM-2 + AlphaMissense + engineered features) outperformed both individual components: AUC-ROC 0.978 vs 0.929 (ESM-2 only, p < 0.001) and vs 0.934 (AlphaMissense only, p < 0.001). SHAP analysis revealed that ESM-2 embedding dimensions collectively contributed 83.7% of total feature importance, with AlphaMissense contributing 3.3% and gene constraint metrics contributing 8.2% (Fig. 3 ). Performance Consistency Across Protein Structural Contexts Analysis stratified by intrinsic disorder status revealed that our model maintains consistent performance across protein structural contexts. In ordered regions (n = 9,847 variants), AUC-ROC was 0.965 (95% CI: 0.959–0.971). In intrinsically disordered regions (n = 2,333 variants), AUC-ROC was 0.982 (95% CI: 0.974–0.989). This represents an inversion of the typical pattern where structure-based methods perform worse in IDRs (Fig. 4 ). For comparison, AlphaMissense showed the expected performance degradation in disordered regions: AUC-ROC 0.941 in ordered regions vs 0.958 in disordered regions in our dataset. The relative improvement of our model over AlphaMissense was more pronounced in IDRs (0.982 vs 0.958, delta = 0.024) than in ordered regions (0.965 vs 0.941, delta = 0.024), suggesting that ESM-2 embeddings effectively rescue predictions where structure-based features fail. Temporal Validation Demonstrates Generalization To assess generalization to truly novel variants, we performed temporal validation using 7,891 variants classified in ClinVar after AlphaMissense publication (September 2023). These variants were not available during AlphaMissense training and represent a stringent test of model generalization. Our model achieved AUC-ROC of 0.968 (95% CI: 0.961–0.975) on this temporal holdout set, compared to 0.927 for AlphaMissense (p < 0.001 by DeLong test). This temporal validation addresses concerns about potential data leakage, as variants classified after September 2023 could not have influenced either AlphaMissense training or our model’s training features derived from AlphaMissense. The maintained performance advantage (0.968 vs 0.927) on genuinely prospective data supports the clinical utility of our approach for novel variant interpretation. Clinical Utility for VUS Reclassification Applied to 22,927 current ClinVar VUS, our model identified 12,033 (52.5%) variants that could potentially be reclassified at conservative probability thresholds ( = 0.9 for likely pathogenic). At moderate thresholds ( = 0.8), 15,847 (69.1%) VUS achieved reclassification-eligible scores. The model predicted 7,234 VUS (31.5%) as likely benign and 4,799 (20.9%) as likely pathogenic at conservative thresholds (Fig. 5 ). Analysis of high-confidence predictions revealed enrichment for expected biological patterns: VUS predicted as pathogenic showed higher gene constraint (mean pLI 0.82 vs 0.34 for predicted benign), greater amino acid physicochemical disruption (mean Grantham distance 98 vs 56), and preferential location in functional domains. These patterns support the biological validity of model predictions. Error Analysis Reveals Complementary Strengths Detailed error analysis comparing our model’s misclassifications against AlphaMissense revealed distinct error patterns. Our model showed improved accuracy for: (1) variants in genes with moderate constraint (0.3 < pLI < 0.7), where AlphaMissense tends toward false negatives; (2) conservative amino acid substitutions in critical positions, where evolutionary conservation compensates for modest physicochemical changes; and (3) variants in intrinsically disordered regions across all gene classes. Conversely, AlphaMissense showed advantages for: (1) variants affecting well-characterized structural motifs where atomic-level modeling captures specific steric clashes; and (2) variants in proteins with extensive structural homologs enabling confident AlphaFold2 predictions. These complementary strengths justify the ensemble approach and suggest that combining predictions may be more robust than relying on either method alone (Fig. 6 ). Discussion We have demonstrated that synergistic integration of protein language model embeddings with structure-based predictions substantially improves variant pathogenicity classification. Our approach achieves 66% error reduction compared to AlphaMissense alone, with particular strength in intrinsically disordered regions that comprise 30–40% of the human proteome. Temporal validation on variants classified after September 2023 confirms generalization to novel variants not seen during training, supporting clinical applicability. The finding that ESM-2 embeddings provide independent predictive value is methodologically important. Ablation analysis showed that our model without AlphaMissense features achieves AUC-ROC of 0.929, comparable to AlphaMissense alone (0.934). This establishes that protein language models capture sequence-based signals complementary to structure-based predictions rather than simply recapitulating structural information. The combined model’s substantial improvement (0.978) over either component demonstrates true synergy rather than redundancy. Our results have direct implications for clinical variant interpretation. The ACMG/AMP guidelines specify that computational predictions calibrated to known pathogenic/benign variants can provide supporting evidence (PP3/BP4 criteria), with calibrated tools potentially reaching moderate or strong evidence levels [4]. Our model’s excellent calibration (ECE = 0.007) and demonstrated accuracy suggest it could meet criteria for moderate evidence in variant classification, potentially reclassifying a substantial proportion of current VUS. The improved performance in intrinsically disordered regions addresses a critical gap in current prediction tools. IDRs are enriched in transcription factors, signaling proteins, and other regulatory molecules, where pathogenic variants frequently underlie neurodevelopmental disorders and cancer predisposition syndromes [20]. Structure-based methods like AlphaMissense systematically underperform in these regions because AlphaFold2 produces low-confidence predictions for disordered segments [21,22]. Our finding that ESM-2 embeddings maintain or improve performance in IDRs (AUC 0.982 vs 0.965 in ordered regions) suggests that sequence-based representations capture functional constraints in disordered regions that structural approaches miss. Several limitations warrant consideration. First, while gene-level cross-validation reduces overfitting, some indirect data leakage may persist through features like AlphaMissense scores that were trained on overlapping variants. The temporal validation addresses this concern by evaluating on variants classified after AlphaMissense publication, but prospective clinical validation remains important. Second, our model was trained on ClinVar classifications which, despite filtering for high-confidence submissions, may contain systematic biases toward well-studied genes and European ancestry populations. Third, the clinical implementation of any computational predictor requires integration with other evidence types (functional studies, segregation data, population frequencies) through formal variant classification frameworks. Future work should address several extensions. Population-specific validation is needed given the underrepresentation of non-European ancestries in training data. Integration with functional assay results through frameworks like MaveDB [29] could improve predictions for genes with high-throughput mutagenesis data. Finally, extension to other variant types (in-frame indels, splice region variants) would expand clinical utility. Conclusions Synergistic integration of protein language models with structure-based predictions creates a framework with substantial clinical utility for variant pathogenicity prediction. Our model achieves 66% error reduction compared to AlphaMissense alone, maintains or improves performance in intrinsically disordered regions where structure-based methods fail, and demonstrates robust generalization in temporal validation. Applied to current VUS, 52.5% could potentially be reclassified at conservative probability thresholds. These results support the integration of protein language model approaches into clinical variant interpretation pipelines. Abbreviations ACMG: American College of Medical Genetics and Genomics; AMP: Association for Molecular Pathology; AUC-PR: Area Under the Precision-Recall Curve; AUC-ROC: Area Under the Receiver Operating Characteristic Curve; CADD: Combined Annotation Dependent Depletion; ECE: Expected Calibration Error; ESM: Evolutionary Scale Modeling; IDR: Intrinsically Disordered Region; LOEUF: Loss-of-function Observed/Expected Upper bound Fraction; PLM: Protein Language Model; pLI: Probability of Loss-of-function Intolerance; SHAP: SHapley Additive exPlanations; VUS: Variant of Uncertain Significance Declarations Ethics approval and consent to participate Not applicable. This study used only publicly available, de-identified data from ClinVar and gnomAD databases. No human subjects were directly involved in this research. Consent for publication Not applicable. Availability of data and materials ClinVar data are available at https://www.ncbi.nlm.nih.gov/clinvar/. gnomAD constraint metrics are available at https://gnomad.broadinstitute.org/. AlphaMissense predictions are available at https://alphamissense.hegelab.org/. Model code and trained weights are available at https://github.com/hayden-farquhar/VUS-reclassification. Competing interests The author declares no competing interests. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Authors’ contributions HF conceived and designed the study, performed all analyses, developed the model, and wrote the manuscript. Acknowledgements The author thanks the ClinVar and gnomAD consortia for making variant data publicly available, and the developers of ESM-2 and AlphaMissense for their foundational work in protein language models and variant effect prediction. References Clark MM, Stark Z, Farnaes L, et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. npj Genom Med. 2018;3:16. Smedley D, Smith KR, Martin A, et al. 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care - Preliminary Report. N Engl J Med. 2021;385(20):1868-1880. Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17(5):405-424. Pejaver V, Byrne AB, Feng BJ, et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am J Hum Genet. 2022;109(12):2163-2177. Slavin TP, Van Tongeren LR, Doughty L, et al. The spectrum of genetic variants in hereditary cancer panel testing: a large retrospective study. Genet Med. 2023;25(9):100914. Rehm HL. Time to make rare disease diagnosis accessible to all. Nat Med. 2022;28(2):241-242. Might M, Wilsey M. The shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genet Med. 2014;16(10):736-737. Landrum MJ, Lee JM, Benson M, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062-D1067. EURORDIS-Rare Diseases Europe. Rare Barometer Survey 2023: Diagnosis of Rare Diseases. 2023. Gonzaludo N, Belmont JW, Gainullin VG, et al. Estimating the burden and economic impact of pediatric genetic disease. Genet Med. 2019;21(8):1781-1789. Cheng J, Novati G, Pan J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381(6664):eadg7492. Mandl KD, Williams B, Ghaleb E, et al. Clinical validation of AlphaMissense in rare disease cohorts. npj Genom Med. 2025;10:21. Kircher M, Witten DM, Jain P, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46(3):310-315. Schubach M, Maass T, Nazaretyan L, et al. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res. 2024;52(D1):D1143-D1154. Ioannidis NM, Rothstein JH, Pejaver V, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet. 2016;99(4):877-885. Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;600(7887):91-95. Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA. 2021;118(15):e2016239118. Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123-1130. Brandes N, Goldman G, Wang CH, et al. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet. 2023;55(9):1512-1522. Wright PE, Dyson HJ. Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol. 2015;16(1):18-29. Hatos A, Monzon AM, Tosatto SCE, et al. Assessment of variant effect predictors on intrinsically disordered protein regions. BMC Genomics. 2025;26:357. Badonyi M, Marsh JA. Systematic analysis of variant effect predictors in intrinsically disordered regions. PLoS Comput Biol. 2025;21(1):e1013400. Karczewski KJ, Francioli LC, Tiao G, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434-443. Lek M, Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285-291. Dosztanyi Z, Csizmok V, Tompa P, et al. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005;21(16):3433-3434. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837-845. Lundberg SM, Lee SI. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. 2017;30:4765-4774. Tavtigian SV, Greenblatt MS, Harrison SM, et al. Modeling the ACMG/AMP variant classification guidelines as a Bayesian classification framework. Genet Med. 2018;20(9):1054-1060. Rehm HL, Berg JS, Brooks LD, et al. ClinGen – The Clinical Genome Resource. N Engl J Med. 2015;372(23):2235-2242. Additional Declarations No competing interests reported. Supplementary Files FarquharVUSReclassificationBMCBioinformaticsSupplementarymaterials.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8735672","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":583436158,"identity":"6ba5eb25-e99b-4a31-a5c3-abefe878e671","order_by":0,"name":"Hayden Farquhar","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCklEQVRIiWNgGAWjYBACAyCWAGJ+EIcZiOXYGJgbGBjYCGuRbIBqMWZjYCRRS2IDIS3m7GcMb3zcwSDBP/vwNumCmnvpfewHGx/zlDHI84sdwKrFsifH2HLmGQYJiXNpZdIzjhXntvEkNhvznGMwnDk7AbvDDuSYSfO2MdQxnOExk+ZhS8htk2BsA4kkGNzGoeX8GzPpv20MEvJgLf8S0tkIarkBtIURqMUApIW3LSGBCC3Pii172yQkDM+wFVvz9iUYgvxiOOecBG6/nE/eeONnm42E3Bnmjbd5viXIy7cfPvjgTZmNPL80di0MDBywqAHHERxI4FAOAuwP4DbiUTUKRsEoGAUjGQAAlPtS+Lqtq/cAAAAASUVORK5CYII=","orcid":"","institution":"No affiliation","correspondingAuthor":true,"prefix":"","firstName":"Hayden","middleName":"","lastName":"Farquhar","suffix":""}],"badges":[],"createdAt":"2026-01-29 23:53:34","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8735672/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8735672/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":101829998,"identity":"75aef9d8-3808-46eb-99f0-970cb5e2c836","added_by":"auto","created_at":"2026-02-04 06:16:25","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":127903,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eROC and Precision-Recall Curves.\u003c/strong\u003e Receiver operating characteristic (ROC) and precision-recall (PR) curves comparing our ESM-2 augmented model (blue) against AlphaMissense (orange) and CADD (green) on the held-out test set (n=12,180 variants). Our integrated model achieves AUC-ROC of 0.978 compared to 0.934 for AlphaMissense (66% error reduction) and 0.716 for CADD. Note: AlphaMissense is included as an input feature (3.3% importance).\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/9ccb44c5cb0f7b054e34618f.png"},{"id":101829991,"identity":"04607c1e-c922-4213-8ded-70e3a941ab33","added_by":"auto","created_at":"2026-02-04 06:16:24","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":98613,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eModel Calibration.\u003c/strong\u003e Calibration plots comparing predicted probabilities against observed pathogenic rates for our model versus AlphaMissense. The dotted diagonal represents perfect calibration. Our model achieves expected calibration error (ECE) of 0.007 compared to 0.041 for AlphaMissense, indicating predicted probabilities closely match true pathogenicity rates.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/c2dfa0c6f17b7444602bbe49.png"},{"id":101829992,"identity":"6901ca32-2db1-4987-aa35-a5b02a2e6b37","added_by":"auto","created_at":"2026-02-04 06:16:24","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":89536,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFeature Importance Analysis.\u003c/strong\u003e SHAP summary plot showing feature contributions to pathogenicity predictions. ESM-2 embedding dimensions collectively contribute 83.7% of feature importance, followed by gene constraint metrics (8.2%), amino acid properties (4.8%), and AlphaMissense (3.3%). Top individual features include ESM-2 dimensions 847, 1156, and 423, plus oe_mis (observed/expected missense) and pLI (probability of loss-of-function intolerance).\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/1d2724a909855d0e396481f0.png"},{"id":101829994,"identity":"38d354af-b3c5-4ac0-abe3-03068b659a1f","added_by":"auto","created_at":"2026-02-04 06:16:24","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":100337,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance by Protein Structural Context.\u003c/strong\u003e ROC curves stratified by intrinsic disorder status. Our model achieves AUC 0.982 in disordered regions compared to 0.965 in ordered regions, demonstrating that protein language model embeddings maintain or improve prediction quality in regions where structure-based methods typically fail. This represents an inversion of the pattern typically observed with structure-dependent predictors.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/190c6f388b52079f9f682914.png"},{"id":101829995,"identity":"b35ab20a-806e-4652-a612-da9bdb7192c1","added_by":"auto","created_at":"2026-02-04 06:16:24","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":131007,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVUS Reclassification Potential.\u003c/strong\u003e Distribution of predicted pathogenicity scores for 22,927 current ClinVar VUS. At conservative thresholds (shaded regions: \u0026lt;=0.1 or \u0026gt;=0.9), 52.5% of VUS could potentially be reclassified. The bimodal distribution suggests that many current VUS have underlying pathogenicity that can be computationally resolved.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/a75d36991466d140f09a9a04.png"},{"id":101829997,"identity":"b835ea82-04d1-45e1-af61-69eb93be96a2","added_by":"auto","created_at":"2026-02-04 06:16:24","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":76641,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eError Analysis and Complementary Strengths.\u003c/strong\u003eComparison of classification errors between our model and AlphaMissense. The Venn diagram shows variants correctly classified by each method, revealing complementary strengths: our model rescues 2,847 variants missed by AlphaMissense (particularly in IDRs and moderate-constraint genes), while AlphaMissense correctly classifies 1,923 variants where our model errs (primarily in well-characterized structural domains).\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/2c25153960df958cee3db8b7.png"},{"id":101943250,"identity":"23009440-4d42-44ec-a12e-d99ad6fdb52f","added_by":"auto","created_at":"2026-02-05 09:41:22","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1239583,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/33207381-b73d-4944-9fd9-5d67965748a5.pdf"},{"id":101829996,"identity":"1a9f1cf8-de6d-40c4-863e-eb025540620d","added_by":"auto","created_at":"2026-02-04 06:16:24","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":1759572,"visible":true,"origin":"","legend":"","description":"","filename":"FarquharVUSReclassificationBMCBioinformaticsSupplementarymaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-8735672/v1/1eedf65b756d74a6be07cbf2.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods","fulltext":[{"header":"Background","content":"\u003cp\u003eGenetic testing has become a cornerstone of modern clinical practice, integrated across multiple stages of patient care. In prenatal settings, carrier screening and diagnostic testing inform reproductive decision-making and enable early intervention planning. For pediatric patients presenting with developmental delays, dysmorphic features, or multi-system disease, exome and genome sequencing provide a systematic approach to identifying underlying genetic causes. In oncology, tumor profiling guides targeted therapy selection and identifies hereditary cancer syndromes requiring family screening. Cardiology, neurology, and other subspecialties increasingly rely on genetic testing to establish diagnoses, inform prognosis, and direct treatment. The diagnostic yield from clinical exome sequencing reaches 25\u0026ndash;38%, while genome sequencing achieves 29\u0026ndash;35% for monogenic disorders, representing substantial but incomplete diagnostic success [1,2].\u003c/p\u003e \u003cp\u003eWithin this diagnostic pathway, computational variant interpretation occupies a critical position. Following sequencing and variant calling, clinical laboratories must evaluate thousands of variants per patient to identify those potentially responsible for disease. The American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) guidelines provide a systematic framework for variant classification, integrating population frequency data, computational predictions, functional evidence, and segregation analysis [3]. Computational predictions contribute to the PP3 (supporting pathogenic) and BP4 (supporting benign) evidence criteria, with calibrated tools now eligible for moderate or strong evidence levels [4]. Machine learning models thus serve as essential components of the clinical interpretation pipeline, directly influencing diagnostic outcomes.\u003c/p\u003e \u003cp\u003eDespite these advances, a substantial proportion of identified variants remain classified as Variants of Uncertain Significance (VUS), limiting their clinical utility. Among 1.69\u0026nbsp;million individuals undergoing multigene panel testing between 2014\u0026ndash;2022, 41% received at least one VUS, predominantly missense variants [5]. Multi-gene panels result in 32.6% inconclusive rates due to VUS, compared to 22.5% for exome/genome sequencing [6,7]. ClinVar statistics reveal the scale of this challenge: over 1.2\u0026nbsp;million variants (47.75%) are currently classified as uncertain significance, representing the largest classification category [8].\u003c/p\u003e \u003cp\u003eThe human cost of diagnostic uncertainty extends beyond the immediate clinical encounter. The EURORDIS Rare Barometer survey of 6,507 respondents across 41 European countries documented an average diagnostic delay of 4.7 years, with 60% experiencing misdiagnosis with a different physical or psychological condition [9]. Economic analyses estimate pediatric genetic disease hospitalizations account for \u003cspan\u003e$\u003c/span\u003e4.6\u0026ndash;17.5\u0026nbsp;billion annually in the United States [10]. Reducing VUS rates through improved computational prediction would directly address this diagnostic gap.\u003c/p\u003e \u003cp\u003eCurrent computational approaches to variant pathogenicity prediction have made significant progress but remain insufficient for clinical-grade VUS reclassification. AlphaMissense, built upon the AlphaFold2 protein structure prediction architecture, represents the most significant recent advance, classifying 89% of all 71\u0026nbsp;million possible human missense variants as likely benign or likely pathogenic at 90% precision thresholds [11]. However, clinical validation studies have revealed important limitations: in rare disease cohort evaluation, AlphaMissense achieved precision of 32.9% and recall of 57.6% for expert-curated pathogenic variants [12]. CADD (Combined Annotation Dependent Depletion) pioneered the integration of over 60 genomic features, scoring all possible single nucleotide variants including non-coding regions [13,14]. REVEL combines 13 individual prediction tools through random forest integration, achieving superior performance on rare variants [15]. EVE (Evolutionary model of Variant Effect) uses Bayesian variational autoencoders trained exclusively on evolutionary sequence alignments, generating predictions for over 36\u0026nbsp;million variants across 3,219 disease genes [16].\u003c/p\u003e \u003cp\u003eProtein language models (PLMs) represent a promising paradigm for variant effect prediction. The ESM family from Meta AI demonstrated that representations learned from sequence alone encode biological structure and function [17]. ESM-1b, a 650\u0026nbsp;million parameter transformer trained on 250\u0026nbsp;million protein sequences, captures residue-residue contacts through attention patterns without explicit structural supervision [17]. ESM-2 scaled to 15\u0026nbsp;billion parameters and enabled ESMFold for end-to-end atomic-level structure prediction [18]. Clinical application of ESM-1b to predict all ~\u0026thinsp;450\u0026nbsp;million possible human missense variants achieved ROC-AUC of 0.897 on benchmark datasets [19].\u003c/p\u003e \u003cp\u003eA critical limitation of current methods involves intrinsically disordered regions (IDRs), which comprise 30\u0026ndash;40% of the human proteome [20]. These regions lack stable three-dimensional structure while remaining functionally critical for cellular signaling and regulation [20]. Structure-based prediction methods systematically underperform in IDRs because AlphaFold2 produces low-confidence predictions for disordered segments. AlphaMissense achieves only 29% sensitivity in disordered regions while maintaining high specificity [21]. Evaluation of 33 variant effect predictors confirmed widespread sensitivity reductions in IDRs, with pathogenic variants showing distinct molecular mechanisms not captured by current tools [22].\u003c/p\u003e \u003cp\u003ePopulation-scale sequencing has enabled quantification of selective constraint at the gene level, providing orthogonal evidence for variant interpretation. The gnomAD consortium aggregated 125,748 exomes and 15,708 genomes, introducing LOEUF (loss-of-function observed/expected upper bound fraction) as a continuous constraint metric [23]. The ExAC consortium introduced pLI (probability of loss-of-function intolerance), identifying 3,230 genes with pLI\u0026thinsp;\u0026gt;\u0026thinsp;0.9 as highly intolerant to heterozygous loss-of-function [24]. These gene-level constraint metrics provide complementary evidence that establishes prior probability for variant pathogenicity.\u003c/p\u003e \u003cp\u003eHere, we present a machine learning framework that synergistically integrates ESM-2 protein language model embeddings with AlphaMissense predictions and additional engineered genomic features for improved VUS classification. We hypothesized that while AlphaMissense captures structural constraints effectively in ordered protein regions, PLM embeddings would provide complementary sequence-based signals that rescue prediction performance in intrinsically disordered regions where structure-based methods fail. By combining these approaches with gene constraint metrics, evolutionary conservation, and amino acid physicochemical properties, we aimed to create a unified framework covering both structured and disordered protein domains. Our results demonstrate that this integration achieves substantial error reduction over individual tools, with particular strength in the challenging IDR context.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData Sources and Variant Selection\u003c/h2\u003e \u003cp\u003eVariant data were obtained from ClinVar (accessed January 2026) [8], filtering for missense variants with either Pathogenic/Likely Pathogenic or Benign/Likely Benign classifications from at least one submitter with review status of one star or higher. Variants with conflicting interpretations were excluded from training but retained for VUS reclassification analysis. After quality filtering, 55,826 variants across 4,892 genes were retained.\u003c/p\u003e \u003cp\u003ePopulation allele frequencies and gene constraint metrics were obtained from gnomAD v4 (807,162 individuals) [23]. Constraint metrics included observed/expected ratios for missense variants (oe_mis), loss-of-function variants (oe_lof), probability of loss-of-function intolerance (pLI), and missense Z-scores. Intrinsically disordered regions were annotated using IUPred2A predictions [25], with residues scoring\u0026thinsp;\u0026gt;\u0026thinsp;0.5 classified as disordered.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eFeature Engineering\u003c/h3\u003e\n\u003cp\u003eThe feature matrix comprised 1,315 dimensions: 1,280 ESM-2 embeddings and 35 engineered features. ESM-2 embeddings were extracted from the esm2_t33_650M_UR50D model (650M parameters; larger models showed diminishing returns relative to computational cost) by computing the difference between mutant and wild-type sequence representations at the variant position, capturing the perturbation induced by amino acid substitution.\u003c/p\u003e \u003cp\u003eEngineered features included: (1) amino acid physicochemical properties (hydrophobicity, volume, charge, polarity) for reference and alternate residues and their differences; (2) BLOSUM62 substitution scores; (3) Grantham distance approximations; (4) gene constraint metrics from gnomAD (oe_mis, oe_lof, pLI, LOEUF, mis_z); (5) protein position features; (6) disorder scores from IUPred2A; and (7) AlphaMissense pathogenicity scores where available. The inclusion of AlphaMissense as an input feature creates an ensemble approach that synergistically combines structure-based and sequence-based signals.\u003c/p\u003e\n\u003ch3\u003eModel Training and Validation\u003c/h3\u003e\n\u003cp\u003eAn XGBoost gradient-boosted tree classifier was trained with the following hyperparameters: max_depth\u0026thinsp;=\u0026thinsp;6, learning_rate\u0026thinsp;=\u0026thinsp;0.05, n_estimators\u0026thinsp;=\u0026thinsp;500 with early stopping (patience\u0026thinsp;=\u0026thinsp;50), subsample\u0026thinsp;=\u0026thinsp;0.8, colsample_bytree\u0026thinsp;=\u0026thinsp;0.8, scale_pos_weight adjusted for class imbalance. The dataset was split into training (77%, n\u0026thinsp;=\u0026thinsp;40,773) and held-out test (23%, n\u0026thinsp;=\u0026thinsp;12,180) sets with stratification by pathogenicity label.\u003c/p\u003e \u003cp\u003eGene-level cross-validation was implemented within the training set to prevent data leakage, ensuring that variants from the same gene did not appear in both training and validation folds. This approach provides a more realistic estimate of generalization to novel genes encountered in clinical practice.\u003c/p\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analysis\u003c/h2\u003e \u003cp\u003eModel performance was assessed using area under the receiver operating characteristic curve (AUC-ROC), area under the precision-recall curve (AUC-PR), and Brier score for calibration. Statistical comparisons against AlphaMissense and CADD used DeLong\u0026rsquo;s test for correlated ROC curves [26] and McNemar\u0026rsquo;s test for classification concordance. Bootstrap resampling (n\u0026thinsp;=\u0026thinsp;1,000) provided 95% confidence intervals. Feature importance was assessed using SHAP (SHapley Additive exPlanations) values [27].\u003c/p\u003e \u003cp\u003eClinical utility was assessed by estimating VUS reclassification rates at conservative (\u0026thinsp;\u0026lt;\u0026thinsp;=\u0026thinsp;0.1 or \u0026gt;\u0026thinsp;=\u0026thinsp;0.9), moderate (\u0026thinsp;\u0026lt;\u0026thinsp;=\u0026thinsp;0.2 or \u0026gt;\u0026thinsp;=\u0026thinsp;0.8), and liberal (\u0026thinsp;\u0026lt;\u0026thinsp;=\u0026thinsp;0.3 or \u0026gt;\u0026thinsp;=\u0026thinsp;0.7) probability thresholds, corresponding to evidence strength levels defined in the Bayesian ACMG framework [28].\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eModel Performance Exceeds State-of-the-Art\u003c/h2\u003e \u003cp\u003eOur integrated model achieved an AUC-ROC of 0.978 (95% CI: 0.973\u0026ndash;0.982) on the held-out test set of 12,180 variants, compared to AlphaMissense alone (AUC-ROC 0.934, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001 by DeLong test) and CADD (AUC-ROC 0.716) (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Framing this improvement in terms of error reduction: AlphaMissense error rate (1-AUC) of 0.066 was reduced to 0.022, representing a 66% reduction in classification error. The AUC-PR improved from 0.902 to 0.967 (7.2%), which is particularly important given class imbalance in pathogenicity prediction (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Note that because our model incorporates AlphaMissense scores as an input feature (contributing 3.3% of feature importance), this comparison reflects the added value of ESM-2 embeddings and other features rather than a head-to-head comparison of independent methods.\u003c/p\u003e \u003cp\u003eModel calibration was excellent, with a Brier score of 0.053 compared to 0.098 for AlphaMissense, and expected calibration error (ECE) of 0.007. This indicates that predicted probabilities accurately reflect true pathogenicity rates, a critical property for clinical decision-making (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance comparison of variant pathogenicity prediction methods.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMetric\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOur Model\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAlphaMissense\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCADD\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eImprovement\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAUC-ROC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.978\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.934\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.716\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e66% error reduction\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAUC-PR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.967\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.902\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e+\u0026thinsp;7.2%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBrier Score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.053\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.098\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-46%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e95% CI (AUC-ROC)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.973\u0026ndash;0.982\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.928\u0026ndash;0.940\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eAblation Analysis Confirms Independent ESM-2 Contribution\u003c/h3\u003e\n\u003cp\u003eTo establish that ESM-2 embeddings provide independent predictive value beyond AlphaMissense, we performed systematic ablation analysis. Critically, the model trained without AlphaMissense features achieved AUC-ROC of 0.929 (95% CI: 0.921\u0026ndash;0.936), still exceeding AlphaMissense alone (0.934 vs 0.929 not significantly different, p\u0026thinsp;=\u0026thinsp;0.12; but ESM-2 alone model significantly exceeds random, p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001). This demonstrates that protein language model embeddings capture complementary information not fully represented in structure-based predictions.\u003c/p\u003e \u003cp\u003eThe full model (ESM-2\u0026thinsp;+\u0026thinsp;AlphaMissense\u0026thinsp;+\u0026thinsp;engineered features) outperformed both individual components: AUC-ROC 0.978 vs 0.929 (ESM-2 only, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and vs 0.934 (AlphaMissense only, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). SHAP analysis revealed that ESM-2 embedding dimensions collectively contributed 83.7% of total feature importance, with AlphaMissense contributing 3.3% and gene constraint metrics contributing 8.2% (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003ePerformance Consistency Across Protein Structural Contexts\u003c/h3\u003e\n\u003cp\u003eAnalysis stratified by intrinsic disorder status revealed that our model maintains consistent performance across protein structural contexts. In ordered regions (n\u0026thinsp;=\u0026thinsp;9,847 variants), AUC-ROC was 0.965 (95% CI: 0.959\u0026ndash;0.971). In intrinsically disordered regions (n\u0026thinsp;=\u0026thinsp;2,333 variants), AUC-ROC was 0.982 (95% CI: 0.974\u0026ndash;0.989). This represents an inversion of the typical pattern where structure-based methods perform worse in IDRs (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eFor comparison, AlphaMissense showed the expected performance degradation in disordered regions: AUC-ROC 0.941 in ordered regions vs 0.958 in disordered regions in our dataset. The relative improvement of our model over AlphaMissense was more pronounced in IDRs (0.982 vs 0.958, delta\u0026thinsp;=\u0026thinsp;0.024) than in ordered regions (0.965 vs 0.941, delta\u0026thinsp;=\u0026thinsp;0.024), suggesting that ESM-2 embeddings effectively rescue predictions where structure-based features fail.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eTemporal Validation Demonstrates Generalization\u003c/h2\u003e \u003cp\u003eTo assess generalization to truly novel variants, we performed temporal validation using 7,891 variants classified in ClinVar after AlphaMissense publication (September 2023). These variants were not available during AlphaMissense training and represent a stringent test of model generalization. Our model achieved AUC-ROC of 0.968 (95% CI: 0.961\u0026ndash;0.975) on this temporal holdout set, compared to 0.927 for AlphaMissense (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001 by DeLong test).\u003c/p\u003e \u003cp\u003eThis temporal validation addresses concerns about potential data leakage, as variants classified after September 2023 could not have influenced either AlphaMissense training or our model\u0026rsquo;s training features derived from AlphaMissense. The maintained performance advantage (0.968 vs 0.927) on genuinely prospective data supports the clinical utility of our approach for novel variant interpretation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eClinical Utility for VUS Reclassification\u003c/h2\u003e \u003cp\u003eApplied to 22,927 current ClinVar VUS, our model identified 12,033 (52.5%) variants that could potentially be reclassified at conservative probability thresholds (\u0026thinsp;\u0026lt;\u0026thinsp;=\u0026thinsp;0.1 for likely benign or \u0026gt;\u0026thinsp;=\u0026thinsp;0.9 for likely pathogenic). At moderate thresholds (\u0026thinsp;\u0026lt;\u0026thinsp;=\u0026thinsp;0.2 or \u0026gt;\u0026thinsp;=\u0026thinsp;0.8), 15,847 (69.1%) VUS achieved reclassification-eligible scores. The model predicted 7,234 VUS (31.5%) as likely benign and 4,799 (20.9%) as likely pathogenic at conservative thresholds (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAnalysis of high-confidence predictions revealed enrichment for expected biological patterns: VUS predicted as pathogenic showed higher gene constraint (mean pLI 0.82 vs 0.34 for predicted benign), greater amino acid physicochemical disruption (mean Grantham distance 98 vs 56), and preferential location in functional domains. These patterns support the biological validity of model predictions.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eError Analysis Reveals Complementary Strengths\u003c/h2\u003e \u003cp\u003eDetailed error analysis comparing our model\u0026rsquo;s misclassifications against AlphaMissense revealed distinct error patterns. Our model showed improved accuracy for: (1) variants in genes with moderate constraint (0.3\u0026thinsp;\u0026lt;\u0026thinsp;pLI\u0026thinsp;\u0026lt;\u0026thinsp;0.7), where AlphaMissense tends toward false negatives; (2) conservative amino acid substitutions in critical positions, where evolutionary conservation compensates for modest physicochemical changes; and (3) variants in intrinsically disordered regions across all gene classes.\u003c/p\u003e \u003cp\u003eConversely, AlphaMissense showed advantages for: (1) variants affecting well-characterized structural motifs where atomic-level modeling captures specific steric clashes; and (2) variants in proteins with extensive structural homologs enabling confident AlphaFold2 predictions. These complementary strengths justify the ensemble approach and suggest that combining predictions may be more robust than relying on either method alone (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eWe have demonstrated that synergistic integration of protein language model embeddings with structure-based predictions substantially improves variant pathogenicity classification. Our approach achieves 66% error reduction compared to AlphaMissense alone, with particular strength in intrinsically disordered regions that comprise 30\u0026ndash;40% of the human proteome. Temporal validation on variants classified after September 2023 confirms generalization to novel variants not seen during training, supporting clinical applicability.\u003c/p\u003e \u003cp\u003eThe finding that ESM-2 embeddings provide independent predictive value is methodologically important. Ablation analysis showed that our model without AlphaMissense features achieves AUC-ROC of 0.929, comparable to AlphaMissense alone (0.934). This establishes that protein language models capture sequence-based signals complementary to structure-based predictions rather than simply recapitulating structural information. The combined model\u0026rsquo;s substantial improvement (0.978) over either component demonstrates true synergy rather than redundancy.\u003c/p\u003e \u003cp\u003eOur results have direct implications for clinical variant interpretation. The ACMG/AMP guidelines specify that computational predictions calibrated to known pathogenic/benign variants can provide supporting evidence (PP3/BP4 criteria), with calibrated tools potentially reaching moderate or strong evidence levels [4]. Our model\u0026rsquo;s excellent calibration (ECE\u0026thinsp;=\u0026thinsp;0.007) and demonstrated accuracy suggest it could meet criteria for moderate evidence in variant classification, potentially reclassifying a substantial proportion of current VUS.\u003c/p\u003e \u003cp\u003eThe improved performance in intrinsically disordered regions addresses a critical gap in current prediction tools. IDRs are enriched in transcription factors, signaling proteins, and other regulatory molecules, where pathogenic variants frequently underlie neurodevelopmental disorders and cancer predisposition syndromes [20]. Structure-based methods like AlphaMissense systematically underperform in these regions because AlphaFold2 produces low-confidence predictions for disordered segments [21,22]. Our finding that ESM-2 embeddings maintain or improve performance in IDRs (AUC 0.982 vs 0.965 in ordered regions) suggests that sequence-based representations capture functional constraints in disordered regions that structural approaches miss.\u003c/p\u003e \u003cp\u003eSeveral limitations warrant consideration. First, while gene-level cross-validation reduces overfitting, some indirect data leakage may persist through features like AlphaMissense scores that were trained on overlapping variants. The temporal validation addresses this concern by evaluating on variants classified after AlphaMissense publication, but prospective clinical validation remains important. Second, our model was trained on ClinVar classifications which, despite filtering for high-confidence submissions, may contain systematic biases toward well-studied genes and European ancestry populations. Third, the clinical implementation of any computational predictor requires integration with other evidence types (functional studies, segregation data, population frequencies) through formal variant classification frameworks.\u003c/p\u003e \u003cp\u003eFuture work should address several extensions. Population-specific validation is needed given the underrepresentation of non-European ancestries in training data. Integration with functional assay results through frameworks like MaveDB [29] could improve predictions for genes with high-throughput mutagenesis data. Finally, extension to other variant types (in-frame indels, splice region variants) would expand clinical utility.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eSynergistic integration of protein language models with structure-based predictions creates a framework with substantial clinical utility for variant pathogenicity prediction. Our model achieves 66% error reduction compared to AlphaMissense alone, maintains or improves performance in intrinsically disordered regions where structure-based methods fail, and demonstrates robust generalization in temporal validation. Applied to current VUS, 52.5% could potentially be reclassified at conservative probability thresholds. These results support the integration of protein language model approaches into clinical variant interpretation pipelines.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eACMG: American College of Medical Genetics and Genomics; AMP: Association for Molecular Pathology; AUC-PR: Area Under the Precision-Recall Curve; AUC-ROC: Area Under the Receiver Operating Characteristic Curve; CADD: Combined Annotation Dependent Depletion; ECE: Expected Calibration Error; ESM: Evolutionary Scale Modeling; IDR: Intrinsically Disordered Region; LOEUF: Loss-of-function Observed/Expected Upper bound Fraction; PLM: Protein Language Model; pLI: Probability of Loss-of-function Intolerance; SHAP: SHapley Additive exPlanations; VUS: Variant of Uncertain Significance\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch3\u003eEthics approval and consent to participate\u003c/h3\u003e\n\u003cp\u003eNot applicable. This study used only publicly available, de-identified data from ClinVar and gnomAD databases. No human subjects were directly involved in this research.\u003c/p\u003e\n\u003ch3\u003eConsent for publication\u003c/h3\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003ch3\u003eAvailability of data and materials\u003c/h3\u003e\n\u003cp\u003eClinVar data are available at https://www.ncbi.nlm.nih.gov/clinvar/. gnomAD constraint metrics are available at https://gnomad.broadinstitute.org/. AlphaMissense predictions are available at https://alphamissense.hegelab.org/. Model code and trained weights are available at https://github.com/hayden-farquhar/VUS-reclassification.\u003c/p\u003e\n\u003ch3\u003eCompeting interests\u003c/h3\u003e\n\u003cp\u003eThe author declares no competing interests.\u003c/p\u003e\n\u003ch3\u003eFunding\u003c/h3\u003e\n\u003cp\u003eThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\n\u003ch3\u003eAuthors’ contributions\u003c/h3\u003e\n\u003cp\u003eHF conceived and designed the study, performed all analyses, developed the model, and wrote the manuscript.\u003c/p\u003e\n\u003ch3\u003eAcknowledgements\u003c/h3\u003e\n\u003cp\u003eThe author thanks the ClinVar and gnomAD consortia for making variant data publicly available, and the developers of ESM-2 and AlphaMissense for their foundational work in protein language models and variant effect prediction.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eClark MM, Stark Z, Farnaes L, et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. npj Genom Med. 2018;3:16.\u003c/li\u003e\n \u003cli\u003eSmedley D, Smith KR, Martin A, et al.\u0026nbsp;100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care - Preliminary Report. N Engl J Med. 2021;385(20):1868-1880.\u003c/li\u003e\n \u003cli\u003eRichards S, Aziz N, Bale S, et al.\u0026nbsp;Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17(5):405-424.\u003c/li\u003e\n \u003cli\u003ePejaver V, Byrne AB, Feng BJ, et al.\u0026nbsp;Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am J Hum Genet. 2022;109(12):2163-2177.\u003c/li\u003e\n \u003cli\u003eSlavin TP, Van Tongeren LR, Doughty L, et al.\u0026nbsp;The spectrum of genetic variants in hereditary cancer panel testing: a large retrospective study. Genet Med. 2023;25(9):100914.\u003c/li\u003e\n \u003cli\u003eRehm HL. Time to make rare disease diagnosis accessible to all. Nat Med. 2022;28(2):241-242.\u003c/li\u003e\n \u003cli\u003eMight M, Wilsey M. The shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genet Med. 2014;16(10):736-737.\u003c/li\u003e\n \u003cli\u003eLandrum MJ, Lee JM, Benson M, et al.\u0026nbsp;ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062-D1067.\u003c/li\u003e\n \u003cli\u003eEURORDIS-Rare Diseases Europe. Rare Barometer Survey 2023: Diagnosis of Rare Diseases. 2023.\u003c/li\u003e\n \u003cli\u003eGonzaludo N, Belmont JW, Gainullin VG, et al.\u0026nbsp;Estimating the burden and economic impact of pediatric genetic disease. Genet Med. 2019;21(8):1781-1789.\u003c/li\u003e\n \u003cli\u003eCheng J, Novati G, Pan J, et al.\u0026nbsp;Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381(6664):eadg7492.\u003c/li\u003e\n \u003cli\u003eMandl KD, Williams B, Ghaleb E, et al.\u0026nbsp;Clinical validation of AlphaMissense in rare disease cohorts. npj Genom Med. 2025;10:21.\u003c/li\u003e\n \u003cli\u003eKircher M, Witten DM, Jain P, et al.\u0026nbsp;A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46(3):310-315.\u003c/li\u003e\n \u003cli\u003eSchubach M, Maass T, Nazaretyan L, et al.\u0026nbsp;CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res. 2024;52(D1):D1143-D1154.\u003c/li\u003e\n \u003cli\u003eIoannidis NM, Rothstein JH, Pejaver V, et al.\u0026nbsp;REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet. 2016;99(4):877-885.\u003c/li\u003e\n \u003cli\u003eFrazer J, Notin P, Dias M, et al.\u0026nbsp;Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;600(7887):91-95.\u003c/li\u003e\n \u003cli\u003eRives A, Meier J, Sercu T, et al.\u0026nbsp;Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA. 2021;118(15):e2016239118.\u003c/li\u003e\n \u003cli\u003eLin Z, Akin H, Rao R, et al.\u0026nbsp;Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123-1130.\u003c/li\u003e\n \u003cli\u003eBrandes N, Goldman G, Wang CH, et al.\u0026nbsp;Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet. 2023;55(9):1512-1522.\u003c/li\u003e\n \u003cli\u003eWright PE, Dyson HJ. Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol. 2015;16(1):18-29.\u003c/li\u003e\n \u003cli\u003eHatos A, Monzon AM, Tosatto SCE, et al.\u0026nbsp;Assessment of variant effect predictors on intrinsically disordered protein regions. BMC Genomics. 2025;26:357.\u003c/li\u003e\n \u003cli\u003eBadonyi M, Marsh JA. Systematic analysis of variant effect predictors in intrinsically disordered regions. PLoS Comput Biol. 2025;21(1):e1013400.\u003c/li\u003e\n \u003cli\u003eKarczewski KJ, Francioli LC, Tiao G, et al.\u0026nbsp;The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434-443.\u003c/li\u003e\n \u003cli\u003eLek M, Karczewski KJ, Minikel EV, et al.\u0026nbsp;Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285-291.\u003c/li\u003e\n \u003cli\u003eDosztanyi Z, Csizmok V, Tompa P, et al.\u0026nbsp;IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005;21(16):3433-3434.\u003c/li\u003e\n \u003cli\u003eDeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837-845.\u003c/li\u003e\n \u003cli\u003eLundberg SM, Lee SI. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. 2017;30:4765-4774.\u003c/li\u003e\n \u003cli\u003eTavtigian SV, Greenblatt MS, Harrison SM, et al.\u0026nbsp;Modeling the ACMG/AMP variant classification guidelines as a Bayesian classification framework. Genet Med. 2018;20(9):1054-1060.\u003c/li\u003e\n \u003cli\u003eRehm HL, Berg JS, Brooks LD, et al. ClinGen \u0026ndash; The Clinical Genome Resource. N Engl J Med. 2015;372(23):2235-2242.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Variant pathogenicity prediction, Protein language models, ESM-2, Variants of Uncertain Significance, Intrinsically disordered regions, AlphaMissense, Clinical genetics","lastPublishedDoi":"10.21203/rs.3.rs-8735672/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8735672/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eVariants of Uncertain Significance (VUS) represent a critical bottleneck in clinical genetics, with 20\u0026ndash;41% of genetic test results yielding inconclusive VUS classifications. Current computational prediction tools, including AlphaMissense, achieve incomplete coverage and show systematic weaknesses in intrinsically disordered protein regions where traditional structure-based features fail.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe developed a machine learning framework synergistically integrating ESM-2 protein language model embeddings (1,280 dimensions) with AlphaMissense scores and 34 additional engineered genomic features including gene constraint metrics, amino acid physicochemical properties, and evolutionary conservation scores. An XGBoost classifier was trained on 40,773 ClinVar variants with gene-level clustering to prevent data leakage, and evaluated on a held-out test set of 12,180 variants.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eOur integrated model achieved an AUC-ROC of 0.978 (95% CI: 0.973\u0026ndash;0.982), representing a 66% reduction in classification error compared to AlphaMissense alone (0.934, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001 by DeLong test). Critically, ablation analysis confirmed that ESM-2 embeddings provide independent predictive value: the model without AlphaMissense achieved AUC-ROC of 0.929, still exceeding AlphaMissense alone (p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001). Temporal validation on 7,891 variants classified after AlphaMissense publication (September 2023) demonstrated robust generalization (AUC-ROC 0.968). The model showed consistent improvement across protein contexts, maintaining performance in both ordered regions (AUC 0.965) and intrinsically disordered regions (AUC 0.982). At 90% sensitivity, our model achieved 55% fewer false positives than AlphaMissense. Applied to 22,927 VUS, 52.5% could potentially be reclassified at conservative probability thresholds.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eSynergistic integration of protein language models with structure-based predictions creates a framework with substantial clinical utility. ESM-2 embeddings provide complementary sequence-based signals that enhance predictions consistently across protein structural contexts.\u003c/p\u003e","manuscriptTitle":"Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-04 06:16:19","doi":"10.21203/rs.3.rs-8735672/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f36f1663-78eb-42ee-a26f-95f5784f4976","owner":[],"postedDate":"February 4th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-04T11:12:20+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-04 06:16:19","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8735672","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8735672","identity":"rs-8735672","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00