VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

doi:10.21203/rs.3.rs-8605164/v1

VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

2026 · doi:10.21203/rs.3.rs-8605164/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 214,513 characters · extracted from preprint-html · click to expand

VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants Jiawei Wu, Marissa Stutzman, Michael Muriello, Joy Lincoln, Donald G. Basel, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8605164/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Interpreting the pathogenicity of genetic variants remains a critical bottleneck in genomic medicine. Millions of variants of uncertain significance (VUS) hinder the clinical application of genetic findings. Traditional computational approaches often rely on hand-engineered features and fail to capture the complexity of multidimensional genomic annotations fully. Methods We developed VUS.Life, a multi-modal framework that synergizes semantic text embeddings of biological and clinical annotations with protein language modeling. We transformed variant annotations from Variant Effect Predictor (VEP) into natural language descriptions which are then converted into vector embeddings via established Large Language Models (LLMs), namely all-mpnet-base-v2, MedEmbed-large-v0.1, and text-embedding-004. Pathogenicity of a variant of interest is predicted by its proximity in the vector embedding space with variants of known pathogenicity. We further extended VUS.Life by employing residue-level delta embeddings from the ESMC-600M model to capture both clinical context and biophysical constraints. Results We evaluated the framework on > 10,000 variants across BRCA1/2 , FBN1 , ATM , and PALB2 genes. VUS.Life achieved greater than 96% accuracy from using VEP annotations alone across all variant types and disease genes evaluated. Additionally, our unsupervised FBN1 structural analysis using ESMC-600M revealed that delta embeddings disentangled distinct pathogenic mechanisms, topologically separating disulfide bond disruptions from calcium-binding defects. These structural clusters correlated strongly with Zero-Shot Log-Likelihood Ratio (LLR) scores, validating evolutionary fitness as a proxy for pathogenicity. Conclusions This semantic embedding framework, VUS.Life, accurately captures pathogenicity-relevant features from complex variant annotations, enabling high-accuracy (> 96%) automated classification across multiple genes and models. The approach generalizes beyond well-curated genes and supports scalable, interpretable, and representation-based classification of VUS. It holds significant promise for alleviating the variant interpretation bottleneck in clinical genomics. Pathogenicity Prediction Vector Embedding Large Language Model ESMC-600M Protein Structure BRCA1/2 FBN1 ATM PALB2 Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Background In just two decades, genomic medicine has transformed from a futuristic vision into clinical reality. What once required years and millions of dollars—sequencing a human genome—can now be accomplished in hours for under a thousand dollars [ 1 ]. This revolution has enabled genetic testing to guide cancer treatment decisions, predict disease risk, and enable precision therapies. Yet paradoxically, as our capacity to identify genetic variants has increased, our ability to interpret their clinical significance has become the rate-limiting step in realizing the potential of genomic medicine. The magnitude of this challenge is staggering. Current databases contain hundreds of millions of human genetic variants, yet only a fraction carry established clinical classifications [ 2 – 4 ]. For genes like BRCA1 and BRCA2 , where accurate interpretation can mean the difference between prophylactic surgery and routine screening, this uncertainty carries profound implications. Numerous computational tools like SIFT [ 5 ], PolyPhen-2 [ 6 ], and CADD [ 7 ] often referred to as in silico predictors, have been developed to help classify the pathogenicity of genetic variants by leveraging evolutionary conservation, protein structural predictions, and statistical modeling. Table 1 lists some of the most popular and commonly used general-purpose predictors. These tools are widely used as a form of evidence in variant interpretation and are supported by clinical guidelines from the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) [ 8 ]. Beyond these general-purpose predictors, significant research efforts have been invested in addressing challenges specific to a variant type or disease. For example, VEST-indel, focusing on insertions and deletions (indels), achieved sufficient accuracy to aid in clinical classification [ 9 , 10 ]. PathoPredictor, on the other hand, is a disease-specific ensemble classifier trained on data from distinct patient cohorts, demonstrating the value of tailored models over general-purpose predictors [ 11 ]. Table 1 The summary of conventional methods of pathogenicity prediction. Tool Category Brief Description SIFT [ 5 ] Foundational (Sequence Conservation) Predicts if an amino acid substitution affects protein function based on sequence homology. It is an early, probabilistic-based method. PolyPhen-2 [ 6 ] Foundational (Sequence & Structure) Predicts the effect of an amino acid substitution using sequence homology, Pfam annotations, and 3D structures. It is also a probabilistic-based method. CADD [ 7 ] Meta-predictor Integrates multiple annotations, including conservation and functional information, into a single score for deleteriousness. GERP [ 32 ] Foundational (Sequence Conservation) Identifies evolutionarily constrained regions to score how tolerant a position is to substitution. Its scores are often used as features in other tools. MutationTaster [ 33 ] Foundational A general-purpose tool for predicting the impact of various variant types. REVEL [ 22 ] Meta-predictor / Ensemble An ensemble method that combines scores from multiple tools to predict the pathogenicity of rare missense variants. MetaLR & MetaSVM [ 34 ] Meta-predictor / Ensemble Ensemble methods that integrate multiple deleteriousness scores and allele frequency data using logistic regression or a support vector machine. ClinPred [ 35 ] Meta-predictor / Ensemble A prediction tool that incorporates conservation scores, other prediction scores, and allele frequencies to identify disease-relevant variants. PathoPredictor is also one of many recent predictors that have leveraged machine learning (ML) and deep learning (DL) to capture complex genomic patterns and improve predictive power. Another prominent example of the group is AlphaMissense, which adapts the state-of-the-art AlphaFold protein structure model to predict the functional impact of missense variants [ 12 ].GNN-MAP, on the other hand, uses graph neural networks (GNNs) to model the relationships between different variants [ 13 ]. This approach enables GNN-MAP to predict pathogenicity for multiple variant types (missense, stop-gain, frameshift) and has shown strong performance on rare variants. However, these afore-mentioned methods face inherent limitations: they typically focus on individual types of variants or features, require extensive domain expertise to engineer appropriate features, and struggle to integrate the increasingly complex landscape of genomic annotations into unified predictions. In the last few years, Large Language Models (LLMs) have demonstrated unprecedented ability to capture semantic relationships within textual data, leading to breakthroughs in machine translation, question answering, and document understanding [ 14 ]. They have also been successfully applied in the field of genomic medicine. One application is to apply language models to the clinical narratives associated with variants. For instance, ClinVar-BERT is a model fine-tuned on millions of free-text clinical summaries from the ClinVar database [ 15 ]. By training a biomedical-specific BERT model to discern patterns in the evidence described in these reports, they successfully re-classified VUS in a way that correlated with experimental functional data. This work demonstrated that unstructured clinical text contains rich, learnable signals of pathogenicity that can be captured by language models. Another powerful approach treats biological sequences themselves as a language. Protein language models (pLMs) like ESM-1b are pre-trained on hundreds of millions of protein sequences and learn the fundamental "grammar" of protein structure and function [ 16 ]. Building on this, Lin et al. developed VariPred, which uses embeddings from a pLM to predict the impact of missense mutations [ 17 ]. By comparing the embeddings of the wild-type and mutant protein sequences, VariPred achieves state-of-the-art performance using only sequence information as input, effectively bypassing the need for explicit structural or evolutionary features. Similarly, Brandes et al. showed that the ESM1b model could be used directly, without fine-tuning, to predict the effects of all possible missense variants, in-frame indels, and even isoform-specific effects, showcasing the remarkable generalizability of these models [ 16 ]. A New Paradigm: Semantic Embeddings for Variant Interpretation This study introduces a fundamentally different approach using LLMs—one that treats comprehensive genomic annotations as a structured narrative rather than discrete features. We name the framework VUS.Life. Our framework transforms variant annotations into natural language descriptions, then leverages pre-trained language models to generate semantic embeddings that capture nuanced relationships between different types of genomic evidence. With VUS.Life, we conceptualize each genetic variant existing within a vast semantic space, where variants with similar functional consequences and clinical implications naturally cluster together. In this space, the pathogenicity of a newly discovered variant can be inferred from its semantic similarity to variants with established clinical significance, revealing how expert clinical geneticists integrate multiple lines of evidence and recognize patterns from experience. Unlike most, if not all, aforementioned tools that focus on predicting the deleteriousness of a variant, which is not equivalent to pathogenicity, VUS.Life is a framework for pathogenicity prediction directly. However, biological and clinical narratives alone cannot fully capture the thermodynamic consequences of genetic variants. To bridge the gap between clinical annotation and molecular reality, we integrated evolutionary scale modeling (ESM). By calculating the vector difference between wild-type and mutant embeddings (delta embeddings), we hypothesize that AI can isolate the specific ’structural insult’ of a variant, distinguishing between mechanisms like protein unfolding and functional domain perturbation. Methods VUS.Life Framework Development and Validation Variant Datasets To develop and validate the VUS.Life framework, we analyzed more than 10,000 variants with established pathogenicity classifications in hereditary breast and ovarian cancer genes BRCA1 (Breast Cancer gene 1), BRCA2 (Breast Cancer gene 2), FBN1 (Fibrillin 1; Marfan syndrome), ATM (Ataxia Telangiectasia Mutated), and PALB2 (Partner And Localizer Of BRCA2) sourced from BRCA Exchange [ 18 ] and ClinVar [ 19 ] (Table 2 ) (Supplementary_Data.xls). BRCA1 and BRCA2 variants were sourced from BRCA Exchange, while variants for FBN1 , ATM , and PALB2 were sourced from ClinVar. For ClinVar entries, we filtered for "pathogenic/likely pathogenic" and "benign/likely benign" classifications with at least a two-star review status (criteria provided, multiple submitters, no conflicts). Table 2 Summary of variant datasets used for model training. Gene Total Variants Pathogenic Benign Pathogenic % BRCA1 3,323 2,242 1,081 67.47% BRCA2 4,081 2,645 1,436 64.81% FBN1 1,532 768 764 50.13% ATM 4,346 1,671 2,675 38.45% PALB2 1,598 752 846 47.06% Annotation Fetch and Processing Variants were annotated using the Variant Effect Predictor (VEP) with an extensive suite of annotation sources and prediction tools [ 20 ]. The VEP configuration was designed to capture a wide range of functional, evolutionary, and predictive information relevant to variant pathogenicity. Key annotation categories included: Core Annotations : Canonical transcript identification, predicted consequence (e.g., missense, frameshift), protein-level effects, and splicing predictions (e.g., SpliceAI [ 21 ]). Pathogenicity Prediction Tools : Scores from established predictors such as CADD [ 7 ], REVEL [ 22 ], AlphaMissense [ 12 ], SIFT [ 5 ], and PolyPhen-2 [ 6 ] . Regulatory and Functional Annotations : Gene Ontology (GO) terms, pathway information, protein domain annotations, tissue-specific expression data, and loss-of-function (LoF) predictions. To prepare the data for language model processing, we systematically converted the structured VEP JSON annotations into a consistent, semi-structured textual format. This was achieved using a custom template that serializes the nested JSON data into a human-readable, key-value report. The template organizes information into logical sections, such as variant identifiers, transcript-level consequences, population frequencies, and scores from various prediction tools (e.g., CADD, SpliceAI, AlphaMissense). This ensures uniform representation of all annotation fields. By converting numerical scores, flags, and identifiers into a unified textual string, we create a comprehensive and uniform input format that enables the embedding model to learn the semantic relationships between disparate types of genomic evidence. An abbreviated example of the generated text for an intronic variant is shown below: Variant ID: NC_000017.11:g.43046253G > A. Variant class: SNV. Most severe consequence: intron_variant. Transcript consequence #1: The transcript id is NM_007294.4; The gene symbol is BRCA1 ; The gene id is 672; The gene symbol source is EntrezGene; The biotype is protein_coding. SpliceAI scores: The SpliceAI SYMBOL is BRCA1 ; The SpliceAI DS AG is 0.0; The SpliceAI DS AL is 0.0. Population frequencies: Allele: A. The frequency in gnomadg_asj is 0.02244; The frequency in amr is 0.0043; The frequency in gnomadg_ami is 0.008216; The frequency in gnomadg_afr is 0.002173. Semantic Embedding Models and Generation Three distinct embedding models were selected to capture different aspects of semantic representation: all-mpnet-base-v2 (MPNet) [ 23 ]: A Transformer-based model pre-trained using masked and permuted language modeling. With 768 dimensions and a 512-token limit, it excels at semantic similarity tasks, balancing sentence structure understanding. MedEmbed-large-v0.1 (MedEmbed) [ 24 ]: A domain-specific Transformer trained on biomedical literature, offering 1,024 dimensions and a 512-token limit. Its specialization provides a nuanced understanding of medical and genomic terminology. text-embedding-004 (Google Embeddings) [ 25 ]: Google’s state-of-the-art general-purpose embedding model, generating 768-dimensional embeddings with a 2,048-token limit, performing exceptionally well on diverse semantic tasks. The embedding generation process transforms each variant’s description into vector representations using these models. These embeddings are stored in ChromaDB, a vector database optimized for similarity search and retrieval. Accompanying metadata includes variant identifiers, genomic coordinates, pathogenicity classifications, confidence scores, gene annotations, functional consequences, and timestamps of processing. Vector indices are tuned for cosine similarity search, with parameters adjusted to each model’s dimensionality. Integrating Structural and Evolutionary Fitness Landscapes To augment semantic annotation embeddings, we performed a deep-learning-based structural analysis of FBN1 variants using the ESMC 600M (600 million parameters) protein language model [ 26 ]. While full-sequence embeddings provide a global overview, our analysis demonstrated that residue-level delta embeddings—which isolate the specific vector difference at the mutation site—offered the most robust discrimination between functional categories of mutations. Structural and Evolutionary Modeling To capture the thermodynamic and evolutionary constraints of missense variants, we integrated the ESMC-600M protein language model [ 26 ]. We focused this analysis on FBN1 , implementing a specific pipeline to handle its large protein size (2,871 aa) and distinct pathogenic mechanisms. Residue-Level Delta Embeddings We hypothesized that the pathogenicity signal is best captured by the vector difference between the mutant and wild-type states, rather than the absolute representation of the mutant alone. We defined the Residue-Level Delta Embedding (∆ E ) at the mutation site target _ idx as: ∆ E = E mutant [ target _ idx ] − E wildtype [ target _ idx ] (1) This subtraction isolates the "structural insult" induced by the variant, removing the background noise of the protein’s sequence identity. Due to the sequence length of Fibrillin-1 exceeding standard context windows, we utilized a sliding window tiling strategy (overlapping windows of ≈ 1000 aa) to generate these embeddings while preserving local structural context. Mechanistic Categorization via Structural Interaction Type To investigate whether the embedding space captures specific biophysical mechanisms, variants were stratified based on the structural role of the wild-type residue within the protein fold, based on their UMD-FBN1 mutations database ( http://www.umd.be/FBN1/ ) annotations: Disulfide Bond Disrupting : Mutations affecting cysteine residues that form covalent disulfide bridges, which are critical for stabilizing the tertiary structure of FBN1 domains. Calcium Binding : Mutations located within high-affinity Ca 2+ binding motifs (e.g., within cbEGF domains), which facilitate domain rigidity and proper inter-domain alignment. Evolutionary Fitness Scoring (Zero-Shot LLR) To quantify evolutionary constraint without supervision, we calculated the Log-Likelihood Ratio (LLR) using the model’s ‘sequence_head’. The LLR compares the probability of the mutant amino acid ( x mut ) against the wild-type ( x wt ) given the surrounding context: LLR = log P ( x mut | context) − log P ( x wt | context) (2) Negative LLR values indicate evolutionary intolerance (potential pathogenicity), while values near zero imply neutrality. Evaluation Metrics and Implementation Dimensionality Reduction and Visualization To analyze the high-dimensional embeddings, we applied three complementary dimensionality reduction techniques: Principal Component Analysis (PCA) [ 27 ], t-Distributed Stochastic Neighbor Embedding (t-SNE) [ 28 ], and Uniform Manifold Approximation and Projection (UMAP) [ 29 ]. PCA projects data into a visualizable space by preserving major variance components. t-SNE emphasizes local neighborhood relationships, while UMAP balances local and global structure preservation. Visualizations revealed distinct clusters of variants with similar pathogenicity, suggesting that the embedding models effectively capture pathogenicity-related features. The discriminative power was quantified by correlating embedding distances with pathogenicity differences, confirming tighter clustering of variants with similar clinical impacts. This structure supports the k-NN classification approach for predicting pathogenicity in variants of uncertain significance. k-NN Pathogenicity Prediction A k-NN classification approach was developed to predict variant pathogenicity using semantic embeddings. Performance was evaluated across multiple neighborhood sizes (k = 1, 5, 10, 15, 20) to identify gene-specific optimal k values. Cosine similarity was employed as the distance metric due to its effectiveness in high-dimensional spaces and its robustness to vector magnitude variations. For each query variant, the algorithm identifies the k most similar variants with known pathogenicity, excluding the query variant from the training set to prevent data leakage. Predictions are based on majority voting, with confidence scores reflecting the proportion of votes supporting the predicted class. The embedding space’s ability to separate variants by pathogenicity was assessed by computing a clustering accuracy metric for known variants. For each variant with established pathogenicity, its k nearest neighbors were identified, and the percentage of neighbors sharing the same classification was calculated. The classification criteria are outlined in Table 3 . Table 3 Criteria for classifying pathogenicity based on the proportion of pathogenic and benign neighbors among the top-k nearest neighbors. Classification Criteria (among k neighbors) Confidence Score Potentially Pathogenic ≥ 90% pathogenic neighbors Pathogenic ratio Likely Pathogenic ≥ 60% pathogenic neighbors (but < 90%) Pathogenic ratio Potentially Benign ≥ 90% benign neighbors Benign ratio Likely Benign ≥ 60% benign neighbors (but < 90%) Benign ratio Uncertain Mixed or no clear majority max (benign, pathogenic) ratio Distance Ratio Analysis To quantitatively validate the clustering structure, we introduced the Distance Ratio metric. For any given variant, we calculate the median cosine distance to its k nearest neighbors of the same class ( D same ) versus the opposite class ( D opposite ). Distance Ratio = D same /D opposite (3) A ratio 1.0 indicates separation failure. Implementation Workflow The pathogenicity prediction pipeline targets novel and variants of uncertain significance (VUS) using a modular, efficient workflow. Unknown variants are retrieved from ChromaDB using pathogenicity status filters, processed in batches for computational efficiency, and analyzed for their k (default: 20) nearest known variants (benign or pathogenic) via cosine similarity. Predictions are generated using a confidence-weighted voting system, with confidence scores reflecting prediction reliability. Visualizations employing PCA, UMAP, and t-SNE illustrate spatial relationships between known and unknown variants, aiding genetic counselors and clinicians in interpreting results (see the workflow in Fig. 1 ). Results Embedding Results The embedding analysis establishes a robust framework for clinical variant interpretation via confidence-based stratification derived from neighborhood consistency patterns. Variants can be categorized as high-confidence (> 95% agreement among nearest neighbors), moderate-confidence (60–95% agreement), or boundary cases requiring expert review (< 60% agreement). BRCA1 analysis revealed that 2–5% of variants fell into boundary cases across models, highlighting cases that may benefit from additional clinical review or functional studies. BRCA2 analysis showed fewer boundary cases (0.4–2.6%), indicating stronger semantic clustering and higher prediction confidence. The gradient of neighborhood agreement serves as a natural quality control mechanism for clinical workflows, enabling automated processing of high-confidence predictions while flagging uncertain cases for genetic counseling and expert interpretation. This approach addresses the critical need for reliable variant classification in hereditary cancer predisposition testing, where accurate interpretation directly influences clinical management decisions. Consistent performance across neighbor counts (k = 1, 5, 20) validates the methodology’s robustness, with optimal k likely being gene-specific to balance local precision and noise reduction for immediate clinical implementation (Fig. 1 ). BRCA1 Semantic Embedding Performance The k-nearest neighbors (K-NN) evaluation of BRCA1 variants (n = 3,311) demonstrated robust classification performance across all three embedding models, with systematic analysis confirming the effectiveness of semantic representations in capturing pathogenicity-relevant features (Fig. 2 ). The MPNet model achieved the highest overall accuracy at 97.9% with top 20 neighbors, with strong performance for both benign/likely benign variants (96.9%) and pathogenic/likely pathogenic variants (98.4%). MedEmbed showed comparable performance at 97.7% overall accuracy, while Google embeddings achieved 97.3% overall accuracy. Neighborhood consistency analysis revealed that 94.6% of benign variants and 97.4% of pathogenic variants had 80–100% agreement among their top 20 nearest neighbors when using MPNet, indicating robust semantic clustering. The high neighborhood consensus across all models suggests that BRCA1 variant annotations contain sufficient semantic information to enable reliable automated classification. MedEmbed exhibited 92.5% consistency for benign variants achieving 80–100% neighborhood agreement, while Google embeddings achieved 92.6% consistency for benign variants. This consistent performance across different embedding approaches validates the robustness of the semantic representation framework for BRCA1 variant interpretation. BRCA2 Semantic Embedding Performance Our method demonstrated exceptional classification performance on BRCA2 variants (n = 4,074) across all embedding models, with MedEmbed achieving outstanding results of 99.1% overall accuracy (99.2% benign/likely benign, 99.1% pathogenic/likely pathogenic) (Fig. 3 ). The domain-specific medical embedding model’s superior performance highlights the value of specialized pre-training for genetic variant analysis. MPNet achieved comparable excellence with 99.0% overall accuracy (98.3% benign/likely benign, 99.4% pathogenic/likely pathogenic), while Google embeddings maintained strong performance at 97.9% overall accuracy. Neighborhood distribution analysis revealed remarkable consistency, with MedEmbed showing 99.6% of benign variants and 98.8% of pathogenic variants achieving 80–100% agreement among top 20 neighbors. MPNet demonstrated equally impressive consistency with 97.4% of benign variants and 99.3% of pathogenic variants showing high neighborhood agreement. The larger sample size for BRCA2 variants appears to contribute to enhanced model training and more stable semantic representations, enabling the development of highly reliable prediction models for clinical applications. Google embeddings, while achieving the lowest performance among the three models, still maintained clinically relevant accuracy with 94.9% of benign variants showing strong neighborhood consensus. FBN1 Semantic Embedding Performance The FBN1 validation results demonstrate robust performance across all three embedding models, achieving consistently high accuracy rates exceeding 96% (MPNet: 96.7%, Google: 96.5%, MedEmbed: 96.6% at k = 5) (Fig. 4 ). These results are particularly significant as they validate the generalizability of our semantic embedding approach beyond the extensively characterized BRCA cancer predisposition genes to a gene associated with connective tissue disorders and mostly missense instead of loss-of-function pathogenic variants. The dimensional reduction visualizations reveal distinct clustering patterns for FBN1 variants, with Google embeddings showing the most compact and well-separated clusters in t-SNE and UMAP projections, while MPNet and MedEmbed demonstrated more distributed but still discriminative patterns. Notably, the FBN1 performance closely mirrors that observed for BRCA1 (97.3% with MPNet), suggesting that the embedding approach maintains effectiveness across different genetic contexts and disease mechanisms, even for genes that may have less comprehensive functional annotation compared to the intensively studied BRCA1/2 genes. This validation supports the broader clinical utility of semantic embeddings for variant interpretation across diverse genetic conditions, particularly important for rare disease contexts where sparse training data may limit traditional computational approaches. Generalization to ATM and PALB2 To further stress-test the framework’s generalizability, we evaluated PALB2 ( n = 1, 598) (Supplementary Fig. 1) and ATM ( n = 4, 346) (Supplementary Fig. 2) (Supplementary_Data.xls). Despite differing biological functions and mutation spectrums, VUS.Life maintained high predictive accuracy. As shown in Table 4 , the ATM gene achieved 98.0–98.4% accuracy across all models at k = 5, while PALB2 achieved 97.2–98.4%. The consistency of these results—comparable to the highly curated BRCA datasets—confirms that the semantic embedding approach captures universal pathogenicity signals and is not artifacts of specific gene curation histories. Table 4 Clustering accuracy of semantic embedding models for predicting variant pathogenicity across genes. Accuracy is measured by the percentage of k-nearest neighbors sharing the same known pathogenicity classification. Higher accuracy indicates better separation of pathogenic and benign variants. Gene Model k = 5 k = 10 k = 15 k = 20 BRCA1 MedEmbed-large-v0.1 0.981 0.980 0.978 0.977 google-embedding 0.980 0.977 0.974 0.973 all-mpnet-base-v2 0.984 0.983 0.981 0.979 BRCA2 MedEmbed-large-v0.1 0.993 0.993 0.992 0.991 google-embedding 0.985 0.983 0.981 0.979 all-mpnet-base-v2 0.992 0.991 0.990 0.990 FBN1 MedEmbed-large-v0.1 0.966 0.965 0.964 0.963 google-embedding 0.965 0.964 0.961 0.959 all-mpnet-base-v2 0.967 0.962 0.961 0.959 ATM MedEmbed-large-v0.1 0.984 0.982 0.979 0.975 google-embedding 0.980 0.978 0.975 0.975 all-mpnet-base-v2 0.983 0.981 0.981 0.978 PALB2 MedEmbed-large-v0.1 0.980 0.979 0.979 0.979 google-embedding 0.972 0.960 0.952 0.946 all-mpnet-base-v2 0.984 0.984 0.983 0.981 Pathogenicity Prediction using Annotation Embeddings The prediction results for 3,000 ‘not-yet-reviewed’ variants in both BRCA1 and 3,000 not-yet-reviewed BRCA2 variants (Supplementary_Data.xls) demonstrate the promising potential of semantic embeddings for variant pathogenicity classification (see Figs. 5 & 6 ). Across all three embedding models (MPNet, Google embeddings, and MedEmbed), the unknown variants (green diamonds) exhibit spatial distributions that closely mirror the established clustering patterns of known benign (blue) and pathogenic (red) variants. To illustrate the practical application of this approach, we examined prediction results from 10 randomly selected variants each from BRCA1 and BRCA2 datasets (Tables 5 & 6 ). The results demonstrate strong model concordance and biologically coherent predictions across different variant types. For BRCA1 variants, all three embedding models achieved perfect agreement on clearly deleterious variants such as frameshift (NC_000013.11:g.32396971_32396972dup) and stop-gained mutations (NC_000013.11:g.32339409C > G), both receiving unanimous pathogenic predictions with 1.00 confidence scores based on k-nearest neighbor counts of B:0, P:20. Intronic variants consistently received benign classifications across all models with maximum confidence, while missense variants showed more nuanced predictions with generally high confidence but occasional inter-model variation in neighbor compositions. Table 5 Predicted pathogenicity of randomly selected unknown variants in BRCA1 (NC_000017.11) using different semantic embedding models. Genomic HGVS Consequence MedEmbed google MPNet Count Conf. Score Pred Count Conf. Score Pred Count Conf. Score Pred g.43094629_43095075del splice_acceptor_variant B:16, P:4 0.800 benign B:2, P:18 0.900 pathogenic B:10, P:10 0.500 uncertain g.43102853C > G intron_variant B:20, P:0 1.000 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.43067120C > T intron_variant B:20, P:0 1.000 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.43093771_43093774delinsTC frameshift_variant B:0, P:20 1.000 pathogenic B:0, P:20 1.000 pathogenic B:0, P:20 1.000 pathogenic g.43061638A > G intron_variant B:20, P:0 1.000 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.43065397G > A intron_variant B:20, P:0 1.000 benign B:19, P:1 0.950 benign B:20, P:0 1.000 benign g.43067430del intron_variant B:20, P:0 1.000 benign B:8, P:12 0.600 pathogenic B:20, P:0 1.000 benign g.43074502G > T missense_variant B:17, P:3 0.850 benign B:13, P:7 0.650 benign B:16, P:4 0.800 benign g.43094690T > C missense_variant B:20, P:0 1.000 benign B:18, P:2 0.900 benign B:20, P:0 1.000 benign g.43115774T > G missense_variant B:12, P:8 0.600 benign B:5, P:15 0.750 pathogenic B:16, P:4 0.800 benign Table 6 Predicted pathogenicity of randomly selected unknown variants in BRCA2 (NC_000013.11) using different semantic embedding models. Genomic HGVS Consequence MedEmbed google MPNet Count Conf. Score Pred Count Conf. Score Pred Count Conf. Score Pred g.32353730A > G intron_variant B:20, P:0 1.000 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.32340247G > T missense_variant B:18, P:2 0.900 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.32363243A > G missense_variant B:19, P:1 0.950 benign B:15, P:5 0.750 benign B:20, P:0 1.000 benign g.32396971_32396972dup frameshift_variant B:0, P:20 1.000 pathogenic B:0, P:20 1.000 pathogenic B:0, P:20 1.000 pathogenic g.32370904A > T intron_variant B:20, P:0 1.000 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.32322227A > G intron_variant B:20, P:0 1.000 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.32385434_32385445dup intron_variant B:20, P:0 1.000 benign B:15, P:5 0.750 benign B:20, P:0 1.000 benign g.32369761A > T intron_variant B:20, P:0 1.000 benign B:20, P:0 1.000 benign B:20, P:0 1.000 benign g.32380102G > A synonymous_variant B:20, P:0 1.000 benign B:18, P:2 0.900 benign B:20, P:0 1.000 benign g.32339409C > G stop_gained B:0, P:20 1.000 pathogenic B:0, P:20 1.000 pathogenic B:0, P:20 1.000 pathogenic The BRCA2 results reveal similar patterns but with some instructive disagreements that highlight the method’s sensitivity to semantic context. Most notably, a splice acceptor variant (NC_000017.11:g.43094629_43095075del) demonstrated model disagreement, with MedEmbed predicting benign (B:16, P:4, confidence 0.800), Google embeddings predicting pathogenic (B:2, P:18, confidence 0.900), and MPNet remaining uncertain (B:10, P:10, confidence 0.500). This variant exemplifies cases where the semantic interpretation of clinical annotations may vary, potentially reflecting different emphasis on splice site disruption versus other contextual factors in the training literature. BRCA1 Results The unknown variants distribute throughout the embedding space in a manner consistent with the trained pathogenicity classes. In the MPNet and Google embedding visualizations, unknown variants segregate into regions predominantly occupied by either benign or pathogenic training variants, particularly evident in the t-SNE projections, where distinct clusters emerge. The MedEmbed results show a more dispersed but still meaningful distribution, with unknown variants positioning themselves in semantically appropriate regions of the embedding space. BRCA2 Results Similar patterns are observed for BRCA2 , where unknown variants demonstrate clear spatial alignment with their predicted pathogenicity classes. The Google embeddings show particularly strong clustering behavior, with unknown variants forming coherent groups that align with either the benign or pathogenic regions. The UMAP projections across all models reveal that unknown variants maintain the same topological relationships as the training data, suggesting robust semantic capture of pathogenicity-relevant features. FBN1 Results The 2015 ACMG/AMP guidelines for variant interpretation do not fully account for gene-specific characteristics. Since 2019, the ClinGen FBN1 Variant Curation Expert Panel (VCEP) has invested substantial effort in developing consensus recommendations tailored to the unique features of this disease gene and its pathogenic variants [ 30 ]. To evaluate these new guidelines, the panel conducted a pilot study on 60 representative, challenging variants. This effort involved collaboration among three core sites and six non-core institutions, demonstrating improved interpretive consistency compared to the 2015 ACMG/AMP framework. Of these 60 variants, 30 overlapped with the 1,532 variants used to construct the FBN1 vector embedding space. Pathogenicity prediction was therefore performed only on the remaining 30 variants. These 30 unresolved variants segregated into established pathogenic or benign clusters, particularly evident in the UMAP projection using MPNet embeddings (Fig. 7 ). All 30 variants received either pathogenic or benign classifications (FBN1VCEP Tab of Supplementary_Data.xls). For 10 variants, the FBN1 VCEP—either at core sites, non-core sites, or both—had been assigned a VUS classification under the updated guidelines. Our algorithm, using the best-performing MPNet model for FBN1 , classified 7 of these as pathogenic and 3 as benign. For the remaining 20 variants, MPNet predictions agreed with VCEP classifications for 17 (85%). We examined the three discordant calls, which are all missense variants predicted as pathogenic by our algorithm with MPNet model but classified a benign by the FBN1 VCEP. Notably, ClinVar lists all three as VUS, with submissions from more than four independent laboratories each, underscoring the uncertainty surrounding their true pathogenicity. We further validated the framework using a subset of FBN1 variants linked to specific cardiovascular phenotypes (e.g., aortic dilation, mitral valve prolapse) retrieved from the UDM database. From an initial collection of approximately 600 variants, we identified 308 that were already present in our ClinVar training set with high-quality review status; these were excluded from the validation set to prevent data leakage. The remaining 177 variants served as an independent test set (the FBN1 with cardiovascular tab of Supplementary Data.xls). Pathogenicity predictions were generated using k = 5 nearest neighbors (Table 7 ). The all-mpnet-base-v2 model achieved the highest sensitivity, correctly identifying 98.31% ( n = 174) of these pathogenic variants. This significantly outperformed Google Embedding (89.83%) and MedEmbed-large-v0.1 (88.14%). The superior performance of MPNet suggests that general-purpose semantic models may be more robust than domain-specific models in capturing the nuanced textual descriptions associated with complex phenotypic presentations. The spatial distribution of these test variants is visualized in Fig. 8 , and detailed results are presented in Supplementary Table X. Table 7 Sensitivity of pathogenicity prediction for FBN1 variants with variable cardiovascular involvement (UDM dataset, n = 177, k = 5). all-mpnet-base-v2 google-embedding MedEmbed-large-v0.1 Prediction of pathogenic (n) 174 159 156 Prediction of pathogenic (%) 98.31% 89.83% 88.14% Unsupervised Disentanglement of Pathogenic Mechanisms via Protein Language Models To determine whether semantic embeddings capture biologically meaningful structural disruptions rather than mere sequence heuristics, we analyzed the latent space of the ESMC-600M protein language model applied to FBN1 (Fibrillin-1). We hypothesized that a "residue-level delta embedding"—the vector difference between the mutant and wild-type representations at the specific amino acid site—would isolate the pathogenicity signal from the background noise of protein sequence identity. Topological Separation of Mutation Subtypes Dimensionality reduction (UMAP) of these delta embeddings revealed a striking, unsupervised segregation of variants based on their underlying molecular mechanism (HCD plot in Fig. 9 ). Disulfide Bond Disruptions (Cysteine loss/gain) : These variants formed a tightly compacted cluster, distinct from other pathogenic variants. This topological tightness suggests that the model recognizes a consistent "structural collapse" signature associated with the breaking of covalent cysteine bridges—a primary driver of protein instability in Marfan syndrome. Calcium-Binding Alterations : Mutations affecting calcium-binding epidermal growth factor-like (cbEGF) domains formed a separate, more diffuse cluster. This spatial distinction implies that the model differentiates between the global destabilization caused by disulfide loss and the local rigidity/flexibility defects caused by calcium-binding disruptions. Crucially, this mechanistic separation was achieved without providing the model with 3D structural coordinates or functional labels, demonstrating that the protein language model has implicitly learned the physics of protein folding solely through evolutionary sequence patterns. Convergence of Structural Impact and Evolutionary Fitness To validate these structural embeddings against evolutionary constraints, we calculated Zero-Shot Log-Likelihood Ratios (LLR) for 216 missense variants of FBN1 associated with cardiovascular phenotypes. The LLR serves as a proxy for evolutionary fitness, quantifying how "unexpected" a mutation is given the protein’s context. We observed a high degree of concordance between the structural embedding space and evolutionary fitness scores (Evolutionary fitness landscape plot in Fig. 9 ): The Pathogenicity Gradient: LLR scores ranged from − 10.43 (highly deleterious) to + 0.45 (neutral). Structural-Evolutionary Coupling: Variants within the "disulfide disruption" structural cluster exhibited the most severe negative LLR scores (mean LLR: -3.77), with extreme outliers (< -7.0) exclusively confined to this region. Clinical Implications of "Delta" Representations Our comparison of embedding strategies revealed a critical methodological insight for clinical AI. Raw mutant embeddings were dominated by sequence identity, obscuring the pathogenic signal. By utilizing residue-level delta embeddings, we successfully subtracted the "background" biological context, leaving only the vector of the pathogenic insult. This approach transforms the "black box" of the LLM into an interpretable tool capable of distinguishing between varying mechanisms of disease—such as haploinsufficiency (protein instability) versus dominant-negative effects— in silico , before experimental functional validation. Distance Ratio Analysis Confirms Robust Class Separation To quantitatively validate the class separation observed in the dimensionality reduction plots, we applied our Distance Ratio metric to the embeddings of BRCA1 and BRCA2 variants. The results confirm that all three embedding models produce a highly structured and discriminative semantic space. As shown in Fig. 10 , the distributions of the Distance Ratio for both pathogenic (red) and benign (blue) variants are overwhelmingly concentrated to the left of the 1.0 threshold. This visually demonstrates that for most variants, the distance to same-class neighbors is significantly smaller than the distance to different-class neighbors. This finding is supported by quantitative analysis (Table 8 ), which shows that across all models and genes, over 98% of variants have a Distance Ratio less than 1.0, indicating effective clustering. The all-mpnet-base-v2 model showed the most consistent high performance across all categories, with an overall average of 98.7% of variants correctly clustered. While all models performed exceptionally well, we observed that BRCA2 variants, which have more extensive annotation data, showed slightly better class separation than BRCA1 variants. This robust quantitative validation provides high confidence in using the semantic embedding space for downstream classification and similarity search tasks. Table 8 Percentage of variants with a Distance Ratio < 1.0, calculated with k = 20 nearest neighbors. Values represent the proportion of variants that are closer to their same-class neighbors than to opposite-class neighbors, indicating effective separation. Model BRCA1 Pathogenic BRCA1 Benign BRCA2 Pathogenic BRCA2 Benign Overall Average all-mpnet-base-v2 98.2% 97.5% 99.3% 99.4% 98.7% google-embedding 97.7% 99.4% 98.5% 98.5% 98.4% MedEmbed-large-v0.1 97.3% 99.5% 98.8% 99.7% 98.6% Clinical Implications The consistent distribution patterns across both genes and all embedding models strongly support the hypothesis that semantic embeddings can effectively capture the linguistic and clinical nuances present in variant annotations that distinguish pathogenic from benign variants. This spatial coherence between known and unknown variants indicates that the learned representations encode biologically meaningful features rather than spurious correlations, providing confidence in the method’s ability to generalize to variants of uncertain significance (VUS). The preservation of clustering structure suggests that clinicians could potentially leverage these embeddings not only for binary classification but also for confidence estimation and identification of variants requiring additional functional studies. Discussion In this study, we present VUS.Life, a unified framework designed to resolve the interpretation bottleneck for Variants of Uncertain Significance (VUS). By synergizing semantic text embeddings with protein language modeling, we achieved > 96% classification accuracy across five clinically distinct genes ( BRCA1 , BRCA2 , ATM , PALB2 , and FBN1 ). Crucially, we demonstrate that this performance stems not merely from memorizing clinical pathogenicity assertions or predictions, but from the model’s unsupervised ability to disentangle the mechanistic and evolutionary roots of pathogenicity. Universality and Clinical Utility Current clinical workflows often necessitate a "patchwork" of calculators—deploying structure-based models for missense changes, SpliceAI for intronic regions, and separate heuristics for truncations. A defining advantage of VUS.Life is its consequence-agnostic capability. While state-of-the-art tools like AlphaMissense are restricted to missense variants, our framework processes the full spectrum of genomic variation. By transforming diverse annotations into a unified narrative, VUS.Life captures the "long tail" of non-coding and complex variants—from high-impact frameshifts to subtle regulatory changes—offering a scalable, single-model solution. We validated this versatility using FBN1 , a gene dominated by missense mutations where mechanism matters. In a pilot study of unresolved variants from the ClinGen FBN1 Variant Curation Expert Panel (VCEP), our model successfully segregated difficult cases into established clusters, achieving 85% concordance with expert consensus on previously manually classified variants. Furthermore, in an independent validation using phenotype-specific data (UDM database), our general-purpose model outperformed domain-specific tools (97.1% sensitivity). This suggests that the broad semantic understanding of a large language model effectively captures complex phenotypic nuance better than models trained on narrower biomedical corpora. Mechanistic Disentanglement and Evolutionary Calibration Beyond binary classification, the future of genomic medicine requires a quantitative understanding of variant impact. As highlighted by recent work on the popEVE model, the breakthrough in rare disease diagnosis lies in the in-depth calibration of evolutionary records against human genetics [ 31 ]. We validated this paradigm within VUS.Life by integrating Zero-Shot Log-Likelihood Ratio (LLR) scoring via the ESMC-600M protein language model. By utilizing residue-level delta embeddings—which isolate the specific "structural insult" of a variant—the model spontaneously disentangled distinct pathogenic mechanisms without supervision. In FBN1 , variants disrupting disulfide bonds (critical for protein stability) formed a tight topological cluster distinct from calcium-binding defects. These structural clusters correlated strongly with evolutionary fitness scores (mean LLR: -3.77 for disulfide disruptions), confirming that the model captures the thermodynamic severity of mutations. This moves VUS interpretation from a static label to a calibrated, mechanism-aware risk assessment. Limitations Our approach has limitations. First, reliance on text descriptions makes the model sensitive to annotation quality and LLM token limits, potentially truncating extensive evidence for complex variants. Second, while our Distance Ratio analysis confirmed robust clustering (98% of variants < 1.0), boundary cases remain where semantic similarity alone may not capture subtle clinical distinctions. Finally, as with all supervised learning, the model inherits biases present in the training repositories (ClinVar/BRCA Exchange). Conclusions VUS.Life represents a significant step toward a "semantic genotype-phenotype map." By combining the breadth of clinical text with the depth of evolutionary protein modeling, we provide a robust, interpretable, extensible, and scalable framework for pathogenicity prediction. As we integrate emerging methods, such as ESMC-600M for evolutionary calibration, this tool holds the promise of transforming the interpretation of VUS from a source of uncertainty into a dynamic, quantitative understanding of human genetic disease. Declarations Ethics approval and consent to participate. Not applicable Consent for publication Not applicable Availability of data and materials A patent application was filed on June 10, 2025 - Xiaowu Gai, Jiawei Wu. Predicting Genetic Variant Pathogenicity Using Vector Embeddings (U.S. Patent Application No. 63/821.249). Github repository for codes and demonstrations: https://github.com/MayaMua/vus-life. Competing interests The authors declare that they have no competing interests. Funding This work is primarily funded by Medical College of Wisconsin. Authors’ contributions X.G. acquired study funding, conceptualized the project, supervised the work, and oversaw manuscript preparation. J.W. contributed to the study conceptualization, designed the software architecture, implemented the VUS.Life framework, performed formal data analysis, and drafted the original manuscript. All other authors assisted with data curation and manuscript review. All authors have read and approved the final manuscript. Acknowledgements We acknowledge the essential contributions of public repositories, specifically ClinVar, BRCA Exchange, and gnomAD. We are also grateful to the open-source community and the developers of the MPNet, MedEmbed, ESMC, and Google embedding models, as well as the Ensembl VEP team, whose tools were integral to the VUS.Life framework. References Manolio, T.A., et al., Genomic medicine year in review: 2023. Am J Hum Genet, 2023. 110 (12): p. 1992-1995. Sayers, E.W., et al., Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 2021. 49 (D1): p. D10-d17. Karczewski, K.J., et al., The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 2020. 581 (7809): p. 434-443. Taliun, D., et al., Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 2021. 590 (7845): p. 290-299. Ng, P.C. and S. Henikoff, SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res, 2003. 31 (13): p. 3812-4. Adzhubei, I., D.M. Jordan, and S.R. Sunyaev, Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet, 2013. Chapter 7 : p. Unit7.20. Rentzsch, P., et al., CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res, 2019. 47 (D1): p. D886-d894. Richards, S., et al., Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med, 2015. 17 (5): p. 405-24. Douville, C., et al., Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel). Hum Mutat, 2016. 37 (1): p. 28-35. Cannon, S., et al., Evaluation of in silico pathogenicity prediction tools for the classification of small in-frame indels. BMC Med Genomics, 2023. 16 (1): p. 36. Evans, P., et al., Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets. Genome Res, 2019. 29 (7): p. 1144-1151. Cheng, J., et al., Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 2023. 381 (6664): p. eadg7492. Yu, H., et al., A graph neural network approach for accurate prediction of pathogenicity in multi-type variants. Brief Bioinform, 2025. 26 (2). Devlin, J., et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . in North American Chapter of the Association for Computational Linguistics . 2019. Li, W., et al., From Text to Translation: Using Language Models to Prioritize Variants for Clinical Review. medRxiv, 2025. Brandes, N., et al., Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet, 2023. 55 (9): p. 1512-1522. Lin, W., et al., Enhancing missense variant pathogenicity prediction with protein language models using VariPred. Sci Rep, 2024. 14 (1): p. 8136. Cline, M.S., et al., BRCA Challenge: BRCA Exchange as a global resource for variants in BRCA1 and BRCA2. PLoS Genet, 2018. 14 (12): p. e1007752. Landrum, M.J., et al., ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res, 2014. 42 (Database issue): p. D980-5. Hunt, S.E., et al., Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor-A tutorial. Hum Mutat, 2022. 43 (8): p. 986-997. Jaganathan, K., et al., Predicting Splicing from Primary Sequence with Deep Learning. Cell, 2019. 176 (3): p. 535-548.e24. Ioannidis, N.M., et al., REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 2016. 99 (4): p. 877-885. Song, K., et al., MPNet: Masked and Permuted Pre-training for Language Understanding. arXiv preprint arXiv:2004.09297, 2020. Balachandran, A., Medembed: Medical-focused embedding models . 2024. Google, C., Generative AI on Vertex AI. Team, E.S.M., ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning . 2024, EvolutionaryScale. Jolliffe, I.T. and J. Cadima, Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci, 2016. 374 (2065): p. 20150202. van der Maaten, L. and G. Hinton, Visualizing Data using t-SNE. Journal of Machine Learning Research, 2008. 9 : p. 2579--2605. McInnes, L. and J. Healy, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv, 2018. abs/1802.03426 . Drackley, A., et al., Interpretation and classification of FBN1 variants associated with Marfan syndrome: consensus recommendations from the Clinical Genome Resource's FBN1 variant curation expert panel. Genome Med, 2024. 16 (1): p. 154. Orenbuch, R., et al., Proteome-wide model for human disease genetics. Nat Genet, 2025. 57 (12): p. 3165-3174. Huber, C.D., B.Y. Kim, and K.E. Lohmueller, Population genetic models of GERP scores suggest pervasive turnover of constrained sites across mammalian evolution. PLoS Genet, 2020. 16 (5): p. e1008827. Schwarz, J.M., et al., MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods, 2014. 11 (4): p. 361-2. Dong, C., et al., Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet, 2015. 24 (8): p. 2125-37. Alirezaie, N., et al., ClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants. Am J Hum Genet, 2018. 103 (4): p. 474-483. Additional Declarations No competing interests reported. Supplementary Files SupplementaryData.xlsx SupplementaryFigures.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8605164","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":575381279,"identity":"c6820a9b-c15e-4ec9-b62d-0316df6533e2","order_by":0,"name":"Jiawei Wu","email":"","orcid":"","institution":"Medical College of Wisconsin","correspondingAuthor":false,"prefix":"","firstName":"Jiawei","middleName":"","lastName":"Wu","suffix":""},{"id":575381280,"identity":"a14f52a6-dcc1-4755-93ec-79560f5e2643","order_by":1,"name":"Marissa Stutzman","email":"","orcid":"","institution":"Medical College of Wisconsin","correspondingAuthor":false,"prefix":"","firstName":"Marissa","middleName":"","lastName":"Stutzman","suffix":""},{"id":575381281,"identity":"aa4cdfde-bdb7-4137-b2c9-0112d2a48c5a","order_by":2,"name":"Michael Muriello","email":"","orcid":"","institution":"Medical College of Wisconsin","correspondingAuthor":false,"prefix":"","firstName":"Michael","middleName":"","lastName":"Muriello","suffix":""},{"id":575381282,"identity":"5bbec68a-e88a-4f67-a4e6-305dea6a3f9a","order_by":3,"name":"Joy Lincoln","email":"","orcid":"","institution":"Medical College of Wisconsin","correspondingAuthor":false,"prefix":"","firstName":"Joy","middleName":"","lastName":"Lincoln","suffix":""},{"id":575381284,"identity":"c1135d6e-44b7-429a-af2f-eb40a2dbfae9","order_by":4,"name":"Donald G. Basel","email":"","orcid":"","institution":"Medical College of Wisconsin","correspondingAuthor":false,"prefix":"","firstName":"Donald","middleName":"G.","lastName":"Basel","suffix":""},{"id":575381292,"identity":"c2ce61ea-2cb9-4f97-ab07-8063cd9fba3f","order_by":5,"name":"Xiaowu Gai","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAoElEQVRIiWNgGAWjYJCCAx8gtAHxWg7OIFkLMw9JWgzOL3542DbHLrGBvXmbBFFaJGc8Mzicuy05sYHnWBlxWvglDoC0HMhtkMgxI04Lm8TxD4ctQVrk3xCphZ+/x+AwI9gWHiK1SM7gKTjYuy25vo0nrdiCKC0G549v/vBzm50xP/vhjTeI0sIgkQCh2YhTDgL8B4hXOwpGwSgYBSMUAAAIsS/KSKVCMwAAAABJRU5ErkJggg==","orcid":"","institution":"Medical College of Wisconsin","correspondingAuthor":true,"prefix":"","firstName":"Xiaowu","middleName":"","lastName":"Gai","suffix":""}],"badges":[],"createdAt":"2026-01-14 21:38:09","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8605164/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8605164/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":100814260,"identity":"6348ae57-3c7b-42dc-9d5a-161ce05fcaca","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6291874,"visible":true,"origin":"","legend":"","description":"","filename":"VUS.LifeGenomeMedicine01142026.docx","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/d97bef3d9560450ca47b375e.docx"},{"id":101296590,"identity":"16079391-a2fb-4383-929a-4c759d6d30e0","added_by":"auto","created_at":"2026-01-28 09:16:36","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8159,"visible":true,"origin":"","legend":"","description":"","filename":"24196cc7272b4c03b79eeda54c13a1f0.json","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/8f799ade6f23d13621ae8f2a.json"},{"id":100814257,"identity":"6ab6de25-bfa7-4303-b888-552d16374ca8","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1928232,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryData.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/3b13cfdbff26c41e78a5cd72.xlsx"},{"id":100814258,"identity":"ca14adae-329b-4a90-ad0b-bd651185e34a","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1752896,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigures.docx","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/a7d2a2f1d13bc4ffb5f559ae.docx"},{"id":100858104,"identity":"a4c60dc8-672c-427a-8b51-ddef4ccfb75e","added_by":"auto","created_at":"2026-01-22 07:23:52","extension":"xml","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":150474,"visible":true,"origin":"","legend":"","description":"","filename":"24196cc7272b4c03b79eeda54c13a1f01enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/2b23789a92d40204d0418c63.xml"},{"id":100814268,"identity":"ec7503dd-3339-41da-914c-7d04ae49513f","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":79884,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/4b62eb1a32395ab30fad462e.jpeg"},{"id":100858089,"identity":"b5dacb45-52b1-4277-92bb-03d9198f458a","added_by":"auto","created_at":"2026-01-22 07:23:50","extension":"jpeg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":581726,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/b28c08c6c9a7f116f401b27a.jpeg"},{"id":100858372,"identity":"3d6a9362-dd6e-481e-aeeb-e7ea03791eeb","added_by":"auto","created_at":"2026-01-22 07:24:14","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":582268,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/74c0bf9ea3bdc8d92f096e37.png"},{"id":100857949,"identity":"67db06a7-4d71-4b92-b33b-91791a64fe8c","added_by":"auto","created_at":"2026-01-22 07:23:29","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":745857,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/d111584dc053556ec6f23b41.png"},{"id":100814285,"identity":"d20ad1a1-f80f-4858-88b7-b2028ba132ee","added_by":"auto","created_at":"2026-01-21 16:13:57","extension":"jpeg","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":851474,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/d026ca28067093635bf82ad5.jpeg"},{"id":100814264,"identity":"f5bac3f4-080f-4b3f-8fc2-68c1fb9591bf","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":943860,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/aa0cba22e186271c1e8402ab.png"},{"id":100857981,"identity":"ba0c18a6-9388-488e-85f8-06df39cde93e","added_by":"auto","created_at":"2026-01-22 07:23:35","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1062462,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/e6dcf0bf4d49a74d686a0489.png"},{"id":100814269,"identity":"7e8ebb18-47ab-44eb-97ff-40715a7c1cda","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":590262,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/3b2f2fc16c75528f0da221b1.png"},{"id":100814261,"identity":"f81b74c8-afc5-44ce-b74c-98595ec5e50f","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":873982,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/b4de7e7f9da92f5ec1393fee.png"},{"id":100814273,"identity":"d959b0b7-a132-4712-8bb3-0a62098193e7","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":446074,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/c2b54a14f92396a737c47f1e.png"},{"id":100814278,"identity":"ca7d7e0b-07aa-4473-a2dc-f2363b6d7d06","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":30547,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/c8fb923cb3aa20ef76dc90fa.png"},{"id":100814279,"identity":"b73e8968-f0b7-4f4e-9f89-e55d768deab7","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":119892,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/f9e5e79d252288dc7db6317b.png"},{"id":100814275,"identity":"b3ae3ae2-1592-49ec-957c-1d3d7eca3d15","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":112381,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/f12e27fb35b77e0ede69dd1b.png"},{"id":100814281,"identity":"04497367-1c93-42d0-962c-e8a07f099014","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":111252,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/81872ae5c470b2b32b464996.png"},{"id":100814287,"identity":"e0e1aa1b-6ab2-42ab-8022-c73116fb4b3b","added_by":"auto","created_at":"2026-01-21 16:13:59","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":181073,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/a3244dfb36a45e1a871fd6f7.png"},{"id":100814280,"identity":"d4676c10-027b-400d-b162-5edd74d8db3f","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":174028,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/cea173b29d3e127cb42b1273.png"},{"id":100814276,"identity":"29b79ac6-a5d6-4edb-8ab9-0f8b725ade2b","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":193895,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/109fcb1cf569e8f92601a115.png"},{"id":100814282,"identity":"22cca42b-3803-4059-9d21-48fa45fcae11","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":110760,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/4332d3a3e907a200dc0b15c0.png"},{"id":100814284,"identity":"27457e96-3b97-47d1-9100-d6b2863b61db","added_by":"auto","created_at":"2026-01-21 16:13:57","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":129961,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/995ce796a846308d4d4491ea.png"},{"id":100814286,"identity":"6291ed7e-78c4-4b0c-aa29-4dc7f6dbe0ed","added_by":"auto","created_at":"2026-01-21 16:13:58","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":78732,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/126c17213e8138305785c145.png"},{"id":100814283,"identity":"14d08875-99a7-4a01-b3d0-ad1236676ef8","added_by":"auto","created_at":"2026-01-21 16:13:57","extension":"xml","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":149463,"visible":true,"origin":"","legend":"","description":"","filename":"24196cc7272b4c03b79eeda54c13a1f01structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/5b5e65430c1a460afe690fa3.xml"},{"id":100858457,"identity":"f086df8d-958e-42d6-aafd-7fd5e2ec5143","added_by":"auto","created_at":"2026-01-22 07:24:19","extension":"html","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":163462,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/6670d16ae7839ec8488612dc.html"},{"id":100857875,"identity":"102eeb9a-9642-4989-88e7-906ef36bdeb3","added_by":"auto","created_at":"2026-01-22 07:23:18","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":76537,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEnd-to-End Architecture of VUS.Life.\u003c/strong\u003e The pipeline separates offline indexing from online inference. In Phase 1, genomic data is transformed from structured JSON (VEP) into natural language, vectorized by Large Language Models, and stored in a vector database. In Phase 2, the system performs a semantic search for an unknown variant against this index. The diagram highlights the critical \"Masking\" step (C) ensures that the model learns from biological context rather than explicit labels, and the k-NN consensus mechanism (H) that provides the final confidence score.\u003c/p\u003e","description":"","filename":"1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/96ada30ad1fe8f3d22dfc130.jpg"},{"id":100814254,"identity":"4b264e7c-6991-4db1-821d-fc3a05056313","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":156579,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDimensionality reduction of semantic embeddings for 3,323 \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eBRCA1\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e variants.\u003c/strong\u003e Each panel displays a 2D projection of the high-dimensional variant embeddings using three different models (rows: all-mpnet-base-v2, google-embedding, MedEmbed-large-v0.1) and three dimensionality reduction techniques (columns: PCA, t-SNE, UMAP). The distinct separation between benign (blue) and pathogenic (red) variants across all models demonstrates that the embeddings effectively capture pathogenicity-relevant features. Non-linear methods like t-SNE and UMAP reveal more defined clusters compared to the linear PCA projection, confirming the semantic space is structured by clinical significance.\u003c/p\u003e","description":"","filename":"2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/323538947a71fb6253e9a7a8.jpg"},{"id":100814262,"identity":"7d8d88c1-71ac-44fa-bb68-d22238b91500","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":153739,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVisualization of semantic embeddings for 4,081 \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eBRCA2\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e variants\u003c/strong\u003e. Following the same structure as Figure 1, these plots show the embedding space for \u003cem\u003eBRCA2\u003c/em\u003e variants. The separation between benign (blue) and pathogenic (red) classes is even more pronounced than for \u003cem\u003eBRCA1\u003c/em\u003e, with t-SNE and UMAP revealing highly compact and well-separated clusters, suggesting a very robust semantic representation.\u003c/p\u003e","description":"","filename":"3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/2ffc82c3db312cb46c538c61.jpg"},{"id":100814255,"identity":"49e5f6f0-8cf9-47ce-b37e-bf0ae1821cf0","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":139338,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eValidation of the semantic embedding framework’s generalizability using 1,532 \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eFBN1 \u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003evariants.\u003c/strong\u003eThis figure demonstrates that the methodology is not limited to \u003cem\u003eBRCA1/2\u003c/em\u003e genes. Despite the different genetic context and disease mechanism (Marfan syndrome), the embeddings for \u003cem\u003eFBN1\u003c/em\u003e variants maintain a strong separation between benign (blue) and pathogenic (red) classes. This result validates the robustness and broad utility of using semantic embeddings for variant interpretation across diverse, clinically important genes.\u003c/p\u003e","description":"","filename":"4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/76c2591dc06a867e3a2e5473.jpg"},{"id":100814271,"identity":"051d4065-2647-4f47-8a6b-8d85e5bbb708","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":206879,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVisualization of pathogenicity of 3,000 not-yet-reviewed variants (highlighted as green diamonds) on \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eBRCA1\u003c/strong\u003e\u003c/em\u003e. The pathogenicity is simply mapped into binary labels: benign and pathogenic. Each panel displays dimensionality reduction (PCA, UMAP, and t-SNE) of variant embeddings. The consistent spatial distribution of unknown variants relative to known benign and pathogenic variants across models and genes suggests that embeddings effectively capture pathogenicity-related semantics.\u003c/p\u003e","description":"","filename":"5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/a8b5f70ef5ad54a65b642472.jpg"},{"id":100858332,"identity":"35b31ab4-7805-4523-baff-06b65d8e6d9a","added_by":"auto","created_at":"2026-01-22 07:24:12","extension":"jpg","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":232156,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVisualization of pathogenicity of 3000 not-yet-reviewed variants on \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eBRCA2\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e.\u003c/strong\u003e The same pattern is seen as \u003cem\u003eBRCA1\u003c/em\u003e prediction.\u003c/p\u003e","description":"","filename":"6.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/9fffb997e1f3e9f6be40d1f6.jpg"},{"id":100858013,"identity":"fca64cb6-6b2f-4856-9491-fcdfabba637a","added_by":"auto","created_at":"2026-01-22 07:23:41","extension":"jpg","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":152453,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVisualization of predicted pathogenicity of 30 variants (highlighted as green diamonds) on \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eFBN1\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e, piloted by the ClinGen \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eFBN1\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e variant curation expert panel (VCEP)\u003c/strong\u003e. The pathogenicity is simply mapped into binary labels: benign and pathogenic. Each panel displays dimensionality reduction (PCA, UMAP, and t-SNE) of variant embeddings. The consistent spatial distribution of unknown variants relative to known benign and pathogenic variants across models and genes suggests that embeddings effectively capture pathogenicity-related semantics.\u003c/p\u003e","description":"","filename":"7.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/5d8032df5f7b3938b5821434.jpg"},{"id":100814267,"identity":"a665010f-e223-442d-a3e1-7740ddee12fe","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"jpg","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":160672,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVisualization of 177 external \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eFBN1\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e variants from the UDM database.\u003c/strong\u003e The green diamonds represent the query variants (associated with cardiovascular phenotypes) projected into the semantic space of 1,532 known ClinVar variants. The query variants predominantly cluster within the established pathogenic regions (red/orange) across all three models, with the tightest clustering observed in the MPNet PCA and UMAP projections.\u003c/p\u003e","description":"","filename":"8.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/aafe9cb9324b36ee54fe8805.jpg"},{"id":100814265,"identity":"422961a4-c4ba-4d61-9218-28830464c7db","added_by":"auto","created_at":"2026-01-21 16:13:56","extension":"jpg","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":105447,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eUnsupervised Disentanglement of Pathogenic Mechanisms in \u003c/strong\u003e\u003cem\u003e\u003cstrong\u003eFBN1\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e. \u003c/strong\u003eDimensionality reduction of residue- level delta embeddings of missense FBN1 variants (n=216). (Top) Variants segregate by molecular mechanism without supervision: mutations disrupting disulfide bonds (red) form a tight ’structural collapse’ cluster, distinct from the diffuse cluster of calcium-binding site mutations (blue). (Bottom) Evolutionary fitness landscape: The structural clusters align with Zero-Shot LLR scores, where disulfide disruptions exhibit the most severe evolutionary penalties (mean LLR -3.77), confirming that the model captures the biological severity of specific structural insults.\u003c/p\u003e","description":"","filename":"9.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/ffa11c675fd3a4ef806a4fc6.jpg"},{"id":100857941,"identity":"2837fb9d-e272-4b31-8695-daedc68b0eda","added_by":"auto","created_at":"2026-01-22 07:23:27","extension":"jpg","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":85711,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistance Ratio distributions confirm strong separation of benign and pathogenic variants.\u003c/strong\u003e The figure shows histograms of the Distance Ratio for \u003cem\u003eBRCA1\u003c/em\u003e(top row of each panel) and \u003cem\u003eBRCA2\u003c/em\u003e (bottom row) variants, evaluated on three different embedding models. The ratio compares a variant’s distance to same-class neighbors versus different-class neighbors. The dashed vertical line marks the theoretical boundary of 1.0; ratios below this line indicate effective clustering. The overwhelming concentration of both pathogenic (red) and benign (blue) distributions at ratios significantly less than 1.0 provides strong quantitative evidence that the semantic embedding space effectively separates variants based on their clinical significance.\u003c/p\u003e","description":"","filename":"10.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/626a8a4e778244c5e90a28ae.jpg"},{"id":103443576,"identity":"b3983931-bf7c-43ed-8813-96dcc463ad15","added_by":"auto","created_at":"2026-02-25 17:55:17","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3760706,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/77252137-26de-4a8b-9544-a9817aab979e.pdf"},{"id":100814253,"identity":"ce06ea6a-62ce-4ee0-84d2-6352de3d76f7","added_by":"auto","created_at":"2026-01-21 16:13:55","extension":"xlsx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":1928232,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryData.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/a800f6e3ad99a3774faa695a.xlsx"},{"id":100858291,"identity":"5077ea5b-74b8-411d-b4a6-2f14f96dbad7","added_by":"auto","created_at":"2026-01-22 07:24:09","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":1752896,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigures.docx","url":"https://assets-eu.researchsquare.com/files/rs-8605164/v1/b600c6aa8ad178969180f84a.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants","fulltext":[{"header":"Background","content":"\u003cp\u003eIn just two decades, genomic medicine has transformed from a futuristic vision into clinical reality. What once required years and millions of dollars—sequencing a human genome—can now be accomplished in hours for under a thousand dollars [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. This revolution has enabled genetic testing to guide cancer treatment decisions, predict disease risk, and enable precision therapies. Yet paradoxically, as our capacity to identify genetic variants has increased, our ability to interpret their clinical significance has become the rate-limiting step in realizing the potential of genomic medicine.\u003c/p\u003e \u003cp\u003eThe magnitude of this challenge is staggering. Current databases contain hundreds of millions of human genetic variants, yet only a fraction carry established clinical classifications [\u003cspan additionalcitationids=\"CR3\" citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e–\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. For genes like \u003cem\u003eBRCA1\u003c/em\u003e and \u003cem\u003eBRCA2\u003c/em\u003e, where accurate interpretation can mean the difference between prophylactic surgery and routine screening, this uncertainty carries profound implications. Numerous computational tools like SIFT [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], PolyPhen-2 [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], and CADD [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] often referred to as \u003cem\u003ein silico\u003c/em\u003e predictors, have been developed to help classify the pathogenicity of genetic variants by leveraging evolutionary conservation, protein structural predictions, and statistical modeling. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e lists some of the most popular and commonly used general-purpose predictors. These tools are widely used as a form of evidence in variant interpretation and are supported by clinical guidelines from the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Beyond these general-purpose predictors, significant research efforts have been invested in addressing challenges specific to a variant type or disease. For example, VEST-indel, focusing on insertions and deletions (indels), achieved sufficient accuracy to aid in clinical classification [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. PathoPredictor, on the other hand, is a disease-specific ensemble classifier trained on data from distinct patient cohorts, demonstrating the value of tailored models over general-purpose predictors [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eThe summary of conventional methods of pathogenicity prediction.\u003c/b\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTool\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCategory\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBrief Description\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSIFT [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFoundational\u003c/p\u003e \u003cp\u003e(Sequence Conservation)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePredicts if an amino acid substitution affects protein function based on sequence homology. It is an early, probabilistic-based method.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePolyPhen-2 [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFoundational (Sequence \u0026amp; Structure)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePredicts the effect of an amino acid substitution using sequence homology, Pfam annotations, and 3D structures. It is also a probabilistic-based method.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCADD [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMeta-predictor\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eIntegrates multiple annotations, including conservation and functional information, into a single score for deleteriousness.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGERP [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFoundational\u003c/p\u003e \u003cp\u003e(Sequence Conservation)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eIdentifies evolutionarily constrained regions to score how tolerant a position is to substitution. Its scores are often used as features in other tools.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMutationTaster [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFoundational\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eA general-purpose tool for predicting the impact of various variant types.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eREVEL [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMeta-predictor / Ensemble\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAn ensemble method that combines scores from multiple tools to predict the pathogenicity of rare missense variants.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMetaLR \u0026amp; MetaSVM [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMeta-predictor / Ensemble\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEnsemble methods that integrate multiple deleteriousness scores and allele frequency data using logistic regression or a support vector machine.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClinPred [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMeta-predictor / Ensemble\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eA prediction tool that incorporates conservation scores, other prediction scores, and allele frequencies to identify disease-relevant variants.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e \u003cp\u003e\u003c/p\u003e \u003cp\u003ePathoPredictor is also one of many recent predictors that have leveraged machine learning (ML) and deep learning (DL) to capture complex genomic patterns and improve predictive power. Another prominent example of the group is AlphaMissense, which adapts the state-of-the-art AlphaFold protein structure model to predict the functional impact of missense variants [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].GNN-MAP, on the other hand, uses graph neural networks (GNNs) to model the relationships between different variants [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. This approach enables GNN-MAP to predict pathogenicity for multiple variant types (missense, stop-gain, frameshift) and has shown strong performance on rare variants.\u003c/p\u003e \u003cp\u003eHowever, these afore-mentioned methods face inherent limitations: they typically focus on individual types of variants or features, require extensive domain expertise to engineer appropriate features, and struggle to integrate the increasingly complex landscape of genomic annotations into unified predictions.\u003c/p\u003e \u003cp\u003eIn the last few years, Large Language Models (LLMs) have demonstrated unprecedented ability to capture semantic relationships within textual data, leading to breakthroughs in machine translation, question answering, and document understanding [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. They have also been successfully applied in the field of genomic medicine. One application is to apply language models to the clinical narratives associated with variants. For instance, ClinVar-BERT is a model fine-tuned on millions of free-text clinical summaries from the ClinVar database [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. By training a biomedical-specific BERT model to discern patterns in the evidence described in these reports, they successfully re-classified VUS in a way that correlated with experimental functional data. This work demonstrated that unstructured clinical text contains rich, learnable signals of pathogenicity that can be captured by language models. Another powerful approach treats biological sequences themselves as a language. Protein language models (pLMs) like ESM-1b are pre-trained on hundreds of millions of protein sequences and learn the fundamental \"grammar\" of protein structure and function [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Building on this, Lin et al. developed VariPred, which uses embeddings from a pLM to predict the impact of missense mutations [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. By comparing the embeddings of the wild-type and mutant protein sequences, VariPred achieves state-of-the-art performance using only sequence information as input, effectively bypassing the need for explicit structural or evolutionary features. Similarly, Brandes et al. showed that the ESM1b model could be used directly, without fine-tuning, to predict the effects of all possible missense variants, in-frame indels, and even isoform-specific effects, showcasing the remarkable generalizability of these models [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e\n\u003ch3\u003eA New Paradigm: Semantic Embeddings for Variant Interpretation\u003c/h3\u003e\n\u003cp\u003eThis study introduces a fundamentally different approach using LLMs—one that treats comprehensive genomic annotations as a structured narrative rather than discrete features. We name the framework VUS.Life. Our framework transforms variant annotations into natural language descriptions, then leverages pre-trained language models to generate semantic embeddings that capture nuanced relationships between different types of genomic evidence.\u003c/p\u003e \u003cp\u003eWith VUS.Life, we conceptualize each genetic variant existing within a vast semantic space, where variants with similar functional consequences and clinical implications naturally cluster together. In this space, the pathogenicity of a newly discovered variant can be inferred from its semantic similarity to variants with established clinical significance, revealing how expert clinical geneticists integrate multiple lines of evidence and recognize patterns from experience. Unlike most, if not all, aforementioned tools that focus on predicting the deleteriousness of a variant, which is not equivalent to pathogenicity, VUS.Life is a framework for pathogenicity prediction directly.\u003c/p\u003e \u003cp\u003eHowever, biological and clinical narratives alone cannot fully capture the thermodynamic consequences of genetic variants. To bridge the gap between clinical annotation and molecular reality, we integrated evolutionary scale modeling (ESM). By calculating the vector difference between wild-type and mutant embeddings (delta embeddings), we hypothesize that AI can isolate the specific ’structural insult’ of a variant, distinguishing between mechanisms like protein unfolding and functional domain perturbation.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003cdiv id=\"Sec4\" class=\"Section3\"\u003e \u003cp\u003e\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\n\n \u003cp\u003e \u003c/p\u003e \u003cp\u003e\u003c/p\u003e \n\n\n\n \n\n \u003cp\u003e \u003c/p\u003e \u003cp\u003e\u003c/p\u003e\n\n \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003cp\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Methods","content":"\u003ch2\u003eVUS.Life Framework Development and Validation\u003c/h2\u003e\u003cp\u003e \u003cb\u003eVariant Datasets\u003c/b\u003e To develop and validate the VUS.Life framework, we analyzed more than 10,000 variants with established pathogenicity classifications in hereditary breast and ovarian cancer genes \u003cem\u003eBRCA1\u003c/em\u003e (Breast Cancer gene 1), \u003cem\u003eBRCA2\u003c/em\u003e (Breast Cancer gene 2), \u003cem\u003eFBN1\u003c/em\u003e (Fibrillin 1; Marfan syndrome), \u003cem\u003eATM\u003c/em\u003e (Ataxia Telangiectasia Mutated), and \u003cem\u003ePALB2\u003c/em\u003e (Partner And Localizer Of BRCA2) sourced from BRCA Exchange [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] and ClinVar [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e) (Supplementary_Data.xls). \u003cem\u003eBRCA1\u003c/em\u003e and \u003cem\u003eBRCA2\u003c/em\u003e variants were sourced from BRCA Exchange, while variants for \u003cem\u003eFBN1\u003c/em\u003e, \u003cem\u003eATM\u003c/em\u003e, and \u003cem\u003ePALB2\u003c/em\u003e were sourced from ClinVar. For ClinVar entries, we filtered for \"pathogenic/likely pathogenic\" and \"benign/likely benign\" classifications with at least a two-star review status (criteria provided, multiple submitters, no conflicts).\u003c/p\u003e\u003cp\u003e \u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSummary of variant datasets used for model training.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGene\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTotal Variants\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePathogenic\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBenign\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePathogenic %\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003eBRCA1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e3,323\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2,242\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1,081\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e67.47%\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003eBRCA2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e4,081\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2,645\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1,436\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e64.81%\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003eFBN1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1,532\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e768\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e764\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e50.13%\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003eATM\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e4,346\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1,671\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2,675\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e38.45%\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003ePALB2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1,598\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e752\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e846\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e47.06%\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e \u003cb\u003eAnnotation Fetch and Processing\u003c/b\u003e Variants were annotated using the Variant Effect Predictor (VEP) with an extensive suite of annotation sources and prediction tools [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. The VEP configuration was designed to capture a wide range of functional, evolutionary, and predictive information relevant to variant pathogenicity. Key annotation categories included:\u003c/p\u003e\u003cp\u003e \u003c/p\u003e\u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eCore Annotations\u003c/b\u003e: Canonical transcript identification, predicted consequence (e.g., missense, frameshift), protein-level effects, and splicing predictions (e.g., SpliceAI [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]).\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003ePathogenicity Prediction Tools\u003c/b\u003e: Scores from established predictors such as CADD [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], REVEL [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e], AlphaMissense [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e], SIFT [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], and PolyPhen-2 [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] .\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eRegulatory and Functional Annotations\u003c/b\u003e: Gene Ontology (GO) terms, pathway information, protein domain annotations, tissue-specific expression data, and loss-of-function (LoF) predictions.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e\u003cp\u003eTo prepare the data for language model processing, we systematically converted the structured VEP JSON annotations into a consistent, semi-structured textual format. This was achieved using a custom template that serializes the nested JSON data into a human-readable, key-value report. The template organizes information into logical sections, such as variant identifiers, transcript-level consequences, population frequencies, and scores from various prediction tools (e.g., CADD, SpliceAI, AlphaMissense). This ensures uniform representation of all annotation fields. By converting numerical scores, flags, and identifiers into a unified textual string, we create a comprehensive and uniform input format that enables the embedding model to learn the semantic relationships between disparate types of genomic evidence. An abbreviated example of the generated text for an intronic variant is shown below:\u003c/p\u003e\u003cp\u003eVariant ID: NC_000017.11:g.43046253G \u0026gt; A. Variant class: SNV. Most severe consequence: intron_variant. Transcript consequence #1: The transcript id is NM_007294.4; The gene symbol is \u003cem\u003eBRCA1\u003c/em\u003e; The gene id is 672; The gene symbol source is EntrezGene; The biotype is protein_coding. SpliceAI scores: The SpliceAI SYMBOL is \u003cem\u003eBRCA1\u003c/em\u003e; The SpliceAI DS AG is 0.0; The SpliceAI DS AL is 0.0. Population frequencies: Allele: A. The frequency in gnomadg_asj is 0.02244; The frequency in amr is 0.0043; The frequency in gnomadg_ami is 0.008216; The frequency in gnomadg_afr is 0.002173.\u003c/p\u003e\u003ch3\u003eSemantic Embedding Models and Generation\u003c/h3\u003e\u003cp\u003eThree distinct embedding models were selected to capture different aspects of semantic representation:\u003c/p\u003e\u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eall-mpnet-base-v2\u003c/b\u003e (MPNet) [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]: A Transformer-based model pre-trained using masked and permuted language modeling. With 768 dimensions and a 512-token limit, it excels at semantic similarity tasks, balancing sentence structure understanding.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eMedEmbed-large-v0.1\u003c/b\u003e (MedEmbed) [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]: A domain-specific Transformer trained on biomedical literature, offering 1,024 dimensions and a 512-token limit. Its specialization provides a nuanced understanding of medical and genomic terminology.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003etext-embedding-004\u003c/b\u003e (Google Embeddings) [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]: Google’s state-of-the-art general-purpose embedding model, generating 768-dimensional embeddings with a 2,048-token limit, performing exceptionally well on diverse semantic tasks.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e\u003cp\u003eThe embedding generation process transforms each variant’s description into vector representations using these models. These embeddings are stored in ChromaDB, a vector database optimized for similarity search and retrieval. Accompanying metadata includes variant identifiers, genomic coordinates, pathogenicity classifications, confidence scores, gene annotations, functional consequences, and timestamps of processing. Vector indices are tuned for cosine similarity search, with parameters adjusted to each model’s dimensionality.\u003c/p\u003e\u003ch3\u003eIntegrating Structural and Evolutionary Fitness Landscapes\u003c/h3\u003e\u003cp\u003eTo augment semantic annotation embeddings, we performed a deep-learning-based structural analysis of \u003cem\u003eFBN1\u003c/em\u003e variants using the ESMC 600M (600\u0026nbsp;million parameters) protein language model [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. While full-sequence embeddings provide a global overview, our analysis demonstrated that residue-level delta embeddings—which isolate the specific vector difference at the mutation site—offered the most robust discrimination between functional categories of mutations.\u003c/p\u003e\u003ch3\u003eStructural and Evolutionary Modeling\u003c/h3\u003e\u003cp\u003eTo capture the thermodynamic and evolutionary constraints of missense variants, we integrated the \u003cb\u003eESMC-600M\u003c/b\u003e protein language model [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. We focused this analysis on \u003cem\u003eFBN1\u003c/em\u003e, implementing a specific pipeline to handle its large protein size (2,871 aa) and distinct pathogenic mechanisms.\u003c/p\u003e\u003ch2\u003eResidue-Level Delta Embeddings\u003c/h2\u003e\u003cp\u003eWe hypothesized that the pathogenicity signal is best captured by the vector difference between the mutant and wild-type states, rather than the absolute representation of the mutant alone. We defined the \u003cb\u003eResidue-Level Delta Embedding\u003c/b\u003e (∆\u003cem\u003eE\u003c/em\u003e) at the mutation site \u003cem\u003etarget\u003c/em\u003e_\u003cem\u003eidx\u003c/em\u003e as:\u003c/p\u003e\u003cp\u003e∆\u003cem\u003eE\u003c/em\u003e = \u003cem\u003eE\u003c/em\u003e\u003csub\u003emutant\u003c/sub\u003e[\u003cem\u003etarget\u003c/em\u003e_\u003cem\u003eidx\u003c/em\u003e] \u003cem\u003e− E\u003c/em\u003e\u003csub\u003ewildtype\u003c/sub\u003e[\u003cem\u003etarget\u003c/em\u003e_\u003cem\u003eidx\u003c/em\u003e] (1)\u003c/p\u003e\u003cp\u003eThis subtraction isolates the \"structural insult\" induced by the variant, removing the background noise of the protein’s sequence identity. Due to the sequence length of Fibrillin-1 exceeding standard context windows, we utilized a sliding window tiling strategy (overlapping windows of \u003cem\u003e≈\u003c/em\u003e 1000 aa) to generate these embeddings while preserving local structural context.\u003c/p\u003e\u003ch3\u003eMechanistic Categorization via Structural Interaction Type\u003c/h3\u003e\u003cp\u003eTo investigate whether the embedding space captures specific biophysical mechanisms, variants were stratified based on the structural role of the wild-type residue within the protein fold, based on their UMD-FBN1 mutations database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.umd.be/FBN1/\u003c/span\u003e\u003cspan address=\"http://www.umd.be/FBN1/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) annotations:\u003c/p\u003e\u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eDisulfide Bond Disrupting\u003c/b\u003e: Mutations affecting cysteine residues that form covalent disulfide bridges, which are critical for stabilizing the tertiary structure of FBN1 domains.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eCalcium Binding\u003c/b\u003e: Mutations located within high-affinity Ca\u003csup\u003e2+\u003c/sup\u003e binding motifs (e.g., within cbEGF domains), which facilitate domain rigidity and proper inter-domain alignment.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e\u003ch3\u003eEvolutionary Fitness Scoring (Zero-Shot LLR)\u003c/h3\u003e\u003cp\u003eTo quantify evolutionary constraint without supervision, we calculated the Log-Likelihood Ratio (LLR) using the model’s ‘sequence_head’. The LLR compares the probability of the mutant amino acid (\u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003emut\u003c/em\u003e\u003c/sub\u003e) against the wild-type (\u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ewt\u003c/em\u003e\u003c/sub\u003e) given the surrounding context:\u003c/p\u003e\u003cp\u003e \u003cem\u003eLLR\u003c/em\u003e = log \u003cem\u003eP\u003c/em\u003e (\u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003emut\u003c/em\u003e\u003c/sub\u003e \u003cem\u003e|\u003c/em\u003e context) \u003cem\u003e−\u003c/em\u003e log \u003cem\u003eP\u003c/em\u003e (\u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ewt\u003c/em\u003e\u003c/sub\u003e \u003cem\u003e|\u003c/em\u003e context) (2)\u003c/p\u003e\u003cp\u003eNegative LLR values indicate evolutionary intolerance (potential pathogenicity), while values near zero imply neutrality.\u003c/p\u003e\u003ch2\u003eEvaluation Metrics and Implementation\u003c/h2\u003e\u003ch2\u003eDimensionality Reduction and Visualization\u003c/h2\u003e\u003cp\u003eTo analyze the high-dimensional embeddings, we applied three complementary dimensionality reduction techniques: Principal Component Analysis (PCA) [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], t-Distributed Stochastic Neighbor Embedding (t-SNE) [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], and Uniform Manifold Approximation and Projection (UMAP) [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. PCA projects data into a visualizable space by preserving major variance components. t-SNE emphasizes local neighborhood relationships, while UMAP balances local and global structure preservation. Visualizations revealed distinct clusters of variants with similar pathogenicity, suggesting that the embedding models effectively capture pathogenicity-related features. The discriminative power was quantified by correlating embedding distances with pathogenicity differences, confirming tighter clustering of variants with similar clinical impacts. This structure supports the k-NN classification approach for predicting pathogenicity in variants of uncertain significance.\u003c/p\u003e\u003ch2\u003ek-NN Pathogenicity Prediction\u003c/h2\u003e\u003cp\u003eA k-NN classification approach was developed to predict variant pathogenicity using semantic embeddings. Performance was evaluated across multiple neighborhood sizes (k = 1, 5, 10, 15, 20) to identify gene-specific optimal k values. Cosine similarity was employed as the distance metric due to its effectiveness in high-dimensional spaces and its robustness to vector magnitude variations. For each query variant, the algorithm identifies the k most similar variants with known pathogenicity, excluding the query variant from the training set to prevent data leakage. Predictions are based on majority voting, with confidence scores reflecting the proportion of votes supporting the predicted class. The embedding space’s ability to separate variants by pathogenicity was assessed by computing a clustering accuracy metric for known variants. For each variant with established pathogenicity, its k nearest neighbors were identified, and the percentage of neighbors sharing the same classification was calculated. The classification criteria are outlined in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e \u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eCriteria for classifying pathogenicity based on the proportion of pathogenic and benign neighbors among the top-k nearest neighbors.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClassification\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCriteria (among \u003cem\u003ek\u003c/em\u003e neighbors)\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eConfidence Score\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePotentially Pathogenic\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e≥\u003c/em\u003e 90% pathogenic neighbors\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePathogenic ratio\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLikely Pathogenic\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e≥\u003c/em\u003e 60% pathogenic neighbors (but \u003cem\u003e\u0026lt;\u003c/em\u003e 90%)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePathogenic ratio\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePotentially Benign\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e≥\u003c/em\u003e 90% benign neighbors\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBenign ratio\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLikely Benign\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e≥\u003c/em\u003e 60% benign neighbors (but \u003cem\u003e\u0026lt;\u003c/em\u003e 90%)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBenign ratio\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUncertain\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMixed or no clear majority\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003emax (benign, pathogenic) ratio\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003ch2\u003eDistance Ratio Analysis\u003c/h2\u003e\u003cp\u003eTo quantitatively validate the clustering structure, we introduced the \u003cb\u003eDistance Ratio\u003c/b\u003e metric. For any given variant, we calculate the median cosine distance to its \u003cem\u003ek\u003c/em\u003e nearest neighbors of the same class (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003esame\u003c/em\u003e\u003c/sub\u003e) versus the opposite class (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003eopposite\u003c/em\u003e\u003c/sub\u003e).\u003c/p\u003e\u003cp\u003eDistance Ratio = \u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003esame\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e/D\u003c/em\u003e\u003csub\u003e\u003cem\u003eopposite\u003c/em\u003e\u003c/sub\u003e (3)\u003c/p\u003e\u003cp\u003eA ratio \u003cem\u003e\u0026lt;\u003c/em\u003e 1.0 indicates that the variant resides within a semantically consistent neighborhood (correct clustering), while a ratio \u003cem\u003e\u0026gt;\u003c/em\u003e 1.0 indicates separation failure.\u003c/p\u003e\u003ch2\u003eImplementation Workflow\u003c/h2\u003e\u003cp\u003eThe pathogenicity prediction pipeline targets novel and variants of uncertain significance (VUS) using a modular, efficient workflow. Unknown variants are retrieved from ChromaDB using pathogenicity status filters, processed in batches for computational efficiency, and analyzed for their k (default: 20) nearest known variants (benign or pathogenic) via cosine similarity. Predictions are generated using a confidence-weighted voting system, with confidence scores reflecting prediction reliability. Visualizations employing PCA, UMAP, and t-SNE illustrate spatial relationships between known and unknown variants, aiding genetic counselors and clinicians in interpreting results (see the workflow in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eEmbedding Results\u003c/h2\u003e \u003cp\u003eThe embedding analysis establishes a robust framework for clinical variant interpretation via confidence-based stratification derived from neighborhood consistency patterns. Variants can be categorized as high-confidence (\u0026gt;\u0026thinsp;95% agreement among nearest neighbors), moderate-confidence (60\u0026ndash;95% agreement), or boundary cases requiring expert review (\u0026lt;\u0026thinsp;60% agreement). \u003cem\u003eBRCA1\u003c/em\u003e analysis revealed that 2\u0026ndash;5% of variants fell into boundary cases across models, highlighting cases that may benefit from additional clinical review or functional studies. \u003cem\u003eBRCA2\u003c/em\u003e analysis showed fewer boundary cases (0.4\u0026ndash;2.6%), indicating stronger semantic clustering and higher prediction confidence. The gradient of neighborhood agreement serves as a natural quality control mechanism for clinical workflows, enabling automated processing of high-confidence predictions while flagging uncertain cases for genetic counseling and expert interpretation. This approach addresses the critical need for reliable variant classification in hereditary cancer predisposition testing, where accurate interpretation directly influences clinical management decisions. Consistent performance across neighbor counts (k\u0026thinsp;=\u0026thinsp;1, 5, 20) validates the methodology\u0026rsquo;s robustness, with optimal k likely being gene-specific to balance local precision and noise reduction for immediate clinical implementation (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cb\u003eBRCA1\u003c/b\u003e \u003cb\u003eSemantic Embedding Performance\u003c/b\u003e The k-nearest neighbors (K-NN) evaluation of \u003cem\u003eBRCA1\u003c/em\u003e variants (n\u0026thinsp;=\u0026thinsp;3,311) demonstrated robust classification performance across all three embedding models, with systematic analysis confirming the effectiveness of semantic representations in capturing pathogenicity-relevant features (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). The MPNet model achieved the highest overall accuracy at 97.9% with top 20 neighbors, with strong performance for both benign/likely benign variants (96.9%) and pathogenic/likely pathogenic variants (98.4%). MedEmbed showed comparable performance at 97.7% overall accuracy, while Google embeddings achieved 97.3% overall accuracy. Neighborhood consistency analysis revealed that 94.6% of benign variants and 97.4% of pathogenic variants had 80\u0026ndash;100% agreement among their top 20 nearest neighbors when using MPNet, indicating robust semantic clustering. The high neighborhood consensus across all models suggests that \u003cem\u003eBRCA1\u003c/em\u003e variant annotations contain sufficient semantic information to enable reliable automated classification. MedEmbed exhibited 92.5% consistency for benign variants achieving 80\u0026ndash;100% neighborhood agreement, while Google embeddings achieved 92.6% consistency for benign variants. This consistent performance across different embedding approaches validates the robustness of the semantic representation framework for \u003cem\u003eBRCA1\u003c/em\u003e variant interpretation.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eBRCA2\u003c/b\u003e \u003cb\u003eSemantic Embedding Performance\u003c/b\u003e Our method demonstrated exceptional classification performance on \u003cem\u003eBRCA2\u003c/em\u003e variants (n\u0026thinsp;=\u0026thinsp;4,074) across all embedding models, with MedEmbed achieving outstanding results of 99.1% overall accuracy (99.2% benign/likely benign, 99.1% pathogenic/likely pathogenic) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). The domain-specific medical embedding model\u0026rsquo;s superior performance highlights the value of specialized pre-training for genetic variant analysis. MPNet achieved comparable excellence with 99.0% overall accuracy (98.3% benign/likely benign, 99.4% pathogenic/likely pathogenic), while Google embeddings maintained strong performance at 97.9% overall accuracy. Neighborhood distribution analysis revealed remarkable consistency, with MedEmbed showing 99.6% of benign variants and 98.8% of pathogenic variants achieving 80\u0026ndash;100% agreement among top 20 neighbors. MPNet demonstrated equally impressive consistency with 97.4% of benign variants and 99.3% of pathogenic variants showing high neighborhood agreement. The larger sample size for \u003cem\u003eBRCA2\u003c/em\u003e variants appears to contribute to enhanced model training and more stable semantic representations, enabling the development of highly reliable prediction models for clinical applications. Google embeddings, while achieving the lowest performance among the three models, still maintained clinically relevant accuracy with 94.9% of benign variants showing strong neighborhood consensus.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eFBN1\u003c/b\u003e \u003cb\u003eSemantic Embedding Performance\u003c/b\u003e The \u003cem\u003eFBN1\u003c/em\u003e validation results demonstrate robust performance across all three embedding models, achieving consistently high accuracy rates exceeding 96% (MPNet: 96.7%, Google: 96.5%, MedEmbed: 96.6% at k\u0026thinsp;=\u0026thinsp;5) (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). These results are particularly significant as they validate the generalizability of our semantic embedding approach beyond the extensively characterized \u003cem\u003eBRCA\u003c/em\u003e cancer predisposition genes to a gene associated with connective tissue disorders and mostly missense instead of loss-of-function pathogenic variants. The dimensional reduction visualizations reveal distinct clustering patterns for \u003cem\u003eFBN1\u003c/em\u003e variants, with Google embeddings showing the most compact and well-separated clusters in t-SNE and UMAP projections, while MPNet and MedEmbed demonstrated more distributed but still discriminative patterns. Notably, the \u003cem\u003eFBN1\u003c/em\u003e performance closely mirrors that observed for \u003cem\u003eBRCA1\u003c/em\u003e (97.3% with MPNet), suggesting that the embedding approach maintains effectiveness across different genetic contexts and disease mechanisms, even for genes that may have less comprehensive functional annotation compared to the intensively studied \u003cem\u003eBRCA1/2\u003c/em\u003e genes. This validation supports the broader clinical utility of semantic embeddings for variant interpretation across diverse genetic conditions, particularly important for rare disease contexts where sparse training data may limit traditional computational approaches.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eGeneralization to ATM and PALB2\u003c/b\u003e To further stress-test the framework\u0026rsquo;s generalizability, we evaluated \u003cem\u003ePALB2\u003c/em\u003e (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;1, 598) (Supplementary Fig.\u0026nbsp;1) and \u003cem\u003eATM\u003c/em\u003e (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;4, 346) (Supplementary Fig.\u0026nbsp;2) (Supplementary_Data.xls). Despite differing biological functions and mutation spectrums, VUS.Life maintained high predictive accuracy. As shown in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, the \u003cem\u003eATM\u003c/em\u003e gene achieved 98.0\u0026ndash;98.4% accuracy across all models at \u003cem\u003ek\u003c/em\u003e\u0026thinsp;=\u0026thinsp;5, while \u003cem\u003ePALB2\u003c/em\u003e achieved 97.2\u0026ndash;98.4%. The consistency of these results\u0026mdash;comparable to the highly curated BRCA datasets\u0026mdash;confirms that the semantic embedding approach captures universal pathogenicity signals and is not artifacts of specific gene curation histories.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eClustering accuracy of semantic embedding models for predicting variant pathogenicity across genes.\u003c/b\u003e Accuracy is measured by the percentage of k-nearest neighbors sharing the same known pathogenicity classification. Higher accuracy indicates better separation of pathogenic and benign variants.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGene\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ek\u0026thinsp;=\u0026thinsp;5\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ek\u0026thinsp;=\u0026thinsp;10\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ek\u0026thinsp;=\u0026thinsp;15\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003ek\u0026thinsp;=\u0026thinsp;20\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cem\u003eBRCA1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMedEmbed-large-v0.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.981\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.980\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.978\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.977\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003egoogle-embedding\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.980\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.977\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.974\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.973\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eall-mpnet-base-v2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.984\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.983\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.981\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.979\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cem\u003eBRCA2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMedEmbed-large-v0.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.993\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.993\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.992\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.991\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003egoogle-embedding\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.985\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.983\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.981\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.979\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eall-mpnet-base-v2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.992\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.991\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.990\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.990\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cem\u003eFBN1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMedEmbed-large-v0.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.966\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.965\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.964\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.963\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003egoogle-embedding\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.965\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.964\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.961\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.959\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eall-mpnet-base-v2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.967\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.962\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.961\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.959\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cem\u003eATM\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMedEmbed-large-v0.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.984\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.982\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.979\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.975\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003egoogle-embedding\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.980\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.978\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.975\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.975\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eall-mpnet-base-v2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.983\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.981\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.981\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.978\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cem\u003ePALB2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMedEmbed-large-v0.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.980\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.979\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.979\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.979\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003egoogle-embedding\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.972\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.960\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.952\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.946\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eall-mpnet-base-v2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.984\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.984\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.983\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.981\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003ePathogenicity Prediction using Annotation Embeddings\u003c/h2\u003e \u003cp\u003eThe prediction results for 3,000 \u0026lsquo;not-yet-reviewed\u0026rsquo; variants in both \u003cem\u003eBRCA1\u003c/em\u003e and 3,000 not-yet-reviewed \u003cem\u003eBRCA2\u003c/em\u003e variants (Supplementary_Data.xls) demonstrate the promising potential of semantic embeddings for variant pathogenicity classification (see Figs.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e\u0026amp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). Across all three embedding models (MPNet, Google embeddings, and MedEmbed), the unknown variants (green diamonds) exhibit spatial distributions that closely mirror the established clustering patterns of known benign (blue) and pathogenic (red) variants.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo illustrate the practical application of this approach, we examined prediction results from 10 randomly selected variants each from \u003cem\u003eBRCA1\u003c/em\u003e and \u003cem\u003eBRCA2\u003c/em\u003e datasets (Tables\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e \u0026amp; \u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). The results demonstrate strong model concordance and biologically coherent predictions across different variant types. For \u003cem\u003eBRCA1\u003c/em\u003e variants, all three embedding models achieved perfect agreement on clearly deleterious variants such as frameshift (NC_000013.11:g.32396971_32396972dup) and stop-gained mutations (NC_000013.11:g.32339409C\u0026thinsp;\u0026gt;\u0026thinsp;G), both receiving unanimous pathogenic predictions with 1.00 confidence scores based on k-nearest neighbor counts of B:0, P:20. Intronic variants consistently received benign classifications across all models with maximum confidence, while missense variants showed more nuanced predictions with generally high confidence but occasional inter-model variation in neighbor compositions.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePredicted pathogenicity of randomly selected unknown variants in \u003cem\u003eBRCA1\u003c/em\u003e (NC_000017.11) using different semantic embedding models.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"11\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c11\" colnum=\"11\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eGenomic HGVS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eConsequence\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c5\" namest=\"c3\"\u003e \u003cp\u003eMedEmbed\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c8\" namest=\"c6\"\u003e \u003cp\u003egoogle\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c11\" namest=\"c9\"\u003e \u003cp\u003eMPNet\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eConf. Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePred\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eConf. Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003ePred\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003eConf. Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c11\"\u003e \u003cp\u003ePred\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43094629_43095075del\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esplice_acceptor_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:16, P:4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.800\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:2, P:18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:10, P:10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e0.500\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003euncertain\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43102853C\u0026thinsp;\u0026gt;\u0026thinsp;G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43067120C\u0026thinsp;\u0026gt;\u0026thinsp;T\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43093771_43093774delinsTC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eframeshift_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43061638A\u0026thinsp;\u0026gt;\u0026thinsp;G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43065397G\u0026thinsp;\u0026gt;\u0026thinsp;A\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:19, P:1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.950\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43067430del\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:8, P:12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.600\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43074502G\u0026thinsp;\u0026gt;\u0026thinsp;T\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emissense_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:17, P:3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.850\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:13, P:7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.650\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:16, P:4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e0.800\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43094690T\u0026thinsp;\u0026gt;\u0026thinsp;C\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emissense_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:18, P:2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.43115774T\u0026thinsp;\u0026gt;\u0026thinsp;G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emissense_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:12, P:8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.600\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:5, P:15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.750\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:16, P:4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e0.800\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePredicted pathogenicity of randomly selected unknown variants in \u003cem\u003eBRCA2\u003c/em\u003e (NC_000013.11) using different semantic embedding models.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"11\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c11\" colnum=\"11\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eGenomic HGVS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eConsequence\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c5\" namest=\"c3\"\u003e \u003cp\u003eMedEmbed\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c8\" namest=\"c6\"\u003e \u003cp\u003egoogle\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c11\" namest=\"c9\"\u003e \u003cp\u003eMPNet\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eConf. Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePred\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eConf. Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003ePred\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003eConf. Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c11\"\u003e \u003cp\u003ePred\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32353730A\u0026thinsp;\u0026gt;\u0026thinsp;G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32340247G\u0026thinsp;\u0026gt;\u0026thinsp;T\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emissense_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:18, P:2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32363243A\u0026thinsp;\u0026gt;\u0026thinsp;G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emissense_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:19, P:1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.950\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:15, P:5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.750\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32396971_32396972dup\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eframeshift_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32370904A\u0026thinsp;\u0026gt;\u0026thinsp;T\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32322227A\u0026thinsp;\u0026gt;\u0026thinsp;G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32385434_32385445dup\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:15, P:5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.750\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32369761A\u0026thinsp;\u0026gt;\u0026thinsp;T\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eintron_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32380102G\u0026thinsp;\u0026gt;\u0026thinsp;A\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esynonymous_variant\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:18, P:2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:20, P:0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003ebenign\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eg.32339409C\u0026thinsp;\u0026gt;\u0026thinsp;G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003estop_gained\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eB:0, P:20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003epathogenic\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe \u003cem\u003eBRCA2\u003c/em\u003e results reveal similar patterns but with some instructive disagreements that highlight the method\u0026rsquo;s sensitivity to semantic context. Most notably, a splice acceptor variant (NC_000017.11:g.43094629_43095075del) demonstrated model disagreement, with MedEmbed predicting benign (B:16, P:4, confidence 0.800), Google embeddings predicting pathogenic (B:2, P:18, confidence 0.900), and MPNet remaining uncertain (B:10, P:10, confidence 0.500). This variant exemplifies cases where the semantic interpretation of clinical annotations may vary, potentially reflecting different emphasis on splice site disruption versus other contextual factors in the training literature.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cem\u003eBRCA1\u003c/em\u003e Results\u003c/strong\u003e \u003cp\u003eThe unknown variants distribute throughout the embedding space in a manner consistent with the trained pathogenicity classes. In the MPNet and Google embedding visualizations, unknown variants segregate into regions predominantly occupied by either benign or pathogenic training variants, particularly evident in the t-SNE projections, where distinct clusters emerge. The MedEmbed results show a more dispersed but still meaningful distribution, with unknown variants positioning themselves in semantically appropriate regions of the embedding space.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cem\u003eBRCA2\u003c/em\u003e Results\u003c/strong\u003e \u003cp\u003eSimilar patterns are observed for \u003cem\u003eBRCA2\u003c/em\u003e, where unknown variants demonstrate clear spatial alignment with their predicted pathogenicity classes. The Google embeddings show particularly strong clustering behavior, with unknown variants forming coherent groups that align with either the benign or pathogenic regions. The UMAP projections across all models reveal that unknown variants maintain the same topological relationships as the training data, suggesting robust semantic capture of pathogenicity-relevant features.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cem\u003eFBN1\u003c/em\u003e Results\u003c/strong\u003e \u003cp\u003eThe 2015 ACMG/AMP guidelines for variant interpretation do not fully account for gene-specific characteristics. Since 2019, the ClinGen FBN1 Variant Curation Expert Panel (VCEP) has invested substantial effort in developing consensus recommendations tailored to the unique features of this disease gene and its pathogenic variants [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. To evaluate these new guidelines, the panel conducted a pilot study on 60 representative, challenging variants. This effort involved collaboration among three core sites and six non-core institutions, demonstrating improved interpretive consistency compared to the 2015 ACMG/AMP framework. Of these 60 variants, 30 overlapped with the 1,532 variants used to construct the \u003cem\u003eFBN1\u003c/em\u003e vector embedding space. Pathogenicity prediction was therefore performed only on the remaining 30 variants. These 30 unresolved variants segregated into established pathogenic or benign clusters, particularly evident in the UMAP projection using MPNet embeddings (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). All 30 variants received either pathogenic or benign classifications (FBN1VCEP Tab of Supplementary_Data.xls). For 10 variants, the FBN1 VCEP\u0026mdash;either at core sites, non-core sites, or both\u0026mdash;had been assigned a VUS classification under the updated guidelines. Our algorithm, using the best-performing MPNet model for \u003cem\u003eFBN1\u003c/em\u003e, classified 7 of these as pathogenic and 3 as benign. For the remaining 20 variants, MPNet predictions agreed with VCEP classifications for 17 (85%). We examined the three discordant calls, which are all missense variants predicted as pathogenic by our algorithm with MPNet model but classified a benign by the \u003cem\u003eFBN1\u003c/em\u003e VCEP. Notably, ClinVar lists all three as VUS, with submissions from more than four independent laboratories each, underscoring the uncertainty surrounding their true pathogenicity.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/p\u003e \u003cp\u003eWe further validated the framework using a subset of \u003cem\u003eFBN1\u003c/em\u003e variants linked to specific cardiovascular phenotypes (e.g., aortic dilation, mitral valve prolapse) retrieved from the UDM database. From an initial collection of approximately 600 variants, we identified 308 that were already present in our ClinVar training set with high-quality review status; these were excluded from the validation set to prevent data leakage. The remaining 177 variants served as an independent test set (the FBN1 with cardiovascular tab of Supplementary Data.xls). Pathogenicity predictions were generated using \u003cem\u003ek\u003c/em\u003e\u0026thinsp;=\u0026thinsp;5 nearest neighbors (Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). The all-mpnet-base-v2 model achieved the highest sensitivity, correctly identifying 98.31% (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;174) of these pathogenic variants. This significantly outperformed Google Embedding (89.83%) and MedEmbed-large-v0.1 (88.14%). The superior performance of MPNet suggests that general-purpose semantic models may be more robust than domain-specific models in capturing the nuanced textual descriptions associated with complex phenotypic presentations. The spatial distribution of these test variants is visualized in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e, and detailed results are presented in Supplementary Table X.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSensitivity of pathogenicity prediction for \u003cem\u003eFBN1\u003c/em\u003e variants with variable cardiovascular involvement (UDM dataset, \u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;177, \u003cem\u003ek\u003c/em\u003e\u0026thinsp;=\u0026thinsp;5).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eall-mpnet-base-v2\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003egoogle-embedding\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMedEmbed-large-v0.1\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrediction of pathogenic (n)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e174\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e159\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e156\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrediction of pathogenic (%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e98.31%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e89.83%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e88.14%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eUnsupervised Disentanglement of Pathogenic Mechanisms via Protein Language Models\u003c/h2\u003e \u003cp\u003eTo determine whether semantic embeddings capture biologically meaningful structural disruptions rather than mere sequence heuristics, we analyzed the latent space of the ESMC-600M protein language model applied to FBN1 (Fibrillin-1). We hypothesized that a \"residue-level delta embedding\"\u0026mdash;the vector difference between the mutant and wild-type representations at the specific amino acid site\u0026mdash;would isolate the pathogenicity signal from the background noise of protein sequence identity.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eTopological Separation of Mutation Subtypes\u003c/h2\u003e \u003cp\u003eDimensionality reduction (UMAP) of these delta embeddings revealed a striking, unsupervised segregation of variants based on their underlying molecular mechanism (HCD plot in Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eDisulfide Bond Disruptions (Cysteine loss/gain)\u003c/b\u003e: These variants formed a tightly compacted cluster, distinct from other pathogenic variants. This topological tightness suggests that the model recognizes a consistent \"structural collapse\" signature associated with the breaking of covalent cysteine bridges\u0026mdash;a primary driver of protein instability in Marfan syndrome.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eCalcium-Binding Alterations\u003c/b\u003e: Mutations affecting calcium-binding epidermal growth factor-like (cbEGF) domains formed a separate, more diffuse cluster. This spatial distinction implies that the model differentiates between the global destabilization caused by disulfide loss and the local rigidity/flexibility defects caused by calcium-binding disruptions.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eCrucially, this mechanistic separation was achieved without providing the model with 3D structural coordinates or functional labels, demonstrating that the protein language model has implicitly learned the physics of protein folding solely through evolutionary sequence patterns.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eConvergence of Structural Impact and Evolutionary Fitness\u003c/h2\u003e \u003cp\u003eTo validate these structural embeddings against evolutionary constraints, we calculated Zero-Shot Log-Likelihood Ratios (LLR) for 216 missense variants of \u003cem\u003eFBN1\u003c/em\u003e associated with cardiovascular phenotypes. The LLR serves as a proxy for evolutionary fitness, quantifying how \"unexpected\" a mutation is given the protein\u0026rsquo;s context. We observed a high degree of concordance between the structural embedding space and evolutionary fitness scores (Evolutionary fitness landscape plot in Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e):\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eThe Pathogenicity Gradient: LLR scores ranged from \u0026minus;\u0026thinsp;10.43 (highly deleterious) to +\u0026thinsp;0.45 (neutral).\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eStructural-Evolutionary Coupling: Variants within the \"disulfide disruption\" structural cluster exhibited the most severe negative LLR scores (mean LLR: -3.77), with extreme outliers (\u0026lt; -7.0) exclusively confined to this region.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eClinical Implications of \"Delta\" Representations\u003c/h2\u003e \u003cp\u003eOur comparison of embedding strategies revealed a critical methodological insight for clinical AI. Raw mutant embeddings were dominated by sequence identity, obscuring the pathogenic signal. By utilizing residue-level delta embeddings, we successfully subtracted the \"background\" biological context, leaving only the vector of the pathogenic insult. This approach transforms the \"black box\" of the LLM into an interpretable tool capable of distinguishing between varying mechanisms of disease\u0026mdash;such as haploinsufficiency (protein instability) versus dominant-negative effects\u0026mdash;\u003cem\u003ein silico\u003c/em\u003e, before experimental functional validation.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003eDistance Ratio Analysis Confirms Robust Class Separation\u003c/h2\u003e \u003cp\u003eTo quantitatively validate the class separation observed in the dimensionality reduction plots, we applied our Distance Ratio metric to the embeddings of \u003cem\u003eBRCA1\u003c/em\u003e and \u003cem\u003eBRCA2\u003c/em\u003e variants. The results confirm that all three embedding models produce a highly structured and discriminative semantic space. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e, the distributions of the Distance Ratio for both pathogenic (red) and benign (blue) variants are overwhelmingly concentrated to the left of the 1.0 threshold. This visually demonstrates that for most variants, the distance to same-class neighbors is significantly smaller than the distance to different-class neighbors.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThis finding is supported by quantitative analysis (Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e), which shows that across all models and genes, over 98% of variants have a Distance Ratio less than 1.0, indicating effective clustering. The all-mpnet-base-v2 model showed the most consistent high performance across all categories, with an overall average of 98.7% of variants correctly clustered. While all models performed exceptionally well, we observed that \u003cem\u003eBRCA2\u003c/em\u003e variants, which have more extensive annotation data, showed slightly better class separation than \u003cem\u003eBRCA1\u003c/em\u003e variants. This robust quantitative validation provides high confidence in using the semantic embedding space for downstream classification and similarity search tasks.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab8\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003ePercentage of variants with a Distance Ratio\u0026thinsp;\u0026lt;\u0026thinsp;1.0, calculated with k\u0026thinsp;=\u0026thinsp;20 nearest neighbors.\u003c/b\u003e Values represent the proportion of variants that are closer to their same-class neighbors than to opposite-class neighbors, indicating effective separation.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eBRCA1\u003c/em\u003e Pathogenic\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eBRCA1\u003c/em\u003e Benign\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eBRCA2\u003c/em\u003e Pathogenic\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cem\u003eBRCA2\u003c/em\u003e Benign\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eOverall Average\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eall-mpnet-base-v2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e98.2%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e97.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e99.3%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e99.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e98.7%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003egoogle-embedding\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e97.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e99.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e98.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e98.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e98.4%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMedEmbed-large-v0.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e97.3%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e99.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e98.8%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e99.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e98.6%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eClinical Implications\u003c/strong\u003e \u003cp\u003eThe consistent distribution patterns across both genes and all embedding models strongly support the hypothesis that semantic embeddings can effectively capture the linguistic and clinical nuances present in variant annotations that distinguish pathogenic from benign variants. This spatial coherence between known and unknown variants indicates that the learned representations encode biologically meaningful features rather than spurious correlations, providing confidence in the method\u0026rsquo;s ability to generalize to variants of uncertain significance (VUS). The preservation of clustering structure suggests that clinicians could potentially leverage these embeddings not only for binary classification but also for confidence estimation and identification of variants requiring additional functional studies.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn this study, we present VUS.Life, a unified framework designed to resolve the interpretation bottleneck for Variants of Uncertain Significance (VUS). By synergizing semantic text embeddings with protein language modeling, we achieved\u0026thinsp;\u0026gt;\u0026thinsp;96% classification accuracy across five clinically distinct genes (\u003cem\u003eBRCA1\u003c/em\u003e, \u003cem\u003eBRCA2\u003c/em\u003e, \u003cem\u003eATM\u003c/em\u003e, \u003cem\u003ePALB2\u003c/em\u003e, and \u003cem\u003eFBN1\u003c/em\u003e). Crucially, we demonstrate that this performance stems not merely from memorizing clinical pathogenicity assertions or predictions, but from the model\u0026rsquo;s unsupervised ability to disentangle the mechanistic and evolutionary roots of pathogenicity.\u003c/p\u003e \u003cdiv id=\"Sec25\" class=\"Section2\"\u003e \u003ch2\u003eUniversality and Clinical Utility\u003c/h2\u003e \u003cp\u003eCurrent clinical workflows often necessitate a \"patchwork\" of calculators\u0026mdash;deploying structure-based models for missense changes, SpliceAI for intronic regions, and separate heuristics for truncations. A defining advantage of VUS.Life is its consequence-agnostic capability. While state-of-the-art tools like AlphaMissense are restricted to missense variants, our framework processes the full spectrum of genomic variation. By transforming diverse annotations into a unified narrative, VUS.Life captures the \"long tail\" of non-coding and complex variants\u0026mdash;from high-impact frameshifts to subtle regulatory changes\u0026mdash;offering a scalable, single-model solution. We validated this versatility using \u003cem\u003eFBN1\u003c/em\u003e, a gene dominated by missense mutations where mechanism matters. In a pilot study of unresolved variants from the ClinGen FBN1 Variant Curation Expert Panel (VCEP), our model successfully segregated difficult cases into established clusters, achieving 85% concordance with expert consensus on previously manually classified variants. Furthermore, in an independent validation using phenotype-specific data (UDM database), our general-purpose model outperformed domain-specific tools (97.1% sensitivity). This suggests that the broad semantic understanding of a large language model effectively captures complex phenotypic nuance better than models trained on narrower biomedical corpora.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section2\"\u003e \u003ch2\u003eMechanistic Disentanglement and Evolutionary Calibration\u003c/h2\u003e \u003cp\u003eBeyond binary classification, the future of genomic medicine requires a quantitative understanding of variant impact. As highlighted by recent work on the popEVE model, the breakthrough in rare disease diagnosis lies in the in-depth calibration of evolutionary records against human genetics [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. We validated this paradigm within VUS.Life by integrating Zero-Shot Log-Likelihood Ratio (LLR) scoring via the ESMC-600M protein language model. By utilizing residue-level delta embeddings\u0026mdash;which isolate the specific \"structural insult\" of a variant\u0026mdash;the model spontaneously disentangled distinct pathogenic mechanisms without supervision. In \u003cem\u003eFBN1\u003c/em\u003e, variants disrupting disulfide bonds (critical for protein stability) formed a tight topological cluster distinct from calcium-binding defects. These structural clusters correlated strongly with evolutionary fitness scores (mean LLR: -3.77 for disulfide disruptions), confirming that the model captures the thermodynamic severity of mutations. This moves VUS interpretation from a static label to a calibrated, mechanism-aware risk assessment.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section2\"\u003e \u003ch2\u003eLimitations\u003c/h2\u003e \u003cp\u003eOur approach has limitations. First, reliance on text descriptions makes the model sensitive to annotation quality and LLM token limits, potentially truncating extensive evidence for complex variants. Second, while our Distance Ratio analysis confirmed robust clustering (98% of variants\u0026thinsp;\u0026lt;\u0026thinsp;1.0), boundary cases remain where semantic similarity alone may not capture subtle clinical distinctions. Finally, as with all supervised learning, the model inherits biases present in the training repositories (ClinVar/BRCA Exchange).\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eVUS.Life represents a significant step toward a \"semantic genotype-phenotype map.\" By combining the breadth of clinical text with the depth of evolutionary protein modeling, we provide a robust, interpretable, extensible, and scalable framework for pathogenicity prediction. As we integrate emerging methods, such as ESMC-600M for evolutionary calibration, this tool holds the promise of transforming the interpretation of VUS from a source of uncertainty into a dynamic, quantitative understanding of human genetic disease.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate.\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA patent application was filed on June 10, 2025 - Xiaowu Gai, Jiawei Wu. Predicting Genetic Variant Pathogenicity Using Vector Embeddings (U.S. Patent Application No. 63/821.249). Github repository for codes and demonstrations: https://github.com/MayaMua/vus-life.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work is primarily funded by Medical College of Wisconsin.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eX.G. acquired study funding, conceptualized the project, supervised the work, and oversaw manuscript preparation. J.W. contributed to the study conceptualization, designed the software architecture, implemented the VUS.Life framework, performed formal data analysis, and drafted the original manuscript. All other authors assisted with data curation and manuscript review. All authors have read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe acknowledge the essential contributions of public repositories, specifically ClinVar, BRCA Exchange, and gnomAD. We are also grateful to the open-source community and the developers of the MPNet, MedEmbed, ESMC, and Google embedding models, as well as the Ensembl VEP team, whose tools were integral to the VUS.Life framework.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eManolio, T.A., et al., \u003cem\u003eGenomic medicine year in review: 2023.\u003c/em\u003e Am J Hum Genet, 2023. \u003cstrong\u003e110\u003c/strong\u003e(12): p. 1992-1995.\u003c/li\u003e\n\u003cli\u003eSayers, E.W., et al., \u003cem\u003eDatabase resources of the National Center for Biotechnology Information.\u003c/em\u003e Nucleic Acids Res, 2021. \u003cstrong\u003e49\u003c/strong\u003e(D1): p. D10-d17.\u003c/li\u003e\n\u003cli\u003eKarczewski, K.J., et al., \u003cem\u003eThe mutational constraint spectrum quantified from variation in 141,456 humans.\u003c/em\u003e Nature, 2020. \u003cstrong\u003e581\u003c/strong\u003e(7809): p. 434-443.\u003c/li\u003e\n\u003cli\u003eTaliun, D., et al., \u003cem\u003eSequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.\u003c/em\u003e Nature, 2021. \u003cstrong\u003e590\u003c/strong\u003e(7845): p. 290-299.\u003c/li\u003e\n\u003cli\u003eNg, P.C. and S. Henikoff, \u003cem\u003eSIFT: Predicting amino acid changes that affect protein function.\u003c/em\u003e Nucleic Acids Res, 2003. \u003cstrong\u003e31\u003c/strong\u003e(13): p. 3812-4.\u003c/li\u003e\n\u003cli\u003eAdzhubei, I., D.M. Jordan, and S.R. Sunyaev, \u003cem\u003ePredicting functional effect of human missense mutations using PolyPhen-2.\u003c/em\u003e Curr Protoc Hum Genet, 2013. \u003cstrong\u003eChapter 7\u003c/strong\u003e: p. Unit7.20.\u003c/li\u003e\n\u003cli\u003eRentzsch, P., et al., \u003cem\u003eCADD: predicting the deleteriousness of variants throughout the human genome.\u003c/em\u003e Nucleic Acids Res, 2019. \u003cstrong\u003e47\u003c/strong\u003e(D1): p. D886-d894.\u003c/li\u003e\n\u003cli\u003eRichards, S., et al., \u003cem\u003eStandards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.\u003c/em\u003e Genet Med, 2015. \u003cstrong\u003e17\u003c/strong\u003e(5): p. 405-24.\u003c/li\u003e\n\u003cli\u003eDouville, C., et al., \u003cem\u003eAssessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel).\u003c/em\u003e Hum Mutat, 2016. \u003cstrong\u003e37\u003c/strong\u003e(1): p. 28-35.\u003c/li\u003e\n\u003cli\u003eCannon, S., et al., \u003cem\u003eEvaluation of in silico pathogenicity prediction tools for the classification of small in-frame indels.\u003c/em\u003e BMC Med Genomics, 2023. \u003cstrong\u003e16\u003c/strong\u003e(1): p. 36.\u003c/li\u003e\n\u003cli\u003eEvans, P., et al., \u003cem\u003eGenetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets.\u003c/em\u003e Genome Res, 2019. \u003cstrong\u003e29\u003c/strong\u003e(7): p. 1144-1151.\u003c/li\u003e\n\u003cli\u003eCheng, J., et al., \u003cem\u003eAccurate proteome-wide missense variant effect prediction with AlphaMissense.\u003c/em\u003e Science, 2023. \u003cstrong\u003e381\u003c/strong\u003e(6664): p. eadg7492.\u003c/li\u003e\n\u003cli\u003eYu, H., et al., \u003cem\u003eA graph neural network approach for accurate prediction of pathogenicity in multi-type variants.\u003c/em\u003e Brief Bioinform, 2025. \u003cstrong\u003e26\u003c/strong\u003e(2).\u003c/li\u003e\n\u003cli\u003eDevlin, J., et al. \u003cem\u003eBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\u003c/em\u003e. in \u003cem\u003eNorth American Chapter of the Association for Computational Linguistics\u003c/em\u003e. 2019.\u003c/li\u003e\n\u003cli\u003eLi, W., et al., \u003cem\u003eFrom Text to Translation: Using Language Models to Prioritize Variants for Clinical Review.\u003c/em\u003e medRxiv, 2025.\u003c/li\u003e\n\u003cli\u003eBrandes, N., et al., \u003cem\u003eGenome-wide prediction of disease variant effects with a deep protein language model.\u003c/em\u003e Nat Genet, 2023. \u003cstrong\u003e55\u003c/strong\u003e(9): p. 1512-1522.\u003c/li\u003e\n\u003cli\u003eLin, W., et al., \u003cem\u003eEnhancing missense variant pathogenicity prediction with protein language models using VariPred.\u003c/em\u003e Sci Rep, 2024. \u003cstrong\u003e14\u003c/strong\u003e(1): p. 8136.\u003c/li\u003e\n\u003cli\u003eCline, M.S., et al., \u003cem\u003eBRCA Challenge: BRCA Exchange as a global resource for variants in BRCA1 and BRCA2.\u003c/em\u003e PLoS Genet, 2018. \u003cstrong\u003e14\u003c/strong\u003e(12): p. e1007752.\u003c/li\u003e\n\u003cli\u003eLandrum, M.J., et al., \u003cem\u003eClinVar: public archive of relationships among sequence variation and human phenotype.\u003c/em\u003e Nucleic Acids Res, 2014. \u003cstrong\u003e42\u003c/strong\u003e(Database issue): p. D980-5.\u003c/li\u003e\n\u003cli\u003eHunt, S.E., et al., \u003cem\u003eAnnotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor-A tutorial.\u003c/em\u003e Hum Mutat, 2022. \u003cstrong\u003e43\u003c/strong\u003e(8): p. 986-997.\u003c/li\u003e\n\u003cli\u003eJaganathan, K., et al., \u003cem\u003ePredicting Splicing from Primary Sequence with Deep Learning.\u003c/em\u003e Cell, 2019. \u003cstrong\u003e176\u003c/strong\u003e(3): p. 535-548.e24.\u003c/li\u003e\n\u003cli\u003eIoannidis, N.M., et al., \u003cem\u003eREVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.\u003c/em\u003e Am J Hum Genet, 2016. \u003cstrong\u003e99\u003c/strong\u003e(4): p. 877-885.\u003c/li\u003e\n\u003cli\u003eSong, K., et al., \u003cem\u003eMPNet: Masked and Permuted Pre-training for Language Understanding.\u003c/em\u003e arXiv preprint arXiv:2004.09297, 2020.\u003c/li\u003e\n\u003cli\u003eBalachandran, A., \u003cem\u003eMedembed: Medical-focused embedding models\u003c/em\u003e. 2024.\u003c/li\u003e\n\u003cli\u003eGoogle, C., \u003cem\u003eGenerative AI on Vertex AI.\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eTeam, E.S.M., \u003cem\u003eESM Cambrian: Revealing the mysteries of proteins with unsupervised learning\u003c/em\u003e. 2024, EvolutionaryScale.\u003c/li\u003e\n\u003cli\u003eJolliffe, I.T. and J. Cadima, \u003cem\u003ePrincipal component analysis: a review and recent developments.\u003c/em\u003e Philos Trans A Math Phys Eng Sci, 2016. \u003cstrong\u003e374\u003c/strong\u003e(2065): p. 20150202.\u003c/li\u003e\n\u003cli\u003evan der Maaten, L. and G. Hinton, \u003cem\u003eVisualizing Data using t-SNE.\u003c/em\u003e Journal of Machine Learning Research, 2008. \u003cstrong\u003e9\u003c/strong\u003e: p. 2579--2605.\u003c/li\u003e\n\u003cli\u003eMcInnes, L. and J. Healy, \u003cem\u003eUMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.\u003c/em\u003e ArXiv, 2018. \u003cstrong\u003eabs/1802.03426\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003eDrackley, A., et al., \u003cem\u003eInterpretation and classification of FBN1 variants associated with Marfan syndrome: consensus recommendations from the Clinical Genome Resource\u0026apos;s FBN1 variant curation expert panel.\u003c/em\u003e Genome Med, 2024. \u003cstrong\u003e16\u003c/strong\u003e(1): p. 154.\u003c/li\u003e\n\u003cli\u003eOrenbuch, R., et al., \u003cem\u003eProteome-wide model for human disease genetics.\u003c/em\u003e Nat Genet, 2025. \u003cstrong\u003e57\u003c/strong\u003e(12): p. 3165-3174.\u003c/li\u003e\n\u003cli\u003eHuber, C.D., B.Y. Kim, and K.E. Lohmueller, \u003cem\u003ePopulation genetic models of GERP scores suggest pervasive turnover of constrained sites across mammalian evolution.\u003c/em\u003e PLoS Genet, 2020. \u003cstrong\u003e16\u003c/strong\u003e(5): p. e1008827.\u003c/li\u003e\n\u003cli\u003eSchwarz, J.M., et al., \u003cem\u003eMutationTaster2: mutation prediction for the deep-sequencing age.\u003c/em\u003e Nat Methods, 2014. \u003cstrong\u003e11\u003c/strong\u003e(4): p. 361-2.\u003c/li\u003e\n\u003cli\u003eDong, C., et al., \u003cem\u003eComparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.\u003c/em\u003e Hum Mol Genet, 2015. \u003cstrong\u003e24\u003c/strong\u003e(8): p. 2125-37.\u003c/li\u003e\n\u003cli\u003eAlirezaie, N., et al., \u003cem\u003eClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants.\u003c/em\u003e Am J Hum Genet, 2018. \u003cstrong\u003e103\u003c/strong\u003e(4): p. 474-483.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Pathogenicity Prediction, Vector Embedding, Large Language Model, ESMC-600M, Protein Structure, BRCA1/2, FBN1, ATM, PALB2","lastPublishedDoi":"10.21203/rs.3.rs-8605164/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8605164/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eInterpreting the pathogenicity of genetic variants remains a critical bottleneck in genomic medicine. Millions of variants of uncertain significance (VUS) hinder the clinical application of genetic findings. Traditional computational approaches often rely on hand-engineered features and fail to capture the complexity of multidimensional genomic annotations fully.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe developed VUS.Life, a multi-modal framework that synergizes semantic text embeddings of biological and clinical annotations with protein language modeling. We transformed variant annotations from Variant Effect Predictor (VEP) into natural language descriptions which are then converted into vector embeddings via established Large Language Models (LLMs), namely all-mpnet-base-v2, MedEmbed-large-v0.1, and text-embedding-004. Pathogenicity of a variant of interest is predicted by its proximity in the vector embedding space with variants of known pathogenicity. We further extended VUS.Life by employing residue-level delta embeddings from the ESMC-600M model to capture both clinical context and biophysical constraints.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eWe evaluated the framework on \u0026gt;\u0026thinsp;10,000 variants across \u003cem\u003eBRCA1/2\u003c/em\u003e, \u003cem\u003eFBN1\u003c/em\u003e, \u003cem\u003eATM\u003c/em\u003e, and \u003cem\u003ePALB2\u003c/em\u003e genes. VUS.Life achieved greater than 96% accuracy from using VEP annotations alone across all variant types and disease genes evaluated. Additionally, our unsupervised FBN1 structural analysis using ESMC-600M revealed that delta embeddings disentangled distinct pathogenic mechanisms, topologically separating disulfide bond disruptions from calcium-binding defects. These structural clusters correlated strongly with Zero-Shot Log-Likelihood Ratio (LLR) scores, validating evolutionary fitness as a proxy for pathogenicity.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eThis semantic embedding framework, VUS.Life, accurately captures pathogenicity-relevant features from complex variant annotations, enabling high-accuracy (\u0026gt;\u0026thinsp;96%) automated classification across multiple genes and models. The approach generalizes beyond well-curated genes and supports scalable, interpretable, and representation-based classification of VUS. It holds significant promise for alleviating the variant interpretation bottleneck in clinical genomics.\u003c/p\u003e","manuscriptTitle":"VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-21 16:13:50","doi":"10.21203/rs.3.rs-8605164/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"5eeff10a-d30e-4976-8375-7281d07cff89","owner":[],"postedDate":"January 21st, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-03-06T08:11:14+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-21 16:13:50","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8605164","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8605164","identity":"rs-8605164","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-20T11:00:21.680559+00:00

License: CC-BY-4.0