Conserved missense variant pathogenicity and correlated phenotypes across paralogous genes

doi:10.21203/rs.3.rs-5434140/v1

Conserved missense variant pathogenicity and correlated phenotypes across paralogous genes

2024 · doi:10.21203/rs.3.rs-5434140/v1

preprint OA: closed

Full text JSON View at publisher

Full text 139,428 characters · extracted from preprint-html · click to expand

Conserved missense variant pathogenicity and correlated phenotypes across paralogous genes | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Conserved missense variant pathogenicity and correlated phenotypes across paralogous genes Tobias Bruenger, Alina Ivanuk, Eduardo Pérez-Palma, Ludovica Montanucci, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5434140/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 13 You are reading this latest preprint version Abstract Background The majority of missense variants in clinical genetic tests are classified as variants of uncertain significance. Prior research has shown that the deleterious effects and the subsequent molecular consequence of variants are often conserved among paralogous protein sequences within a gene family. Here, we systematically quantified on an exome-wide scale if the existence of pathogenic variants in paralogous genes at a conserved position could serve as evidence for the pathogenicity of a new variant. For the gene family of voltage-gated sodium channels where variants and expert-curated clinical phenotypes were available, we also assessed whether phenotype patterns of multiple disorders for each gene were also conserved across variant positions within the gene family. Methods We developed a framework that assesses the presence of pathogenic missense variants located in conserved residues across paralogous genes. We systematically mapped 2.5 million pathogenic and general population variants from the ClinVar, HGMD, and gnomAD databases onto a total of 9,990 genes and aligned them by gene families. We evaluated the quantity of classifiable amino acids by utilizing pathogenic variants identified in databases alone and then compared this assessment to the inclusion of paralogous pathogenic variants. We validated and quantified the evidence of conserved pathogenic paralogous variants in variant pathogenicity classification. Results Considering conserved pathogenic variants in paralogous genes, increased the number of classifiable variants 2.8-fold across the exome, compared to pathogenic variants in the gene of interest alone. The presence of a pathogenic variant in a paralogous gene is associated with a positive likelihood ratio of 8.32 for variant pathogenicity. The likelihood ratio was gene family-specific. Across ten genes encoding voltage-gated sodium channels and 22 expert-curated disorders, we identified cross-paralog correlated phenotypes based on 3D structure spatial position. For example, the established loss-of-function disorders SCN1A -associated Dravet syndrome, SCN2A- associated autism, SCN5A -associated Brugarda Syndrome, and SCN8A- associated neurodevelopmental disorder without seizures were correlated in their spatial variant position on structure. Finally, we show that phenotype integration in paralog variant selection improves variant classification. Conclusion Our results show that paralogous variants, in particular with phenotype information can enhance our understanding of variant effects. Variant classification Paralogs Genetics Missense variants Figures Figure 1 Figure 2 Figure 3 Figure 4 Background Large gene panels, exome, and genome sequencing have led to the identification of novel variants at an exponential rate( 1 ). Up to 80% of pathogenic variants are located within protein-coding regions of the gene( 2 ), with missense variants being particularly challenging to interpret due to the variety of different molecular mechanisms through which they can cause disease. Furthermore, several disease-associated genes are pleiotropic, further complicating variant interpretation( 3 – 5 ). Despite these challenges, variant classification is necessary for diagnosing rare and genetically heterogeneous disorders, and for the development of personalized medicine. About 80% of genes associated with monogenic disorders are paralogs( 6 ). These paralogous genes can be grouped into 2871 gene families as defined by the Human Gene Nomenclature Consortium (HGNC)( 7 ) with > 80% sequence similarity( 8 ). Genes within a gene family arise from gene duplication events of common ancestral genes and can share > 90% amino acid sequence similarity at functionally essential protein domains( 9 ). We and others have shown that quantifying conservation across these paralogous genes and homologous domains is an effective strategy to distinguish between pathogenic and benign variants( 8 , 10 – 12 ). Molecular studies further indicate that the biophysical function of domains is conserved within a gene family. As a result, a single amino acid substitution in the same position of a homologous domain often leads to similar molecular effects across members of the same gene family( 11 , 13 , 14 ). This suggests that a comprehensive understanding of variants in one gene can provide, through a form of knowledge "transfer", insights into the pathogenicity and also into the biological disease mechanisms of unstudied variants in its paralogs when these variants are located at identical positions. Within the same gene family, proteins show similar patterns of population variant-constrained and pathogenic variant clustering. In addition to identifying conservation patterns within gene families, previous research has highlighted the differential distribution of missense variants between the general population and pathogenic missense variants which was consistent across a subset of paralogous genes( 11 , 14 ). Furthermore, our previous findings indicate that this regional clustering is prevalent across paralogous genes and enables a systematic identification of regions enriched with pathogenic variants, termed Pathogenic Variant Enriched Regions (PERs)( 15 ). Our study showed that novel missense variants located within PERs have a higher likelihood of being pathogenic compared to those in non-PER regions of the same gene( 15 ). However, this method currently has limited sensitivity, since many newly discovered variants are located outside of PERs. Moreover, as PERs typically define a larger protein region, interpretations regarding disease mechanisms are constrained to a regional context, preventing insights at the individual amino acid level. To standardize variant interpretation, the American College of Medical Genetics and Genomics (ACMG) published recommendations for evaluating the pathogenicity of variants( 16 ). However, > 45% of single nucleotide variants reported in the ClinVar database( 17 ) (accessed March 2023) are classified as variants of uncertain significance (VUS), due to the absence of sufficient evidence for or against variant pathogenicity. The guidelines include criteria that utilize information from previous variant classifications e.g., the presence of an established pathogenic variant with the same amino acid exchange (PS1) or a different amino acid exchange (PM5) at the same position in the same gene that can provide strong to moderate evidence for pathogenicity( 16 ). However, since the vast majority of rare monogenic disorders are genetically heterogeneous and about half of the identified pathogenic variants have not yet been observed in other individuals( 18 , 19 ), the application of these evidence criteria is limited. In the present study, we extend prior work on gene family conservation to provide access to a paralog-based annotation that could improve the assessment of variant pathogenicity. We postulate that variants previously classified in conserved residues of paralogous genes can provide evidence for the pathogenicity of novel variants located at corresponding amino acid positions in these genes. The use of pre-classified variants in paralogs as evidence of pathogenicity has been previously suggested for a select group of genes e.g., by the RASopathy ClinGen Expert Panel( 20 ). However, the broad applicability of this approach across the entire protein-coding exome - particularly, the potential of single missense variants from paralogs as a feature to inform variant pathogenicity - remains unquantified and untested. In this proof of concept study, our findings reveal that for 519 gene families (comprising 1,459 genes) with high sequence similarity, the presence of a pathogenic variant in one gene family member at an equivalent protein position is associated with a significant increase in the likelihood of pathogenicity for a novel variant at a conserved paralogous site in the target gene. Additionally, we illustrate in a case study that integrating expert-curated clinical data across sodium channels can refine variant selection, which not only enhances variant pathogenicity classification but also identifies disorders across paralogs that likely share similar disease mechanisms. Methods Annotation of missense variants from public repositories Missense variants from patients Missense variants associated with the disease were collected from the ClinVar database( 17 ) (ClinVar, release October 2019) and the Human Gene Mutation Database( 21 ) (HGMD®) Professional release 2019.2. Similarly, we gathered an updated version of the variants from ClinVar (released December 2022) and HGMD (Professional release 2023.1), processed them as described before, and extracted all previously unreported pathogenic variants not observed in the previous dataset to obtain an independent set of variants. The ClinVar missense variants were obtained in a tabular format from the FTP site ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/ ) and only those variants exclusively classified as "Pathogenic" and/or "Likely Pathogenic" in their final consensus interpretation were considered to ensure high stringency. The HGMD dataset was filtered for "missense variants," "High Confidence" calls (hgmd_confidence = "HIGH" flag), and "Disease causing" state (hgmd_variantType = "DM" flag). All annotations refer to the human reference genome version GRCh37.p13/hg19, and variants belonging to non canonical transcripts as defined by Ensembl were excluded( 22 ). Since ClinVar and HGMD are not mutually exclusive, we used the union of both resources and removed duplicate entries by comparing HGVS annotations. We further refer to the combined set of variants classified as likely-pathogenic, pathogenic, or "Disease-causing" as ‘pathogenic variants’. Missense variants from the population Missense variants present in the Genome Aggregation Database( 23 ) (gnomAD, public release 2.0.2) were obtained in the Variant Call Format( 24 ) (VCFs). We extracted the high-quality missense variants by filtering the VCF files to the "CSQ" field and “PASS” flag. The annotations were based on the human reference genome version GRCh37.p13/hg19. We extracted only entries annotated to the canonical gene transcripts, as defined by Ensembl( 22 ). The aggregated population variants serve as control variants in our study and are further referred to as controls. Similarly, we gathered an updated version of the variants from gnomAD (public release 2.1.1, processed them as described above, and extracted all novel variants not observed in the previous set of gnomAD variants to obtain an independent set of control variants. Annotation of missense variants and associated phenotypes for the voltage-gated sodium channels Brain-related phenotypes We aggregated published patient missense variants in voltage-gated sodium channel genes (VGSC) genes from the literature. All patient variants for SCN1A were obtained from Brunklaus et al.,2022 and Brunklaus et al., 2022( 25 , 26 ). Variants for SCN2A were obtained from Wolff et al, 2017( 27 ) and Crawford et al., 2021( 28 ). Variants for SCN3A were obtained from Zaman et al., 2018( 29 ). All SCN8A variants are taken from Johannesen et al, 2021( 4 ). Affected individuals were recruited through a network of collaborating clinicians, as well as GeneMatcher( 30 ), using a standardized phenotyping sheet to assess clinical characteristics cognition), EEG, neuroimaging, and retrospective data on antiepileptic treatment. Non-brain phenotypes SCN5A variants were obtained from the studies conducted by Milman et al., 2021( 3 ), and Walsh et al., 2021( 31 ). Data from SCN4A, SCN9A, SCN10A, and SCN11A variants were collected from various publications listed in Supplementary Table 2. Variants in the voltage-gated sodium channels (VGSC) encoding genes that were not missense-constrained were filtered for the maximum population frequency (MAF). We inferred the MAF thresholds by using the approach described by Whiffin et al., 2017( 32 ), via the authors' app ( https://www.cardiodb.org/allelefrequencyapp ), based on the phenotype's estimated prevalence, mode of inheritance, and penetrance of the phenotype. We categorized SCN4A variants related to myotonia congenita and paramyotonia congenita and SCN9A variants related to primary erythromelalgia and paroxysmal episodic pain disorder into single categories (Relaxation Impairment Disorders and Paroxysmal Pain Disorders, respectively) based on their shared molecular pathology and pathophysiology after applying the MAF filter. We mapped all variants to their Ensemble canonical transcript 24 ( SCN1A : ENST00000303395, SCN2A : ENST00000283256, SCN3A : ENST00000283254, SCN4A : ENST00000435607, SCN5A : ENST00000423572, SCN8A : ENST00000283254, SCN9A : ENST00000409672, SCN10A : ENST00000449082, SCN11A : ENST00000302328). Only phenotypes associated with variants at more than five different protein positions were considered. The original and harmonized phenotype annotations for each phenotype are listed in Supplementary Table 2. Gene family definition We obtained the paralogous genes that belong to a gene family from Pérez-Palma et al. 2020( 15 ), as originally described in Lal et al., 2020( 8 ). Briefly, the human paralog definitions were taken from Ensembl BioMart ( 33 ) and filtered for those with an HGNC symbol( 7 ). For each gene, the canonical transcript as defined by Ensembl was considered. To avoid aligning highly diverged sequences, families with less than 80% similarity on the full protein sequence were removed. Definition of paralogous variants For all the protein sequences within the same gene family, we performed a multiple sequence alignment using the MUSCLE( 34 ) software. We then mapped pathogenic and general population variants onto these multiple sequence alignments. Given two variants on two different genes of the same gene family, we considered them as paralogous variants if they satisfied the two following conditions: 1) they are located at the same position in the multiple protein sequence alignment of the gene family, and 2) the reference amino acid in the target gene and the paralogous gene is the same (Supplementary Fig. 1). We further establish an expanded set of criteria, termed para-PS1 and para-PM5, which is defined as follows: para-PS1 This refers to a pathogenic paralogous variant that exhibits the same amino acid substitution as the investigated variant. para-PM5 This denotes a pathogenic paralogous variant that exhibits a different amino acid substitution compared to the investigated variant. Calculation of the positive likelihood ratio when a pathogenic paralogous variant is found For each gene, we calculated the positive likelihood ratio using our aggregated set of pathogenic and general population variants for the para-PS1/PM5 criteria (Supplementary Fig. 1). While considering the definition of the criteria (see above) we counted for each gene i) the number of pathogenic variants for which at least one pathogenic paralogous variant was observed and ii) the number of pathogenic variants for which no pathogenic paralogous variant was observed. For the same gene we also counted i) the number of control variants for which at least one pathogenic paralogous variant was observed and ii) the number of control variants for which no pathogenic paralogous variant was observed. To determine the level of evidence each criterion can define we calculated the positive likelihood ratios for the two cases: A) Presence of a pathogenic paralogous variant with either the same amino acid substitution (para-PS1) and B) Presence of a pathogenic paralogous variant with a different amino acid substitution (para-PM5). The positive likelihood ratio was computed using the sensitivity and specificity of the test: Equation 1: $$\:Positive\:Likelihood\:ratio\:\left(LR+\right)=\frac{Sensitivity}{\left(1-Specificity\right)}\:=\:\frac{\left(\frac{TP}{TP+FN}\:\right)}{1-\:\left(\frac{TN}{TN+FP}\right)}$$ where LR + represents the positive likelihood ratio, TP (true positives) denotes the number of pathogenic variants, for which a pathogenic variant is observed at a conserved corresponding paralogous residue position, TN (true negatives) indicates the number of variants from the general population, for which no pathogenic variants is observed at a corresponding paralogous residue position, FP (false positives) represents the number of population variants, for which a pathogenic variant is observed at a conserved corresponding paralogous residue position, and FN (false negative) denotes the number of pathogenic variants observed, for which no pathogenic variant is observed at a corresponding paralogous residue position. We calculated the LR + both individually for each gene as well as combined across all genes. For the gene-wise metric, we counted the variants denoting TP, FP, TN, and FN for each gene separately. For the combined metric we assessed the numbers for TP, FP, TN, and FN across all disease-associated genes within a gene family together to end up with a single LR+. All analyses were performed using R v.4.2.1. Comparison to established gene-family-based approaches To compare our results to an established gene-family-based approach which identified pathogenic enriched regions (PERs) across paralogous genes on an exome-wide scale14, we gathered an independent set of variants (see Annotation of missense variants from public repositories) which was not previously used nor in the PER approach nor the enrichment analysis of this study, and we repeated the calculation outlined above. To estimate LR + that are not mediated by paralog conservation we repeated the analysis described above for three paralog conservation sub-groups using the Parazscore( 8 ). The groups we considered are alignment positions with gene family wise 1) maximum Parazscore, indicating full paralog conservation across the gene family at the alignment position 2) Parazscore > 0 & not maximum Parazscore, indicating high paralog conservation at this alignment position but not full conservation and 3) Parazscore < 0, indicating low levels of conservation between paralogous genes at the alignment position. Identification of phenotype correlation based on 3D-variant positions To identify phenotypes associated with variants located at corresponding positions across voltage-gated sodium channels (VGSCs), we evaluated the spatial distribution of sodium channel protein structures for variants associated with each phenotype. We tallied the number of patients reported for each variant in every phenotype. Since not all sodium channels had available protein structures, we mapped the patient variants and their corresponding number of patients on the Nav1.2 protein structure (PDB-ID: 6j8e) using the multiple protein sequence alignment. We only considered patient variants that could be mapped to the protein structure for downstream analysis. For every residue in the Nav1.2 protein structure, we counted the number of patients with a variant in the residue or its local 3D neighborhood using a 5-angstrom radius cutoff, as previously introduced in Iqbal et al., 2022( 35 ). The number of patients with variants at a certain residue position was evaluated independently for each phenotype. To identify phenotypes associated with variants at similar 3D-variant positions we calculated the Pearson correlation between the 3D-variant distribution associated with each phenotype. Integrating variant similarity between phenotypes for the assessment of paralogous variant-based pathogenicity We next explored whether utilizing phenotype correlation could refine the selection of variants for our paralogous patient variant approach. To test our hypothesis we first extracted the variants of the most common phenotypes in each sodium channel with > 40 different variants ( SCN1A : Dravet Syndrome, SCN2A : Early onset developmental epileptic encephalopathy (DEE), SCN5A : Brugada Syndrome, SCN8A : DEE). We divided these cohorts randomly into four subsets of patient variants, each containing 25% of the variants. We then combined three of the four subsets (representing 75% of variants for each phenotype) with our remaining patient cohort containing all variants associated with other phenotypes. Following the approach outlined in the previous section we then identified 3D-variant position-based phenotype correlations. Finally, using the independent test cohort (the fourth subset), we calculated the LR + of patient vs control variants a) using paralogous pathogenic variants associated with non-correlated phenotypes and b) using paralogous pathogenic variants with significant (Bonferroni adjusted P < 0.05) 3D-position-based phenotype correlation. We repeated this approach three times, such that each set of variants was used as part of the training set three times and once as the test set, and calculated the LR + by summing up the individual TP, FP, TN, and FN values of each iteration. Results Incorporating pathogenic paralogous variants triples classifiable amino acid residues The guidelines of the ACMG suggest that for determining the pathogenicity of novel variants, two scenarios can be considered: ( 1 ) the presence of a variant in the same gene with an identical amino acid change, irrespective of the nucleotide alteration and ( 2 ) a novel amino acid substitution at a position where another substitution was previously been considered pathogenic, named PS1 and PM5 criteria respecitvely( 16 ). In this study, our objective was to explore whether this principle could be extrapolated to encompass pathogenic variants in paralogous genes. We specifically assessed if the existence of pathogenic variants in paralogous genes at a conserved, corresponding position could serve as evidence for the pathogenicity of a new variant. For our study, we termed a 'paralogous variant' as a variant that meets two conditions: ( 1 ) it's positioned in a paralogous gene at the analogous residue index position, as delineated by multiple sequence alignment (refer to Methods for details), and ( 2 ) it shares the same reference amino acid as the target gene. First, we assessed the number of amino acid residues not overlapping with pathogenic variants within the same gene at equivalent paralogous amino acid positions, but yet overlapping with pathogenic variants in paralogous genes. We aggregated a total of 60,486 pathogenic variants from ClinVar( 17 ) and HGMD( 21 ) and mapped them to 2,871 different gene family alignments, consisting of 9,990 genes (Fig. 1 ). Our paralog variant analysis integrates pathogenic variants from multiple genes in the same gene family (see Methods for details). We, therefore, restricted the dataset to gene families harboring pathogenic variants in at least two genes and identified 1,459 genes from 519 gene families. Within these genes, 41,223 pathogenic missense variants and 171,690 pathogenic paralogous variants were found that covered 32,137 and 91,259 amino acid residues respectively (Supplementary Table 1). Of these 91,259 residues that are covered by a paralogous pathogenic variant 92.6% (N = 84,553 residues) were not covered by a pathogenic variant in the same gene. Therefore, the integration of paralogous pathogenic variants would increase the number of amino acids in these gene families were the criteria can be applied by about 3.6-fold (N = 116,690 residues, Fig. 2 A). The increase in the number of classifiable amino acids in each gene family is highly correlated with the number of disease-associated genes in a gene family (R = 0.97, P = < 1e-300, Supplementary Fig. 2). Presence of single pathogenic paralogous variants can be used to assess variant pathogenicity Next, we quantified the value of incorporating pathogenic variants at paralogous positions to assess the variant pathogenicity of novel variants. Therefore, in addition to the aforementioned pathogenic variants, we included 2,478,899 variants from the gnomAD database( 23 ) which served as controls in our study. When a pathogenic paralogous variant with the same amino acid exchange was present at a corresponding alignment index position (termed para-PS1 criterium, for details on the approach, see Methods) we observed across 519 gene families an average LR + of 8.32 (8.02–8.62, 95% confidence interval (CI), Fig. 2 B). Restricting the analysis to missense variant-constrained genes (Missense-z score > 3.09( 1 )), increased the LR + to 8.91 (8.03–9.88, 95% CI, Fig. 2 C). Notably, even for paralogous variants with a different substitution at the same alignment index position (termed para-PM5 criterium), we observed an increased LR+ (All genes: LR + = 4.32, (4.24–4.48, 95% CI), Missense constraint genes: LR + = 6.48, (6.05–6.94, 95% C). Overall we observed a wide range of LR + across different genes (Fig. 2 B, C). The presence of pathogenic paralogous variants provides evidence for pathogenicity beyond evolutionary conservation Variant mapping across paralogous residues requires residue conservation. Next, we investigated the added value of mapping beyond conservation. Previously, we developed a 'parazscore( 8 )' to measure the conservation across paralog genes, showing that amino acids conserved within a gene family are significantly enriched for pathogenic variants. Notably, a fundamental prerequisite for the incorporation of pathogenic paralogous variants into the variant is assessment is the conservation of amino acid residues between the target gene and its paralogous gene. Hence, whenever pathogenic paralogous variants criteria are incorporated, a certain degree of conservation within the genes of the same gene family becomes inevitable. This conservation likely explains a portion of the elevated LR + we observed. Notably, while many methods( 36 – 38 ) employ evolutionary conservation as a predictor of variant pathogenicity, it is crucial to discern the added value our approach provides beyond solely relying on conservation-based evidence. To achieve this, we reconsidered our previous analysis, segmenting amino acids based on their paralog conservation and grouping amino acid residues with similar conservation across paralogs together (see Methods for details). Interestingly, within these subgroups, the highest LR + were observed for residues exhibiting the least paralog conservation for both the para-PS1 criterium (Parazscore < 0; LR + para−PS1 = 10.49, 95% CI = 9.60-11.45, Fig. 3 A) as well as the para-PM5 criterium (Parazscore < 0; LR + para−PM5 = 5.21, 95% CI = 4.87–5.59, Fig. 3 B). Yet, even within the subgroup demonstrating the least increase in LR+, where maximum conservation across all paralogous genes of the same gene family was noted, we still detected an increased LR + of 4.88 and 2.69, for para-PS1 and para-PM5 criteria respectively. This observation suggests that the existence of pathogenic paralogous variants provides additional information beyond the level of conservation between paralogous genes. Integrating single pathogenic paralogous variants improves a previous family-based variant interpretation approach We compared our approach, using paralogous pathogenic variants located at corresponding amino acids to a previously published method( 15 ). In contrast to our new approach, the published approach identifies ‘pathogenic variant enriched regions’ (‘PERs’, on average 33 consecutive amino acids( 15 )) across a gene family that is consistently enriched for pathogenic variants while depleted for control variants. Due to the sliding window approach the identified regions that are enriched for pathogenic variants, PERs can span amino acid residues without an established pathogenic variant across paralogs, and the regional association is derived from adjacent variants. However, identifying PERs within a gene family alignment requires a large number of pathogenic variants, limiting its applicability. First, we compared the number of exome-wide classifiable variants using single paralogous pathogenic variants with the PER approach. We used an independent set of pathogenic and control variants that were not utilized in the PER generation or the application of the para-PS1/PM5 criteria (see Methods for details). We found that the approach based on single paralogous pathogenic variants captured 2.2 times more residues compared to PERs (Fig. 3 C). In the second comparison, we compared the LR + for each approach and observed similar LR + for the PER approach and for the para-PS1 approach (LR + PER = 5.28, LR + para−PS1 = 5.63, Fig. 3 D). Leveraging phenotype correlations across paralogs can enhance pathogenicity assessment A single gene can be associated with different disorders. The number of disorders associated with variants in the same gene frequently correlates with the number of different molecular functional defects. Given that structure determines function, the molecular consequences of variants often relate to their specific position within the protein structure(39). Thus, pinpointing phenotype correlations based on analogous variant distributions might reveal paralogous variants with consistent molecular effects. In the context of voltage-gated sodium channels (VGSCs), past research has underscored not only the conservation of pathogenicity but also the consistent functional effects among paralogous variants( 13 ). Building on this, we hypothesized that uncovering phenotype correlations across VGSCs could fine-tune the application of pathogenic paralogous variants for variant pathogenicity assessment. We hypothesize that within gene family phenotype correlations could identify correlated phenotypes based on substitution position, subsequently enhancing the likelihood of conserved pathogenicity for variants at equivalent positions. To test this hypothesis, we curated a comprehensive dataset featuring 1,346 affected individuals, associated with 22 diverse phenotypes and possessing 886 unique missense variants in VGSC-encoding genes (detailed in Supplementary Table 2). Performing alignment position-based mapping onto the same structure combined with spatial-based phenotype proximity correlation analysis (see Methods for details), we identified within gene family position correlated phenotypes (Fig. 4 A). For example, SCN1A -associated Dravet syndrome variants exhibited 3D positional correlations with SCN2A variants associated with autism (R = 0.31, P = 2.1e-35), and Brugada syndrome variants in SCN5A (R = 0.29, P = 2.8e-40). For genes associated with several related disorders, such as the VGSC, variant classification is challenging since phenotype specificity is not high. Therefore, not all pathogenic classified variants might be correctly classified. Next, we tested whether variants from spatially correlated phenotypes across different paralogous genes could increase variant pathogenicity classification accuracy. We selected the most frequently reported phenotypes for VGSC genes with at least 40 patients. The four genes SCN1A , SCN2A , SCN5A , and SCN8A fulfilled this criterion. We dissected the associated variants into four subsets and calculated the evidence for variant pathogenicity (see methods for details). We observed an increased positive likelihood ratio by a factor of 3–8 for paralogous variants associated with 3D-position correlated phenotypes, in contrast to those paralogous variants without a significant 3D-position correlation (Fig. 4 B). For example, for SCN8A DEE cases pathogenic paralogous variants whose phenotype correlate with the DEE in SCN8A (LR + = 34.7, CI 16.3) showed an 8.6- fold higher strength to asses variant pathogenicity compared to pathogenic paralogous variants found in cases with non-correlating phenotypes (LR + = 4.0, CI 1.8–8.9). Discussion Many paralogs are highly conserved in sequence and have similar biophysical molecular functions. Current variant interpretation guidelines only consider previously classified pathogenic missense variants in the gene of interest as evidence for pathogenicity. Here, we developed and validated a bioinformatic framework to integrate pathogenic missense variants in paralogous genes at corresponding alignment index positions as evidence for the pathogenicity of novel variants. We demonstrated that integrating paralogous pathogenic variants located at a corresponding protein position can provide evidence for pathogenicity even if the amino acid exchange is not conserved. Compared to approaches, such as the PS1 and PM5 criteria of the ACMG guidelines( 16 ) which consider pathogenic variants in the same gene at the same position as evidence, our approach can be applied to 3.6 fold more protein residues where novel variants of unknown pathogenicity could be observed. Pathogenic missense variants in paralogous genes can serve as a proxy for pathogenicity. Within a protein sequence, pathogenic variants are unevenly distributed and tend to accumulate in certain regions that are critical for protein function( 40 ). These pathogenic variant-enriched regions have proven valuable for variant classification through established guidelines for variant interpretation( 16 ) and the use of in-silico prediction algorithms( 41 ). Moreover, the observation that critical protein regions tend to be evolutionarily conserved between paralogous genes can be harnessed to enhance statistical robustness by incorporating pathogenic variants across these paralogous genes( 15 ). Still, about 70%, of pathogenic variants are located outside the regions identified as essential. As a result, individual pathogenic variants in paralogous genes outside these regions were not considered for variant interpretation. In a study examining long QT syndrome, it was observed that individual pathogenic variants in paralogous genes are often located at paralogous positions as determined from multiple sequence alignments( 11 ), suggesting that the presence of a pathogenic variant at a particular position may serve as a proxy of pathogenicity at that alignment position in other paralogs. Our data test this hypothesis across a wide range of gene families and suggests that individual pathogenic paralogous variants can indeed serve as proxies for pathogenicity on a broad scale, thereby augmenting the efficacy of established variants in variant interpretation frameworks. Pathogenic variants in voltage-gated sodium channel (VGSC) genes are associated with a broad spectrum of clinical phenotypes, even within the same gene( 4 , 25 , 27 , 29 ). Prior research demonstrated a strong correlation between different molecular variant effects, such as the gain or loss of a protein function, and the clinical phenotype( 42 ). We identified phenotypes across VGSC genes with different organ or cellular gene expressions that are caused by corresponding paralogous variants located at the same alignment index position. The location of a variant in the protein structure in VGSC, particularly in critical regions like the selectivity filter or the inactivation gate, is often associated with conserved molecular function( 13 ). Our findings of 3D-position-based phenotype correlations across VGSC genes likely identify phenotypes caused by variants in paralogous genes with similar molecular effects. The framework we developed assumes that both pathogenicity and the molecular impact of a variant are generally conserved. We confirmed that pathogenicity is often preserved across paralogous genes at conserved residues. Nonetheless, our results suggest that applying correlations derived from the 3D positioning of these variants can potentially identify cases where this conservation does not hold or where variants previously classified as pathogenic were misclassified. Despite efforts to standardize criteria for pathogenicity assignment( 16 ) and many improvements in variant interpretation, about 75% of missense variants in ClinVar( 17 ) (accessed 12/2022) are annotated as variants of uncertain significance (VUS). Extending or modifying existing ACMG criteria has been demonstrated as a promising approach to reclassifying VUSs ( 20 , 38 , 43 – 45 ). We demonstrated that the PS1 and PM5 criteria of the ACMG guidelines could, in principle, be extended by considering already classified pathogenic variants with corresponding amino acid positions as evidence for pathogenicity. This approach was previously suggested by Clingen Expert curated guidelines for a small set of genes associated with Rasopathies( 20 ). However, here we have demonstrated the generalizability of the approach across a large set of 519 gene families and quantified the evidence gained from this approach. Our proposed inclusion of the paralogous variants as biologically interpretable evidence of variant pathogenicity has several limitations. First, incorporating pathogenic variants at paralogous positions into the established ACMG/AMP variant classification guidelines requires careful evaluation. This is due to the potential overlap between the basic data supporting an extension of PS1/PM5 criteria to paralogous genes and those already covered by the existing guidelines. Notably, the in silico scores, PP3/BP4, overlap, given that many predictive models, such as REVEL( 36 ) or Bayesdel( 46 ), incorporate evolutionary conservation as a fundamental training feature. On the other hand, the para-PS1/PM5 criteria we defined require conservation across paralogous genes at the specified position, thus also considering evidence derived from evolutionary conservation across paralogous genes. We demonstrated that orthologous conservation commonly harnessed in most in silico predictive scores, differs from the evolutionary insights acquired from paralogous gene analyses, albeit they are correlated( 15 ). Furthermore, we demonstrated in this study that even for residues similarly conserved across paralogs, the presence of a pathogenic variant at a conserved paralogous residue provides additional evidence supporting pathogenicity. Nevertheless, enabling the implementation of criteria based on pathogenic variants at paralogous positions along with PP3/BP4 requires a rigorous analysis to determine the discrete evidence provided by integrating pathogenic variants at paralogous positions beyond that provided by the selected PP3, to ensure that information is not considered redundantly. Therefore, incorporating evidence from pathogenic variants at paralogous positions—especially when concurrently considering other related criteria for final classification—introduces a potential risk of inadvertently over-representing shared basic elements. This could lead to an inflated assessment of evidence either supporting or contesting pathogenicity. Second, variants integrated in our framework of pathogenic variants at paralogous positions could be inflated by spliceogenic exonic variants( 47 ). Although previous results suggest that their impact might be minor on our approach, an exclusion of variants with a predicted high splicing impact could resolve this concern. Third, a limitation of our study is the inclusion of control variants aggregated from the gnomAD database, some of which may be pathogenic despite their presence in the general population. In instances where these control variants are indeed pathogenic, the likelihood ratios calculated in our study may represent underestimations, maintaining the conservative nature of our findings. Conclusion In Conclusion, our findings suggest that utilizing pathogenic paralogous variants provides significant potential to improve variant interpretation and aid in the diagnosis of pathogenic variants in clinical practice. Reference databases continue to grow and include well-classified pathogenic variants. While we have demonstrated that pathogenic variants in paralogous genes at the same alignment position provide evidence for pathogenicity across all disease-associated gene families, the potential integration of these criteria into the ACMG classification framework would require a careful approach to avoid double counting due to correlation with other criteria that your evolutionary conservation as a feature (e.g., in silico prediction scores). Future iterations of variant interpretation guidelines that consider the presence of paralogous pathogenic variants as evidence of pathogenicity could thus significantly increase the application of criteria based on already established pathogenic variants. Abbreviations HGNC: Human Gene Nomenclature Consortium; PER: Pathogenic Variant Enriched Regions ACMG: American College of Medical Genetics and Genomics; VUS: Variants of Uncertain Significance; HGMD: Human Gene Mutation Database; gnomAD: Genome Aggregation Database; MAF: Maximum Population Frequency; VGSC: Voltage-Gated Sodium Channels; LR+: Positive Likelihood Ratio; FN: False Negative; FP: False Positive; TP: True Positive; TN: True Negative; DEE: Developmental Epileptic Encephalopathy Declarations Ethics approval and consent to participate Not applicable Consent for publication Not applicable Availability of data and materials Data is available in the Supplementary Tables. Code to annotate para-PS1/PM5 criteria with a custom variant dataset for any gene family can be obtained from https://github.com/TobiasBruenger/paraPS1-PM5-annotation. Competing interests The authors report no conflicts of interest. Funding Funding for this work was provided by the German Federal Ministry for Education and Research (BMBF, Treat-ION, 01GM1907D) to D.L., T.B., and P.M., by the BMBF (Treat-Ion2, 01GM2210B) to P.M, the Fonds Nationale de la Recherche in Luxembourg (FNR, Research Unit FOR-2715, INTER/DFG/21/16394868 MechEPI2) to P.M., the Agencia Nacional de Investigación y Desarrollo de Chile (ANID, Fondecyt 1221464 grant) to E.P., the Familie SCN2A foundation 2020 Action Potential Grant to E.P., the Dravet Syndrome Foundation (grant number, 272016) to D.L, and the NIH NINDS (Channelopathy-Associated Epilepsy Research Center, 5-U54-NS108874) to D.L. Author’s Contributions Conceptualization: T.B., A.I., D.L. I.H. ; Data curation: T.B., A.I., E.P; Analysis: T.B, A.I., Supervision: D.L.; P.M.; M.N.; Writing-original draft: T.B., A.I., L.M.; Writing- editing: D.L., P.M., S.C, L.S., S.P., L.M. Acknowledgement Not applicable. References Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016 Aug 18;536(7616):285–91. Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A. 2009 Nov 10;106(45):19096–101. Milman A, Behr ER, Gray B, Johnson DC, Andorin A, Hochstadt A, et al. Genotype-Phenotype Correlation of SCN5A Genotype in Patients With Brugada Syndrome and Arrhythmic Events: Insights From the SABRUS in 392 Probands. Circ Genom Precis Med. 2021 Oct;14(5):e003222. Johannesen KM, Liu Y, Koko M, Gjerulfsen CE, Sonnenberg L, Schubert J, et al. Genotype-phenotype correlations in SCN8A-related disorders reveal prognostic and therapeutic implications. Brain. 2022 Sep 14;145(9):2991–3009. Kamada F, Kure S, Kudo T, Suzuki Y, Oshima T, Ichinohe A, et al. A novel KCNQ4 one-base deletion in a large pedigree with hearing loss: implication for the genotype-phenotype correlation. J Hum Genet. 2006;51(5):455–60. Dickerson JE, Robertson DL. On the Origins of Mendelian Disease Genes in Man: The Impact of Gene Duplication. Mol Biol Evol. 2012 Jan;29(1):61–9. Yates B, Gray KA, Jones TEM, Bruford EA. Updates to HCOP: the HGNC comparison of orthology predictions tool. Briefings in Bioinformatics [Internet]. 2021 May 6 [cited 2021 Jul 23];(bbab155). Available from: https://doi.org/10.1093/bib/bbab155 Lal D, May P, Perez-Palma E, Samocha KE, Kosmicki JA, Robinson EB, et al. Gene family information facilitates variant interpretation and identification of disease-associated genes in neurodevelopmental disorders. Genome Med. 2020 17;12(1):28. Chen WH, Zhao XM, van Noort V, Bork P. Human Monogenic Disease Genes Have Frequently Functionally Redundant Paralogs. PLoS Comput Biol. 2013 May 16;9(5):e1003073. Wiel L, Venselaar H, Veltman JA, Vriend G, Gilissen C. Aggregation of population-based genetic variation over protein domain homologues and its potential use in genetic diagnostics. Hum Mutat. 2017 Nov;38(11):1454–63. Ware JS, Walsh R, Cunningham F, Birney E, Cook SA. Paralogous annotation of disease-causing variants in long QT syndrome genes. Hum Mutat. 2012 Aug;33(8):1188–91. Zhang X, Theotokis PI, Li N, Investigators the Sh, Wright CF, Samocha KE, et al. Genetic constraint at single amino acid resolution improves missense variant prioritisation and gene discovery [Internet]. medRxiv; 2022 [cited 2023 Oct 19]. p. 2022.02.16.22271023. Available from: https://www.medrxiv.org/content/10.1101/2022.02.16.22271023v1 Brunklaus A, Feng T, Brünger T, Perez-Palma E, Heyne H, Matthews E, et al. Gene variant effects across sodium channelopathies predict function and guide precision therapy. Brain. 2022 Jan 17;awac006. Walsh R, Peters NS, Cook SA, Ware JS. Paralogue annotation identifies novel pathogenic variants in patients with Brugada syndrome and catecholaminergic polymorphic ventricular tachycardia. J Med Genet. 2014 Jan;51(1):35–44. Pérez-Palma E, May P, Iqbal S, Niestroj LM, Du J, Heyne HO, et al. Identification of pathogenic variant enriched regions across genes and gene families. Genome Res. 2020;30(1):62–71. Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015 May;17(5):405–24. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018 04;46(D1):D1062–7. Marinakis NM, Svingou M, Veltra D, Kekou K, Sofocleous C, Tilemis FN, et al. Phenotype-driven variant filtration strategy in exome sequencing toward a high diagnostic yield and identification of 85 novel variants in 400 patients with rare Mendelian disorders. Am J Med Genet A. 2021 Aug;185(8):2561–71. Zech M, Jech R, Boesch S, Škorvánek M, Weber S, Wagner M, et al. Monogenic variants in dystonia: an exome-wide sequencing study. Lancet Neurol. 2020 Nov;19(11):908–18. Gelb BD, Cavé H, Dillon MW, Gripp KW, Lee JA, Mason-Suares H, et al. ClinGen’s RASopathy Expert Panel consensus methods for variant interpretation. Genet Med. 2018 Nov;20(11):1334–45. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NST, et al. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat. 2003 Jun;21(6):577–81. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Res. 2018 Jan 4;46(Database issue):D754–61. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–43. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011 Aug 1;27(15):2156–8. Brunklaus A, Brünger T, Feng T, Fons C, Lehikoinen A, Panagiotakaki E, et al. The gain of function SCN1A disorder spectrum: novel epilepsy phenotypes and therapeutic implications. Brain. 2022 Jun 13;awac210. Brunklaus A, Pérez-Palma E, Ghanty I, Xinge J, Brilstra E, Ceulemans B, et al. Development and Validation of a Prediction Model for Early Diagnosis of SCN1A-Related Epilepsies. Neurology. 2022 Mar 15;98(11):e1163–74. Wolff M, Johannesen KM, Hedrich UBS, Masnada S, Rubboli G, Gardella E, et al. Genetic and phenotypic heterogeneity suggest therapeutic implications in SCN2A-related disorders. Brain. 2017 May 1;140(5):1316–36. Crawford K, Xian J, Helbig KL, Galer PD, Parthasarathy S, Lewis-Smith D, et al. Computational analysis of 10,860 phenotypic annotations in individuals with SCN2A-related disorders. Genet Med. 2021 Jul;23(7):1263–72. Zaman T, Helbig KL, Clatot J, Thompson CH, Kang SK, Stouffs K, et al. SCN3A-related neurodevelopmental disorder: A spectrum of epilepsy and brain malformation. Ann Neurol. 2020 Aug;88(2):348–62. Sobreira N, Schiettecatte F, Valle D, Hamosh A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum Mutat. 2015 Oct;36(10):928–30. Walsh R, Lahrouchi N, Tadros R, Kyndt F, Glinge C, Postema PG, et al. Enhancing rare variant interpretation in inherited arrhythmias through quantitative analysis of consortium disease cohorts and population controls. Genet Med. 2021 Jan;23(1):47–58. Whiffin N, Minikel E, Walsh R, O’Donnell-Luria AH, Karczewski K, Ing AY, et al. Using high-resolution variant frequencies to empower clinical genome interpretation. Genet Med. 2017 Oct;19(10):1151–8. Kinsella RJ, Kähäri A, Haider S, Zamora J, Proctor G, Spudich G, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford). 2011;2011:bar030. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004 Mar 8;32(5):1792–7. Iqbal S, Brünger T, Pérez-Palma E, Macnee M, Brunklaus A, Daly MJ, et al. Delineation of functionally essential protein regions for 242 neurodevelopmental disorders. Brain. 2022 Oct 18;awac381. Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet. 2016 Oct 6;99(4):877–85. Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++. PLoS Comput Biol [Internet]. 2010 Dec 2 [cited 2019 Dec 29];6(12). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996323/ Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021 Nov;599(7883):91–5. Heyne HO, Baez-Nieto D, Iqbal S, Palmer DS, Brunklaus A, May P, et al. Predicting functional effects of missense variants in voltage-gated sodium and calcium channels. Sci Transl Med. 2020 Aug 12;12(556). Tokheim C, Bhattacharya R, Niknafs N, Gygax DM, Kim R, Ryan M, et al. Exome-scale discovery of hotspot mutation regions in human cancer using 3D protein structure. Cancer Res. 2016 Jul 1;76(13):3719–31. Quinodoz M, Peter VG, Cisarova K, Royer-Bertrand B, Stenson PD, Cooper DN, et al. Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity. Am J Hum Genet. 2022 Mar 3;109(3):457–70. Brunklaus A, Du J, Steckler F, Ghanty II, Johannesen KM, Fenger CD, et al. Biological concepts in human sodium channel epilepsies and their relevance in clinical practice. Epilepsia. 2020;61(3):387–99. Kelly MA, Caleshu C, Morales A, Buchan J, Wolf Z, Harrison SM, et al. Adaptation and validation of the ACMG/AMP variant classification framework for MYH7-associated inherited cardiomyopathies: recommendations by ClinGen’s Inherited Cardiomyopathy Expert Panel. Genet Med. 2018 Mar;20(3):351–9. Patel MJ, DiStefano MT, Oza AM, Hughes MY, Wilcox EH, Hemphill SE, et al. Disease-specific ACMG/AMP guidelines improve sequence variant interpretation for hearing loss. Genet Med. 2021 Nov;23(11):2208–12. Pejaver V, Byrne AB, Feng BJ, Pagel KA, Mooney SD, Karchin R, et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am J Hum Genet. 2022 Dec 1;109(12):2163–77. Feng BJ. PERCH: A Unified Framework for Disease Gene Prioritization. Hum Mutat. 2017 Mar;38(3):243–51. Loong L, Cubuk C, Choi S, Allen S, Torr B, Garrett A, et al. Quantifying prediction of pathogenicity for within-codon concordance (PM5) using 7541 functional classifications of BRCA1 and MSH2 missense variants. Genet Med. 2022 Mar;24(3):552–63. Additional Declarations No competing interests reported. Supplementary Files SupplementarymaterialGenomemedicine.docx SupplementaryTable1.xlsx SupplementaryTable2.xlsx Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 21 Feb, 2025 Reviews received at journal 19 Feb, 2025 Reviews received at journal 16 Feb, 2025 Reviewers agreed at journal 26 Jan, 2025 Reviewers agreed at journal 25 Jan, 2025 Reviews received at journal 16 Dec, 2024 Reviewers agreed at journal 09 Dec, 2024 Reviewers agreed at journal 09 Dec, 2024 Reviewers invited by journal 09 Dec, 2024 Editor invited by journal 15 Nov, 2024 Editor assigned by journal 14 Nov, 2024 Submission checks completed at journal 12 Nov, 2024 First submitted to journal 11 Nov, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5434140","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":379148366,"identity":"64a2a3fe-49df-43bb-beda-8a7331dddbeb","order_by":0,"name":"Tobias Bruenger","email":"","orcid":"","institution":"Department of Neurology, The University of Texas Health Science Center at Houston, Houston, TX","correspondingAuthor":false,"prefix":"","firstName":"Tobias","middleName":"","lastName":"Bruenger","suffix":""},{"id":379148367,"identity":"31dc0e60-2452-4c7a-a70a-3574efd02722","order_by":1,"name":"Alina Ivanuk","email":"","orcid":"","institution":"Department of Neurology, Mayo Clinic Florida, Jacksonville, FL","correspondingAuthor":false,"prefix":"","firstName":"Alina","middleName":"","lastName":"Ivanuk","suffix":""},{"id":379148369,"identity":"346aa07d-aada-4fc8-88c7-2cb681b48d90","order_by":2,"name":"Eduardo Pérez-Palma","email":"","orcid":"","institution":"Universidad del Desarrollo, Centro de Genética y Genómica, Facultad de Medicina Clínica Alemana. Santiago","correspondingAuthor":false,"prefix":"","firstName":"Eduardo","middleName":"","lastName":"Pérez-Palma","suffix":""},{"id":379148371,"identity":"ed17c4c9-6bda-43a9-9990-332a5b2e8863","order_by":3,"name":"Ludovica Montanucci","email":"","orcid":"","institution":"Department of Neurology, The University of Texas Health Science Center at Houston, Houston, TX","correspondingAuthor":false,"prefix":"","firstName":"Ludovica","middleName":"","lastName":"Montanucci","suffix":""},{"id":379148373,"identity":"adba55e1-ee0a-43c1-85c1-74caefcd28a4","order_by":4,"name":"Stacey Cohen","email":"","orcid":"","institution":"Division of Neurology, Children's Hospital of Philadelphia, Philadelphia, PA","correspondingAuthor":false,"prefix":"","firstName":"Stacey","middleName":"","lastName":"Cohen","suffix":""},{"id":379148375,"identity":"ab2fe1ee-c9e7-4444-8073-478b1a570ff6","order_by":5,"name":"Lacey Smith","email":"","orcid":"","institution":"Epilepsy Genetics Program, Division of Epilepsy and Clinical Neurophysiology, Department of Neurology, Boston Children's Hospital, Boston, MA","correspondingAuthor":false,"prefix":"","firstName":"Lacey","middleName":"","lastName":"Smith","suffix":""},{"id":379148377,"identity":"17780a7e-29e7-4dce-bd2c-6087c75e745f","order_by":6,"name":"Shridhar Parthasarathy","email":"","orcid":"","institution":"Division of Neurology, Children's Hospital of Philadelphia, Philadelphia, PA","correspondingAuthor":false,"prefix":"","firstName":"Shridhar","middleName":"","lastName":"Parthasarathy","suffix":""},{"id":379148379,"identity":"124f599e-e9d1-49fc-a4f6-06e7c57fc2f5","order_by":7,"name":"Ingo Helbig","email":"","orcid":"","institution":"Division of Neurology, Children's Hospital of Philadelphia, Philadelphia, PA","correspondingAuthor":false,"prefix":"","firstName":"Ingo","middleName":"","lastName":"Helbig","suffix":""},{"id":379148382,"identity":"090f5e8b-3ff2-4e17-a250-935a09696689","order_by":8,"name":"Michael Nothnagel","email":"","orcid":"","institution":"Cologne Center for Genomics (CCG), University of Cologne, Cologne","correspondingAuthor":false,"prefix":"","firstName":"Michael","middleName":"","lastName":"Nothnagel","suffix":""},{"id":379148383,"identity":"67388ac4-9635-4102-993f-f68ba7604f82","order_by":9,"name":"Patrick May","email":"","orcid":"","institution":"Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette","correspondingAuthor":false,"prefix":"","firstName":"Patrick","middleName":"","lastName":"May","suffix":""},{"id":379148384,"identity":"bdc4ad19-a283-4963-a439-e0c1430fa7be","order_by":10,"name":"Dennis Lal","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7UlEQVRIiWNgGAWjYBACexCRAELszAcYGBugwjx4tBg2wLQwsyUwHGxgkCCoxeAAhAZq4TEgUsvx3mMPHvxhSOBv5vkm/XEHQ51u+wHGB2/b8Gg5cy7dILGNIUHiMO82iYNnGCTMziQwG87Fp+VGjplEYgPQYWAtbUAtNxjYpHnxabn/xkwiAegw+cM8z2Ba2H/j1XKDB6iFjSHB4DAPG9wWZnxaDHtADmuTSDA8zGZscbZNQnLbmcRmyTnncGuxZz9jJvnjj0293PHmhzcq22z4zY4fPvjhTRluLVAggcyAp4FRMApGwSgYBeQCAG1ITvTDKYuKAAAAAElFTkSuQmCC","orcid":"","institution":"Department of Neurology, The University of Texas Health Science Center at Houston, Houston, TX","correspondingAuthor":true,"prefix":"","firstName":"Dennis","middleName":"","lastName":"Lal","suffix":""}],"badges":[],"createdAt":"2024-11-11 18:38:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5434140/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5434140/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":70194700,"identity":"c5e53265-3352-49d7-971e-8caa0680c031","added_by":"auto","created_at":"2024-11-29 11:10:45","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":1128803,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGraphical summary of the study.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/468a4d950aca6cd3cf366f1d.png"},{"id":70194819,"identity":"90c1fb33-67e6-4800-85d7-97c32020d063","added_by":"auto","created_at":"2024-11-29 11:18:45","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":208746,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eIndividual pathogenic\u003c/strong\u003e \u003cstrong\u003eparalogous variants can serve as a proxy for variant pathogenicity. \u003c/strong\u003eA) Number of amino acid residues in 519 gene families that have a pathogenic variant (ClinVar/HGMD) at the same protein position in the same gene or a corresponding protein residue in a paralogous gene. B) Amino acids with a paralogous pathogenic variant at a paralogous aliment position have an increased positive likelihood ratio (LR+ \u0026gt;1). In contrast, amino acids with a paralogous control variant (gnomAD) at a paralogous alignment position are not enriched for pathogenic variants. Each data point represents the gene-wise LR+. The gene-wise LR+ was calculated for genes where 10 or more pathogenic variants (ClinVar/HGMD) and control variants (gnomAD) could be mapped. C) As in (B), but limited to missense constraint genes (Missense-z score \u0026gt; 3.09).\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/34d4e2119cb0805784ac55c6.png"},{"id":70194818,"identity":"71b397e3-49dd-41b8-a099-867f95a71a18","added_by":"auto","created_at":"2024-11-29 11:18:45","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":161616,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparison to established gene family-based methods.\u003c/strong\u003e A) The forest plot illustrates the enrichment of pathogenic versus control variants applying the para-PS1 criterium for residues with similar paralog conservation levels, as defined in Lal et al., 2020(8). B) Similar to (A), but for the para-PM5 criterium. C) The bar plot shows the number (N) of amino acid residues across all genes where a previously established approach (Pathogenic Enriched Region, PER; Perez-Palma et al., 2019(15)) and/or our para-PS1/ para-PM5 ACMG criteria extension can be applied. D) The forest plot compares the likelihood ratios (LR+) for amino acid residues within a PER and amino acid residues where para-PS1/para-PM5 criteria can be applied (see Methods for details).\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/6ed22e3f286b72ad9ae12981.png"},{"id":70195788,"identity":"94756a3e-1c0a-43e2-addb-ca7c52d02f1f","added_by":"auto","created_at":"2024-11-29 11:26:45","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":373369,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eLeveraging phenotype correlations to enhance the application of paralogous pathogenic variants.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA) Displayed is a correlation matrix that delineates the relationships between the 3D variant distributions across various phenotypes. Phenotypes that share significantly (after Bonferroni adjustment) similar 3D-variant distributions are color-coded in purple, whereas those with significantly distinct distributions are in orange. Statistically significant correlations are marked with stars (* for Padj\u0026lt;0.05 and ** for P\u0026lt;0.001). B) Presented is a forest plot capturing the positive likelihood ratio for four pivotal phenotypes that is derived from a comparison of affected individuals and control variants sourced from gnomAD. These ratios were computed by either 1) employing paralogous variants from affected individuals that exhibited a significantly positive correlation based on 3D position (depicted in purple), 2) utilizing paralogous variants from affected individuals displaying a 3D position-based negative correlation (showcased in orange) and 3) considering paralogous control variants (represented in grey). Abbreviations: DEE – Developmental Epileptic Encephalopathy; FHM3 – Familial Hemiplegic Migraine Type 3; PP – Periodic Paralysis; NDDwoE – Neurodevelopmental Disorders Without Epilepsy; Dravet – Dravet Syndrome.\u003c/p\u003e","description":"","filename":"Figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/0b38c7decc39302ccece9518.png"},{"id":70195913,"identity":"a3352707-be21-4d1e-8c43-8c6e05104723","added_by":"auto","created_at":"2024-11-29 11:34:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2649428,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/bd452d6d-4f7f-45b1-8e91-0373cf082d2d.pdf"},{"id":70194703,"identity":"14d3d00d-9f77-42d3-abfb-1ec8b091b27f","added_by":"auto","created_at":"2024-11-29 11:10:45","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":106297,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementarymaterialGenomemedicine.docx","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/8aa8ddda474042d37a9113aa.docx"},{"id":70194704,"identity":"b5d17258-0f31-45d0-83d2-8d63d3dfeaeb","added_by":"auto","created_at":"2024-11-29 11:10:46","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":3418278,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTable1.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/d91c55af9e180420c70ac51d.xlsx"},{"id":70194698,"identity":"7baa1056-c349-4b56-95d4-544a37840c95","added_by":"auto","created_at":"2024-11-29 11:10:45","extension":"xlsx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":64211,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTable2.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-5434140/v1/5f66f91b16eff6fd17017ea8.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Conserved missense variant pathogenicity and correlated phenotypes across paralogous genes","fulltext":[{"header":"Background","content":"\u003cp\u003eLarge gene panels, exome, and genome sequencing have led to the identification of novel variants at an exponential rate(\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e). Up to 80% of pathogenic variants are located within protein-coding regions of the gene(\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e), with missense variants being particularly challenging to interpret due to the variety of different molecular mechanisms through which they can cause disease. Furthermore, several disease-associated genes are pleiotropic, further complicating variant interpretation(\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). Despite these challenges, variant classification is necessary for diagnosing rare and genetically heterogeneous disorders, and for the development of personalized medicine.\u003c/p\u003e \u003cp\u003eAbout 80% of genes associated with monogenic disorders are paralogs(\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). These paralogous genes can be grouped into 2871 gene families as defined by the Human Gene Nomenclature Consortium (HGNC)(\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e) with \u0026gt;\u0026thinsp;80% sequence similarity(\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). Genes within a gene family arise from gene duplication events of common ancestral genes and can share\u0026thinsp;\u0026gt;\u0026thinsp;90% amino acid sequence similarity at functionally essential protein domains(\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e). We and others have shown that quantifying conservation across these paralogous genes and homologous domains is an effective strategy to distinguish between pathogenic and benign variants(\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan additionalcitationids=\"CR11\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e). Molecular studies further indicate that the biophysical function of domains is conserved within a gene family. As a result, a single amino acid substitution in the same position of a homologous domain often leads to similar molecular effects across members of the same gene family(\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). This suggests that a comprehensive understanding of variants in one gene can provide, through a form of knowledge \"transfer\", insights into the pathogenicity and also into the biological disease mechanisms of unstudied variants in its paralogs when these variants are located at identical positions.\u003c/p\u003e \u003cp\u003eWithin the same gene family, proteins show similar patterns of population variant-constrained and pathogenic variant clustering. In addition to identifying conservation patterns within gene families, previous research has highlighted the differential distribution of missense variants between the general population and pathogenic missense variants which was consistent across a subset of paralogous genes(\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). Furthermore, our previous findings indicate that this regional clustering is prevalent across paralogous genes and enables a systematic identification of regions enriched with pathogenic variants, termed Pathogenic Variant Enriched Regions (PERs)(\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Our study showed that novel missense variants located within PERs have a higher likelihood of being pathogenic compared to those in non-PER regions of the same gene(\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). However, this method currently has limited sensitivity, since many newly discovered variants are located outside of PERs. Moreover, as PERs typically define a larger protein region, interpretations regarding disease mechanisms are constrained to a regional context, preventing insights at the individual amino acid level.\u003c/p\u003e \u003cp\u003eTo standardize variant interpretation, the American College of Medical Genetics and Genomics (ACMG) published recommendations for evaluating the pathogenicity of variants(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). However, \u0026gt;\u0026thinsp;45% of single nucleotide variants reported in the ClinVar database(\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e) (accessed March 2023) are classified as variants of uncertain significance (VUS), due to the absence of sufficient evidence for or against variant pathogenicity. The guidelines include criteria that utilize information from previous variant classifications e.g., the presence of an established pathogenic variant with the same amino acid exchange (PS1) or a different amino acid exchange (PM5) at the same position in the same gene that can provide strong to moderate evidence for pathogenicity(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). However, since the vast majority of rare monogenic disorders are genetically heterogeneous and about half of the identified pathogenic variants have not yet been observed in other individuals(\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e), the application of these evidence criteria is limited.\u003c/p\u003e \u003cp\u003eIn the present study, we extend prior work on gene family conservation to provide access to a paralog-based annotation that could improve the assessment of variant pathogenicity. We postulate that variants previously classified in conserved residues of paralogous genes can provide evidence for the pathogenicity of novel variants located at corresponding amino acid positions in these genes. The use of pre-classified variants in paralogs as evidence of pathogenicity has been previously suggested for a select group of genes e.g., by the RASopathy ClinGen Expert Panel(\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). However, the broad applicability of this approach across the entire protein-coding exome - particularly, the potential of single missense variants from paralogs as a feature to inform variant pathogenicity - remains unquantified and untested.\u003c/p\u003e \u003cp\u003eIn this proof of concept study, our findings reveal that for 519 gene families (comprising 1,459 genes) with high sequence similarity, the presence of a pathogenic variant in one gene family member at an equivalent protein position is associated with a significant increase in the likelihood of pathogenicity for a novel variant at a conserved paralogous site in the target gene. Additionally, we illustrate in a case study that integrating expert-curated clinical data across sodium channels can refine variant selection, which not only enhances variant pathogenicity classification but also identifies disorders across paralogs that likely share similar disease mechanisms.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eAnnotation of missense variants from public repositories\u003c/h2\u003e \u003cdiv id=\"Sec4\" class=\"Section3\"\u003e \u003ch2\u003eMissense variants from patients\u003c/h2\u003e \u003cp\u003eMissense variants associated with the disease were collected from the ClinVar database(\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e) (ClinVar, release October 2019) and the Human Gene Mutation Database(\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e) (HGMD\u0026reg;) Professional release 2019.2. Similarly, we gathered an updated version of the variants from ClinVar (released December 2022) and HGMD (Professional release 2023.1), processed them as described before, and extracted all previously unreported pathogenic variants not observed in the previous dataset to obtain an independent set of variants. The ClinVar missense variants were obtained in a tabular format from the FTP site (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003eftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/\u003c/span\u003e\u003cspan address=\"http://ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and only those variants exclusively classified as \"Pathogenic\" and/or \"Likely Pathogenic\" in their final consensus interpretation were considered to ensure high stringency. The HGMD dataset was filtered for \"missense variants,\" \"High Confidence\" calls (hgmd_confidence = \"HIGH\" flag), and \"Disease causing\" state (hgmd_variantType = \"DM\" flag). All annotations refer to the human reference genome version GRCh37.p13/hg19, and variants belonging to non canonical transcripts as defined by Ensembl were excluded(\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e). Since ClinVar and HGMD are not mutually exclusive, we used the union of both resources and removed duplicate entries by comparing HGVS annotations. We further refer to the combined set of variants classified as likely-pathogenic, pathogenic, or \"Disease-causing\" as \u0026lsquo;pathogenic variants\u0026rsquo;.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section3\"\u003e \u003ch2\u003eMissense variants from the population\u003c/h2\u003e \u003cp\u003eMissense variants present in the Genome Aggregation Database(\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e) (gnomAD, public release 2.0.2) were obtained in the Variant Call Format(\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e) (VCFs). We extracted the high-quality missense variants by filtering the VCF files to the \"CSQ\" field and \u0026ldquo;PASS\u0026rdquo; flag. The annotations were based on the human reference genome version GRCh37.p13/hg19. We extracted only entries annotated to the canonical gene transcripts, as defined by Ensembl(\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e). The aggregated population variants serve as control variants in our study and are further referred to as controls.\u003c/p\u003e \u003cp\u003eSimilarly, we gathered an updated version of the variants from gnomAD (public release 2.1.1, processed them as described above, and extracted all novel variants not observed in the previous set of gnomAD variants to obtain an independent set of control variants.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eAnnotation of missense variants and associated phenotypes for the voltage-gated sodium channels\u003c/h2\u003e \u003cp\u003eBrain-related phenotypes\u003c/p\u003e \u003cp\u003eWe aggregated published patient missense variants in voltage-gated sodium channel genes (VGSC) genes from the literature. All patient variants for \u003cem\u003eSCN1A\u003c/em\u003e were obtained from Brunklaus et al.,2022 and Brunklaus et al., 2022(\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e). Variants for \u003cem\u003eSCN2A\u003c/em\u003e were obtained from Wolff et al, 2017(\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e) and Crawford et al., 2021(\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e). Variants for \u003cem\u003eSCN3A\u003c/em\u003e were obtained from Zaman et al., 2018(\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e). All \u003cem\u003eSCN8A\u003c/em\u003e variants are taken from Johannesen et al, 2021(\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e). Affected individuals were recruited through a network of collaborating clinicians, as well as GeneMatcher(\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e), using a standardized phenotyping sheet to assess clinical characteristics cognition), EEG, neuroimaging, and retrospective data on antiepileptic treatment.\u003c/p\u003e \u003cp\u003eNon-brain phenotypes\u003c/p\u003e \u003cp\u003eSCN5A variants were obtained from the studies conducted by Milman et al., 2021(\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e), and Walsh et al., 2021(\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e). Data from SCN4A, SCN9A, SCN10A, and SCN11A variants were collected from various publications listed in Supplementary Table\u0026nbsp;2. Variants in the voltage-gated sodium channels (VGSC) encoding genes that were not missense-constrained were filtered for the maximum population frequency (MAF). We inferred the MAF thresholds by using the approach described by Whiffin et al., 2017(\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e), via the authors' app (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.cardiodb.org/allelefrequencyapp\u003c/span\u003e\u003cspan address=\"https://www.cardiodb.org/allelefrequencyapp\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), based on the phenotype's estimated prevalence, mode of inheritance, and penetrance of the phenotype. We categorized SCN4A variants related to myotonia congenita and paramyotonia congenita and SCN9A variants related to primary erythromelalgia and paroxysmal episodic pain disorder into single categories (Relaxation Impairment Disorders and Paroxysmal Pain Disorders, respectively) based on their shared molecular pathology and pathophysiology after applying the MAF filter.\u003c/p\u003e \u003cp\u003eWe mapped all variants to their Ensemble canonical transcript\u003csup\u003e24\u003c/sup\u003e (\u003cem\u003eSCN1A\u003c/em\u003e: ENST00000303395, \u003cem\u003eSCN2A\u003c/em\u003e: ENST00000283256, \u003cem\u003eSCN3A\u003c/em\u003e: ENST00000283254, \u003cem\u003eSCN4A\u003c/em\u003e: ENST00000435607, \u003cem\u003eSCN5A\u003c/em\u003e: ENST00000423572, \u003cem\u003eSCN8A\u003c/em\u003e: ENST00000283254, \u003cem\u003eSCN9A\u003c/em\u003e: ENST00000409672, \u003cem\u003eSCN10A\u003c/em\u003e: ENST00000449082, \u003cem\u003eSCN11A\u003c/em\u003e: ENST00000302328). Only phenotypes associated with variants at more than five different protein positions were considered. The original and harmonized phenotype annotations for each phenotype are listed in Supplementary Table\u0026nbsp;2.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eGene family definition\u003c/h2\u003e \u003cp\u003eWe obtained the paralogous genes that belong to a gene family from P\u0026eacute;rez-Palma et al. 2020(\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e), as originally described in Lal et al., 2020(\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). Briefly, the human paralog definitions were taken from Ensembl BioMart (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e) and filtered for those with an HGNC symbol(\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e). For each gene, the canonical transcript as defined by Ensembl was considered. To avoid aligning highly diverged sequences, families with less than 80% similarity on the full protein sequence were removed.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eDefinition of paralogous variants\u003c/h2\u003e \u003cp\u003eFor all the protein sequences within the same gene family, we performed a multiple sequence alignment using the MUSCLE(\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e) software. We then mapped pathogenic and general population variants onto these multiple sequence alignments. Given two variants on two different genes of the same gene family, we considered them as paralogous variants if they satisfied the two following conditions: \u003cem\u003e1)\u003c/em\u003e they are located at the same position in the multiple protein sequence alignment of the gene family, and \u003cem\u003e2)\u003c/em\u003e the reference amino acid in the target gene and the paralogous gene is the same (Supplementary Fig.\u0026nbsp;1).\u003c/p\u003e \u003cp\u003eWe further establish an expanded set of criteria, termed para-PS1 and para-PM5, which is defined as follows:\u003c/p\u003e \u003cp\u003e \u003cstrong\u003epara-PS1\u003c/strong\u003e \u003cp\u003eThis refers to a pathogenic paralogous variant that exhibits the same amino acid substitution as the investigated variant.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003epara-PM5\u003c/strong\u003e \u003cp\u003eThis denotes a pathogenic paralogous variant that exhibits a different amino acid substitution compared to the investigated variant.\u003c/p\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eCalculation of the positive likelihood ratio when a pathogenic paralogous variant is found\u003c/h2\u003e \u003cp\u003eFor each gene, we calculated the positive likelihood ratio using our aggregated set of pathogenic and general population variants for the para-PS1/PM5 criteria (Supplementary Fig.\u0026nbsp;1). While considering the definition of the criteria (see above) we counted for each gene \u003cem\u003ei)\u003c/em\u003e the number of pathogenic variants for which at least one pathogenic paralogous variant was observed and \u003cem\u003eii)\u003c/em\u003e the number of pathogenic variants for which no pathogenic paralogous variant was observed. For the same gene we also counted \u003cem\u003ei)\u003c/em\u003e the number of control variants for which at least one pathogenic paralogous variant was observed and \u003cem\u003eii)\u003c/em\u003e the number of control variants for which no pathogenic paralogous variant was observed. To determine the level of evidence each criterion can define we calculated the positive likelihood ratios for the two cases: A) Presence of a pathogenic paralogous variant with either the same amino acid substitution (para-PS1) and B) Presence of a pathogenic paralogous variant with a different amino acid substitution (para-PM5). The positive likelihood ratio was computed using the sensitivity and specificity of the test:\u003c/p\u003e \u003cp\u003eEquation 1:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:Positive\\:Likelihood\\:ratio\\:\\left(LR+\\right)=\\frac{Sensitivity}{\\left(1-Specificity\\right)}\\:=\\:\\frac{\\left(\\frac{TP}{TP+FN}\\:\\right)}{1-\\:\\left(\\frac{TN}{TN+FP}\\right)}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere LR\u0026thinsp;+\u0026thinsp;represents the positive likelihood ratio, TP (true positives) denotes the number of pathogenic variants, for which a pathogenic variant is observed at a conserved corresponding paralogous residue position, TN (true negatives) indicates the number of variants from the general population, for which no pathogenic variants is observed at a corresponding paralogous residue position, FP (false positives) represents the number of population variants, for which a pathogenic variant is observed at a conserved corresponding paralogous residue position, and FN (false negative) denotes the number of pathogenic variants observed, for which no pathogenic variant is observed at a corresponding paralogous residue position. We calculated the LR\u0026thinsp;+\u0026thinsp;both individually for each gene as well as combined across all genes. For the gene-wise metric, we counted the variants denoting TP, FP, TN, and FN for each gene separately. For the combined metric we assessed the numbers for TP, FP, TN, and FN across all disease-associated genes within a gene family together to end up with a single LR+. All analyses were performed using R v.4.2.1.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eComparison to established gene-family-based approaches\u003c/h2\u003e \u003cp\u003eTo compare our results to an established gene-family-based approach which identified pathogenic enriched regions (PERs) across paralogous genes on an exome-wide scale14, we gathered an independent set of variants (see Annotation of missense variants from public repositories) which was not previously used nor in the PER approach nor the enrichment analysis of this study, and we repeated the calculation outlined above.\u003c/p\u003e \u003cp\u003eTo estimate LR\u0026thinsp;+\u0026thinsp;that are not mediated by paralog conservation we repeated the analysis described above for three paralog conservation sub-groups using the Parazscore(\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). The groups we considered are alignment positions with gene family wise 1) maximum Parazscore, indicating full paralog conservation across the gene family at the alignment position 2) Parazscore\u0026thinsp;\u0026gt;\u0026thinsp;0 \u0026amp; not maximum Parazscore, indicating high paralog conservation at this alignment position but not full conservation and 3) Parazscore\u0026thinsp;\u0026lt;\u0026thinsp;0, indicating low levels of conservation between paralogous genes at the alignment position.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eIdentification of phenotype correlation based on 3D-variant positions\u003c/h2\u003e \u003cp\u003eTo identify phenotypes associated with variants located at corresponding positions across voltage-gated sodium channels (VGSCs), we evaluated the spatial distribution of sodium channel protein structures for variants associated with each phenotype. We tallied the number of patients reported for each variant in every phenotype. Since not all sodium channels had available protein structures, we mapped the patient variants and their corresponding number of patients on the Nav1.2 protein structure (PDB-ID: 6j8e) using the multiple protein sequence alignment. We only considered patient variants that could be mapped to the protein structure for downstream analysis. For every residue in the Nav1.2 protein structure, we counted the number of patients with a variant in the residue or its local 3D neighborhood using a 5-angstrom radius cutoff, as previously introduced in Iqbal et al., 2022(\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e). The number of patients with variants at a certain residue position was evaluated independently for each phenotype. To identify phenotypes associated with variants at similar 3D-variant positions we calculated the Pearson correlation between the 3D-variant distribution associated with each phenotype.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eIntegrating variant similarity between phenotypes for the assessment of paralogous variant-based pathogenicity\u003c/h2\u003e \u003cp\u003eWe next explored whether utilizing phenotype correlation could refine the selection of variants for our paralogous patient variant approach. To test our hypothesis we first extracted the variants of the most common phenotypes in each sodium channel with \u0026gt;\u0026thinsp;40 different variants (\u003cem\u003eSCN1A\u003c/em\u003e: Dravet Syndrome, \u003cem\u003eSCN2A\u003c/em\u003e: Early onset developmental epileptic encephalopathy (DEE), \u003cem\u003eSCN5A\u003c/em\u003e: Brugada Syndrome, \u003cem\u003eSCN8A\u003c/em\u003e: DEE). We divided these cohorts randomly into four subsets of patient variants, each containing 25% of the variants. We then combined three of the four subsets (representing 75% of variants for each phenotype) with our remaining patient cohort containing all variants associated with other phenotypes. Following the approach outlined in the previous section we then identified 3D-variant position-based phenotype correlations. Finally, using the independent test cohort (the fourth subset), we calculated the LR\u0026thinsp;+\u0026thinsp;of patient vs control variants a) using paralogous pathogenic variants associated with non-correlated phenotypes and b) using paralogous pathogenic variants with significant (Bonferroni adjusted P\u0026thinsp;\u0026lt;\u0026thinsp;0.05) 3D-position-based phenotype correlation. We repeated this approach three times, such that each set of variants was used as part of the training set three times and once as the test set, and calculated the LR\u0026thinsp;+\u0026thinsp;by summing up the individual TP, FP, TN, and FN values of each iteration.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eIncorporating pathogenic paralogous variants triples classifiable amino acid residues\u003c/h2\u003e \u003cp\u003eThe guidelines of the ACMG suggest that for determining the pathogenicity of novel variants, two scenarios can be considered: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) the presence of a variant in the same gene with an identical amino acid change, irrespective of the nucleotide alteration and (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) a novel amino acid substitution at a position where another substitution was previously been considered pathogenic, named PS1 and PM5 criteria respecitvely(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). In this study, our objective was to explore whether this principle could be extrapolated to encompass pathogenic variants in paralogous genes. We specifically assessed if the existence of pathogenic variants in paralogous genes at a conserved, corresponding position could serve as evidence for the pathogenicity of a new variant. For our study, we termed a 'paralogous variant' as a variant that meets two conditions: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) it's positioned in a paralogous gene at the analogous residue index position, as delineated by multiple sequence alignment (refer to Methods for details), and (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) it shares the same reference amino acid as the target gene.\u003c/p\u003e \u003cp\u003eFirst, we assessed the number of amino acid residues not overlapping with pathogenic variants within the same gene at equivalent paralogous amino acid positions, but yet overlapping with pathogenic variants in paralogous genes. We aggregated a total of 60,486 pathogenic variants from ClinVar(\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e) and HGMD(\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e) and mapped them to 2,871 different gene family alignments, consisting of 9,990 genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Our paralog variant analysis integrates pathogenic variants from multiple genes in the same gene family (see Methods for details). We, therefore, restricted the dataset to gene families harboring pathogenic variants in at least two genes and identified 1,459 genes from 519 gene families. Within these genes, 41,223 pathogenic missense variants and 171,690 pathogenic paralogous variants were found that covered 32,137 and 91,259 amino acid residues respectively (Supplementary Table\u0026nbsp;1). Of these 91,259 residues that are covered by a paralogous pathogenic variant 92.6% (N\u0026thinsp;=\u0026thinsp;84,553 residues) were not covered by a pathogenic variant in the same gene. Therefore, the integration of paralogous pathogenic variants would increase the number of amino acids in these gene families were the criteria can be applied by about 3.6-fold (N\u0026thinsp;=\u0026thinsp;116,690 residues, Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). The increase in the number of classifiable amino acids in each gene family is highly correlated with the number of disease-associated genes in a gene family (R\u0026thinsp;=\u0026thinsp;0.97, P\u0026thinsp;=\u0026thinsp;\u0026lt;\u0026thinsp;1e-300, Supplementary Fig.\u0026nbsp;2).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003ePresence of single pathogenic paralogous variants can be used to assess variant pathogenicity\u003c/h2\u003e \u003cp\u003eNext, we quantified the value of incorporating pathogenic variants at paralogous positions to assess the variant pathogenicity of novel variants. Therefore, in addition to the aforementioned pathogenic variants, we included 2,478,899 variants from the gnomAD database(\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e) which served as controls in our study. When a pathogenic paralogous variant with the same amino acid exchange was present at a corresponding alignment index position (termed para-PS1 criterium, for details on the approach, see Methods) we observed across 519 gene families an average LR\u0026thinsp;+\u0026thinsp;of 8.32 (8.02\u0026ndash;8.62, 95% confidence interval (CI), Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB). Restricting the analysis to missense variant-constrained genes (Missense-z score\u0026thinsp;\u0026gt;\u0026thinsp;3.09(\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e)), increased the LR\u0026thinsp;+\u0026thinsp;to 8.91 (8.03\u0026ndash;9.88, 95% CI, Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC). Notably, even for paralogous variants with a different substitution at the same alignment index position (termed para-PM5 criterium), we observed an increased LR+ (All genes: LR\u0026thinsp;+\u0026thinsp;=\u0026thinsp;4.32, (4.24\u0026ndash;4.48, 95% CI), Missense constraint genes: LR\u0026thinsp;+\u0026thinsp;=\u0026thinsp;6.48, (6.05\u0026ndash;6.94, 95% C). Overall we observed a wide range of LR\u0026thinsp;+\u0026thinsp;across different genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB, C).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eThe presence of pathogenic paralogous variants provides evidence for pathogenicity beyond evolutionary conservation\u003c/h2\u003e \u003cp\u003eVariant mapping across paralogous residues requires residue conservation. Next, we investigated the added value of mapping beyond conservation. Previously, we developed a 'parazscore(\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e)' to measure the conservation across paralog genes, showing that amino acids conserved within a gene family are significantly enriched for pathogenic variants. Notably, a fundamental prerequisite for the incorporation of pathogenic paralogous variants into the variant is assessment is the conservation of amino acid residues between the target gene and its paralogous gene. Hence, whenever pathogenic paralogous variants criteria are incorporated, a certain degree of conservation within the genes of the same gene family becomes inevitable. This conservation likely explains a portion of the elevated LR\u0026thinsp;+\u0026thinsp;we observed. Notably, while many methods(\u003cspan additionalcitationids=\"CR37\" citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e) employ evolutionary conservation as a predictor of variant pathogenicity, it is crucial to discern the added value our approach provides beyond solely relying on conservation-based evidence. To achieve this, we reconsidered our previous analysis, segmenting amino acids based on their paralog conservation and grouping amino acid residues with similar conservation across paralogs together (see Methods for details). Interestingly, within these subgroups, the highest LR\u0026thinsp;+\u0026thinsp;were observed for residues exhibiting the least paralog conservation for both the para-PS1 criterium (Parazscore\u0026thinsp;\u0026lt;\u0026thinsp;0; LR\u0026thinsp;+\u0026thinsp;\u003csub\u003epara\u0026minus;PS1\u003c/sub\u003e = 10.49, 95% CI\u0026thinsp;=\u0026thinsp;9.60-11.45, Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA) as well as the para-PM5 criterium (Parazscore\u0026thinsp;\u0026lt;\u0026thinsp;0; LR\u0026thinsp;+\u0026thinsp;\u003csub\u003epara\u0026minus;PM5\u003c/sub\u003e = 5.21, 95% CI\u0026thinsp;=\u0026thinsp;4.87\u0026ndash;5.59, Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). Yet, even within the subgroup demonstrating the least increase in LR+, where maximum conservation across all paralogous genes of the same gene family was noted, we still detected an increased LR\u0026thinsp;+\u0026thinsp;of 4.88 and 2.69, for para-PS1 and para-PM5 criteria respectively. This observation suggests that the existence of pathogenic paralogous variants provides additional information beyond the level of conservation between paralogous genes.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eIntegrating single pathogenic paralogous variants improves a previous family-based variant interpretation approach\u003c/h2\u003e \u003cp\u003eWe compared our approach, using paralogous pathogenic variants located at corresponding amino acids to a previously published method(\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). In contrast to our new approach, the published approach identifies \u0026lsquo;pathogenic variant enriched regions\u0026rsquo; (\u0026lsquo;PERs\u0026rsquo;, on average 33 consecutive amino acids(\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e)) across a gene family that is consistently enriched for pathogenic variants while depleted for control variants. Due to the sliding window approach the identified regions that are enriched for pathogenic variants, PERs can span amino acid residues without an established pathogenic variant across paralogs, and the regional association is derived from adjacent variants. However, identifying PERs within a gene family alignment requires a large number of pathogenic variants, limiting its applicability. First, we compared the number of exome-wide classifiable variants using single paralogous pathogenic variants with the PER approach. We used an independent set of pathogenic and control variants that were not utilized in the PER generation or the application of the para-PS1/PM5 criteria (see Methods for details). We found that the approach based on single paralogous pathogenic variants captured 2.2 times more residues compared to PERs (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). In the second comparison, we compared the LR\u0026thinsp;+\u0026thinsp;for each approach and observed similar LR\u0026thinsp;+\u0026thinsp;for the PER approach and for the para-PS1 approach (LR\u0026thinsp;+\u0026thinsp;\u003csub\u003ePER\u003c/sub\u003e = 5.28, LR\u0026thinsp;+\u0026thinsp;\u003csub\u003epara\u0026minus;PS1\u003c/sub\u003e = 5.63, Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eLeveraging phenotype correlations across paralogs can enhance pathogenicity assessment\u003c/h2\u003e \u003cp\u003eA single gene can be associated with different disorders. The number of disorders associated with variants in the same gene frequently correlates with the number of different molecular functional defects. Given that structure determines function, the molecular consequences of variants often relate to their specific position within the protein structure(39). Thus, pinpointing phenotype correlations based on analogous variant distributions might reveal paralogous variants with consistent molecular effects. In the context of voltage-gated sodium channels (VGSCs), past research has underscored not only the conservation of pathogenicity but also the consistent functional effects among paralogous variants(\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). Building on this, we hypothesized that uncovering phenotype correlations across VGSCs could fine-tune the application of pathogenic paralogous variants for variant pathogenicity assessment. We hypothesize that within gene family phenotype correlations could identify correlated phenotypes based on substitution position, subsequently enhancing the likelihood of conserved pathogenicity for variants at equivalent positions. To test this hypothesis, we curated a comprehensive dataset featuring 1,346 affected individuals, associated with 22 diverse phenotypes and possessing 886 unique missense variants in VGSC-encoding genes (detailed in Supplementary Table\u0026nbsp;2). Performing alignment position-based mapping onto the same structure combined with spatial-based phenotype proximity correlation analysis (see Methods for details), we identified within gene family position correlated phenotypes (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). For example, \u003cem\u003eSCN1A\u003c/em\u003e-associated Dravet syndrome variants exhibited 3D positional correlations with \u003cem\u003eSCN2A\u003c/em\u003e variants associated with autism (R\u0026thinsp;=\u0026thinsp;0.31, P\u0026thinsp;=\u0026thinsp;2.1e-35), and Brugada syndrome variants in \u003cem\u003eSCN5A\u003c/em\u003e (R\u0026thinsp;=\u0026thinsp;0.29, P\u0026thinsp;=\u0026thinsp;2.8e-40).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor genes associated with several related disorders, such as the VGSC, variant classification is challenging since phenotype specificity is not high. Therefore, not all pathogenic classified variants might be correctly classified. Next, we tested whether variants from spatially correlated phenotypes across different paralogous genes could increase variant pathogenicity classification accuracy. We selected the most frequently reported phenotypes for VGSC genes with at least 40 patients. The four genes \u003cem\u003eSCN1A\u003c/em\u003e, \u003cem\u003eSCN2A\u003c/em\u003e, \u003cem\u003eSCN5A\u003c/em\u003e, and \u003cem\u003eSCN8A\u003c/em\u003e fulfilled this criterion. We dissected the associated variants into four subsets and calculated the evidence for variant pathogenicity (see methods for details). We observed an increased positive likelihood ratio by a factor of 3\u0026ndash;8 for paralogous variants associated with 3D-position correlated phenotypes, in contrast to those paralogous variants without a significant 3D-position correlation (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB). For example, for SCN8A DEE cases pathogenic paralogous variants whose phenotype correlate with the DEE in SCN8A (LR\u0026thinsp;+\u0026thinsp;=\u0026thinsp;34.7, CI 16.3) showed an 8.6- fold higher strength to asses variant pathogenicity compared to pathogenic paralogous variants found in cases with non-correlating phenotypes (LR\u0026thinsp;+\u0026thinsp;=\u0026thinsp;4.0, CI 1.8\u0026ndash;8.9).\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eMany paralogs are highly conserved in sequence and have similar biophysical molecular functions. Current variant interpretation guidelines only consider previously classified pathogenic missense variants in the gene of interest as evidence for pathogenicity. Here, we developed and validated a bioinformatic framework to integrate pathogenic missense variants in paralogous genes at corresponding alignment index positions as evidence for the pathogenicity of novel variants. We demonstrated that integrating paralogous pathogenic variants located at a corresponding protein position can provide evidence for pathogenicity even if the amino acid exchange is not conserved. Compared to approaches, such as the PS1 and PM5 criteria of the ACMG guidelines(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e) which consider pathogenic variants in the same gene at the same position as evidence, our approach can be applied to 3.6 fold more protein residues where novel variants of unknown pathogenicity could be observed.\u003c/p\u003e \u003cp\u003ePathogenic missense variants in paralogous genes can serve as a proxy for pathogenicity. Within a protein sequence, pathogenic variants are unevenly distributed and tend to accumulate in certain regions that are critical for protein function(\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e). These pathogenic variant-enriched regions have proven valuable for variant classification through established guidelines for variant interpretation(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e) and the use of \u003cem\u003ein-silico\u003c/em\u003e prediction algorithms(\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e). Moreover, the observation that critical protein regions tend to be evolutionarily conserved between paralogous genes can be harnessed to enhance statistical robustness by incorporating pathogenic variants across these paralogous genes(\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Still, about 70%, of pathogenic variants are located outside the regions identified as essential. As a result, individual pathogenic variants in paralogous genes outside these regions were not considered for variant interpretation. In a study examining long QT syndrome, it was observed that individual pathogenic variants in paralogous genes are often located at paralogous positions as determined from multiple sequence alignments(\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e), suggesting that the presence of a pathogenic variant at a particular position may serve as a proxy of pathogenicity at that alignment position in other paralogs. Our data test this hypothesis across a wide range of gene families and suggests that individual pathogenic paralogous variants can indeed serve as proxies for pathogenicity on a broad scale, thereby augmenting the efficacy of established variants in variant interpretation frameworks.\u003c/p\u003e \u003cp\u003ePathogenic variants in voltage-gated sodium channel (VGSC) genes are associated with a broad spectrum of clinical phenotypes, even within the same gene(\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e). Prior research demonstrated a strong correlation between different molecular variant effects, such as the gain or loss of a protein function, and the clinical phenotype(\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e). We identified phenotypes across VGSC genes with different organ or cellular gene expressions that are caused by corresponding paralogous variants located at the same alignment index position. The location of a variant in the protein structure in VGSC, particularly in critical regions like the selectivity filter or the inactivation gate, is often associated with conserved molecular function(\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). Our findings of 3D-position-based phenotype correlations across VGSC genes likely identify phenotypes caused by variants in paralogous genes with similar molecular effects. The framework we developed assumes that both pathogenicity and the molecular impact of a variant are generally conserved. We confirmed that pathogenicity is often preserved across paralogous genes at conserved residues. Nonetheless, our results suggest that applying correlations derived from the 3D positioning of these variants can potentially identify cases where this conservation does not hold or where variants previously classified as pathogenic were misclassified.\u003c/p\u003e \u003cp\u003eDespite efforts to standardize criteria for pathogenicity assignment(\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e) and many improvements in variant interpretation, about 75% of missense variants in ClinVar(\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e) (accessed 12/2022) are annotated as variants of uncertain significance (VUS). Extending or modifying existing ACMG criteria has been demonstrated as a promising approach to reclassifying VUSs (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e, \u003cspan additionalcitationids=\"CR44\" citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e). We demonstrated that the PS1 and PM5 criteria of the ACMG guidelines could, in principle, be extended by considering already classified pathogenic variants with corresponding amino acid positions as evidence for pathogenicity. This approach was previously suggested by Clingen Expert curated guidelines for a small set of genes associated with Rasopathies(\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). However, here we have demonstrated the generalizability of the approach across a large set of 519 gene families and quantified the evidence gained from this approach.\u003c/p\u003e \u003cp\u003eOur proposed inclusion of the paralogous variants as biologically interpretable evidence of variant pathogenicity has several limitations. First, incorporating pathogenic variants at paralogous positions into the established ACMG/AMP variant classification guidelines requires careful evaluation. This is due to the potential overlap between the basic data supporting an extension of PS1/PM5 criteria to paralogous genes and those already covered by the existing guidelines. Notably, the \u003cem\u003ein silico\u003c/em\u003e scores, PP3/BP4, overlap, given that many predictive models, such as REVEL(\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e) or Bayesdel(\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e), incorporate evolutionary conservation as a fundamental training feature. On the other hand, the para-PS1/PM5 criteria we defined require conservation across paralogous genes at the specified position, thus also considering evidence derived from evolutionary conservation across paralogous genes. We demonstrated that orthologous conservation commonly harnessed in most in silico predictive scores, differs from the evolutionary insights acquired from paralogous gene analyses, albeit they are correlated(\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Furthermore, we demonstrated in this study that even for residues similarly conserved across paralogs, the presence of a pathogenic variant at a conserved paralogous residue provides additional evidence supporting pathogenicity. Nevertheless, enabling the implementation of criteria based on pathogenic variants at paralogous positions along with PP3/BP4 requires a rigorous analysis to determine the discrete evidence provided by integrating pathogenic variants at paralogous positions beyond that provided by the selected PP3, to ensure that information is not considered redundantly. Therefore, incorporating evidence from pathogenic variants at paralogous positions\u0026mdash;especially when concurrently considering other related criteria for final classification\u0026mdash;introduces a potential risk of inadvertently over-representing shared basic elements. This could lead to an inflated assessment of evidence either supporting or contesting pathogenicity. Second, variants integrated in our framework of pathogenic variants at paralogous positions could be inflated by spliceogenic exonic variants(\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e). Although previous results suggest that their impact might be minor on our approach, an exclusion of variants with a predicted high splicing impact could resolve this concern. Third, a limitation of our study is the inclusion of control variants aggregated from the gnomAD database, some of which may be pathogenic despite their presence in the general population. In instances where these control variants are indeed pathogenic, the likelihood ratios calculated in our study may represent underestimations, maintaining the conservative nature of our findings.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn Conclusion, our findings suggest that utilizing pathogenic paralogous variants provides significant potential to improve variant interpretation and aid in the diagnosis of pathogenic variants in clinical practice. Reference databases continue to grow and include well-classified pathogenic variants. While we have demonstrated that pathogenic variants in paralogous genes at the same alignment position provide evidence for pathogenicity across all disease-associated gene families, the potential integration of these criteria into the ACMG classification framework would require a careful approach to avoid double counting due to correlation with other criteria that your evolutionary conservation as a feature (e.g., \u003cem\u003ein silico\u003c/em\u003e prediction scores). Future iterations of variant interpretation guidelines that consider the presence of paralogous pathogenic variants as evidence of pathogenicity could thus significantly increase the application of criteria based on already established pathogenic variants.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eHGNC: Human Gene Nomenclature Consortium; PER: Pathogenic Variant Enriched Regions ACMG: American College of Medical Genetics and Genomics; VUS: Variants of Uncertain Significance; HGMD: Human Gene Mutation Database; gnomAD: Genome Aggregation Database; MAF: Maximum Population Frequency; VGSC: Voltage-Gated Sodium Channels; LR+: Positive Likelihood Ratio; FN: False Negative; FP: False Positive; TP: True Positive; TN: True Negative; DEE: Developmental Epileptic Encephalopathy\u0026nbsp;\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eEthics approval and consent to participate\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003eAvailability of data and materials\u003c/p\u003e\n\u003cp\u003eData is available in the Supplementary Tables. Code to annotate para-PS1/PM5 criteria with a custom variant dataset for any gene family can be obtained from https://github.com/TobiasBruenger/paraPS1-PM5-annotation.\u003c/p\u003e\n\u003cp\u003eCompeting interests\u003c/p\u003e\n\u003cp\u003eThe authors report no conflicts of interest.\u003c/p\u003e\n\u003cp\u003eFunding\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFunding for this work was provided by the German Federal Ministry for Education and Research (BMBF, Treat-ION, 01GM1907D) to D.L., T.B., and P.M., by the BMBF (Treat-Ion2, 01GM2210B) to P.M, the Fonds Nationale de la Recherche in Luxembourg (FNR, Research Unit FOR-2715, INTER/DFG/21/16394868 MechEPI2) to P.M., the Agencia Nacional de Investigaci\u0026oacute;n y Desarrollo de Chile (ANID, Fondecyt 1221464 grant) to E.P., the Familie SCN2A foundation 2020 Action Potential Grant to E.P., the Dravet Syndrome Foundation (grant number, 272016) to D.L, \u0026nbsp;and the NIH NINDS (Channelopathy-Associated Epilepsy Research Center, 5-U54-NS108874) to D.L.\u003c/p\u003e\n\u003cp\u003eAuthor\u0026rsquo;s Contributions\u003c/p\u003e\n\u003cp\u003eConceptualization: T.B., A.I., D.L. I.H. ; Data curation: T.B., A.I., E.P; Analysis: T.B, A.I., Supervision: D.L.; P.M.; M.N.; Writing-original draft: T.B., A.I., L.M.; Writing- editing: D.L., P.M., S.C, L.S., S.P., L.M.\u003c/p\u003e\n\u003cp\u003eAcknowledgement\u003c/p\u003e\n\u003cp\u003eNot applicable. \u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eLek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016 Aug 18;536(7616):285\u0026ndash;91. \u003c/li\u003e\n\u003cli\u003eChoi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A. 2009 Nov 10;106(45):19096\u0026ndash;101. \u003c/li\u003e\n\u003cli\u003eMilman A, Behr ER, Gray B, Johnson DC, Andorin A, Hochstadt A, et al. Genotype-Phenotype Correlation of SCN5A Genotype in Patients With Brugada Syndrome and Arrhythmic Events: Insights From the SABRUS in 392 Probands. Circ Genom Precis Med. 2021 Oct;14(5):e003222. \u003c/li\u003e\n\u003cli\u003eJohannesen KM, Liu Y, Koko M, Gjerulfsen CE, Sonnenberg L, Schubert J, et al. Genotype-phenotype correlations in SCN8A-related disorders reveal prognostic and therapeutic implications. Brain. 2022 Sep 14;145(9):2991\u0026ndash;3009. \u003c/li\u003e\n\u003cli\u003eKamada F, Kure S, Kudo T, Suzuki Y, Oshima T, Ichinohe A, et al. A novel KCNQ4 one-base deletion in a large pedigree with hearing loss: implication for the genotype-phenotype correlation. J Hum Genet. 2006;51(5):455\u0026ndash;60. \u003c/li\u003e\n\u003cli\u003eDickerson JE, Robertson DL. On the Origins of Mendelian Disease Genes in Man: The Impact of Gene Duplication. Mol Biol Evol. 2012 Jan;29(1):61\u0026ndash;9. \u003c/li\u003e\n\u003cli\u003eYates B, Gray KA, Jones TEM, Bruford EA. Updates to HCOP: the HGNC comparison of orthology predictions tool. Briefings in Bioinformatics [Internet]. 2021 May 6 [cited 2021 Jul 23];(bbab155). Available from: https://doi.org/10.1093/bib/bbab155\u003c/li\u003e\n\u003cli\u003eLal D, May P, Perez-Palma E, Samocha KE, Kosmicki JA, Robinson EB, et al. Gene family information facilitates variant interpretation and identification of disease-associated genes in neurodevelopmental disorders. Genome Med. 2020 17;12(1):28. \u003c/li\u003e\n\u003cli\u003eChen WH, Zhao XM, van Noort V, Bork P. Human Monogenic Disease Genes Have Frequently Functionally Redundant Paralogs. PLoS Comput Biol. 2013 May 16;9(5):e1003073. \u003c/li\u003e\n\u003cli\u003eWiel L, Venselaar H, Veltman JA, Vriend G, Gilissen C. Aggregation of population-based genetic variation over protein domain homologues and its potential use in genetic diagnostics. Hum Mutat. 2017 Nov;38(11):1454\u0026ndash;63. \u003c/li\u003e\n\u003cli\u003eWare JS, Walsh R, Cunningham F, Birney E, Cook SA. Paralogous annotation of disease-causing variants in long QT syndrome genes. Hum Mutat. 2012 Aug;33(8):1188\u0026ndash;91. \u003c/li\u003e\n\u003cli\u003eZhang X, Theotokis PI, Li N, Investigators the Sh, Wright CF, Samocha KE, et al. Genetic constraint at single amino acid resolution improves missense variant prioritisation and gene discovery [Internet]. medRxiv; 2022 [cited 2023 Oct 19]. p. 2022.02.16.22271023. Available from: https://www.medrxiv.org/content/10.1101/2022.02.16.22271023v1\u003c/li\u003e\n\u003cli\u003eBrunklaus A, Feng T, Br\u0026uuml;nger T, Perez-Palma E, Heyne H, Matthews E, et al. Gene variant effects across sodium channelopathies predict function and guide precision therapy. Brain. 2022 Jan 17;awac006. \u003c/li\u003e\n\u003cli\u003eWalsh R, Peters NS, Cook SA, Ware JS. Paralogue annotation identifies novel pathogenic variants in patients with Brugada syndrome and catecholaminergic polymorphic ventricular tachycardia. J Med Genet. 2014 Jan;51(1):35\u0026ndash;44. \u003c/li\u003e\n\u003cli\u003eP\u0026eacute;rez-Palma E, May P, Iqbal S, Niestroj LM, Du J, Heyne HO, et al. Identification of pathogenic variant enriched regions across genes and gene families. Genome Res. 2020;30(1):62\u0026ndash;71. \u003c/li\u003e\n\u003cli\u003eRichards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015 May;17(5):405\u0026ndash;24. \u003c/li\u003e\n\u003cli\u003eLandrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018 04;46(D1):D1062\u0026ndash;7. \u003c/li\u003e\n\u003cli\u003eMarinakis NM, Svingou M, Veltra D, Kekou K, Sofocleous C, Tilemis FN, et al. Phenotype-driven variant filtration strategy in exome sequencing toward a high diagnostic yield and identification of 85 novel variants in 400 patients with rare Mendelian disorders. Am J Med Genet A. 2021 Aug;185(8):2561\u0026ndash;71. \u003c/li\u003e\n\u003cli\u003eZech M, Jech R, Boesch S, \u0026Scaron;korv\u0026aacute;nek M, Weber S, Wagner M, et al. Monogenic variants in dystonia: an exome-wide sequencing study. Lancet Neurol. 2020 Nov;19(11):908\u0026ndash;18. \u003c/li\u003e\n\u003cli\u003eGelb BD, Cav\u0026eacute; H, Dillon MW, Gripp KW, Lee JA, Mason-Suares H, et al. ClinGen\u0026rsquo;s RASopathy Expert Panel consensus methods for variant interpretation. Genet Med. 2018 Nov;20(11):1334\u0026ndash;45. \u003c/li\u003e\n\u003cli\u003eStenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NST, et al. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat. 2003 Jun;21(6):577\u0026ndash;81. \u003c/li\u003e\n\u003cli\u003eZerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Res. 2018 Jan 4;46(Database issue):D754\u0026ndash;61. \u003c/li\u003e\n\u003cli\u003eKarczewski KJ, Francioli LC, Tiao G, Cummings BB, Alf\u0026ouml;ldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434\u0026ndash;43. \u003c/li\u003e\n\u003cli\u003eDanecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011 Aug 1;27(15):2156\u0026ndash;8. \u003c/li\u003e\n\u003cli\u003eBrunklaus A, Br\u0026uuml;nger T, Feng T, Fons C, Lehikoinen A, Panagiotakaki E, et al. The gain of function SCN1A disorder spectrum: novel epilepsy phenotypes and therapeutic implications. Brain. 2022 Jun 13;awac210. \u003c/li\u003e\n\u003cli\u003eBrunklaus A, P\u0026eacute;rez-Palma E, Ghanty I, Xinge J, Brilstra E, Ceulemans B, et al. Development and Validation of a Prediction Model for Early Diagnosis of SCN1A-Related Epilepsies. Neurology. 2022 Mar 15;98(11):e1163\u0026ndash;74. \u003c/li\u003e\n\u003cli\u003eWolff M, Johannesen KM, Hedrich UBS, Masnada S, Rubboli G, Gardella E, et al. Genetic and phenotypic heterogeneity suggest therapeutic implications in SCN2A-related disorders. Brain. 2017 May 1;140(5):1316\u0026ndash;36. \u003c/li\u003e\n\u003cli\u003eCrawford K, Xian J, Helbig KL, Galer PD, Parthasarathy S, Lewis-Smith D, et al. Computational analysis of 10,860 phenotypic annotations in individuals with SCN2A-related disorders. Genet Med. 2021 Jul;23(7):1263\u0026ndash;72. \u003c/li\u003e\n\u003cli\u003eZaman T, Helbig KL, Clatot J, Thompson CH, Kang SK, Stouffs K, et al. SCN3A-related neurodevelopmental disorder: A spectrum of epilepsy and brain malformation. Ann Neurol. 2020 Aug;88(2):348\u0026ndash;62. \u003c/li\u003e\n\u003cli\u003eSobreira N, Schiettecatte F, Valle D, Hamosh A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum Mutat. 2015 Oct;36(10):928\u0026ndash;30. \u003c/li\u003e\n\u003cli\u003eWalsh R, Lahrouchi N, Tadros R, Kyndt F, Glinge C, Postema PG, et al. Enhancing rare variant interpretation in inherited arrhythmias through quantitative analysis of consortium disease cohorts and population controls. Genet Med. 2021 Jan;23(1):47\u0026ndash;58. \u003c/li\u003e\n\u003cli\u003eWhiffin N, Minikel E, Walsh R, O\u0026rsquo;Donnell-Luria AH, Karczewski K, Ing AY, et al. Using high-resolution variant frequencies to empower clinical genome interpretation. Genet Med. 2017 Oct;19(10):1151\u0026ndash;8. \u003c/li\u003e\n\u003cli\u003eKinsella RJ, K\u0026auml;h\u0026auml;ri A, Haider S, Zamora J, Proctor G, Spudich G, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford). 2011;2011:bar030. \u003c/li\u003e\n\u003cli\u003eEdgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004 Mar 8;32(5):1792\u0026ndash;7. \u003c/li\u003e\n\u003cli\u003eIqbal S, Br\u0026uuml;nger T, P\u0026eacute;rez-Palma E, Macnee M, Brunklaus A, Daly MJ, et al. Delineation of functionally essential protein regions for 242 neurodevelopmental disorders. Brain. 2022 Oct 18;awac381. \u003c/li\u003e\n\u003cli\u003eIoannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet. 2016 Oct 6;99(4):877\u0026ndash;85. \u003c/li\u003e\n\u003cli\u003eDavydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++. PLoS Comput Biol [Internet]. 2010 Dec 2 [cited 2019 Dec 29];6(12). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996323/\u003c/li\u003e\n\u003cli\u003eFrazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021 Nov;599(7883):91\u0026ndash;5. \u003c/li\u003e\n\u003cli\u003eHeyne HO, Baez-Nieto D, Iqbal S, Palmer DS, Brunklaus A, May P, et al. Predicting functional effects of missense variants in voltage-gated sodium and calcium channels. Sci Transl Med. 2020 Aug 12;12(556). \u003c/li\u003e\n\u003cli\u003eTokheim C, Bhattacharya R, Niknafs N, Gygax DM, Kim R, Ryan M, et al. Exome-scale discovery of hotspot mutation regions in human cancer using 3D protein structure. Cancer Res. 2016 Jul 1;76(13):3719\u0026ndash;31. \u003c/li\u003e\n\u003cli\u003eQuinodoz M, Peter VG, Cisarova K, Royer-Bertrand B, Stenson PD, Cooper DN, et al. Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity. Am J Hum Genet. 2022 Mar 3;109(3):457\u0026ndash;70. \u003c/li\u003e\n\u003cli\u003eBrunklaus A, Du J, Steckler F, Ghanty II, Johannesen KM, Fenger CD, et al. Biological concepts in human sodium channel epilepsies and their relevance in clinical practice. Epilepsia. 2020;61(3):387\u0026ndash;99. \u003c/li\u003e\n\u003cli\u003eKelly MA, Caleshu C, Morales A, Buchan J, Wolf Z, Harrison SM, et al. Adaptation and validation of the ACMG/AMP variant classification framework for MYH7-associated inherited cardiomyopathies: recommendations by ClinGen\u0026rsquo;s Inherited Cardiomyopathy Expert Panel. Genet Med. 2018 Mar;20(3):351\u0026ndash;9. \u003c/li\u003e\n\u003cli\u003ePatel MJ, DiStefano MT, Oza AM, Hughes MY, Wilcox EH, Hemphill SE, et al. Disease-specific ACMG/AMP guidelines improve sequence variant interpretation for hearing loss. Genet Med. 2021 Nov;23(11):2208\u0026ndash;12. \u003c/li\u003e\n\u003cli\u003ePejaver V, Byrne AB, Feng BJ, Pagel KA, Mooney SD, Karchin R, et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am J Hum Genet. 2022 Dec 1;109(12):2163\u0026ndash;77. \u003c/li\u003e\n\u003cli\u003eFeng BJ. PERCH: A Unified Framework for Disease Gene Prioritization. Hum Mutat. 2017 Mar;38(3):243\u0026ndash;51. \u003c/li\u003e\n\u003cli\u003eLoong L, Cubuk C, Choi S, Allen S, Torr B, Garrett A, et al. Quantifying prediction of pathogenicity for within-codon concordance (PM5) using 7541 functional classifications of BRCA1 and MSH2 missense variants. Genet Med. 2022 Mar;24(3):552\u0026ndash;63. \u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"genome-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Genome Medicine](https://genomemedicine.biomedcentral.com/)","snPcode":"13073","submissionUrl":"https://submission.springernature.com/new-submission/13073/3","title":"Genome Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Variant classification, Paralogs, Genetics, Missense variants","lastPublishedDoi":"10.21203/rs.3.rs-5434140/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5434140/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eThe majority of missense variants in clinical genetic tests are classified as variants of uncertain significance. Prior research has shown that the deleterious effects and the subsequent molecular consequence of variants are often conserved among paralogous protein sequences within a gene family. Here, we systematically quantified on an exome-wide scale if the existence of pathogenic variants in paralogous genes at a conserved position could serve as evidence for the pathogenicity of a new variant. For the gene family of voltage-gated sodium channels where variants and expert-curated clinical phenotypes were available, we also assessed whether phenotype patterns of multiple disorders for each gene were also conserved across variant positions within the gene family.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe developed a framework that assesses the presence of pathogenic missense variants located in conserved residues across paralogous genes. We systematically mapped 2.5\u0026nbsp;million pathogenic and general population variants from the ClinVar, HGMD, and gnomAD databases onto a total of 9,990 genes and aligned them by gene families. We evaluated the quantity of classifiable amino acids by utilizing pathogenic variants identified in databases alone and then compared this assessment to the inclusion of paralogous pathogenic variants. We validated and quantified the evidence of conserved pathogenic paralogous variants in variant pathogenicity classification.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eConsidering conserved pathogenic variants in paralogous genes, increased the number of classifiable variants 2.8-fold across the exome, compared to pathogenic variants in the gene of interest alone. The presence of a pathogenic variant in a paralogous gene is associated with a positive likelihood ratio of 8.32 for variant pathogenicity. The likelihood ratio was gene family-specific. Across ten genes encoding voltage-gated sodium channels and 22 expert-curated disorders, we identified cross-paralog correlated phenotypes based on 3D structure spatial position. For example, the established loss-of-function disorders \u003cem\u003eSCN1A\u003c/em\u003e-associated Dravet syndrome, \u003cem\u003eSCN2A-\u003c/em\u003eassociated autism, \u003cem\u003eSCN5A\u003c/em\u003e-associated Brugarda Syndrome, and \u003cem\u003eSCN8A-\u003c/em\u003eassociated neurodevelopmental disorder without seizures were correlated in their spatial variant position on structure. Finally, we show that phenotype integration in paralog variant selection improves variant classification.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eOur results show that paralogous variants, in particular with phenotype information can enhance our understanding of variant effects.\u003c/p\u003e","manuscriptTitle":"Conserved missense variant pathogenicity and correlated phenotypes across paralogous genes","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-11-29 11:10:40","doi":"10.21203/rs.3.rs-5434140/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-02-21T11:37:35+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-02-20T00:50:03+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-02-16T18:20:25+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"87904573219590936593237626770192528534","date":"2025-01-26T14:55:23+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"164982636973021459592959222769400686997","date":"2025-01-25T18:47:06+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-12-16T14:35:15+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"77179675111709364821955865596686553211","date":"2024-12-09T17:20:05+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"191034642557923130259929200859286259718","date":"2024-12-09T16:40:39+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-12-09T14:52:33+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2024-11-15T08:38:54+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-11-14T15:03:46+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-11-12T10:52:50+00:00","index":"","fulltext":""},{"type":"submitted","content":"Genome Medicine","date":"2024-11-11T18:31:04+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"genome-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Genome Medicine](https://genomemedicine.biomedcentral.com/)","snPcode":"13073","submissionUrl":"https://submission.springernature.com/new-submission/13073/3","title":"Genome Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a864701e-9daf-4cb8-8595-e695ec58fd30","owner":[],"postedDate":"November 29th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2025-02-21T11:53:11+00:00","versionOfRecord":[],"versionCreatedAt":"2024-11-29 11:10:40","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5434140","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5434140","identity":"rs-5434140","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00