Tackling the role of rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker

doi:10.21203/rs.3.rs-4415057/v1

Tackling the role of rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker

2024 · doi:10.21203/rs.3.rs-4415057/v1

preprint OA: gold CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 104,438 characters · extracted from preprint-html · click to expand

Tackling the role of rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Tackling the role of rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker Sarah Ennis, Guo Cheng, James Ashton, R.Mark Beattie, Andrew Collins This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4415057/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Rare and common variants often converge in the pathogenic pathway of in inflammatory bowel disease (IBD), a heterogenous autoimmune condition with genomic and environmental influences. We identified 794 functionally-targeted-genes/linkage-disequilibrium-mapped blocks (LDBs) implicated by genome-wide-association-studies (GWAS), then developed GenePy2, a burden score that integrates functional impacts of rare variants for each gene/LDB, using exome data of UK-Biobank phase2 IBD cohort. Through case/control 2-way Man-Whitney-U test tuning on subpopulations with extreme GenePy2 scores, 34 genes/LDBs in Crohn’s disease (CD) and 25 in Ulcerative Colitis (UC) survived significance test, confirming roles for rare functional variants. The optimal threshold of GenePy2 were then pinpointed for each gene/LDB based on tests’ maximum effect size. Further itemset association mining of the binarised GenePy2 scores detected an intriguing cooccurrence of extreme scores of the risk NOD2 and protective IL23R in controls, which are mutually exclusive in CD patients, implicating a ‘rescue’ of disease by protective rare variants. Health sciences/Risk factors Health sciences/Gastroenterology/Gastrointestinal diseases/Inflammatory bowel disease IBD Crohn’s disease Ulcerative colitis GWAS GenePy2 pathogenic burden genetics Figures Figure 1 Figure 2 Figure 3 Figure 4 INTRODUCTION Inflammatory bowel disease (IBD) is a chronic, highly heterogenous, inflammatory condition resulting from an aberrant immune response to environmental triggers, in genetically susceptible individuals[ 1 ]. The disease is commonly classified into two subtypes: Crohn’s disease (CD) and ulcerative colitis (UC) based on clinical findings, yet the clinical phenotype is much more varied and hinders effective disease treatment. Genome-wide association studies (GWASs) of common variants (minor allele frequency (MAF) > 3 ~ 5% in the general population) on IBD, driven by large cohorts including the UK Biobank[ 2 ], have identified over 300 IBD-associated loci, which shed light both on the IBD genetic landscape and implicate pathways[ 1 ]. However, with modest effect sizes, common variants altogether explain only a minor fraction of the observed IBD heritability, and most of the GWAS variants are not causal variants but rather proxies of the causative variations based on linkage disequilibrium (LD)[ 3 ]. This limits the application for clinical translation of GWAS. Rare variants (RVs) with major effect sizes represent a key genomic driver that have known associations with IBD[ 4 , 5 ]. Although methodologically challenging[ 4 ], a growing number of RVs have been statistically identified in complex IBD and functionally validated for monogenic IBD, with the latter manifesting as a rare Mendelian subtype of IBD with familial clustering of occurrence, often with specific additional features, such as immunodeficiency[ 6 , 7 ]. Association of RVs often overlap with common variants [ 8 ]. In complex IBD, evidence suggests it may be associated with disease through a 2nd hit mechanism by RVs in the pathogenesis pathway/gene in addition to a GWAS association, or via synthetic association of RVs underpinning the GWAS signal in the LD region, while both mechanisms may hold the key to decipher the RVs’ role in disease[ 3 , 9 – 11 ]. This means that both targeted genes of the GWAS association and the region delineated by the LD block (LDB) that encompasses the association signal can host the disease causal RVs. LDBs can vary among different ethnic groups, whilst targeted genes either under influence of expression Quantitative Trait Loci (eQTL) or in physical adjacency of the variant cannot always be clearly defined due to pleiotropic effects and epigenetic modifications[ 12 – 15 ]. Within a targeted genomic region, burden-based test of selected RVs, e.g . missense variants, or loss-of-function variants, is the norm to check for the case versus control associations, with well-established sequence kernel association test (SKAT/SKAT-O) for example[ 16 ]. However, refined pathogenicity weighting of the variants in the burden tests can be essential to elucidate the role of RVs in GWAS loci and in disease. Taking the ‘mendelian-complex’ genes, the causal genes that overlap between complex IBD and monogenic IBD, for instance, while damaging mutations of these genes cause a severe phenotypic presentation as monogenic IBD, the variant of modest effect can predispose risk to a milder polygenic form of the phenotype as identified from GWASs[ 6 ]. The pathogenicity variance of the variants in the same gene is the cause of the vastly different phenotypic presentation in this case, which implicate that integrating deleteriousness score in burden tests is essential. Methods based on this, represented by GenePy score integrating mutation load, allele frequency and pathogenicity score of individual variants, has been successfully applied in both clinical genetics and machine learning models based on small cohorts of data[ 17 , 18 ]. In this study we developed GenePy2 to adapt with large cohort of rare variants data and tested it as a prototype of a DNA biomarker for IBD. This was followed by investigations on disease association and personalized examination on patients’ genetic landscape of disease. Analyses were carried out on the UK Biobank cohort and tested on GWAS association regions of IBD. MATERIALS AND METHODS The UK BioBank IBD cohort The analysis is based on the UK Biobank phase 2 dataset (project 72911), encompassing exome sequencing and detailed phenotype information from approximately 200,000 participants, which was publicly released in October 2020[ 19 , 20 ]. Participants who have withdrawn were excluded from the analysis. The exomes were captured using the IDT xGen Exome research panel V1.0, designed to target 39 Mbps of the human genome. To ensure data quality, additional quality control (QC) metrics were applied to the project-VCF (pVCF). A detailed workflow of this process, along with a list of immune-related diseases curated by the clinic and informatics team, is presented in Fig. 1a and supplementary methods with Table S1 . Patients or the public WERE NOT involved in the design, or conduct, or reporting, or dissemination plans of our research. Curation of the IBD-associated genomic variants UC, CD, or IBD-associated Single-nucleotide polymorphisms (SNPs) with maximum association p-value of 5x10 − 8 were retrieved from the GWAS Catalogue v1.09 [ 21 ]. Through a literature review conducted in June 2023, we refined the dataset by excluding associations derived from case-case studies, associations related to disease subtypes other than UC and CD, as well as those identified in non-European populations. The PubMed search query utilized for literature review was: (((((("Crohn's disease") OR "inflammatory bowel disease") OR "Ulcerative Colitis")) AND (("genome-wide association"[Title/Abstract]) OR "genome-wide association"[Title/Abstract]))) AND ("1000"[Date - Create] : "2023/06/07"[Date - Create]). For each association SNP, we first examined its physically mapped genes (mappedGenes), using the same approach of GWAS Catalogue[ 21 ]. We identified cis-regulated genes (eGenes) using data from the recent GTEx database V8 ( https://www.gtexportal.org/ ), by extracting those associated-SNPs that function as expression Quantitative Trait Loci (eQTLs) in tissues including transverse colon, sigmoid colon, small intestine&terminal ilium, EBV-transformed lymphocytes, fibroblasts, and whole blood [ 22 ]. To delineate linkage disequilibrium (LD) blocks (LDBs) associated with IBD, we projected the locus of each association SNP onto the LD unit map of the European population. Employing a sliding window of 1 LD unit (LDU) in size, whereby loci within a 1-LDU distance are grouped into one LDB [ 14 , 15 ]. In cases where direct interpolation of a locus was not feasible, we utilized the position of the most adjacent marker for this purpose. Such LDBs and targeted genes were identified as GWAS loci in this study. Monogenic IBD genes retrieved from literature are also included in the analysis. The curation process of all the candidate loci is illustrated in Fig. 1B. Per-locus GenePy score for the IBD cohort The GenePy v2.0 was developed to cope with large cohort data by addressing issues: 1) incorporation of multi-allelic considerations into the score (maximum-n alternative_allele =10); 2) enabling the calculation of scores for various genomic regions, such as LDBs; 3) computational cost reduction with optional processing using GPU; 4) optional selection of variants that are pathogenic or likely pathogenic. The score was built on assessing the pathogenicity potential of each variant allele besides the variant load of a genomic region, integrating information from the Combined Annotation–Dependent Depletion(CADD V1.6) score and population allele frequency as observed in the 200K participants [ 23 ]. Genepy2 score was calculated for each candidate gene or LDB based on likely-pathogenic variants of CADD phred_score ≥ 15 for every individual within the cohort. Details of calculation were described in supplementary methods. The GenePy2.0 pipeline is open source and can be accessed at https://github.com/UoS-HGIG/GenePy-2 LDB/gene-based mutation test GenePy2 score-based Mann-Whitney U test was conducted with other burden and threshold-based tests (supplementary methods). We considered the genetic heterogeneity of IBD, with the most commonly associated gene, NOD2 for example, estimated to account for 7.5% of Crohn’s disease cases [ 7 , 24 ] therefore tapered the test from all individuals to those with all non-zero score, top 7.5%, 5%, 2.5% and 1% of highest GenePy scores in cases and controls respectively, to provide a more statistically robust characterisation of the contribution of each gene to disease pathogenesis. This was also followed by a permutation test of 10 5 times to address confounding effects caused by population stratifications. The effect size of the Mann-Whitney U test was evaluated using the Mann-Whitney parameter, theta ϕ = Pr [ X < Y ] + 0.5 Pr [ X = Y ] with bootstrap resampling to estimate its confidence interval[ 25 – 27 ]. All associated tests utilized the same sets of variants and identical LDB/gene coordinates, specifically focusing on pathogenic variants with a CADD Phred_Scor e > 15. Mann-Whitney U tests are based on the scikit-learn library of Python 3.7[ 28 ]. Itemset analysis GenePy status was defined as follows: individuals in the sub-population ( i.e. those with the top 7.5%, 5%, 2.5%, or 1% highest scores, or all with non-zero scores) whereby maximum effect size is observed in the GenePy-based Mann-Whitney U test, were deemed positive for GenePy for the testing locus, others with lower scores as negative. The binarization process was conducted for UC case/controls and CD case/controls separately. GenePy status of associated loci (Mann-Whitney U test permutation p < 5.65x10 − 5 addressing multiple testing issue), was analysed by an item association rule mining unsupervised learning approach via the APRORI algorithm[ 29 , 30 ], as implemented in arules and arulesViz package of R[ 31 , 32 ]. To reduce the sparsity of the data, individuals without any positive GenePy status were removed before the association mining. Itemset support ( i.e. frequency), lift, and confidence were examined in both CD and UC cohorts and for cases and controls separately to understand the pattern of co-occurrence of association loci, exploring the potential epistatic effects of risk and protective variants. The minimum itemset support for the analysis was 0.0001, and minimum confidence was 0.5. RESULTS LDB and genes in association with IBD A total of 546 IBD-associated single-nucleotide polymorphisms (SNPs) were identified from 35 association studies (Table S2), corresponding to 718 GWAS genes. This set includes 413 mappedGenes and 448 eGenes, with an overlap of 143 genes, as depicted in Fig. 1B and Table S2. Notably, 13 of the 104 monogenic IBD genes (monoGenes) are GWAS genes, i.e. ’Mendelian-complex genes’, exhibiting significant intersection (Fisher’s exact test; protein-coding genes only, p = 6.72x10 − 6 ). Functional gene set enrichment analysis revealed similar enrichment of both GWAS genes and monoGenes in immune-related pathway (Table S3; Figure S1 ), aligning with the anticipated convergence of molecular pathogenic pathways in monogenic IBD and complex IBD. Another feature of GWAS genes is the enrichment of non-protein coding pseudogenes (n = 157), which make up 26.39% of mappedGenes and 12.95% of eGenes. This aligns with overrepresentation of pseudogenes in the applied reference GENCODE V43 [ 33 , 34 ] and our impartial SNP-gene mapping approach with no preferable selection for protein-coding genes or known IBD genes. Whilst there is growing knowledge of their association with disease and immune regulation [ 35 ], the majority of the pseudogenes are not covered by the Exome capture kit (n = 116; Fig. 1b). Utilizing the European-based LDU map[ 14 ], 546 GWAS SNPs are categorized into 260 LDBs, with 150 consisting of a single association SNP (IBD-association p < 5x10 − 8 ), and the remaining defined by ≥ 2 GWAS association SNPs. The LDBs span from 1.00 to 3.20 LDU, or 3,630bp to 3,246,717 bp according to the physical position in size. The largest LDB, LDB78b, is located at 5q31.1 (Table S2), and encompasses 6 GWAS association SNPs, which consist of eQTLs of MEIKIN . LDB78a, despite being > 1LDU far away from LDB78b, encompasses another IBD-associated eQTL of MEIKIN . Such LDBs, by sharing a common gene with the association SNP that they encompass, are defined as clusters of LDBs (n = 21). As might be expected, the most significant cluster is the HLA region at 6p21.32-33, comprising 7 LDBs (Table S2). One hundred and ninety-four LDBs are captured by the Exome sequencing assay. These LDBs encompass the complete sequence of 313 GWAS genes, partially overlap with 201 GWAS genes, and have no intersection with the other 204 GWAS genes. LDBs can also extend beyond mappedGenes and eGenes. For instance, LDB187 at 16q12.1, delineated by 5 GWAS SNPs covers CYLD , a monoGene but not a GWAS gene, besides NOD2 and CYLD-AS1 (Figure S2). The GWAS genes, LDBs and monoGenes together account for 885 target regions to be tested, as component LDBs within a LDB cluster is tested separately. The UK Biobank IBD cohort Following QC, ethnicity- and phenotype-based filtration retained 891 CD, 1,409 UC cases, and 60,118 controls. Most of the IBD diagnoses were made in patients’ adulthood, whilst 37 CD and 33 UC were diagnosed on or before the patients reached 18 years old. Further demographic and sub-phenotypic features of the UC and CD patients are derived based on the ICD-10 code of diagnosis as shown in Table 1 . Table 1 Demographic characteristics of the European UK BioBank cohort for the analysis CD (n = 891) UC (n = 1409) Controls (n = 60,118) Demographics Male 387 730 32,925 Female 504 679 27,193 Age at latest assessment: median (IQR) (Year) 59 (51–64) 61 (54–65) 58 (50–63) Age at Diagnosis; median (IQR) (Year) 57 (49–65) 51 (49–67) NA Disease Subtypes Small intestine Large intestine Both small & lareg intestine Unspecified ileocolitis proctitis rectosigmoiditis proctitis & rectosigmoiditis unspecified NA 174 197 47 473 6 192 81 25 1105 NA GI complications Fistula disease 57 30 NA Stricturing disease 138 57 NA Colon cancer 26 43 NA Megacolon disease 4 6 NA Comorbidities with other autoimmune diseases n = 1 111 151 NA n = 2 15 15 NA n > = 3 1 5 NA Pathogenic mutations of GWAS association loci and monogenic IBD genes All but 10 of the GWAS-derived set of 794 targets host ≥ 1 variants with CADD phred_score ≥ 15 in the cohort, and all the monoGenes were mutated in at least 1 patient. Despite this, pathogenic variants were very sparsely identified in the patients. Approximately half of the testing loci had a non-zero GenePy score in fewer than 5% of patients, as observed on 416 (52.39%) of the GWAS loci and 46 (44.23%) of the monoGenes in CD patients, and similarly on 425 (53.53%) GWAS loci and 48 (46.15%) monoGenes in UC. With more than half of the values being zeros, the GenePy score matrix per locus/individual is a sparse matrix for downstream analysis. The most mutated genes are the 13 known ‘mendelian-complex’ IBD genes, as 8 (61.53%) are mutated in > 5% of both UC and CD, except for CD40, IL2RA, IL10, STAT3 and LACC1 that are rarely mutated either UC or CD. Such sparsity of non-zero GenePy scores of the patients corresponds to the genetic heterogeneity of IBD and is the rational for the following GenePy-based association tests on subpopulations with highest scores. Association of the candidate regions with disease Under the monogenic IBD model, two significant associations are observed with CD which exert opposing effects on disease: NOD2 being risk under the recessive model and IL23R , under the dominant and additive inheritance models, both with protective effects. Both genes are known IBD genes with NOD2 also being a ‘mendelian-complex’ IBD gene. No significant associations were detected with UC from this test (Figure S3). Burden-based SKAT-O test highlighted the most significantly associated gene of UC, RIPK2-DT , a noncoding eGene associated with the IBD-association SNP rs7015630. RIPK2-DT plays a role in mitigating inflammation induced by free fatty acids but is less known in IBD compared to its downstream gene RIPK2 [ 36 , 37 ]. The RIPK2-DT association was not detected in the GenePy-based rank sum tests. GenePy-based Mann-Whitney U test uncovered 35 loci in significant association with CD and 25 with UC (Fig. 2; Figure S4). HLA-DQA1 and HLA-DQB2 are the most significantly associated genes with UC and controls of the top 7.5% or GenePy scores, albeit of modest effect sizes ( ϕ HLA−DQA1 = 0.63, CI [0.59,0.67]; ϕ HLA−DQB2 = 0.66, CI [0.63,0.70]), compared to other associated genes, e.g . ϕ SLC17A1 = 0.81, CI [0.73,0.88], or the monoGene LIG4 , where ϕ LIG4 = 0.82, CI [0.74,0.89]; NOD2 , together with the co-located LDB187 and CYLA-AS1 gene (Figure S2), but not CYLD the monoGene, are the most significantly associated with CD (Fig. 2). Such associations propped up by the rare pathogenic variants (CADD phred_score ≥ 15) exert larger effect sizes to disease compared to that identified from the original GWAS, of both protective and risk effects observed, and such effects tend to be bigger when the affected sub-population is smaller (Fig. 2). Notably, although the smallest p value of NOD2 was observed in individuals of the top 7.5% highest scores ( p = 1.41x10 − 17 , ϕ = 0.80, CI [0.77,0.83], the maximum effect size was observed in those with the top 2.5% highest score p = 5.13x10 − 7 , ϕ = 0.81, CI [0.76,0.86]) The eGene NOTCH1 and a mappedGene CARD9 at locus 9q34.3 are tagged by the association SNPs encompassed by LDB131a/b, and both exhibited significant association with CD, evincing pleiotropic effects at the gene level of a GWAS association locus. In another case, LDB189 which constitutes a proportion of the PLCG2 gene encompassing the phospholipases domain, is significantly associated with CD with protective effects but the entire PLCG2 gene is not (Fig. 3), in line the GWAS findings[ 38 ]. Set of highly mutated genes in IBD and controls We tested rare variant-based associations in both UC and CD, appeared to exert both protective and risk effects, with the potential for some cases of the disease to constitute oligogenic pathogenesis given the large effect sizes. We tested this using an itemset association analysis by the APRIORI algorithm, with patients carrying higher GenePy2 score than the cut-off applied in association tests considered being GenePy positive for a mutant gene or LDB (Table S5-8). The test was conducted on 398 CD patients and 28,017 controls with any positive GenePy status of the 34 CD-associated genes/LDBs, and similarly on 613 UC with 25,748 controls with positive GenePy of the 25 association loci. GenePy status of LDBs/genes within the same GWAS association region tend to be associated because of the existing intra-region overlaps (Table S2 and Table S5-8). Between GWAS regions, considerable coexistence of ‘positive’ GenePy status of LDB187/ NOD2 and IR23R /LDB6 were observed in controls (Fig. 4A). This coexistence was completely absent in itemset observation in CD cases, with GenePy(+) status of both the NOD2/IL23R regions being mutually exclusive in CD patients (Fig. 4A; Table S5-S6). IL23R and the genomic region also showed strongest associations with other regions in controls of the UC-associated genes/LDBs (Fig. 4B; Table S7-8), indicating its counter-risk effects in both IBD subtypes, albeit the observation in UC can be biased as the sub-population with IL23R positive GenePy status constituted 14.08% of the UC cohort (Fig. 3C). DISCUSSION Both single DNA variant and aggregated effects of multiple variants has been utilized for disease risk stratifications[ 39 , 40 ], but a biomarker from rare and functional genomic variants is missing for complex disease despite their potentially direct causal effect with disease. Filling the gap relies on a large cohort, but big genomic data is enriched with issues of complex variations, e.g. multi-allelic variation, variation of unknown significance, etc. Based on the UK BioBank cohort, we tackled such complexities using an evolved GenePy2.0 with more computational efficiency and flexibility, and then tested it on known GWAS loci represented by common variants-based associations. A tailored analysis on IBD was performed, and the result demonstrated the significant enrichment of associations represented by GenePy score with both risk and protective effects on disease occurrence, which will change our previous outlook on the IBD genetic architecture. This approach also exemplifies a new approach to tackling the relationship of GWAS CVs and rare variants. IBD is the archetypal ‘complex’ disease, with genetic heterogeneity leading to distinct underlying aetiology of disease pathogenesis within individuals and governing the role of both triggering and ongoing environmental drivers of disease. In addition to the plethora of GWAS findings which shed light on the genetic pathogenic pathways of the disease, recent analysis of large numbers of patients with WES data has continued to advance knowledge and implicated more rare variation in pathogenesis [ 4 ]. Here we build upon the ability to assess rare variation through application of statistical analysis to determine the maximal contribution of each GWAS locus to IBD pathogenesis within the cohort. We did not limit our view on GWAS to genes, instead followed a naïve approach revisiting the SNP associations with LD mapping in addition to evidence-based physical and eQTL mapping of candidate genes. This introduced pseudogenes and intergenic LDBs which are undesirable targets for the WES-based downstream analysis as many of the candidates are not captured by the sequencing assay, not to mention that many are less studied. However, this has also led to novel discoveries. Our analysis points to variation across the entire NOD2 -associated LDB, rather than just the gene, as being significantly associated with Crohn’s disease inferring important roles for regulatory regions in addition to established coding variants. Similarly, our analysis pinpoints an association in the PLCG2 to only part of the gene with the potential to utilise this to better understand the underlying biological process through which variants lead to disease. Pseudogenes RIPK2-DT and CYLD-AS1 also stand out in association tests which indicates novel pathogenic gene pathway of IBD. The discovery of associations has been significantly promoted by GenePy2. By capturing the role of rare variation at an individual level, this technique provides the ability to both determine the relative contributions to IBD pathogenesis of associated genes across a cohort, and to determine, at an individual level, patients presumed to have disease where a specific gene (or set of genes) has a statistical contribution compared to other patients. This opens the possibility of personalising the molecular diagnosis for an individual patient and identifies genomic biomarkers of disease. By taking the subset of individuals with the highest GenePy score, we can tackle the genetic heterogeneity of IBD in a straightforward approach. For instance, rank-based comparison recovered the most significant association of NOD2 locus for the CD patients with the top 7.5% GenePy scores, concurring with previous findings, although we found that the largest hazard effect of NOD2 mutation is for the more extreme top 1% of scores. Not all ‘pathogenic’ variants are causal to IBD, as we found both risk and protective effects in the CD and UC cohort. This is consistent with the evolutionary picture of autoimmune disease[ 41 ], and the directionality of genetic variants may be addressed in burden-based association tests able to annotate gain-of-function, loss-of-function and dominant negative effects into the GenePy score in the near future. Interestingly, the effect sizes of the GenePy score-based tests are much larger than the GWAS findings on index SNPs, providing the possibility for the scoring tool to be applied as a potential biomarker for implicated genomic ()counter each other when occurring to the same individual, as we observed in controls with positive NOD2 GenePy status being also positive for IL23R . Furthermore, identifying this pattern implicates an oligogenic picture of IBD for some patients, with disease aetiology lying between complex and monogenic IBD. Whilst the UK Biobank cohort provides many advantages, including its large size and rich phenotyping data, the nature of WES data are not ideal for analysis of all GWAS targets as many of the associations lie in noncoding regions, as observed in a large proportion of the LDBs in this study. WGS may provide the opportunity for improvement in both methods and discovery, and application of these methods. Another area of potential weakness in UK Biobank data are the precision of the clinical phenotyping, which impedes the subtype or genotype-phenotype correlation analysis even with GenePy of large effects. In this study we have attempted to identify genomic associations of specific IBD subtypes and are therefore reliant on the accuracy of clinical data to make correct associations. It is also important to recognise that quality control of phenotypes by specific researchers is not possible and we have used the available data to categorise IBD patients, and to identify controls that are reported to have no other autoimmune conditions. With approved access to the Phase 3 UK Biobank data (project ID140070) and other IBD data, we are looking to replicate the GenePy-based findings on IBD and other diseases, with testing and development of GenePy as a potential DNA biomarker representing rare functional variants for complex diseases. Declarations Acknowledgements This study is funded by AGENDA EPSRC funding on AI health research (EP/Y01720X/1) and was supported by the National Institute for Health Research (NIHR) Southampton Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work. JJA is funded by an NIHR Advanced fellowship. Authors contributions CG and SE designed and presented the idea. GC executed the analysis and code development. SE, JA, AC, and MB verified the analytical methods. JA and MB supervised the clinical data interpretation and commented on the phenotypic data quality issue; JA helped a lot on the writing of this manuscript and methods of data presentation. References Graham, D.B. and R.J. Xavier, Pathway paradigms revealed from the genetics of inflammatory bowel disease. Nature, 2020. 578 (7796): p. 527-539. Jiang, L., et al., A generalized linear mixed model association tool for biobank-scale data. Nat Genet, 2021. 53 (11): p. 1616-1621. Uffelmann, E., et al., Genome-wide association studies. Nature Reviews Methods Primers, 2021. 1 (1). Sazonovs, A., et al., Large-scale sequencing identifies multiple genes and rare variants associated with Crohn's disease susceptibility. Nat Genet, 2022. 54 (9): p. 1275-1283. Gettler, K., et al., Common and Rare Variant Prediction and Penetrance of IBD in a Large, Multi-ethnic, Health System-based Biobank Cohort. Gastroenterology, 2021. 160 (5): p. 1546-1557. Bolton, C., et al., An Integrated Taxonomy for Monogenic Inflammatory Bowel Disease. Gastroenterology, 2022. 162 (3): p. 859-876. Ashton, J.J., et al., Genetic Sequencing of Pediatric Patients Identifies Mutations in Monogenic Inflammatory Bowel Disease Genes that Translate to Distinct Clinical Phenotypes. Clinical and Translational Gastroenterology, 2020. 11 . Zhou, D., et al., A phenome-wide scan reveals convergence of common and rare variant associations. Genome Medicine, 2023. 15 (1). Dickson, S.P., et al., Rare Variants Create Synthetic Genome-Wide Associations. Plos Biology, 2010. 8 (1). Goldstein, D.B., The Importance of Synthetic Associations Will Only Be Resolved Empirically. Plos Biology, 2011. 9 (1). Wray, N.R., S.M. Purcell, and P.M. Visscher, Synthetic Associations Created by Rare Variants Do Not Explain Most GWAS Results. Plos Biology, 2011. 9 (1). Bail, P., How Life Works:A User’s Guide to the New Biology . 2023. Noble, D., It’s time to admit that genes are not the blueprint for life. Nature, 2024. 626 : p. 254-255. Vergara-Lope, A., et al., Linkage disequilibrium maps for European and African populations constructed from whole genome sequence data. Sci Data, 2019. 6 (1): p. 208. Zhang, W.H., et al., Properties of linkage disequilibrium (LD) maps. Proceedings of the National Academy of Sciences of the United States of America, 2002. 99 (26): p. 17004-17007. Lee, S., et al., Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies. American Journal of Human Genetics, 2012. 91 (2): p. 224-237. Stafford, I.S., et al., Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data. J Crohns Colitis, 2023. 17 (10): p. 1672-1680. Seaby, E.G., et al., A gene pathogenicity tool 'GenePy' identifies missed biallelic diagnoses in the 100,000 Genomes Project. Genet Med, 2024: p. 101073. Bycroft, C., et al., The UK Biobank resource with deep phenotyping and genomic data. Nature, 2018. 562 (7726): p. 203-+. Szustakowski, J.D., et al., Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nature Genetics, 2021. 53 (7): p. 942-948. Sollis, E., et al., The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research, 2023. 51 (D1): p. D977-D985. Consortium, G.T., The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science, 2020. 369 (6509): p. 1318-1330. Rentzsch, P., et al., CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research, 2019. 47 (D1): p. D886-D894. Horowitz, J.E., et al., Mutation spectrum of reveals recessive inheritance as a main driver of Early Onset Crohn's Disease. Scientific Reports, 2021. 11 (1). Lai, M.H.C., Bootstrap Confidence Intervals for Multilevel Standardized Effect Size. Multivariate Behavioral Research, 2021. 56 (4): p. 558-578. Mann HB, W.D., On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 1947. 18 (1): p. 50-60. Fay, M.P. and Y. Malinovsky, Confidence intervals of the Mann-Whitney parameter that are compatible with the Wilcoxon-Mann-Whitney test. Statistics in Medicine, 2018. 37 (27): p. 3991-4006. Pedregosa, F., et al., Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2011. 12 : p. 2825-2830. Agrawal, R., Imieliński, T., & Swami, A., Mining association rules between sets of items in large databases. ACM SIGMOD Record, 1993. 22 (2): p. 207-216. Huang, L.S., et al., A fast algorithm for mining association rules. Journal of Computer Science and Technology, 2000. 15 (6): p. 619-624. Hahsler, M., B. Grün, and K. Hornik, arules -: A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 2005. 14 (15). Hahsler, M., arulesViz: Interactive Visualization of Association Rules with R. R Journal, 2017. 9 (2): p. 163-175. Frankish, A., et al., GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Research, 2023. 51 (D1): p. D942-D949. Sisu, C., GENCODE Pseudogenes. Pseudogenes, 2 Edition, 2021. 2324 : p. 67-82. Zheng, D.Y., et al., Pseudogenes in the ENCODE regions:: Consensus annotation, analysis of transcription, and evolution. Genome Research, 2007. 17 (6): p. 839-851. Tanwar, V.S., et al., Palmitic Acid-Induced Long Noncoding RNA Regulates Inflammation via Interaction With RNA-Binding Protein ELAVL1 in Monocytes and Macrophages. Arteriosclerosis Thrombosis and Vascular Biology, 2023. 43 (7): p. 1157-1175. Honjo, H., et al., RIPK2 as a New Therapeutic Target in Inflammatory Bowel Diseases. Frontiers in Pharmacology, 2021. 12 . de Lange, K.M., et al., Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nature Genetics, 2017. 49 (2): p. 256-261. Sitinjak, B.D.P., et al., The Potential of Single Nucleotide Polymorphisms (SNPs) as Biomarkers and Their Association with the Increased Risk of Coronary Heart Disease: A Systematic Review. Vascular Health and Risk Management, 2023. 19 : p. 289-301. Lewis, C.M. and E. Vassos, Polygenic risk scores: from research tools to clinical instruments. Genome Medicine, 2020. 12 (1). Barrie, W., et al., Ancient DNA reveals evolutionary origins of autoimmune diseases. Nat Rev Immunol, 2024. 24 (2): p. 85-86. Additional Declarations There is NO Competing Interest. Supplementary Files Supplementarytables2024.xlsx Supplementary Dataset 1 Supplementarymaterials.docx Supplementary materials Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4415057","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":305139134,"identity":"9fa1e980-c2a4-409a-97b3-6bcfb52428d8","order_by":0,"name":"Sarah Ennis","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyUlEQVRIiWNgGAWjYBAC+Qb+BwwMBgxySGJs+LUYHOABkQzGDAzMxGph4AHTiQ2kaGF7XFBQlz6/vf/g4wIGO3kGibQEvFqAfjluPMPgcO6GM4eZjWcwJBs2SKQdwG/NAR42aR6DA7kbJJKBDAbmBAaJ9AZitNSly88Aa6knWgvQ8BtgLYeBWgg4zOAwWMthQ6BfjI15DI4btvE8S8CrRb69/5k0z586efn2xoePeSqq5fnZ0wzwO4wZhWdAMFZGwSgYBaNgFBADANfcNcDWYsp1AAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0003-2648-0869","institution":"University of Southampton","correspondingAuthor":true,"prefix":"","firstName":"Sarah","middleName":"","lastName":"Ennis","suffix":""},{"id":305139135,"identity":"bc45bfd9-4cca-4a2b-ad46-dbf27321100b","order_by":1,"name":"Guo Cheng","email":"","orcid":"","institution":"University of Southampton","correspondingAuthor":false,"prefix":"","firstName":"Guo","middleName":"","lastName":"Cheng","suffix":""},{"id":305139136,"identity":"27e42633-e608-40ad-a79d-0ea933dfa860","order_by":2,"name":"James Ashton","email":"","orcid":"","institution":"University of Southampton","correspondingAuthor":false,"prefix":"","firstName":"James","middleName":"","lastName":"Ashton","suffix":""},{"id":305139137,"identity":"504e22ae-c5a5-4159-a68a-c31ca73e244d","order_by":3,"name":"R.Mark Beattie","email":"","orcid":"","institution":"Southampton Children's Hospital","correspondingAuthor":false,"prefix":"","firstName":"R.Mark","middleName":"","lastName":"Beattie","suffix":""},{"id":305139138,"identity":"60391353-568f-4459-b41d-33d43aed27dc","order_by":4,"name":"Andrew Collins","email":"","orcid":"","institution":"University of Southampton","correspondingAuthor":false,"prefix":"","firstName":"Andrew","middleName":"","lastName":"Collins","suffix":""}],"badges":[],"createdAt":"2024-05-13 19:35:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4415057/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4415057/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":56974219,"identity":"700ecb3d-cef1-44d5-a76b-aae101f53cf4","added_by":"auto","created_at":"2024-05-23 01:51:22","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":678369,"visible":true,"origin":"","legend":"\u003cp\u003eLegend not included with this version.\u003c/p\u003e","description":"","filename":"Fig1workflow.png","url":"https://assets-eu.researchsquare.com/files/rs-4415057/v1/777b6228e386ee185c5cad8c.png"},{"id":56974222,"identity":"8b29aa3c-8a05-475f-97ab-79485e45af67","added_by":"auto","created_at":"2024-05-23 01:51:23","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":2169407,"visible":true,"origin":"","legend":"\u003cp\u003eLegend not included with this version.\u003c/p\u003e","description":"","filename":"Figure2.Associations.png","url":"https://assets-eu.researchsquare.com/files/rs-4415057/v1/fab2018ae9085155363a2ef2.png"},{"id":56974223,"identity":"6a56872a-5f23-4988-9235-ee48f6987b0c","added_by":"auto","created_at":"2024-05-23 01:51:23","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":565259,"visible":true,"origin":"","legend":"\u003cp\u003eLegend not included with this version.\u003c/p\u003e","description":"","filename":"Figure3.PLCG2andLDB189.png","url":"https://assets-eu.researchsquare.com/files/rs-4415057/v1/65711e1466612645beb1d196.png"},{"id":56974224,"identity":"c7fbe3f7-af1a-4a86-8b61-6ae38229bdd9","added_by":"auto","created_at":"2024-05-23 01:51:23","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1010654,"visible":true,"origin":"","legend":"\u003cp\u003eLegend not included with this version.\u003c/p\u003e","description":"","filename":"Fig4.associationnetwork.png","url":"https://assets-eu.researchsquare.com/files/rs-4415057/v1/fb83abd905313930867cf875.png"},{"id":65188995,"identity":"314065cb-bc67-4a86-b3b4-505d655c8ff5","added_by":"auto","created_at":"2024-09-24 14:21:40","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":4316731,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4415057/v1/1506ffc5-797a-4035-a0a4-0306c0b05b66.pdf"},{"id":56974702,"identity":"5eadbc9c-3566-4498-94ba-4eb65cb1d9b7","added_by":"auto","created_at":"2024-05-23 01:59:22","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":2047647,"visible":true,"origin":"","legend":"Supplementary Dataset 1","description":"","filename":"Supplementarytables2024.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4415057/v1/ed00cd642df74ecbb5497da3.xlsx"},{"id":56974221,"identity":"880caab0-a34f-49b4-b88d-bda823272d58","added_by":"auto","created_at":"2024-05-23 01:51:22","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":749417,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary materials\u003c/p\u003e","description":"","filename":"Supplementarymaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-4415057/v1/1a65ec3c4ad23c04c28f8286.docx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Tackling the role of rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eInflammatory bowel disease (IBD) is a chronic, highly heterogenous, inflammatory condition resulting from an aberrant immune response to environmental triggers, in genetically susceptible individuals[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. The disease is commonly classified into two subtypes: Crohn\u0026rsquo;s disease (CD) and ulcerative colitis (UC) based on clinical findings, yet the clinical phenotype is much more varied and hinders effective disease treatment.\u003c/p\u003e \u003cp\u003eGenome-wide association studies (GWASs) of common variants (minor allele frequency (MAF)\u0026thinsp;\u0026gt;\u0026thinsp;3\u0026thinsp;~\u0026thinsp;5% in the general population) on IBD, driven by large cohorts including the UK Biobank[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], have identified over 300 IBD-associated loci, which shed light both on the IBD genetic landscape and implicate pathways[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. However, with modest effect sizes, common variants altogether explain only a minor fraction of the observed IBD heritability, and most of the GWAS variants are not causal variants but rather proxies of the causative variations based on linkage disequilibrium (LD)[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. This limits the application for clinical translation of GWAS.\u003c/p\u003e \u003cp\u003eRare variants (RVs) with major effect sizes represent a key genomic driver that have known associations with IBD[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Although methodologically challenging[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], a growing number of RVs have been statistically identified in complex IBD and functionally validated for monogenic IBD, with the latter manifesting as a rare Mendelian subtype of IBD with familial clustering of occurrence, often with specific additional features, such as immunodeficiency[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. Association of RVs often overlap with common variants [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. In complex IBD, evidence suggests it may be associated with disease through a 2nd hit mechanism by RVs in the pathogenesis pathway/gene in addition to a GWAS association, or via synthetic association of RVs underpinning the GWAS signal in the LD region, while both mechanisms may hold the key to decipher the RVs\u0026rsquo; role in disease[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan additionalcitationids=\"CR10\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. This means that both targeted genes of the GWAS association and the region delineated by the LD block (LDB) that encompasses the association signal can host the disease causal RVs. LDBs can vary among different ethnic groups, whilst targeted genes either under influence of expression Quantitative Trait Loci (eQTL) or in physical adjacency of the variant cannot always be clearly defined due to pleiotropic effects and epigenetic modifications[\u003cspan additionalcitationids=\"CR13 CR14\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWithin a targeted genomic region, burden-based test of selected RVs, \u003cem\u003ee.g\u003c/em\u003e. missense variants, or loss-of-function variants, is the norm to check for the case versus control associations, with well-established sequence kernel association test (SKAT/SKAT-O) for example[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. However, refined pathogenicity weighting of the variants in the burden tests can be essential to elucidate the role of RVs in GWAS loci and in disease. Taking the \u0026lsquo;mendelian-complex\u0026rsquo; genes, the causal genes that overlap between complex IBD and monogenic IBD, for instance, while damaging mutations of these genes cause a severe phenotypic presentation as monogenic IBD, the variant of modest effect can predispose risk to a milder polygenic form of the phenotype as identified from GWASs[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. The pathogenicity variance of the variants in the same gene is the cause of the vastly different phenotypic presentation in this case, which implicate that integrating deleteriousness score in burden tests is essential. Methods based on this, represented by GenePy score integrating mutation load, allele frequency and pathogenicity score of individual variants, has been successfully applied in both clinical genetics and machine learning models based on small cohorts of data[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn this study we developed GenePy2 to adapt with large cohort of rare variants data and tested it as a prototype of a DNA biomarker for IBD. This was followed by investigations on disease association and personalized examination on patients\u0026rsquo; genetic landscape of disease. Analyses were carried out on the UK Biobank cohort and tested on GWAS association regions of IBD.\u003c/p\u003e"},{"header":"MATERIALS AND METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eThe UK BioBank IBD cohort\u003c/h2\u003e \u003cp\u003eThe analysis is based on the UK Biobank phase 2 dataset (project 72911), encompassing exome sequencing and detailed phenotype information from approximately 200,000 participants, which was publicly released in October 2020[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Participants who have withdrawn were excluded from the analysis. The exomes were captured using the IDT xGen Exome research panel V1.0, designed to target 39 Mbps of the human genome. To ensure data quality, additional quality control (QC) metrics were applied to the project-VCF (pVCF). A detailed workflow of this process, along with a list of immune-related diseases curated by the clinic and informatics team, is presented in Fig.\u0026nbsp;1a and supplementary methods with Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003ePatients or the public WERE NOT involved in the design, or conduct, or reporting, or dissemination plans of our research.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eCuration of the IBD-associated genomic variants\u003c/h2\u003e \u003cp\u003eUC, CD, or IBD-associated Single-nucleotide polymorphisms (SNPs) with maximum association p-value of 5x10\u003csup\u003e\u0026minus;\u0026thinsp;8\u003c/sup\u003e were retrieved from the GWAS Catalogue v1.09 [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Through a literature review conducted in June 2023, we refined the dataset by excluding associations derived from case-case studies, associations related to disease subtypes other than UC and CD, as well as those identified in non-European populations. The PubMed search query utilized for literature review was: ((((((\"Crohn's disease\") OR \"inflammatory bowel disease\") OR \"Ulcerative Colitis\")) AND ((\"genome-wide association\"[Title/Abstract]) OR \"genome-wide association\"[Title/Abstract]))) AND (\"1000\"[Date - Create] : \"2023/06/07\"[Date - Create]).\u003c/p\u003e \u003cp\u003eFor each association SNP, we first examined its physically mapped genes (mappedGenes), using the same approach of GWAS Catalogue[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. We identified cis-regulated genes (eGenes) using data from the recent GTEx database V8 (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.gtexportal.org/\u003c/span\u003e\u003cspan address=\"https://www.gtexportal.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), by extracting those associated-SNPs that function as expression Quantitative Trait Loci (eQTLs) in tissues including transverse colon, sigmoid colon, small intestine\u0026amp;terminal ilium, EBV-transformed lymphocytes, fibroblasts, and whole blood [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eTo delineate linkage disequilibrium (LD) blocks (LDBs) associated with IBD, we projected the locus of each association SNP onto the LD unit map of the European population. Employing a sliding window of 1 LD unit (LDU) in size, whereby loci within a 1-LDU distance are grouped into one LDB [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. In cases where direct interpolation of a locus was not feasible, we utilized the position of the most adjacent marker for this purpose. Such LDBs and targeted genes were identified as GWAS loci in this study.\u003c/p\u003e \u003cp\u003eMonogenic IBD genes retrieved from literature are also included in the analysis. The curation process of all the candidate loci is illustrated in Fig.\u0026nbsp;1B.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003ePer-locus GenePy score for the IBD cohort\u003c/h2\u003e \u003cp\u003eThe GenePy v2.0 was developed to cope with large cohort data by addressing issues: 1) incorporation of multi-allelic considerations into the score (maximum-n\u003csub\u003ealternative_allele\u003c/sub\u003e=10); 2) enabling the calculation of scores for various genomic regions, such as LDBs; 3) computational cost reduction with optional processing using GPU; 4) optional selection of variants that are pathogenic or likely pathogenic. The score was built on assessing the pathogenicity potential of each variant allele besides the variant load of a genomic region, integrating information from the Combined Annotation\u0026ndash;Dependent Depletion(CADD V1.6) score and population allele frequency as observed in the 200K participants [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. Genepy2 score was calculated for each candidate gene or LDB based on likely-pathogenic variants of CADD\u003csub\u003ephred_score\u003c/sub\u003e \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e\u0026ge;\u003c/span\u003e15 for every individual within the cohort.\u003c/p\u003e \u003cp\u003eDetails of calculation were described in supplementary methods. The GenePy2.0 pipeline is open source and can be accessed at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/UoS-HGIG/GenePy-2\u003c/span\u003e\u003cspan address=\"https://github.com/UoS-HGIG/GenePy-2\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eLDB/gene-based mutation test\u003c/h2\u003e \u003cp\u003eGenePy2 score-based Mann-Whitney U test was conducted with other burden and threshold-based tests (supplementary methods). We considered the genetic heterogeneity of IBD, with the most commonly associated gene, \u003cem\u003eNOD2\u003c/em\u003e for example, estimated to account for 7.5% of Crohn\u0026rsquo;s disease cases [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] therefore tapered the test from all individuals to those with all non-zero score, top 7.5%, 5%, 2.5% and 1% of highest GenePy scores in cases and controls respectively, to provide a more statistically robust characterisation of the contribution of each gene to disease pathogenesis. This was also followed by a permutation test of 10\u003csup\u003e5\u003c/sup\u003e times to address confounding effects caused by population stratifications. The effect size of the Mann-Whitney U test was evaluated using the Mann-Whitney parameter, theta\u003c/p\u003e \u003cp\u003e \u003cem\u003eϕ\u003c/em\u003e\u0026thinsp;=\u0026thinsp;\u003cem\u003ePr\u003c/em\u003e[\u003cem\u003eX\u0026thinsp;\u0026lt;\u0026thinsp;Y\u003c/em\u003e]\u0026thinsp;+\u0026thinsp;0.5\u003cem\u003ePr\u003c/em\u003e[\u003cem\u003eX\u003c/em\u003e\u0026thinsp;=\u0026thinsp;\u003cem\u003eY\u003c/em\u003e]\u003c/p\u003e \u003cp\u003ewith bootstrap resampling to estimate its confidence interval[\u003cspan additionalcitationids=\"CR26\" citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAll associated tests utilized the same sets of variants and identical LDB/gene coordinates, specifically focusing on pathogenic variants with a CADD\u003csub\u003ePhred_Scor\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003ee\u003c/span\u003e\u003c/sub\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e\u0026gt;\u003c/span\u003e15. Mann-Whitney U tests are based on the scikit-learn library of Python 3.7[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eItemset analysis\u003c/h2\u003e \u003cp\u003eGenePy status was defined as follows: individuals in the sub-population (\u003cem\u003ei.e.\u003c/em\u003e those with the top 7.5%, 5%, 2.5%, or 1% highest scores, or all with non-zero scores) whereby maximum effect size is observed in the GenePy-based Mann-Whitney U test, were deemed positive for GenePy for the testing locus, others with lower scores as negative. The binarization process was conducted for UC case/controls and CD case/controls separately. GenePy status of associated loci (Mann-Whitney U test permutation \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;5.65x10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e addressing multiple testing issue), was analysed by an item association rule mining unsupervised learning approach via the APRORI algorithm[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e, \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], as implemented in arules and arulesViz package of R[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. To reduce the sparsity of the data, individuals without any positive GenePy status were removed before the association mining.\u003c/p\u003e \u003cp\u003eItemset support (\u003cem\u003ei.e.\u003c/em\u003e frequency), lift, and confidence were examined in both CD and UC cohorts and for cases and controls separately to understand the pattern of co-occurrence of association loci, exploring the potential epistatic effects of risk and protective variants. The minimum itemset support for the analysis was 0.0001, and minimum confidence was 0.5.\u003c/p\u003e \u003c/div\u003e"},{"header":"RESULTS","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eLDB and genes in association with IBD\u003c/h2\u003e \u003cp\u003eA total of 546 IBD-associated single-nucleotide polymorphisms (SNPs) were identified from 35 association studies (Table S2), corresponding to 718 GWAS genes. This set includes 413 mappedGenes and 448 eGenes, with an overlap of 143 genes, as depicted in Fig.\u0026nbsp;1B and Table S2. Notably, 13 of the 104 monogenic IBD genes (monoGenes) are GWAS genes, i.e. \u0026rsquo;Mendelian-complex genes\u0026rsquo;, exhibiting significant intersection (Fisher\u0026rsquo;s exact test; protein-coding genes only, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;6.72x10\u003csup\u003e\u0026minus;\u0026thinsp;6\u003c/sup\u003e). Functional gene set enrichment analysis revealed similar enrichment of both GWAS genes and monoGenes in immune-related pathway (Table S3; Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e), aligning with the anticipated convergence of molecular pathogenic pathways in monogenic IBD and complex IBD.\u003c/p\u003e \u003cp\u003eAnother feature of GWAS genes is the enrichment of non-protein coding pseudogenes (n\u0026thinsp;=\u0026thinsp;157), which make up 26.39% of mappedGenes and 12.95% of eGenes. This aligns with overrepresentation of pseudogenes in the applied reference GENCODE V43 [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e] and our impartial SNP-gene mapping approach with no preferable selection for protein-coding genes or known IBD genes. Whilst there is growing knowledge of their association with disease and immune regulation [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e], the majority of the pseudogenes are not covered by the Exome capture kit (n\u0026thinsp;=\u0026thinsp;116; Fig.\u0026nbsp;1b).\u003c/p\u003e \u003cp\u003eUtilizing the European-based LDU map[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], 546 GWAS SNPs are categorized into 260 LDBs, with 150 consisting of a single association SNP (IBD-association \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;5x10\u003csup\u003e\u0026minus;\u0026thinsp;8\u003c/sup\u003e), and the remaining defined by \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e\u0026ge;\u003c/span\u003e\u0026thinsp;2 GWAS association SNPs. The LDBs span from 1.00 to 3.20 LDU, or 3,630bp to 3,246,717 bp according to the physical position in size. The largest LDB, LDB78b, is located at 5q31.1 (Table S2), and encompasses 6 GWAS association SNPs, which consist of eQTLs of \u003cem\u003eMEIKIN\u003c/em\u003e. LDB78a, despite being \u0026gt;\u0026thinsp;1LDU far away from LDB78b, encompasses another IBD-associated eQTL of \u003cem\u003eMEIKIN\u003c/em\u003e. Such LDBs, by sharing a common gene with the association SNP that they encompass, are defined as clusters of LDBs (n\u0026thinsp;=\u0026thinsp;21). As might be expected, the most significant cluster is the HLA region at 6p21.32-33, comprising 7 LDBs (Table S2). One hundred and ninety-four LDBs are captured by the Exome sequencing assay. These LDBs encompass the complete sequence of 313 GWAS genes, partially overlap with 201 GWAS genes, and have no intersection with the other 204 GWAS genes. LDBs can also extend beyond mappedGenes and eGenes. For instance, LDB187 at 16q12.1, delineated by 5 GWAS SNPs covers \u003cem\u003eCYLD\u003c/em\u003e, a monoGene but not a GWAS gene, besides \u003cem\u003eNOD2\u003c/em\u003e and \u003cem\u003eCYLD-AS1\u003c/em\u003e (Figure S2).\u003c/p\u003e \u003cp\u003eThe GWAS genes, LDBs and monoGenes together account for 885 target regions to be tested, as component LDBs within a LDB cluster is tested separately.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eThe UK Biobank IBD cohort\u003c/h2\u003e \u003cp\u003eFollowing QC, ethnicity- and phenotype-based filtration retained 891 CD, 1,409 UC cases, and 60,118 controls. Most of the IBD diagnoses were made in patients\u0026rsquo; adulthood, whilst 37 CD and 33 UC were diagnosed on or before the patients reached 18 years old. Further demographic and sub-phenotypic features of the UC and CD patients are derived based on the ICD-10 code of diagnosis as shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDemographic characteristics of the European UK BioBank cohort for the analysis\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"11\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c11\" colnum=\"11\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003eCD (n\u0026thinsp;=\u0026thinsp;891)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003eUC (n\u0026thinsp;=\u0026thinsp;1409)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c11\"\u003e \u003cp\u003eControls (n\u0026thinsp;=\u0026thinsp;60,118)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDemographics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c11\"\u003e\u0026nbsp;\u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e387\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e730\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e32,925\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e504\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e679\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e27,193\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge at latest assessment: median (IQR) (Year)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e59 (51\u0026ndash;64)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e61 (54\u0026ndash;65)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e58 (50\u0026ndash;63)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge at Diagnosis; median (IQR) (Year)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e57 (49\u0026ndash;65)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e51 (49\u0026ndash;67)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"11\" nameend=\"c11\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDisease Subtypes\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSmall intestine\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLarge intestine\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBoth small \u0026amp; lareg intestine\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eUnspecified\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eileocolitis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eproctitis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003erectosigmoiditis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eproctitis \u0026amp; rectosigmoiditis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003eunspecified\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e174\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e197\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e473\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e192\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e81\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e25\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e1105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"11\" nameend=\"c11\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGI complications\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFistula disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStricturing disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e138\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eColon cancer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e43\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMegacolon disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"11\" nameend=\"c11\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eComorbidities with other autoimmune diseases\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003en\u0026thinsp;=\u0026thinsp;1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e111\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e151\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003en\u0026thinsp;=\u0026thinsp;2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003en\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003ePathogenic mutations of GWAS association loci and monogenic IBD genes\u003c/h2\u003e \u003cp\u003eAll but 10 of the GWAS-derived set of 794 targets host\u0026thinsp;\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e\u0026ge;\u003c/span\u003e\u0026thinsp;1 variants with CADD\u003csub\u003ephred_score\u003c/sub\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e\u0026ge;\u003c/span\u003e15 in the cohort, and all the monoGenes were mutated in at least 1 patient. Despite this, pathogenic variants were very sparsely identified in the patients. Approximately half of the testing loci had a non-zero GenePy score in fewer than 5% of patients, as observed on 416 (52.39%) of the GWAS loci and 46 (44.23%) of the monoGenes in CD patients, and similarly on 425 (53.53%) GWAS loci and 48 (46.15%) monoGenes in UC. With more than half of the values being zeros, the GenePy score matrix per locus/individual is a sparse matrix for downstream analysis.\u003c/p\u003e \u003cp\u003eThe most mutated genes are the 13 known \u0026lsquo;mendelian-complex\u0026rsquo; IBD genes, as 8 (61.53%) are mutated in \u0026gt;\u0026thinsp;5% of both UC and CD, except for \u003cem\u003eCD40, IL2RA, IL10, STAT3\u003c/em\u003e and \u003cem\u003eLACC1\u003c/em\u003e that are rarely mutated either UC or CD. Such sparsity of non-zero GenePy scores of the patients corresponds to the genetic heterogeneity of IBD and is the rational for the following GenePy-based association tests on subpopulations with highest scores.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eAssociation of the candidate regions with disease\u003c/h2\u003e \u003cp\u003eUnder the monogenic IBD model, two significant associations are observed with CD which exert opposing effects on disease: \u003cem\u003eNOD2\u003c/em\u003e being risk under the recessive model and \u003cem\u003eIL23R\u003c/em\u003e, under the dominant and additive inheritance models, both with protective effects. Both genes are known IBD genes with \u003cem\u003eNOD2\u003c/em\u003e also being a \u0026lsquo;mendelian-complex\u0026rsquo; IBD gene. No significant associations were detected with UC from this test (Figure S3).\u003c/p\u003e \u003cp\u003eBurden-based SKAT-O test highlighted the most significantly associated gene of UC, \u003cem\u003eRIPK2-DT\u003c/em\u003e, a noncoding eGene associated with the IBD-association SNP rs7015630. \u003cem\u003eRIPK2-DT\u003c/em\u003e plays a role in mitigating inflammation induced by free fatty acids but is less known in IBD compared to its downstream gene \u003cem\u003eRIPK2\u003c/em\u003e [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e, \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]. The \u003cem\u003eRIPK2-DT\u003c/em\u003e association was not detected in the GenePy-based rank sum tests.\u003c/p\u003e \u003cp\u003eGenePy-based Mann-Whitney U test uncovered 35 loci in significant association with CD and 25 with UC (Fig.\u0026nbsp;2; Figure S4). \u003cem\u003eHLA-DQA1\u003c/em\u003e and \u003cem\u003eHLA-DQB2\u003c/em\u003e are the most significantly associated genes with UC and controls of the top 7.5% or GenePy scores, albeit of modest effect sizes (\u003cem\u003eϕ\u003c/em\u003e\u003csub\u003e\u003cem\u003eHLA\u0026minus;DQA1\u003c/em\u003e\u003c/sub\u003e\u0026thinsp;=\u0026thinsp;0.63, CI [0.59,0.67]; \u003cem\u003eϕ\u003c/em\u003e\u003csub\u003e\u003cem\u003eHLA\u0026minus;DQB2\u003c/em\u003e\u003c/sub\u003e\u0026thinsp;=\u0026thinsp;0.66, CI [0.63,0.70]), compared to other associated genes, \u003cem\u003ee.g\u003c/em\u003e. \u003cem\u003eϕ\u003c/em\u003e\u003csub\u003e\u003cem\u003eSLC17A1\u003c/em\u003e\u003c/sub\u003e\u0026thinsp;=\u0026thinsp;0.81, CI [0.73,0.88], or the monoGene \u003cem\u003eLIG4\u003c/em\u003e, where \u003cem\u003eϕ\u003c/em\u003e\u003csub\u003e\u003cem\u003eLIG4\u003c/em\u003e\u003c/sub\u003e\u0026thinsp;=\u0026thinsp;0.82, CI [0.74,0.89]; \u003cem\u003eNOD2\u003c/em\u003e, together with the co-located LDB187 and \u003cem\u003eCYLA-AS1\u003c/em\u003e gene (Figure S2), but not \u003cem\u003eCYLD\u003c/em\u003e the monoGene, are the most significantly associated with CD (Fig.\u0026nbsp;2). Such associations propped up by the rare pathogenic variants (CADD\u003csub\u003ephred_score\u003c/sub\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e\u0026ge;\u003c/span\u003e15) exert larger effect sizes to disease compared to that identified from the original GWAS, of both protective and risk effects observed, and such effects tend to be bigger when the affected sub-population is smaller (Fig.\u0026nbsp;2). Notably, although the smallest p value of \u003cem\u003eNOD2\u003c/em\u003e was observed in individuals of the top 7.5% highest scores (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;1.41x10\u003csup\u003e\u0026minus;\u0026thinsp;17\u003c/sup\u003e, \u003cem\u003eϕ\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.80, CI [0.77,0.83], the maximum effect size was observed in those with the top 2.5% highest score \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;5.13x10\u003csup\u003e\u0026minus;\u0026thinsp;7\u003c/sup\u003e, \u003cem\u003eϕ\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.81, CI [0.76,0.86])\u003c/p\u003e \u003cp\u003eThe eGene \u003cem\u003eNOTCH1\u003c/em\u003e and a mappedGene \u003cem\u003eCARD9\u003c/em\u003e at locus 9q34.3 are tagged by the association SNPs encompassed by LDB131a/b, and both exhibited significant association with CD, evincing pleiotropic effects at the gene level of a GWAS association locus. In another case, LDB189 which constitutes a proportion of the \u003cem\u003ePLCG2\u003c/em\u003e gene encompassing the phospholipases domain, is significantly associated with CD with protective effects but the entire \u003cem\u003ePLCG2\u003c/em\u003e gene is not (Fig.\u0026nbsp;3), in line the GWAS findings[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eSet of highly mutated genes in IBD and controls\u003c/h2\u003e \u003cp\u003eWe tested rare variant-based associations in both UC and CD, appeared to exert both protective and risk effects, with the potential for some cases of the disease to constitute oligogenic pathogenesis given the large effect sizes. We tested this using an itemset association analysis by the APRIORI algorithm, with patients carrying higher GenePy2 score than the cut-off applied in association tests considered being GenePy positive for a mutant gene or LDB (Table S5-8). The test was conducted on 398 CD patients and 28,017 controls with any positive GenePy status of the 34 CD-associated genes/LDBs, and similarly on 613 UC with 25,748 controls with positive GenePy of the 25 association loci.\u003c/p\u003e \u003cp\u003eGenePy status of LDBs/genes within the same GWAS association region tend to be associated because of the existing intra-region overlaps (Table S2 and Table S5-8). Between GWAS regions, considerable coexistence of \u0026lsquo;positive\u0026rsquo; GenePy status of LDB187/\u003cem\u003eNOD2\u003c/em\u003e and \u003cem\u003eIR23R\u003c/em\u003e/LDB6 were observed in controls (Fig.\u0026nbsp;4A). This coexistence was completely absent in itemset observation in CD cases, with GenePy(+) status of both the \u003cem\u003eNOD2/IL23R\u003c/em\u003e regions being mutually exclusive in CD patients (Fig.\u0026nbsp;4A; Table S5-S6). \u003cem\u003eIL23R\u003c/em\u003e and the genomic region also showed strongest associations with other regions in controls of the UC-associated genes/LDBs (Fig.\u0026nbsp;4B; Table S7-8), indicating its counter-risk effects in both IBD subtypes, albeit the observation in UC can be biased as the sub-population with \u003cem\u003eIL23R\u003c/em\u003e positive GenePy status constituted 14.08% of the UC cohort (Fig.\u0026nbsp;3C).\u003c/p\u003e \u003c/div\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eBoth single DNA variant and aggregated effects of multiple variants has been utilized for disease risk stratifications[\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e], but a biomarker from rare and functional genomic variants is missing for complex disease despite their potentially direct causal effect with disease. Filling the gap relies on a large cohort, but big genomic data is enriched with issues of complex variations, \u003cem\u003ee.g.\u003c/em\u003e multi-allelic variation, variation of unknown significance, etc. Based on the UK BioBank cohort, we tackled such complexities using an evolved GenePy2.0 with more computational efficiency and flexibility, and then tested it on known GWAS loci represented by common variants-based associations. A tailored analysis on IBD was performed, and the result demonstrated the significant enrichment of associations represented by GenePy score with both risk and protective effects on disease occurrence, which will change our previous outlook on the IBD genetic architecture. This approach also exemplifies a new approach to tackling the relationship of GWAS CVs and rare variants.\u003c/p\u003e \u003cp\u003eIBD is the archetypal \u0026lsquo;complex\u0026rsquo; disease, with genetic heterogeneity leading to distinct underlying aetiology of disease pathogenesis within individuals and governing the role of both triggering and ongoing environmental drivers of disease. In addition to the plethora of GWAS findings which shed light on the genetic pathogenic pathways of the disease, recent analysis of large numbers of patients with WES data has continued to advance knowledge and implicated more rare variation in pathogenesis [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Here we build upon the ability to assess rare variation through application of statistical analysis to determine the maximal contribution of each GWAS locus to IBD pathogenesis within the cohort.\u003c/p\u003e \u003cp\u003eWe did not limit our view on GWAS to genes, instead followed a na\u0026iuml;ve approach revisiting the SNP associations with LD mapping in addition to evidence-based physical and eQTL mapping of candidate genes. This introduced pseudogenes and intergenic LDBs which are undesirable targets for the WES-based downstream analysis as many of the candidates are not captured by the sequencing assay, not to mention that many are less studied. However, this has also led to novel discoveries. Our analysis points to variation across the entire \u003cem\u003eNOD2\u003c/em\u003e-associated LDB, rather than just the gene, as being significantly associated with Crohn\u0026rsquo;s disease inferring important roles for regulatory regions in addition to established coding variants. Similarly, our analysis pinpoints an association in the \u003cem\u003ePLCG2\u003c/em\u003e to only part of the gene with the potential to utilise this to better understand the underlying biological process through which variants lead to disease. Pseudogenes \u003cem\u003eRIPK2-DT\u003c/em\u003e and \u003cem\u003eCYLD-AS1\u003c/em\u003e also stand out in association tests which indicates novel pathogenic gene pathway of IBD.\u003c/p\u003e \u003cp\u003eThe discovery of associations has been significantly promoted by GenePy2. By capturing the role of rare variation at an individual level, this technique provides the ability to both determine the relative contributions to IBD pathogenesis of associated genes across a cohort, and to determine, at an individual level, patients presumed to have disease where a specific gene (or set of genes) has a statistical contribution compared to other patients. This opens the possibility of personalising the molecular diagnosis for an individual patient and identifies genomic biomarkers of disease. By taking the subset of individuals with the highest GenePy score, we can tackle the genetic heterogeneity of IBD in a straightforward approach. For instance, rank-based comparison recovered the most significant association of \u003cem\u003eNOD2\u003c/em\u003e locus for the CD patients with the top 7.5% GenePy scores, concurring with previous findings, although we found that the largest hazard effect of \u003cem\u003eNOD2\u003c/em\u003e mutation is for the more extreme top 1% of scores.\u003c/p\u003e \u003cp\u003eNot all \u0026lsquo;pathogenic\u0026rsquo; variants are causal to IBD, as we found both risk and protective effects in the CD and UC cohort. This is consistent with the evolutionary picture of autoimmune disease[\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e], and the directionality of genetic variants may be addressed in burden-based association tests able to annotate gain-of-function, loss-of-function and dominant negative effects into the GenePy score in the near future. Interestingly, the effect sizes of the GenePy score-based tests are much larger than the GWAS findings on index SNPs, providing the possibility for the scoring tool to be applied as a potential biomarker for implicated genomic ()counter each other when occurring to the same individual, as we observed in controls with positive \u003cem\u003eNOD2\u003c/em\u003e GenePy status being also positive for \u003cem\u003eIL23R\u003c/em\u003e. Furthermore, identifying this pattern implicates an oligogenic picture of IBD for some patients, with disease aetiology lying between complex and monogenic IBD.\u003c/p\u003e \u003cp\u003eWhilst the UK Biobank cohort provides many advantages, including its large size and rich phenotyping data, the nature of WES data are not ideal for analysis of all GWAS targets as many of the associations lie in noncoding regions, as observed in a large proportion of the LDBs in this study. WGS may provide the opportunity for improvement in both methods and discovery, and application of these methods. Another area of potential weakness in UK Biobank data are the precision of the clinical phenotyping, which impedes the subtype or genotype-phenotype correlation analysis even with GenePy of large effects. In this study we have attempted to identify genomic associations of specific IBD subtypes and are therefore reliant on the accuracy of clinical data to make correct associations. It is also important to recognise that quality control of phenotypes by specific researchers is not possible and we have used the available data to categorise IBD patients, and to identify controls that are reported to have no other autoimmune conditions.\u003c/p\u003e \u003cp\u003eWith approved access to the Phase 3 UK Biobank data (project ID140070) and other IBD data, we are looking to replicate the GenePy-based findings on IBD and other diseases, with testing and development of GenePy as a potential DNA biomarker representing rare functional variants for complex diseases.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eThis study is funded by AGENDA EPSRC funding on AI health research (EP/Y01720X/1) and was supported by the National Institute for Health Research (NIHR) Southampton Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work. JJA is funded by an NIHR Advanced fellowship.\u003c/p\u003e\n\u003ch2\u003eAuthors contributions\u003c/h2\u003e \u003cp\u003eCG and SE designed and presented the idea. GC executed the analysis and code development. SE, JA, AC, and MB verified the analytical methods. JA and MB supervised the clinical data interpretation and commented on the phenotypic data quality issue; JA helped a lot on the writing of this manuscript and methods of data presentation.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eGraham, D.B. and R.J. Xavier, \u003cem\u003ePathway paradigms revealed from the genetics of inflammatory bowel disease.\u003c/em\u003e Nature, 2020. \u003cstrong\u003e578\u003c/strong\u003e(7796): p. 527-539.\u003c/li\u003e\n\u003cli\u003eJiang, L., et al., \u003cem\u003eA generalized linear mixed model association tool for biobank-scale data.\u003c/em\u003e Nat Genet, 2021. \u003cstrong\u003e53\u003c/strong\u003e(11): p. 1616-1621.\u003c/li\u003e\n\u003cli\u003eUffelmann, E., et al., \u003cem\u003eGenome-wide association studies.\u003c/em\u003e Nature Reviews Methods Primers, 2021. \u003cstrong\u003e1\u003c/strong\u003e(1).\u003c/li\u003e\n\u003cli\u003eSazonovs, A., et al., \u003cem\u003eLarge-scale sequencing identifies multiple genes and rare variants associated with Crohn\u0026apos;s disease susceptibility.\u003c/em\u003e Nat Genet, 2022. \u003cstrong\u003e54\u003c/strong\u003e(9): p. 1275-1283.\u003c/li\u003e\n\u003cli\u003eGettler, K., et al., \u003cem\u003eCommon and Rare Variant Prediction and Penetrance of IBD in a Large, Multi-ethnic, Health System-based Biobank Cohort.\u003c/em\u003e Gastroenterology, 2021. \u003cstrong\u003e160\u003c/strong\u003e(5): p. 1546-1557.\u003c/li\u003e\n\u003cli\u003eBolton, C., et al., \u003cem\u003eAn Integrated Taxonomy for Monogenic Inflammatory Bowel Disease.\u003c/em\u003e Gastroenterology, 2022. \u003cstrong\u003e162\u003c/strong\u003e(3): p. 859-876.\u003c/li\u003e\n\u003cli\u003eAshton, J.J., et al., \u003cem\u003eGenetic Sequencing of Pediatric Patients Identifies Mutations in Monogenic Inflammatory Bowel Disease Genes that Translate to Distinct Clinical Phenotypes.\u003c/em\u003e Clinical and Translational Gastroenterology, 2020. \u003cstrong\u003e11\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003eZhou, D., et al., \u003cem\u003eA phenome-wide scan reveals convergence of common and rare variant associations.\u003c/em\u003e Genome Medicine, 2023. \u003cstrong\u003e15\u003c/strong\u003e(1).\u003c/li\u003e\n\u003cli\u003eDickson, S.P., et al., \u003cem\u003eRare Variants Create Synthetic Genome-Wide Associations.\u003c/em\u003e Plos Biology, 2010. \u003cstrong\u003e8\u003c/strong\u003e(1).\u003c/li\u003e\n\u003cli\u003eGoldstein, D.B., \u003cem\u003eThe Importance of Synthetic Associations Will Only Be Resolved Empirically.\u003c/em\u003e Plos Biology, 2011. \u003cstrong\u003e9\u003c/strong\u003e(1).\u003c/li\u003e\n\u003cli\u003eWray, N.R., S.M. Purcell, and P.M. Visscher, \u003cem\u003eSynthetic Associations Created by Rare Variants Do Not Explain Most GWAS Results.\u003c/em\u003e Plos Biology, 2011. \u003cstrong\u003e9\u003c/strong\u003e(1).\u003c/li\u003e\n\u003cli\u003eBail, P., \u003cem\u003eHow Life Works:A User\u0026rsquo;s Guide to the New Biology\u003c/em\u003e. 2023.\u003c/li\u003e\n\u003cli\u003eNoble, D., \u003cem\u003eIt\u0026rsquo;s time to admit that genes are not the blueprint for life.\u003c/em\u003e Nature, 2024. \u003cstrong\u003e626\u003c/strong\u003e: p. 254-255.\u003c/li\u003e\n\u003cli\u003eVergara-Lope, A., et al., \u003cem\u003eLinkage disequilibrium maps for European and African populations constructed from whole genome sequence data.\u003c/em\u003e Sci Data, 2019. \u003cstrong\u003e6\u003c/strong\u003e(1): p. 208.\u003c/li\u003e\n\u003cli\u003eZhang, W.H., et al., \u003cem\u003eProperties of linkage disequilibrium (LD) maps.\u003c/em\u003e Proceedings of the National Academy of Sciences of the United States of America, 2002. \u003cstrong\u003e99\u003c/strong\u003e(26): p. 17004-17007.\u003c/li\u003e\n\u003cli\u003eLee, S., et al., \u003cem\u003eOptimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies.\u003c/em\u003e American Journal of Human Genetics, 2012. \u003cstrong\u003e91\u003c/strong\u003e(2): p. 224-237.\u003c/li\u003e\n\u003cli\u003eStafford, I.S., et al., \u003cem\u003eSupervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data.\u003c/em\u003e J Crohns Colitis, 2023. \u003cstrong\u003e17\u003c/strong\u003e(10): p. 1672-1680.\u003c/li\u003e\n\u003cli\u003eSeaby, E.G., et al., \u003cem\u003eA gene pathogenicity tool \u0026apos;GenePy\u0026apos; identifies missed biallelic diagnoses in the 100,000 Genomes Project.\u003c/em\u003e Genet Med, 2024: p. 101073.\u003c/li\u003e\n\u003cli\u003eBycroft, C., et al., \u003cem\u003eThe UK Biobank resource with deep phenotyping and genomic data.\u003c/em\u003e Nature, 2018. \u003cstrong\u003e562\u003c/strong\u003e(7726): p. 203-+.\u003c/li\u003e\n\u003cli\u003eSzustakowski, J.D., et al., \u003cem\u003eAdvancing human genetics research and drug discovery through exome sequencing of the UK Biobank.\u003c/em\u003e Nature Genetics, 2021. \u003cstrong\u003e53\u003c/strong\u003e(7): p. 942-948.\u003c/li\u003e\n\u003cli\u003eSollis, E., et al., \u003cem\u003eThe NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource.\u003c/em\u003e Nucleic Acids Research, 2023. \u003cstrong\u003e51\u003c/strong\u003e(D1): p. D977-D985.\u003c/li\u003e\n\u003cli\u003eConsortium, G.T., \u003cem\u003eThe GTEx Consortium atlas of genetic regulatory effects across human tissues.\u003c/em\u003e Science, 2020. \u003cstrong\u003e369\u003c/strong\u003e(6509): p. 1318-1330.\u003c/li\u003e\n\u003cli\u003eRentzsch, P., et al., \u003cem\u003eCADD: predicting the deleteriousness of variants throughout the human genome.\u003c/em\u003e Nucleic Acids Research, 2019. \u003cstrong\u003e47\u003c/strong\u003e(D1): p. D886-D894.\u003c/li\u003e\n\u003cli\u003eHorowitz, J.E., et al., \u003cem\u003eMutation spectrum of \u003cem\u003e reveals recessive inheritance as a main driver of Early Onset Crohn\u0026apos;s Disease.\u003c/em\u003e Scientific Reports, 2021. \u003cstrong\u003e11\u003c/strong\u003e(1).\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eLai, M.H.C., \u003cem\u003eBootstrap Confidence Intervals for Multilevel Standardized Effect Size.\u003c/em\u003e Multivariate Behavioral Research, 2021. \u003cstrong\u003e56\u003c/strong\u003e(4): p. 558-578.\u003c/li\u003e\n\u003cli\u003eMann HB, W.D., \u003cem\u003eOn a test of whether one of two random variables is stochastically larger \u003cem\u003ethan the other.\u003c/em\u003e The Annals of Mathematical Statistics, 1947. \u003cstrong\u003e18\u003c/strong\u003e(1): p. 50-60.\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eFay, M.P. and Y. Malinovsky, \u003cem\u003eConfidence intervals of the Mann-Whitney parameter that are compatible with the Wilcoxon-Mann-Whitney test.\u003c/em\u003e Statistics in Medicine, 2018. \u003cstrong\u003e37\u003c/strong\u003e(27): p. 3991-4006.\u003c/li\u003e\n\u003cli\u003ePedregosa, F., et al., \u003cem\u003eScikit-learn: Machine Learning in Python.\u003c/em\u003e Journal of Machine Learning Research, 2011. \u003cstrong\u003e12\u003c/strong\u003e: p. 2825-2830.\u003c/li\u003e\n\u003cli\u003eAgrawal, R., Imieliński, T., \u0026amp; Swami, A., \u003cem\u003eMining association rules between sets of items in large databases.\u003c/em\u003e ACM SIGMOD Record, 1993. \u003cstrong\u003e22\u003c/strong\u003e(2): p. 207-216.\u003c/li\u003e\n\u003cli\u003eHuang, L.S., et al., \u003cem\u003eA fast algorithm for mining association rules.\u003c/em\u003e Journal of Computer Science and Technology, 2000. \u003cstrong\u003e15\u003c/strong\u003e(6): p. 619-624.\u003c/li\u003e\n\u003cli\u003eHahsler, M., B. Gr\u0026uuml;n, and K. Hornik, \u003cem\u003earules -: A computational environment for mining association rules and frequent item sets.\u003c/em\u003e Journal of Statistical Software, 2005. \u003cstrong\u003e14\u003c/strong\u003e(15).\u003c/li\u003e\n\u003cli\u003eHahsler, M., \u003cem\u003earulesViz: Interactive Visualization of Association Rules with R.\u003c/em\u003e R Journal, 2017. \u003cstrong\u003e9\u003c/strong\u003e(2): p. 163-175.\u003c/li\u003e\n\u003cli\u003eFrankish, A., et al., \u003cem\u003eGENCODE: reference annotation for the human and mouse genomes in 2023.\u003c/em\u003e Nucleic Acids Research, 2023. \u003cstrong\u003e51\u003c/strong\u003e(D1): p. D942-D949.\u003c/li\u003e\n\u003cli\u003eSisu, C., \u003cem\u003eGENCODE Pseudogenes.\u003c/em\u003e Pseudogenes, 2 Edition, 2021. \u003cstrong\u003e2324\u003c/strong\u003e: p. 67-82.\u003c/li\u003e\n\u003cli\u003eZheng, D.Y., et al., \u003cem\u003ePseudogenes in the ENCODE regions:: Consensus annotation, analysis of transcription, and evolution.\u003c/em\u003e Genome Research, 2007. \u003cstrong\u003e17\u003c/strong\u003e(6): p. 839-851.\u003c/li\u003e\n\u003cli\u003eTanwar, V.S., et al., \u003cem\u003ePalmitic Acid-Induced Long Noncoding RNA \u003cem\u003e Regulates Inflammation via Interaction With RNA-Binding Protein ELAVL1 in Monocytes and Macrophages.\u003c/em\u003e Arteriosclerosis Thrombosis and Vascular Biology, 2023. \u003cstrong\u003e43\u003c/strong\u003e(7): p. 1157-1175.\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eHonjo, H., et al., \u003cem\u003eRIPK2 as a New Therapeutic Target in Inflammatory Bowel Diseases.\u003c/em\u003e Frontiers in Pharmacology, 2021. \u003cstrong\u003e12\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003ede Lange, K.M., et al., \u003cem\u003eGenome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease.\u003c/em\u003e Nature Genetics, 2017. \u003cstrong\u003e49\u003c/strong\u003e(2): p. 256-261.\u003c/li\u003e\n\u003cli\u003eSitinjak, B.D.P., et al., \u003cem\u003eThe Potential of Single Nucleotide Polymorphisms (SNPs) as Biomarkers and Their Association with the Increased Risk of Coronary Heart Disease: A Systematic Review.\u003c/em\u003e Vascular Health and Risk Management, 2023. \u003cstrong\u003e19\u003c/strong\u003e: p. 289-301.\u003c/li\u003e\n\u003cli\u003eLewis, C.M. and E. Vassos, \u003cem\u003ePolygenic risk scores: from research tools to clinical instruments.\u003c/em\u003e Genome Medicine, 2020. \u003cstrong\u003e12\u003c/strong\u003e(1).\u003c/li\u003e\n\u003cli\u003eBarrie, W., et al., \u003cem\u003eAncient DNA reveals evolutionary origins of autoimmune diseases.\u003c/em\u003e Nat Rev Immunol, 2024. \u003cstrong\u003e24\u003c/strong\u003e(2): p. 85-86.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"IBD, Crohn’s disease, Ulcerative colitis, GWAS, GenePy2, pathogenic burden, genetics","lastPublishedDoi":"10.21203/rs.3.rs-4415057/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4415057/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eRare and common variants often converge in the pathogenic pathway of in inflammatory bowel disease (IBD), a heterogenous autoimmune condition with genomic and environmental influences. We identified 794 functionally-targeted-genes/linkage-disequilibrium-mapped blocks (LDBs) implicated by genome-wide-association-studies (GWAS), then developed GenePy2, a burden score that integrates functional impacts of rare variants for each gene/LDB, using exome data of UK-Biobank phase2 IBD cohort. Through case/control 2-way Man-Whitney-U test tuning on subpopulations with extreme GenePy2 scores, 34 genes/LDBs in Crohn\u0026rsquo;s disease (CD) and 25 in Ulcerative Colitis (UC) survived significance test, confirming roles for rare functional variants. The optimal threshold of GenePy2 were then pinpointed for each gene/LDB based on tests\u0026rsquo; maximum effect size. Further itemset association mining of the binarised GenePy2 scores detected an intriguing cooccurrence of extreme scores of the risk \u003cem\u003eNOD2\u003c/em\u003e and protective \u003cem\u003eIL23R\u003c/em\u003e in controls, which are mutually exclusive in CD patients, implicating a \u0026lsquo;rescue\u0026rsquo; of disease by protective rare variants.\u003c/p\u003e","manuscriptTitle":"Tackling the role of rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-05-23 01:51:18","doi":"10.21203/rs.3.rs-4415057/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a935e8ac-4d56-4ad4-9495-73955183ddf7","owner":[],"postedDate":"May 23rd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":32218347,"name":"Health sciences/Risk factors"},{"id":32218348,"name":"Health sciences/Gastroenterology/Gastrointestinal diseases/Inflammatory bowel disease"}],"tags":[],"updatedAt":"2024-09-24T14:13:29+00:00","versionOfRecord":[],"versionCreatedAt":"2024-05-23 01:51:18","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4415057","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4415057","identity":"rs-4415057","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-4.0