Long read sequencing enhances pathogenic and novel variation discovery in patients with rare diseases | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Brief Communication Long read sequencing enhances pathogenic and novel variation discovery in patients with rare diseases Ahmad Abou Tayoun, Shruti Sinha, Fatma Rabea, Sathishkumar Ramaswamy, and 13 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4235049/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 14 Mar, 2025 Read the published version in Nature Communications → Version 1 posted You are reading this latest preprint version Abstract With ongoing improvements in accuracy and capacity to detect complex genomic and epigenomic variations, long-read sequencing (LRS) technologies could serve as a unified platform for clinical genetic testing, particularly in rare disease settings, where nearly half of patients remain undiagnosed using existing technologies. Here, we report a simplified funnel-down filtration strategy aimed at identifying large deleterious variants and abnormal episignature disease profiles from whole-genome LRS data. This approach substantially reduced structural and copy number variants by 98.5–99.9%, respectively, while detecting all pathogenic changes in a positive control set (N = 10). When applied to patients who previously had negative short-read testing (N = 39), additional diagnoses were uncovered in 13% of cases, including a novel methylation profile specific to spinal muscular atrophy, thus opening new avenues for diagnosing and treating this life-threatening condition. Our study illustrates the utility of LRS in clinical genetic testing and in the discovery of novel disease variations. Biological sciences/Genetics/Genomics/Medical genomics Biological sciences/Genetics/Sequencing Figures Figure 1 Figure 2 INTRODUCTION Around 7,000 rare diseases have been identified, collectively imposing significant health socio-economic burden 1 . Majority of these diseases have a genetic origin due to variants ranging from single nucleotide variants (SNVs) or a few nucleotide insertions/deletions (INDELs), to large genomic changes such as copy number variants (CNVs), translocations, inversions, transposable element (TE) insertions, or complex rearrangements. Some are also associated with specific epigenomic profiles 2 . This diverse spectrum of disease-causing changes, often detected by different technologies, has challenged current genetic diagnostic strategies and contributed to long diagnostic odysseys, averaging at 6 years 3 , and delayed timely management or treatment plans for patients with rare disease. Although short-read sequencing technologies have brought a remarkable leap in the diagnosis of rare genetic diseases 4 , 5 , more than half of the patients remain undiagnosed. This is partly due to the inherent limitations of this technology in detecting complex variants such as structural variants, methylation profiles, repeat expansions, or variants embedded in inaccessible regions of the genome, specifically high homology and GC rich regions 6 . Recent advances in third generation sequencing technologies have demonstrated the application of targeted LRS for identifying pathogenic variants in known or novel disease-causing genes 7 – 9 . However, the clinical implementation of LRS for detecting genome-wide variation and methylation changes in the context of rare diseases has been limited by challenges associated with the annotation and filtration of a large number of variants and is yet to be explored. Here we optimize a whole genome LRS workflow and a computational strategy in a cohort of undiagnosed patients with suspected rare diseases leading to additional diagnoses and the uncovering of a novel methylation signature associated with Spinal Muscular Atrophy (SMA). We optimized our analysis workflow on a selected cohort of 14 patients with confirmed genetic diagnoses, encompassing a diverse array of genomic and epigenomic pathogenic variants (Fig. 1 a and Supplementary Fig. 1a ). The study design incorporated wet bench protocol optimized for long-read Oxford Nanopore sequencing using a PromethION system targeting a minimum of 30X coverage with average N50 of 12kb (Fig. 1 a and Supplementary Fig. 1b ). Our computational analysis workflow consists of a “genome” and “epigenome” modules (Fig. 1 a and Extended Method ). The former module consists of detection, annotation, and selection of genome-wide rearrangements like copy number variations (CNVs), short variants (SNVs and INDELs) and structural variations (SVs). Raw variants were retained if calls were supported by ≥ 5 reads with allele fraction ≥ 0.3 and were affecting the coding region of genes associated with disease as defined in OMIM or GeneCC (Extended Method). This reduced the number of variants by 40% for CNVs and 99% for SVs. Further filtering of variants unique to each patient in the cohort reduced CNVs by 98% (average n = 2) and SVs to 99.9% (average n = 12) (Fig. 1 b and Supplementary Fig. 1c ), which were then manually inspected for any clinical correlation. This led to the detection of all associated pathogenic variants in this group ( Supplementary Fig. 1d-f ). The epigenomic module is composed of two methods for scanning episignatures specific to 42 known diseases 2 , and for the diagnosis of SMA based on a novel methylation signature we characterize in this study (Extended Methods). SMA is a common, life-threatening autosomal recessive neuromuscular disease caused by biallelic loss, mostly deletions in exon 7, of the survival-of-motor-neuron ( SMN1 ) gene 10 . We observed a specific methylation profile across introns 6 to 8 (chr5:70239954–70249165) of the SMN1 gene where 0%, 50–70% (moderate) and 98–100% (high) of bases with methylation modification were present for SMA patients, carriers and non-carriers respectively, elucidating a unique episignature for SMA (Fig. 1 c and Supplementary Fig. 1e ). We also confirmed the methylation profile for a control sample (OXN-18) with Angelman syndrome ( Supplementary Fig. 1f ). Overall, our pipeline was able to correctly identify all the pathogenic variants, including complex rearrangements and aberrant methylation, in the optimization cohort. We applied this workflow to a set of undiagnosed patients (N = 39), who previously had inconclusive testing using short read exome sequencing with 39% also receiving microarray assays testing (Fig. 1 a and Supplementary Table 1 ). Patients, were mostly of Arab descendant (90%), had overall equal gender representation (~ 40% females) and primarily presented with neurological disorders (44%) (Fig. 1 d-e and Supplementary Table 1 ). Whole genome LRS in this cohort obtained an average of 53X coverage and N50 of 12.2Kb. Approximately 35,000 SVs and 83 CNVs were detected in each sample ( Supplementary Table 2 ) which were significantly reduced by 99.98% and 98.49%, respectively, after applying our filtering and selection criteria (Fig. 2 b and Supplementary Fig. 2a ). Since all patients previously had inconclusive exome testing, we focused our analysis on SNVs with predicted splicing impact, which could have been previously filtered out. We applied our splicing SNV filtration criteria (see Methods) which retained on an average ~ 54 SNVs in disease-causing genes for each sample; significantly reducing the total number of SNVs (~ 1.6M) ( Supplementary Fig. 2b and Supplementary Table 2 ). We evaluated variants within the genes matching the patient phenotype and identified a single variant in DNMT1 (NM_001130823: c.891 + 8C > T) in OXN-044, though its impact on DNMT1 RNA splicing ( Supplementary Fig. S2 c ) and its relatively high allele frequency in the general population led to its classification as clinically benign. No other putative clinically relevant sequence variants were identified. We next focused on large CNV events and identified pathogenic variants in two patients. For patient OXN-033, three deletions from a total of 59 CNVs were prioritized, of which a heterozygous deletion event (1.4Mb) at 2q11.1-q11.2 was classified as pathogenic post manual inspection and was validated by CMA (Fig. 2 b and Supplementary Table 3 ). Individuals with 2q11.2 deletions have developmental delay, intellectual disability, dysmorphic features and variable skeletal anomalies along with obesity 11 , 12 which was consistent with this patient’s phenotype. In another patient (OXN-048), with unconfirmed diagnosis of anterior segment dysgenesis and a heterozygous pathogenic variant in the SLC38A8 identified by exome sequencing, we detected a single heterozygous deletion in 16q23.3 (Fig. 2 c and Supplementary Table 3 ), partially encompassing SLC38A8 (exons 8–3’UTR), using LRS. SLC38A8 is associated with autosomal recessive foveal hypoplasia and/or anterior segment dysgenesis matching the phenotype of the patient 13 . Taking advantage of the long reads, we phased the two variants and observed that each variant is in a distinct haplotype confirming the compound heterozygous state in this individual and biallelic impairment of the SLC38A3 (Fig. 2 c). We then examined the landscape of structural variants. We identified a homozygous deletion of 3.6kb partially including the 3’ untranslated region (UTR) of the M-Phase Specific PLK1 Interacting Protein gene ( MPLKIP ) in patient OXN-027 (Fig. 2 d and Supplementary Table 3 ). This patient showed signs of learning disabilities with distinctive brittle hair, a hallmark of Trichothiodystrophy nonphotosensitive 1 associated with non-functional MPLKIP protein. The 3’UTR region is known to regulate mRNA-based processes 14 , hence we hypothesized that the homozygous 3’UTR deletion of the MPLKIP gene could alter its expression levels. In fact, transcriptomic analysis showed that this gene is significantly overexpressed (Fig. 2 d) in this patient suggesting that its dysregulation might underlie the observed phenotype. Further investigation is required to understand the functional role of this 3’UTR deletion. We next scanned the methylation patterns for all 39 patients and compared them to the episignature profiles associated with 42 known diseases 2 . One patient (OXN-062) had a methylation profile consistent with Hunter McAlpine syndrome (HMA) (Fig. 2 e). Independently, we also identified a duplication at 5q35.2-q35.3 containing the NSD1 gene which was confirmed by chromosomal microarrays (Fig. 2 e). HMA is characterized by craniosynostosis, intellectual deficit, short stature and facial dysmorphism matching the clinical indication of the patient. While deletions of NSD1 and hypomethylation at this locus are associated with Sotos syndrome, HMA has been associated with micro-duplication involving NSD1 and a hypermethylation profile 2 confirming the diagnosis for this patient. We then examined the SMA-specific methylation pattern, described above, across all the patients. Interestingly, we observed one patient (OXN-063) with the characteristic SMA episignature. The biallelic loss of SMN1 in this patient was confirmed by droplet digital PCR (Fig. 2 e). The protocols for analyzing LRS are still in nascent stages and no global standard methods have been established specifically for the clinical annotation, filtration and interpretation of the large genomic and epigenomic landscape in patients with rare diseases. In this study, we propose a simplified workflow which substantially reduces the number of putative disease-causing changes detected by whole genome LRS, while detecting a wide spectrum of genomic and epigenomic pathogenic variation, leading to 13% (5 out 39) additional diagnoses in patients with rare diseases who had inconclusive testing using traditional methods. We developed a LRS-based “Epimarker” method using known episignature of 42 diseases to empirically profile patients in clinical setting. We also uncover, for the first time, an SMA-specific methylation profile which was incorporated into our clinical “Epimarker” profiling. Taken together, our results demonstrate the potential of long read sequencing as a single unified assay for routine clinical genetic testing and the discovery of novel rare disease variation. METHOD Patient samples Control DNA samples, with known genomic aberrations ( Supplementary Fig. 1a ), were used for optimizing the library preparation, sequencing, bioinformatics analysis, and clinical annotation and filtration. The clinical utility of our approach was then evaluated on DNA from 39 patients with highly suspected monogenic disorders, and non-diagnostic short-read whole exome sequencing. 39% of those patients (N = 15) also had non-diagnostic chromosomal microarray testing. All patients were consented for clinical genetic testing and deidentified research. Patients with positive findings were further consented for additional investigations and for data sharing. This study was reviewed and approved by the Dubai Scientific Research Ethics Committee, Dubai Health Authority (approval no. DSREC-SR-03/2023_08). Long Read WGS Library Preparation and sequencing Genomic DNA was extracted from peripheral whole blood using the QIAsymphony DSP DNA Kit (Qiagen, Hilden, Germany) and QIAsymphony automated nucleic acid extraction instrument, according to the manufacturer's instructions. 6,000 ng gDNA was sheared with G-Tubes (Covaris LLC, USA) following the standard 20 kb protocol. The resulting DNA fragments were utilized for duplicate library preparation per sample using the Ligation Sequencing Kit V14 (Oxford Nanopore, UK), according to the manufacturer's instructions. Libraries were sequenced on the PromethION P48 device with R10.4.1 flow cells (Oxford Nanopore, UK) for 72 hours with a second library loaded at 24 hours post flow cell washing. mRNA library preparation and Transcriptome sequencing Transcriptome sequencing was performed for two patients and two controls ( Supplementary Table 4 ). Total RNA was extracted and purified from human whole blood samples collected in Tempus blood RNA tubes using Tempus spin RNA isolation kit (Applied Biosystems, US), according to the manufacturer's instructions. 270-290ng of total RNA was utilized for triplicate library preparation per sample using TruSeq® Stranded mRNA Library Prep kit (Illumina, USA), according to the manufacturer's instructions. Libraries were sequenced on Illumina NovaSeq 6000. Long-read sequencing data analysis New pipeline appropriate for long-read nanopore technology was developed in-house using published softwares (see Extended Methods for details). Briefly, base calling was done using “high-accuracy base calling” (HAC) mode during the run using MinKnow distribution (version 22.05.7) and Guppy (version 6.1.5). The methylation tag (MM,ML / mm,ml) was inferred using samtools (version 1.13) for all bam passed files and were aligned to the human reference genome (GRCh37/hg19) using minimap2 (version 2.22-r1101). Epi2Me 15 workflow wf-human-variation (v1.2.0), suitable for long read technology was used for detection of the genomic variants using its module – ‘--cnv’, ‘--sv’, ‘--snp’ and ‘--methyl’, with default parametes except for CNVs that was run with bin size of 5. CuteSV(v2.0.3) was applied in conjunction for identifying SVs. CNVs and SVs were annotated using ClassifyCNV(1.1.1) and AnnotSV(v3.2.3). A funnel-down approach was used to filter SVs and CNVs, where SVs with at least 5 supporting reads with allele frequency ≥ 0.3 and CNVs with log 2 fold change of 0.5 were used for downstream analysis. Variants overlapping coding region of genes associated with disease as identified from OMIM and GeneCC database were retained and those unique within the cohort and each method were correlated with patients’ phenotype using in-house scripts. Matching variants were then manually inspected to identify putative pathogenic ones. Methylation analysis was performed by comparing the methylation profile of the patients with those reported in literature for the epigenomic signature 2 . SMA detection was developed based on the methylation profile in the genomic region capturing exons 6/7 (chr5:70,239,954 − 70,249,165) of SMN1 , where absence of methylation modifications indicated absence of SMN1 . SpliceAI was used to detect splicing variants within 50bp of the annotated exons. Variants with genotype quality > = 10, read depth > = 30, base quality > = 10, filter tag as "PASS" present in at least 90% of the isoforms and within 50 bp of annotated exon from NCBI Refseq transcripts for build hg19, were identified as high confidence splicing variants; of which only those present in disease-causing genes as identified by GeneCC and OMIM database and matching patient phenotype were further evaluated. Transcriptome sequence data analysis FastQC and MultiQC were used to assess sequencing read quality. High-quality reads (Q ≥ 30) were mapped to GRCh37 (hg19) using STAR (v2.7.8a) with the default settings. Gene count was performed using featureCounts from the SubReads (v2.0.1) with the '-p -O -g gene_id -s 2' parameters ( Supplementary Table 5 ) and analysed by DESeq2 (v1.38.3) correcting for batch effects, normalization and differential gene expression analysis. Genes with adj p-value < 0.05 were identified as significant and selected for pathway enrichment analysis using Enrichr web-application ( Supplementary Table 6 ). Additional statistical analysis was performed using the Fisher exact test to rank the top pathways. Chromosomal Microarray Analysis Chromosomal microarray analysis was performed as previously described 16 . Briefly, CMA was done using the Affymetrix CytoScan HDTM assay consisting of 2.67 million probes and analysed using Chromosome Analysis SuiteTM software 4.0 to compare, insilico , the hybridization pattern of a patient specimen against a pooled reference sample set. Losses larger than 200 kb (with ≥ 25 probes) or gains larger than 400 kb (≥ 50 probes) are reported, along with smaller variants of pathogenic potential. Droplet Digital PCR Analysis The copy numbers of SMN1 and SMN2 were determined by Digital droplet PCR (ddPCR) technology as described previously 16 , using predesigned proprietary ddPCR assay kits for SMN1 (Catalog No: 186–3500, Bio-Rad). In addition, experimental controls – 0 copy, 1 copy and 2 copy controls for SMN1 were included along with a no template control. Data analysis was performed using QuantaSoft version 1.7.4.0917 (Bio-Rad) to determine the copy number variation (CNV). References Kent, A., Parker, A. P., Patel, A., Wynn, S. L. & Steward, C. A. Genomics in rare diseases: an overview for the patient, family and non-specialist healthcare professional. Future Rare Diseases 3, FRD56 (2023). Aref-Eshghi, E. et al. Evaluation of DNA Methylation Episignatures for Diagnosis and Phenotype Correlations in 42 Mendelian Neurodevelopmental Disorders. The American Journal of Human Genetics 106, 356–370 (2020). Blöß, S. et al. Diagnostic needs for rare diseases and shared prediagnostic phenomena: Results of a German-wide expert Delphi survey. PLoS One 12, e0172532 (2017). Mitsuhashi, S. & Matsumoto, N. Long-read sequencing for rare human genetic diseases. J Hum Genet 65, 11–19 (2020). Neerman, N. et al. A clinically validated whole genome pipeline for structural variant detection and analysis. BMC Genomics 20, 545 (2019). Oehler, J. B., Wright, H., Stark, Z., Mallett, A. J. & Schmitz, U. The application of long-read sequencing in clinical settings. Hum Genomics 17, 73 (2023). Mizuguchi, T. et al. A 12-kb structural variation in progressive myoclonic epilepsy was newly identified by long-read whole-genome sequencing. J Hum Genet 64, 359–368 (2019). Miller, D. E. et al. Targeted long-read sequencing identifies missing disease-causing variation. The American Journal of Human Genetics 108, 1436–1449 (2021). Sone, J. et al. Long-read sequencing identifies GGC repeat expansions in NOTCH2NLC associated with neuronal intranuclear inclusion disease. Nat Genet 51, 1215–1221 (2019). Ogino, S. & Wilson, R. B. Genetic testing and risk assessment for spinal muscular atrophy (SMA). Hum Genet 111, 477–500 (2002). Voll, S. et al. Obesity in adults with 22q11.2 deletion syndrome. Genet Med 19, 204–208 (2017). Riley, K. N. et al. Recurrent deletions and duplications of chromosome 2q11.2 and 2q13 are associated with variable outcomes. American Journal of Medical Genetics Part A 167, 2664–2673 (2015). Kuht, H. J. et al. SLC38A8 mutations result in arrested retinal development with loss of cone photoreceptor specialization. Hum Mol Genet 29, 2989–3002 (2020). Mayr, C. What Are 3′ UTRs Doing? Cold Spring Harb Perspect Biol 11, a034728 (2019). EPI2ME Labs 23.02-01 Release. EPI2ME Labs https://labs.epi2me.io/epi2me-labs-23.02.01-release/ (2023). El Naofal, M. et al. The genomic landscape of rare disorders in the Middle East. Genome Medicine 15, 5 (2023). Additional Declarations There is NO Competing Interest. Supplementary Files Suppfig05.docx Supplementarytables.xlsx Cite Share Download PDF Status: Published Journal Publication published 14 Mar, 2025 Read the published version in Nature Communications → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4235049","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Brief Communication","associatedPublications":[],"authors":[{"id":288997536,"identity":"caffed36-bec1-4502-be97-e0dcabeb2cc9","order_by":0,"name":"Ahmad Abou Tayoun","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8UlEQVRIiWNgGAWjYBACNgkgwcPwH8L7wMDGYEBICz9EiwSYwziDGC2SM5C0MPMACYJaDG53J354wyAhb3C8x+yxTQ2fvDkD88MHeLXcObtZcg6DhOGGM2fMjXOOsRnubGAzxmuTwY3cDdJAhzFuuJFjJp3DxpZgcIDBTAKfFvsbuZt/A7XYg7VY/ANpYf/+g4At20C2JIK1MLaBtPCY4dMB1mI5x0AieeaZY2WSvX1shhsO8xTjdRhQy+YbbyokbPuON2+T+PHtGDDo2jd+wGsNRCMDg8IBMOsYMHYIq4cA+QYwVUOs+lEwCkbBKBhBAAC4tUmcYVQszQAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-9134-1673","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":true,"prefix":"","firstName":"Ahmad","middleName":"Abou","lastName":"Tayoun","suffix":""},{"id":288997537,"identity":"dc9cc784-a770-4853-b4ea-ff5c0023ffdf","order_by":1,"name":"Shruti Sinha","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Shruti","middleName":"","lastName":"Sinha","suffix":""},{"id":288997538,"identity":"2b1c7368-c682-4e95-89cb-1baae0e337d8","order_by":2,"name":"Fatma Rabea","email":"","orcid":"","institution":"Mohammed Bin Rashid University","correspondingAuthor":false,"prefix":"","firstName":"Fatma","middleName":"","lastName":"Rabea","suffix":""},{"id":288997539,"identity":"c3fc37ae-7f35-4363-b1e7-6b9d1330579a","order_by":3,"name":"Sathishkumar Ramaswamy","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Sathishkumar","middleName":"","lastName":"Ramaswamy","suffix":""},{"id":288997540,"identity":"61e15d39-0865-462c-b015-1fc18c3f734c","order_by":4,"name":"Ikram Chekroun","email":"","orcid":"","institution":"Mohammed Bin Rashid University","correspondingAuthor":false,"prefix":"","firstName":"Ikram","middleName":"","lastName":"Chekroun","suffix":""},{"id":288997541,"identity":"252cba96-2efa-4f10-a82b-c643ca4c6159","order_by":5,"name":"Maha El Naofal","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Maha","middleName":"El","lastName":"Naofal","suffix":""},{"id":288997542,"identity":"c38eb6a8-dfea-495f-8d83-2384acb9967b","order_by":6,"name":"Ruchi Jain","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Ruchi","middleName":"","lastName":"Jain","suffix":""},{"id":288997543,"identity":"bcf345db-2a6a-4bc3-94f6-d76ff031d605","order_by":7,"name":"Roudha Alfalasi","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Roudha","middleName":"","lastName":"Alfalasi","suffix":""},{"id":288997544,"identity":"c80d2c32-186d-4007-be1f-dc27b7474f2e","order_by":8,"name":"Nour Halabi","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Nour","middleName":"","lastName":"Halabi","suffix":""},{"id":288997545,"identity":"865283f7-5fbd-4ee6-a17e-cf63f50bebff","order_by":9,"name":"Sawsan Yaslam","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Sawsan","middleName":"","lastName":"Yaslam","suffix":""},{"id":288997546,"identity":"c2ce7c85-5c21-4d04-9d30-648900e213b0","order_by":10,"name":"Massomeh Sheikh Hassani","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Massomeh","middleName":"Sheikh","lastName":"Hassani","suffix":""},{"id":288997547,"identity":"e3876d22-e584-45b0-87f2-f734508ffaff","order_by":11,"name":"Shruti Shenbagam","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Shruti","middleName":"","lastName":"Shenbagam","suffix":""},{"id":288997548,"identity":"d6a6c4fb-03db-4ed2-a43b-4fae4ced87e3","order_by":12,"name":"Alan Taylor","email":"","orcid":"","institution":"Al Jalila Children's Specialty Hospital","correspondingAuthor":false,"prefix":"","firstName":"Alan","middleName":"","lastName":"Taylor","suffix":""},{"id":288997549,"identity":"1cd97269-cf46-4d77-a54b-e385a17b0d01","order_by":13,"name":"Mohammed Uddin","email":"","orcid":"https://orcid.org/0000-0001-6867-5803","institution":"Mohammed Bin Rashid University of Medicine and Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Mohammed","middleName":"","lastName":"Uddin","suffix":""},{"id":288997550,"identity":"e31b9608-8667-447d-8f51-edb24db927ee","order_by":14,"name":"Mohamed Al Marri","email":"","orcid":"","institution":"Mohammed Bin Rashid Universty","correspondingAuthor":false,"prefix":"","firstName":"Mohamed","middleName":"Al","lastName":"Marri","suffix":""},{"id":288997551,"identity":"88398195-f338-4f7f-a13b-c8c0b82cac66","order_by":15,"name":"Stefan Du Plessis","email":"","orcid":"","institution":"Mohammed Bin Rashid Universty","correspondingAuthor":false,"prefix":"","firstName":"Stefan","middleName":"","lastName":"Du Plessis","suffix":""},{"id":288997552,"identity":"7196f89c-eaaf-4fcf-8cc5-27cb6c61ae3b","order_by":16,"name":"Alawi Alsheikh-Ali","email":"","orcid":"","institution":"Mohammed Bin Rashid Universty","correspondingAuthor":false,"prefix":"","firstName":"Alawi","middleName":"","lastName":"Alsheikh-Ali","suffix":""}],"badges":[],"createdAt":"2024-04-08 08:31:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4235049/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4235049/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41467-025-57695-9","type":"published","date":"2025-03-14T04:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":54372349,"identity":"964ee6dd-6317-4195-87bb-69fd5158683d","added_by":"auto","created_at":"2024-04-09 13:19:28","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":610489,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStudy design, proof of concept in positive cohort and overview of negative cohort\u003c/strong\u003e. Shown are a) the study design schema, along with the wet bench and bioinformatics workflow with the tools used. b) counts of CNVs and SVs for each method in each filtering step of the “funnel-down” approach. c) episignature of SMA – heatmap of base methylation modification (%) within the region chr5:70239954-70249165 across positive, carrier and negative samples (left panel); IGV methylation view for SMA positive (OXN-068, OXN-069), Carrier (OXN-070, OXN-071) and non-SMA (OXN-012) (top right panel) and ddPCR results with detected copy number (bottom right panel), where 0 CC, 1 CC and 2 CC refers to controls with 0, 1, and 2 copies of \u003cem\u003eSMN1\u003c/em\u003e. d) the gender and demography and e) most prevalent primary clinical symptom in the “Negative Cohort”.\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-4235049/v1/f71523c6246ee3523cd8f3d1.png"},{"id":54372344,"identity":"79cb5148-0c90-4991-9eba-05c5848ee480","added_by":"auto","created_at":"2024-04-09 13:19:25","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":876972,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDetected pathogenic variants in negative cohort leading to confirmed diagnosis\u003c/strong\u003e. a) Reported on the right side of the graph are the detected number of genomic variants - CNVs, SVs and on the left side the high confidence variants, post funnel down filtering with samples of confirmed diagnosis highlighted in orange identified from CNVs (white line graph) or SVs (grey bars). b) deletion event at 2q11.1-q11.2 in OXN-033 identified by LRS (IGV coverage top panel) and validated by CMA (bottom panel). c) Phased genomic alignment with allele specific INDEL and heterozygous large deletion in \u003cem\u003eSLC38A8\u003c/em\u003e in OXN-048 (left top panel) with a zoomed view of INDEL (bottom panel). CMA profile (right top panel) and PCR assays (bottom panel) corroborating the finding. d) homozygous deletion in the 3’ UTR of \u003cem\u003eMPLKIP\u003c/em\u003edetected by LRS (IGV alignment, top panel), validated by PCR (bottom right panel) and significant difference in the normalized gene expression (bottom left panel). P-value was calculated using Wilcoxon two-tailed test. e) heatmap with methylation profile across negative cohort (including OXN-62) and a published HMA control sample (left panel); duplication event of 5q35.2-q35.3 in OXN-062 detected by LRS (IGV coverage view, top right panel) and validated by CMA (bottom right panel). f) heatmap of base methylation modification (%) within the region chr5:70239954-70249165 across negative cohort, SMA positive (OXN-068, OXN-069) and carriers (OXN-070, OXN-071) (left panel), IGV methylation view (right top panel) and ddPCR results (bottom right panel) with detected copy number for OXN-060 and two SMA-negative samples (OXN-035 and OXN-047). 0 CC, 1 CC and 2 CC refers to controls with 0, 1, and 2 copies of \u003cem\u003eSMN1.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-4235049/v1/f249b5a835966ede6b8533cb.png"},{"id":78566685,"identity":"fa769e88-3e9e-4077-98f3-bc1255e41a53","added_by":"auto","created_at":"2025-03-15 07:06:17","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2301532,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4235049/v1/70a48087-b0b9-4ea7-8faf-52586bb49922.pdf"},{"id":54372343,"identity":"9edd151e-f2fc-438f-ae99-163f27c49747","added_by":"auto","created_at":"2024-04-09 13:19:24","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":1049748,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cbr\u003e\u003c/p\u003e","description":"","filename":"Suppfig05.docx","url":"https://assets-eu.researchsquare.com/files/rs-4235049/v1/eb6c11c0343a566b4604310d.docx"},{"id":54372359,"identity":"fc815a45-ea9c-4ed4-9d51-898982927a6e","added_by":"auto","created_at":"2024-04-09 13:19:31","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":34777,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarytables.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4235049/v1/8358585bd9c4a0ba13cdbee2.xlsx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Long read sequencing enhances pathogenic and novel variation discovery in patients with rare diseases","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eAround 7,000 rare diseases have been identified, collectively imposing significant health socio-economic burden\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Majority of these diseases have a genetic origin due to variants ranging from single nucleotide variants (SNVs) or a few nucleotide insertions/deletions (INDELs), to large genomic changes such as copy number variants (CNVs), translocations, inversions, transposable element (TE) insertions, or complex rearrangements. Some are also associated with specific epigenomic profiles\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. This diverse spectrum of disease-causing changes, often detected by different technologies, has challenged current genetic diagnostic strategies and contributed to long diagnostic odysseys, averaging at 6 years\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e, and delayed timely management or treatment plans for patients with rare disease.\u003c/p\u003e \u003cp\u003eAlthough short-read sequencing technologies have brought a remarkable leap in the diagnosis of rare genetic diseases\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e, more than half of the patients remain undiagnosed. This is partly due to the inherent limitations of this technology in detecting complex variants such as structural variants, methylation profiles, repeat expansions, or variants embedded in inaccessible regions of the genome, specifically high homology and GC rich regions\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. Recent advances in third generation sequencing technologies have demonstrated the application of targeted LRS for identifying pathogenic variants in known or novel disease-causing genes\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. However, the clinical implementation of LRS for detecting genome-wide variation and methylation changes in the context of rare diseases has been limited by challenges associated with the annotation and filtration of a large number of variants and is yet to be explored. Here we optimize a whole genome LRS workflow and a computational strategy in a cohort of undiagnosed patients with suspected rare diseases leading to additional diagnoses and the uncovering of a novel methylation signature associated with Spinal Muscular Atrophy (SMA).\u003c/p\u003e \u003cp\u003eWe optimized our analysis workflow on a selected cohort of 14 patients with confirmed genetic diagnoses, encompassing a diverse array of genomic and epigenomic pathogenic variants (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea and \u003cb\u003eSupplementary Fig.\u0026nbsp;1a\u003c/b\u003e). The study design incorporated wet bench protocol optimized for long-read Oxford Nanopore sequencing using a PromethION system targeting a minimum of 30X coverage with average N50 of 12kb (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea and \u003cb\u003eSupplementary Fig.\u0026nbsp;1b\u003c/b\u003e). Our computational analysis workflow consists of a \u0026ldquo;genome\u0026rdquo; and \u0026ldquo;epigenome\u0026rdquo; modules (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea and \u003cb\u003eExtended Method\u003c/b\u003e). The former module consists of detection, annotation, and selection of genome-wide rearrangements like copy number variations (CNVs), short variants (SNVs and INDELs) and structural variations (SVs). Raw variants were retained if calls were supported by \u0026ge;\u0026thinsp;5 reads with allele fraction\u0026thinsp;\u0026ge;\u0026thinsp;0.3 and were affecting the coding region of genes associated with disease as defined in OMIM or GeneCC (Extended Method). This reduced the number of variants by 40% for CNVs and 99% for SVs. Further filtering of variants unique to each patient in the cohort reduced CNVs by 98% (average n\u0026thinsp;=\u0026thinsp;2) and SVs to 99.9% (average n\u0026thinsp;=\u0026thinsp;12) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb and \u003cb\u003eSupplementary Fig.\u0026nbsp;1c\u003c/b\u003e), which were then manually inspected for any clinical correlation. This led to the detection of all associated pathogenic variants in this group (\u003cb\u003eSupplementary Fig.\u0026nbsp;1d-f\u003c/b\u003e). The epigenomic module is composed of two methods for scanning episignatures specific to 42 known diseases\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e, and for the diagnosis of SMA based on a novel methylation signature we characterize in this study (Extended Methods). SMA is a common, life-threatening autosomal recessive neuromuscular disease caused by biallelic loss, mostly deletions in exon 7, of the survival-of-motor-neuron (\u003cem\u003eSMN1\u003c/em\u003e) gene\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. We observed a specific methylation profile across introns 6 to 8 (chr5:70239954\u0026ndash;70249165) of the\u003cem\u003eSMN1\u003c/em\u003e gene where 0%, 50\u0026ndash;70% (moderate) and 98\u0026ndash;100% (high) of bases with methylation modification were present for SMA patients, carriers and non-carriers respectively, elucidating a unique episignature for SMA (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec and \u003cb\u003eSupplementary Fig.\u0026nbsp;1e\u003c/b\u003e). We also confirmed the methylation profile for a control sample (OXN-18) with Angelman syndrome (\u003cb\u003eSupplementary Fig.\u0026nbsp;1f\u003c/b\u003e). Overall, our pipeline was able to correctly identify all the pathogenic variants, including complex rearrangements and aberrant methylation, in the optimization cohort.\u003c/p\u003e \u003cp\u003eWe applied this workflow to a set of undiagnosed patients (N\u0026thinsp;=\u0026thinsp;39), who previously had inconclusive testing using short read exome sequencing with 39% also receiving microarray assays testing (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea and \u003cb\u003eSupplementary Table\u0026nbsp;1\u003c/b\u003e). Patients, were mostly of Arab descendant (90%), had overall equal gender representation (~\u0026thinsp;40% females) and primarily presented with neurological disorders (44%) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed-e and \u003cb\u003eSupplementary Table\u0026nbsp;1\u003c/b\u003e). Whole genome LRS in this cohort obtained an average of 53X coverage and N50 of 12.2Kb. Approximately 35,000 SVs and 83 CNVs were detected in each sample (\u003cb\u003eSupplementary Table\u0026nbsp;2\u003c/b\u003e) which were significantly reduced by 99.98% and 98.49%, respectively, after applying our filtering and selection criteria (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb and \u003cb\u003eSupplementary Fig.\u0026nbsp;2a\u003c/b\u003e). Since all patients previously had inconclusive exome testing, we focused our analysis on SNVs with predicted splicing impact, which could have been previously filtered out. We applied our splicing SNV filtration criteria (see Methods) which retained on an average\u0026thinsp;~\u0026thinsp;54 SNVs in disease-causing genes for each sample; significantly reducing the total number of SNVs (~\u0026thinsp;1.6M) (\u003cb\u003eSupplementary Fig.\u0026nbsp;2b\u003c/b\u003e and \u003cb\u003eSupplementary Table\u0026nbsp;2\u003c/b\u003e). We evaluated variants within the genes matching the patient phenotype and identified a single variant in \u003cem\u003eDNMT1\u003c/em\u003e (NM_001130823: c.891\u0026thinsp;+\u0026thinsp;8C\u0026thinsp;\u0026gt;\u0026thinsp;T) in OXN-044, though its impact on \u003cem\u003eDNMT1\u003c/em\u003e RNA splicing (\u003cb\u003eSupplementary Fig. \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003ec\u003c/b\u003e) and its relatively high allele frequency in the general population led to its classification as clinically benign. No other putative clinically relevant sequence variants were identified.\u003c/p\u003e \u003cp\u003eWe next focused on large CNV events and identified pathogenic variants in two patients. For patient OXN-033, three deletions from a total of 59 CNVs were prioritized, of which a heterozygous deletion event (1.4Mb) at 2q11.1-q11.2 was classified as pathogenic post manual inspection and was validated by CMA (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb and \u003cb\u003eSupplementary Table\u0026nbsp;3\u003c/b\u003e). Individuals with 2q11.2 deletions have developmental delay, intellectual disability, dysmorphic features and variable skeletal anomalies along with obesity\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e which was consistent with this patient\u0026rsquo;s phenotype. In another patient (OXN-048), with unconfirmed diagnosis of anterior segment dysgenesis and a heterozygous pathogenic variant in the \u003cem\u003eSLC38A8\u003c/em\u003e identified by exome sequencing, we detected a single heterozygous deletion in 16q23.3 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec and \u003cb\u003eSupplementary Table\u0026nbsp;3\u003c/b\u003e), partially encompassing \u003cem\u003eSLC38A8\u003c/em\u003e (exons 8\u0026ndash;3\u0026rsquo;UTR), using LRS. \u003cem\u003eSLC38A8\u003c/em\u003e is associated with autosomal recessive foveal hypoplasia and/or anterior segment dysgenesis matching the phenotype of the patient\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. Taking advantage of the long reads, we phased the two variants and observed that each variant is in a distinct haplotype confirming the compound heterozygous state in this individual and biallelic impairment of the \u003cem\u003eSLC38A3\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec).\u003c/p\u003e \u003cp\u003eWe then examined the landscape of structural variants. We identified a homozygous deletion of 3.6kb partially including the 3\u0026rsquo; untranslated region (UTR) of the M-Phase Specific PLK1 Interacting Protein gene (\u003cem\u003eMPLKIP\u003c/em\u003e) in patient OXN-027 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed and \u003cb\u003eSupplementary Table\u0026nbsp;3\u003c/b\u003e). This patient showed signs of learning disabilities with distinctive brittle hair, a hallmark of Trichothiodystrophy nonphotosensitive 1 associated with non-functional MPLKIP protein. The 3\u0026rsquo;UTR region is known to regulate mRNA-based processes\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e, hence we hypothesized that the homozygous 3\u0026rsquo;UTR deletion of the \u003cem\u003eMPLKIP\u003c/em\u003e gene could alter its expression levels. In fact, transcriptomic analysis showed that this gene is significantly overexpressed (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed) in this patient suggesting that its dysregulation might underlie the observed phenotype. Further investigation is required to understand the functional role of this 3\u0026rsquo;UTR deletion.\u003c/p\u003e \u003cp\u003eWe next scanned the methylation patterns for all 39 patients and compared them to the episignature profiles associated with 42 known diseases\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. One patient (OXN-062) had a methylation profile consistent with Hunter McAlpine syndrome (HMA) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee). Independently, we also identified a duplication at 5q35.2-q35.3 containing the \u003cem\u003eNSD1\u003c/em\u003egene which was confirmed by chromosomal microarrays (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee). HMA is characterized by craniosynostosis, intellectual deficit, short stature and facial dysmorphism matching the clinical indication of the patient. While deletions of \u003cem\u003eNSD1\u003c/em\u003e and hypomethylation at this locus are associated with Sotos syndrome, HMA has been associated with micro-duplication involving \u003cem\u003eNSD1\u003c/em\u003e and a hypermethylation profile\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e confirming the diagnosis for this patient. We then examined the SMA-specific methylation pattern, described above, across all the patients. Interestingly, we observed one patient (OXN-063) with the characteristic SMA episignature. The biallelic loss of \u003cem\u003eSMN1\u003c/em\u003e in this patient was confirmed by droplet digital PCR (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee).\u003c/p\u003e \u003cp\u003eThe protocols for analyzing LRS are still in nascent stages and no global standard methods have been established specifically for the clinical annotation, filtration and interpretation of the large genomic and epigenomic landscape in patients with rare diseases. In this study, we propose a simplified workflow which substantially reduces the number of putative disease-causing changes detected by whole genome LRS, while detecting a wide spectrum of genomic and epigenomic pathogenic variation, leading to 13% (5 out 39) additional diagnoses in patients with rare diseases who had inconclusive testing using traditional methods. We developed a LRS-based \u0026ldquo;Epimarker\u0026rdquo; method using known episignature of 42 diseases to empirically profile patients in clinical setting. We also uncover, for the first time, an SMA-specific methylation profile which was incorporated into our clinical \u0026ldquo;Epimarker\u0026rdquo; profiling. Taken together, our results demonstrate the potential of long read sequencing as a single unified assay for routine clinical genetic testing and the discovery of novel rare disease variation.\u003c/p\u003e"},{"header":"METHOD","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003ePatient samples\u003c/h2\u003e \u003cp\u003eControl DNA samples, with known genomic aberrations (\u003cb\u003eSupplementary Fig.\u0026nbsp;1a\u003c/b\u003e), were used for optimizing the library preparation, sequencing, bioinformatics analysis, and clinical annotation and filtration. The clinical utility of our approach was then evaluated on DNA from 39 patients with highly suspected monogenic disorders, and non-diagnostic short-read whole exome sequencing. 39% of those patients (N\u0026thinsp;=\u0026thinsp;15) also had non-diagnostic chromosomal microarray testing. All patients were consented for clinical genetic testing and deidentified research. Patients with positive findings were further consented for additional investigations and for data sharing. This study was reviewed and approved by the Dubai Scientific Research Ethics Committee, Dubai Health Authority (approval no. DSREC-SR-03/2023_08).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eLong Read WGS Library Preparation and sequencing\u003c/h2\u003e \u003cp\u003eGenomic DNA was extracted from peripheral whole blood using the QIAsymphony DSP DNA Kit (Qiagen, Hilden, Germany) and QIAsymphony automated nucleic acid extraction instrument, according to the manufacturer's instructions. 6,000 ng gDNA was sheared with G-Tubes (Covaris LLC, USA) following the standard 20 kb protocol. The resulting DNA fragments were utilized for duplicate library preparation per sample using the Ligation Sequencing Kit V14 (Oxford Nanopore, UK), according to the manufacturer's instructions. Libraries were sequenced on the PromethION P48 device with R10.4.1 flow cells (Oxford Nanopore, UK) for 72 hours with a second library loaded at 24 hours post flow cell washing.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003emRNA library preparation and Transcriptome sequencing\u003c/h2\u003e \u003cp\u003eTranscriptome sequencing was performed for two patients and two controls (\u003cb\u003eSupplementary Table\u0026nbsp;4\u003c/b\u003e). Total RNA was extracted and purified from human whole blood samples collected in Tempus blood RNA tubes using Tempus spin RNA isolation kit (Applied Biosystems, US), according to the manufacturer's instructions. 270-290ng of total RNA was utilized for triplicate library preparation per sample using TruSeq\u0026reg; Stranded mRNA Library Prep kit (Illumina, USA), according to the manufacturer's instructions. Libraries were sequenced on Illumina NovaSeq 6000.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eLong-read sequencing data analysis\u003c/h2\u003e \u003cp\u003eNew pipeline appropriate for long-read nanopore technology was developed in-house using published softwares (see Extended Methods for details). Briefly, base calling was done using \u0026ldquo;high-accuracy base calling\u0026rdquo; (HAC) mode during the run using MinKnow distribution (version 22.05.7) and Guppy (version 6.1.5). The methylation tag (MM,ML / mm,ml) was inferred using samtools (version 1.13) for all bam passed files and were aligned to the human reference genome (GRCh37/hg19) using minimap2 (version 2.22-r1101). Epi2Me\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e workflow wf-human-variation (v1.2.0), suitable for long read technology was used for detection of the genomic variants using its module \u0026ndash; \u0026lsquo;--cnv\u0026rsquo;, \u0026lsquo;--sv\u0026rsquo;, \u0026lsquo;--snp\u0026rsquo; and \u0026lsquo;--methyl\u0026rsquo;, with default parametes except for CNVs that was run with bin size of 5. CuteSV(v2.0.3) was applied in conjunction for identifying SVs. CNVs and SVs were annotated using ClassifyCNV(1.1.1) and AnnotSV(v3.2.3). A funnel-down approach was used to filter SVs and CNVs, where SVs with at least 5 supporting reads with allele frequency\u0026thinsp;\u0026ge;\u0026thinsp;0.3 and CNVs with log\u003csub\u003e2\u003c/sub\u003efold change of 0.5 were used for downstream analysis. Variants overlapping coding region of genes associated with disease as identified from OMIM and GeneCC database were retained and those unique within the cohort and each method were correlated with patients\u0026rsquo; phenotype using in-house scripts. Matching variants were then manually inspected to identify putative pathogenic ones.\u003c/p\u003e \u003cp\u003eMethylation analysis was performed by comparing the methylation profile of the patients with those reported in literature for the epigenomic signature\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. SMA detection was developed based on the methylation profile in the genomic region capturing exons 6/7 (chr5:70,239,954\u0026thinsp;\u0026minus;\u0026thinsp;70,249,165) of \u003cem\u003eSMN1\u003c/em\u003e, where absence of methylation modifications indicated absence of \u003cem\u003eSMN1\u003c/em\u003e. SpliceAI was used to detect splicing variants within 50bp of the annotated exons. Variants with genotype quality\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;10, read depth\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;30, base quality\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;10, filter tag as \"PASS\" present in at least 90% of the isoforms and within 50 bp of annotated exon from NCBI Refseq transcripts for build hg19, were identified as high confidence splicing variants; of which only those present in disease-causing genes as identified by GeneCC and OMIM database and matching patient phenotype were further evaluated.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eTranscriptome sequence data analysis\u003c/h2\u003e \u003cp\u003eFastQC and MultiQC were used to assess sequencing read quality. High-quality reads (Q\u0026thinsp;\u0026ge;\u0026thinsp;30) were mapped to GRCh37 (hg19) using STAR (v2.7.8a) with the default settings. Gene count was performed using featureCounts from the SubReads (v2.0.1) with the '-p -O -g gene_id -s 2' parameters (\u003cb\u003eSupplementary Table\u0026nbsp;5\u003c/b\u003e) and analysed by DESeq2 (v1.38.3) correcting for batch effects, normalization and differential gene expression analysis. Genes with adj p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05 were identified as significant and selected for pathway enrichment analysis using Enrichr web-application (\u003cb\u003eSupplementary Table\u0026nbsp;6\u003c/b\u003e). Additional statistical analysis was performed using the Fisher exact test to rank the top pathways.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eChromosomal Microarray Analysis\u003c/h2\u003e \u003cp\u003eChromosomal microarray analysis was performed as previously described\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. Briefly, CMA was done using the Affymetrix CytoScan HDTM assay consisting of 2.67\u0026nbsp;million probes and analysed using Chromosome Analysis SuiteTM software 4.0 to compare, \u003cem\u003einsilico\u003c/em\u003e, the hybridization pattern of a patient specimen against a pooled reference sample set. Losses larger than 200 kb (with \u0026ge;\u0026thinsp;25 probes) or gains larger than 400 kb (\u0026ge;\u0026thinsp;50 probes) are reported, along with smaller variants of pathogenic potential.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eDroplet Digital PCR Analysis\u003c/h2\u003e \u003cp\u003eThe copy numbers of \u003cem\u003eSMN1\u003c/em\u003e and \u003cem\u003eSMN2\u003c/em\u003e were determined by Digital droplet PCR (ddPCR) technology as described previously\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e, using predesigned proprietary ddPCR assay kits for \u003cem\u003eSMN1\u003c/em\u003e (Catalog No: 186\u0026ndash;3500, Bio-Rad). In addition, experimental controls \u0026ndash; 0 copy, 1 copy and 2 copy controls for \u003cem\u003eSMN1\u003c/em\u003e were included along with a no template control. Data analysis was performed using QuantaSoft version 1.7.4.0917 (Bio-Rad) to determine the copy number variation (CNV).\u003c/p\u003e \u003c/div\u003e "},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eKent, A., Parker, A. P., Patel, A., Wynn, S. L. \u0026amp; Steward, C. A. Genomics in rare diseases: an overview for the patient, family and non-specialist healthcare professional. Future Rare Diseases 3, FRD56 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAref-Eshghi, E. \u003cem\u003eet al.\u003c/em\u003e Evaluation of DNA Methylation Episignatures for Diagnosis and Phenotype Correlations in 42 Mendelian Neurodevelopmental Disorders. The American Journal of Human Genetics 106, 356\u0026ndash;370 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBl\u0026ouml;\u0026szlig;, S. \u003cem\u003eet al.\u003c/em\u003e Diagnostic needs for rare diseases and shared prediagnostic phenomena: Results of a German-wide expert Delphi survey. PLoS One 12, e0172532 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMitsuhashi, S. \u0026amp; Matsumoto, N. Long-read sequencing for rare human genetic diseases. J Hum Genet 65, 11\u0026ndash;19 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNeerman, N. \u003cem\u003eet al.\u003c/em\u003e A clinically validated whole genome pipeline for structural variant detection and analysis. BMC Genomics 20, 545 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOehler, J. B., Wright, H., Stark, Z., Mallett, A. J. \u0026amp; Schmitz, U. The application of long-read sequencing in clinical settings. Hum Genomics 17, 73 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMizuguchi, T. \u003cem\u003eet al.\u003c/em\u003e A 12-kb structural variation in progressive myoclonic epilepsy was newly identified by long-read whole-genome sequencing. J Hum Genet 64, 359\u0026ndash;368 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMiller, D. E. \u003cem\u003eet al.\u003c/em\u003e Targeted long-read sequencing identifies missing disease-causing variation. The American Journal of Human Genetics 108, 1436\u0026ndash;1449 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSone, J. \u003cem\u003eet al.\u003c/em\u003e Long-read sequencing identifies GGC repeat expansions in NOTCH2NLC associated with neuronal intranuclear inclusion disease. Nat Genet 51, 1215\u0026ndash;1221 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOgino, S. \u0026amp; Wilson, R. B. Genetic testing and risk assessment for spinal muscular atrophy (SMA). Hum Genet 111, 477\u0026ndash;500 (2002).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVoll, S. \u003cem\u003eet al.\u003c/em\u003e Obesity in adults with 22q11.2 deletion syndrome. Genet Med 19, 204\u0026ndash;208 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRiley, K. N. \u003cem\u003eet al.\u003c/em\u003e Recurrent deletions and duplications of chromosome 2q11.2 and 2q13 are associated with variable outcomes. American Journal of Medical Genetics Part A 167, 2664\u0026ndash;2673 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKuht, H. J. \u003cem\u003eet al.\u003c/em\u003e SLC38A8 mutations result in arrested retinal development with loss of cone photoreceptor specialization. Hum Mol Genet 29, 2989\u0026ndash;3002 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMayr, C. What Are 3\u0026prime; UTRs Doing? Cold Spring Harb Perspect Biol 11, a034728 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEPI2ME Labs 23.02-01 Release. \u003cem\u003eEPI2ME Labs\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://labs.epi2me.io/epi2me-labs-23.02.01-release/\u003c/span\u003e\u003cspan address=\"https://labs.epi2me.io/epi2me-labs-23.02.01-release/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEl Naofal, M. \u003cem\u003eet al.\u003c/em\u003e The genomic landscape of rare disorders in the Middle East. Genome Medicine 15, 5 (2023).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4235049/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4235049/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eWith ongoing improvements in accuracy and capacity to detect complex genomic and epigenomic variations, long-read sequencing (LRS) technologies could serve as a unified platform for clinical genetic testing, particularly in rare disease settings, where nearly half of patients remain undiagnosed using existing technologies. Here, we report a simplified funnel-down filtration strategy aimed at identifying large deleterious variants and abnormal episignature disease profiles from whole-genome LRS data. This approach substantially reduced structural and copy number variants by 98.5\u0026ndash;99.9%, respectively, while detecting all pathogenic changes in a positive control set (N\u0026thinsp;=\u0026thinsp;10). When applied to patients who previously had negative short-read testing (N\u0026thinsp;=\u0026thinsp;39), additional diagnoses were uncovered in 13% of cases, including a novel methylation profile specific to spinal muscular atrophy, thus opening new avenues for diagnosing and treating this life-threatening condition. Our study illustrates the utility of LRS in clinical genetic testing and in the discovery of novel disease variations.\u003c/p\u003e","manuscriptTitle":"Long read sequencing enhances pathogenic and novel variation discovery in patients with rare diseases","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-04-09 13:18:39","doi":"10.21203/rs.3.rs-4235049/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"0acf9789-3d62-4a0e-8b78-75920694c9ef","owner":[],"postedDate":"April 9th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":30429601,"name":"Biological sciences/Genetics/Genomics/Medical genomics"},{"id":30429602,"name":"Biological sciences/Genetics/Sequencing"}],"tags":[],"updatedAt":"2025-03-15T07:06:08+00:00","versionOfRecord":{"articleIdentity":"rs-4235049","link":"https://doi.org/10.1038/s41467-025-57695-9","journal":{"identity":"nature-communications","isVorOnly":false,"title":"Nature Communications"},"publishedOn":"2025-03-14 04:00:00","publishedOnDateReadable":"March 14th, 2025"},"versionCreatedAt":"2024-04-09 13:18:39","video":"","vorDoi":"10.1038/s41467-025-57695-9","vorDoiUrl":"https://doi.org/10.1038/s41467-025-57695-9","workflowStages":[]},"version":"v1","identity":"rs-4235049","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4235049","identity":"rs-4235049","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.