Poly(A) selection limits detection of long and alternatively spliced transcripts compared with rRNA depletion in RNA-Sequencing

doi:10.21203/rs.3.rs-8195045/v1

Poly(A) selection limits detection of long and alternatively spliced transcripts compared with rRNA depletion in RNA-Sequencing

2025 · doi:10.21203/rs.3.rs-8195045/v1

preprint OA: closed

Full text JSON View at publisher

Full text 101,589 characters · extracted from preprint-html · click to expand

Poly(A) selection limits detection of long and alternatively spliced transcripts compared with rRNA depletion in RNA-Sequencing | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Poly(A) selection limits detection of long and alternatively spliced transcripts compared with rRNA depletion in RNA-Sequencing Swethaa Natraj Gayathri, Victoria Lillback, Bjarne Udd, Peter Hackman, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8195045/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 12 May, 2026 Read the published version in BMC Genomics → Version 1 posted You are reading this latest preprint version Abstract The eukaryotic transcriptome diversity arises largely from alternative splicing. One of the widely used high-throughput methods to study this diversity is RNA sequencing. RNA sequencing has become a cornerstone of both basic biology and precision medicine, facilitating the quantification of gene and transcript expression, as well as the characterization of alternative splicing events and regulatory biological pathways in these studies. As there is a wide interest in studying non-ribosomal RNAs, which constitute about 20% of cellular RNAs, it is common to either select for poly(A) + RNAs or to deplete ribosomal RNAs during the library preparation stage of RNA sequencing. Using blood and skeletal muscle transcriptomics data, we show that poly(A) + enriched RNA library data inefficiently detects long transcripts, with lengths larger than 5kb constituting to ~ 16.5% of isoforms in Gencode v39, and predominantly detects the 3′ end compared to the 5′ end of these transcripts. In contrast, rRNA depletion provides a more uniform 5′-3′ coverage, an improved detection of splicing events, and a robust detection of long disease-relevant transcripts. Furthermore, we show that the improved performance of rRNA depleted RNA sequencing, compared to poly(A)+, is particularly evident in the detection of extremely large transcripts, such as the sarcomeric genes OBSCN (~ 39kb) and TTN (> 100 kb). Our findings reveal the advantages of using rRNA depletion over the more commonly used poly(A) + selection for both research and diagnostic applications, especially where RNA-Seq is employed to analyse long muscle transcripts and detect pathogenic splicing defects and refine variant interpretation. RNA-Sequencing rRNA poly(A)+ transcriptomics TTN muscle Figures Figure 1 Figure 2 Figure 3 Figure 4 Background RNA sequencing or RNA-Seq enables the characterization of gene expression patterns, alternative splicing, and regulatory pathways in different samples and conditions. It is widely used in a broad range of research fields, including life sciences, clinical diagnostics, and in the development of novel therapeutics 1–3 . As ribosomal RNAs (rRNAs) account for more than 80% of the total RNA in the eukaryotic cells 4, 5 , often during library preparation either poly(A) + selection is performed, to enrich for polyadenylated mRNA, or ribosomal RNA (rRNA) depletion is applied to remove rRNA and retain the remaining RNA population. These two approaches have previously been benchmarked and compared for the composition of the detected RNAs, quantification of gene expression, and ability to detect lowly expressed genes 6–8 . In particular, rRNA depleted RNA-Seq has been reported to be capable of detecting more long non-coding RNAs (lncRNA) and to better detect the lowly expressed transcripts 6–9 . However, this approach also retains immature RNA, including degraded transcripts, which may complicate data interpretation 6 . In contrast, poly(A) + RNA-Seq predominantly captures mature mRNAs but exhibits a bias toward detecting the 3' ends of these transcripts 10 . Overall, the sequence reads achieved from rRNA depleted RNA-Seq cover the gene body ( i.e. from 5' to 3' end of the gene) more uniformly 8 . However, one critical aspect that remains insufficiently explored is the extent to which different RNA-Seq library enrichment methods affect the detection of transcripts of varying sizes. This is important since long transcripts with complex genetic architecture, diverse splicing patterns critical biological functions present challenges that can complicate their detection by RNA-Seq 11–13 . Furthermore, the size of mRNAs encoded by genes in human genome can exceed 50kb, which is too large to be fully detectable by short-read and even some existing long-read RNA-Seq platforms ( e.g. Iso-Seq by PacBio) 14, 15 . These long-isoform-coding genes are associated with diverse functions including neuronal processes, embryonic development and ageing 13, 16, 17 . Three of the genes that code for the longest mRNAs in humans, namely TTN , NEB and OBSCN , encode for sarcomeric proteins that are essential in muscle formation and function 18–21 . Therefore, we propose that the limited detection sensitivity and non-uniform coverage of long mRNAs in poly(A) + RNA-Seq data particularly impact research areas such as ageing, neuronal development, muscle biology, and disorders affecting neuronal and muscle tissues, although similar biases are expected across all biological research fields. Importantly, this challenge extends to the clinical diagnostic setting, where RNA-Seq is increasingly used to interpret the pathogenicity of genomic variants, especially splice variants in disease-associated genes 22–24 . A critical step in the clinical interpretation of RNA-Seq data involves visualizing candidate disease-causing variants in the Integrative Genomics Viewer (IGV) 25 , which is instrumental in identifying splice variants that reveal aberrant splicing patterns. To detect the potential disease-assocaited RNA splicing and gene expression dysregulation, several computational tools, including DROP 26 , FRASER 27 , and OUTRIDER 28 have been developed. These tools enable detection of aberrant splicing and transcriptional outliers by integrating statistical models and multi-omics data, thereby increasing sensitivity and diagnostic yield for rare disease-associated variants. Here, we aim to systematically compare poly(A) + selection and rRNA depletion, two commonly used RNA-Seq library enrichment methods, by analyzing data from varied human tissue types ( i.e. blood and skeletal muscle). We study transcript-body coverage, gene expression detection, and splice variant detection across transcripts of different size groups, focusing on aspects that have not been thoroughly explored previously. We further illustrate how these differences can influence the interpretation of disease-associated variants in very large transcripts like TTN (> 100kb). Results 1. Ribodepletion RNA-Seq reads offer more uniform transcript coverage We compared the distribution of mapped sequence reads across the transcript body of different transcript lengths when using poly(A)+ selection versus rRNA depletion for library enrichment. For this analysis, twenty-three skeletal muscle samples were run in rRNA depletion and poly(A)+ enrichment RNA sequencing (RNA-Seq). The coefficient of variation (CV) values for muscle RNA-Seq libraries were plotted against transcription length (TL) to assess coverage uniformity ( Figure 1 A ). rRNA depleted RNA-Seq consistently showed lower CV values compared to those of poly(A)+ enriched, particularly for transcripts longer than 5 kb, indicating more uniform transcript coverage when using rRNA depletion. When examining the TL measurements, poly(A)+ selected RNA-Seq detected only a limited number of long transcripts (> 40 kb): three transcripts in the 40-50 kb range ( CWC27, FTX, MYO5A ), one between 50-100 kb ( CCDC26 ), and one above 100 kb ( TTN ) ( Figure 1 A ). In contrast, the rRNA depleted RNA-Seq detected a higher number of long transcripts: six of which constitute in the 40-50 kb range ( ARID1B, DST, KIAA1109, MAPK10, MYO5A, NF1 ) ( Figure 1 A ), four are in the 50-100 kb range ( ANK2, CCDC26, KMT2C, MACF1 ), and one over 100 kb ( TTN ). We also checked CV distribution against TL in blood samples. In blood RNA-Seq, only a few long transcripts (>40 kb) were detected. Interestingly, only rRNA depleted RNA-Seq could detect transcripts larger than 50kb, namely ANK2 (average TPM 6.2) and TTN (average TPM 2.8), as well as long non-coding transcripts CCDC26, KCNQ1OT1, HELLPAR . Whereas poly(A)+ RNA-Seq did not detect any transcript larger than 40 kb. Within the 1-40 kb range transcripts, rRNA depleted RNA-Seq consistently showed lower CV than poly(A)+ ( Figure 1 B) . This also highlights the reduced sensitivity of poly(A)+ selection for large genes in blood ( Figure 1 B) . In order to illustrate how the chosen library enrichment method influences the detection of individual genes, we compared the transcript body-coverage profiles (normalized by total coverage) of two muscle-function genes with long isoforms, OBSCN (~39 kb) and TTN (>100 kb), to that of a gene with a substantially shorter isoform, MYOD1 (~2 kb) ( Figure 2 A ). Poly(A)+ detection RNA-Seq from muscle biopsies displayed a clear decrease in read coverage toward the 5' end, highlighting an overall strong 3' end detection bias. In contrast, the rRNA depleted RNA-Seq mostly provided uniform coverage signals across the transcript body ( Figure 2 A) , with the exception of localized dips in the coverage due to low exon usage, particularly in TTN 29 and OBSCN 21 . In blood, transcript body coverage results for SYNE1 (~47 kb), MYO9A (~20 kb) and LCN2 (~1kb) exhibited a strong 3′ end bias and a very low coverage toward the 5' end in poly(A)+ RNA-Seq ( Figure 2 B) . In contrast, rRNA depleted RNA-Seq reads covered more uniformly across each transcript body, except at regions where the exon usage differs among isoforms, resulting in localized dips in the sequence read coverage. These transcript body coverage profile results are consistent with the 5’end - 3’end coverage ratio analysis (Supplementary file 1 A), which reflects the uniformity of read coverage across transcript ends. In our data, rRNA depleted RNA-Seq exhibited coverage ratios closer to zero, indicating more uniform 5’-3’ end transcript coverage. In contrast poly(A)+ RNA-Seq showed negative ratios reflecting its strong 3’ end bias (Supplementary file 1 A). 2. Expression detection across gene lengths and biotypes We examined how gene expression estimates differ between the two library enrichment methods. We plotted the log₂ fold change between expression values obtained from poly(A)+ RNA-Seq and those from rRNA-depleted RNA-Seq against TL (log scaled) for muscle and blood samples ( Figure 4 A&B). A LOWESS (Locally Weighted Scatterplot Smoothing) curve was fitted to illustrate the overall relationship between TL and fold-change values. In skeletal muscle ( Figure 4 A ), transcripts shorter than 5 kb, showed broadly similar expression levels between the two methods. In contrast, for longer transcripts, the distribution of values shifted toward negative log2 values, indicating higher expression detection in the rRNA-depleted dataset. The LOWESS curve demonstrated a pronounced downward trajectory at log-scaled transcript length of 10 3 -10 4 , indicating that discrepancies in expression estimates between the two methods become more pronounced for transcripts longer than ~5kb. This pattern is consistent with our earlier observations that rRNA-depleted RNA-Seq provides superior coverage and sensitivity for long transcripts. When the analysis was restricted only to protein-coding genes ( Figure 4 A ), a similar length-dependent bias favoring rRNA depletion was observed. In blood, the trend was distinct ( Figure 4 B ). The LOWESS curve shows a drastic shift below zero for transcript lengths above 5kb, highlighted with a downward trajectory at log-scaled transcript length of 10 3 -10 4 ( Figure 4 B ). Similar trend was seen for only protein-coding genes, especially transcripts exceeding 10 kb showed higher expression detection in rRNA-depleted libraries. 3. rRNA depleted RNA-Seq facilitates improved detection and clinical interpretation of splice variants To evaluate how the choice of library enrichment strategy influences clinical interpretation and diagnostic sensitivity, we analyzed muscle biopsies from four patients with a confirmed titinopathy diagnosis using both rRNA depletion and poly(A)+ enrichment RNA-Seq methods. Each patient had a confirmed diagnosis of intronic variants in the TTN gene that caused splicing defects. A two-tiered evaluation combining IGV visualization and the DROP RNA-Seq pipeline was employed to detect pathogenic variants and aberrant splicing events. Comparative analysis across the two sequencing methods indicated that rRNA-depleted RNA-Seq detected pathogenic variants with greater sensitivity and statistical confidence, whereas poly(A)+ enrichment RNA-Seq provided minimal coverage of the affected variant ( Figure 3 ; Supplementary file 1 B). rRNA-depleted data consistently revealed patient-specific aberrant splice junctions, including complex exon-skipping events and activation of cryptic splice sites with statistical confidence (padjust < 0.01 in DROP pipeline) ( Figure 3 ). Notably, in this titinopathy cohort, where strong and previously characterized pathogenic splice variants are present, poly(A)+ RNA-Seq failed to detect many novel and cryptic splicing events that were readily captured by rRNA-depleted RNA-Seq (Supplementary file 1 B). Discussion Over the last decade, RNA-Seq has become an increasingly important tool in both clinical diagnostics and biomedical research, owing to its ability to quantify gene expression patterns, detect splicing events and provide insights on transcriptome-wide alterations has expanded its role in understanding disease mechanisms and improving diagnostic yield 9, 22, 23, 26 . As RNA-Seq becomes increasingly incorporated into clinical settings 1, 30 , selecting the appropriate enrichment strategy will be essential to maximize diagnostic yield. Previous studies such as Barrett et al. (2021) conducted head-to-head comparisons of poly(A)-based (SMART-seq V4) and rRNA depletion (SoLo Ovation) RNA-Seq in Caenorhabditis elegans , demonstrating notable advantages for rRNA depletion in the detection of noncoding RNAs, reduction of noise in lowly expressed genes, and more accurate quantification of long transcripts. However, the C. elegans genome differs significantly from that of humans, with notable differences in intron lengths, splicing complexity and gene-length distributions, and expression heterogeneity. Furthermore, to our knowledge, no study has yet systematically evaluated how these two enrichment strategies perform in the context of human, patient-derived tissues, particularly regarding long transcript coverage and splice variant validation. This study addresses this critical gap by directly comparing rRNA depletion and poly(A) + enrichment in human muscle and blood RNA samples. Our results indicate that, in both groups of studied samples (skeletal muscle and blood), the use of rRNA depletion leads to markedly lower variation in the coverage of transcripts, suggesting enhanced uniformity of reads along transcript length and, overall, improved transcript coverage. In contrast, poly(A) + enriched libraries exhibited a pronounced 3′ end bias, particularly for transcripts longer than 5 kb, resulting in non-uniform coverage. Expression analyses further corroborated these findings, demonstrating that rRNA depletion improves the quantitative detection of transcripts. We calculated relative expression between the two library enrichment methods, where a negative log 2 FC (poly(A)+ / rRNA depletion) indicates that the library-size-normalized read counts are higher in the rRNA depleted libraries. This observation supports the notion that rRNA depletion yields more reads aligning to longer transcripts, implying improved coverage for long transcripts. Moreover, visualization of coverage profiles further demonstrated that complex, multi-exon splicing events caused by pathogenic TTN intronic variants were robustly detected only in rRNA-depleted datasets. In contrast, these events were missed or underrepresented in poly(A) + enrichment RNA-Seq, as denoted by our results using both IGV and the DROP pipeline. The lower estimated expression in poly(A) + RNA-Seq compared to rRNA-depleted libraries, and the exacerbation of this discrepancy with longer genes can be explained under the light of RNA degradation and fragmentation dynamics. Poly(A) + protocols enrich only molecules that still carry an intact 3′ poly(A) tail, whereas rRNA-depletion captures both mature mRNAs and a broad range of additional RNA species, including pre-mRNAs and fragmented transcripts, thereby providing a more comprehensive representation of the transcribed RNA population. Furthermore, longer transcripts show greater susceptibility to degradation and accumulate more fragmentation events under the same conditions 31, 32 . Because random fragmentation generates a larger number of tail-less fragments for longer transcripts, only a smaller proportion of their fragments retain the 3′ poly(A) tail required for capture in poly(A) + libraries. In contrast, rRNA-depleted protocols can detect any fragment regardless of tail status. This sampling asymmetry results in a length-dependent underestimation of long-gene expression and produces the characteristic 3′ coverage bias observed in degraded poly(A) + datasets 31, 32 . From a diagnostic perspective, these findings offer major implications: improved coverage of long genes directly translates to enhanced detection of aberrant splicing and more reliable variant interpretation in diseases involving large transcripts, particularly those associated with TTN , NEB , and OBSCN which encode some of the longest mRNAs 18–21 . Long sarcomeric genes are significant targets in genetic testing for muscular dystrophies and cardiomyopathies, yet their complex architecture and large transcript size often hinder reliable read coverage 22, 33 . Together, our analyses conclude that, although poly(A) + enrichment remains suitable for standard expression profiling, rRNA depletion is technically and functionally superior when comprehensive transcript coverage and splice-aware variant interpretation are required, particularly for studying long clinically relevant transcripts. Despite the clear advantages demonstrated for rRNA depletion, this study is limited by the use of short-read RNA-Seq data, which cannot resolve full-length transcript isoforms or complex splicing patterns with base-pair precision. Short-read approaches risk missing rare or novel isoforms, particularly in large transcripts and are unable to reliably resolve loci containing long repetitive regions. Therefore, future work integrating long-read sequencing technologies, such as PacBio Iso-Seq or Oxford Nanopore, could complement our findings by enabling the detection of full-length and previously unannotated transcripts that may refine transcript models and isoform-level analyses 14, 15, 34–36 . Notably, several poly(A)-independent protocols have recently been adapted for long-read platforms, including Nanopore-based workflows 37, 38 . Because long-read sequencing methods capture a much broader fraction of the transcriptome, it remains unclear whether the length-dependent differences observed between poly(A) + and poly(A)-independent libraries persist in long-read total-RNA datasets, and if so, to what extent. Conclusion Our data demonstrate that rRNA depleted RNA-Seq provides superior coverage, sensitivity, uniformity, transcript integrity and statistical confidence, enabling detection of splicing aberrations and enhancing variant interpretation. As RNA-Seq becomes increasingly central to molecular diagnostics, careful selection of library enrichment strategies is essential to maximize diagnostic yield and improve variant interpretation. This study represents the first direct benchmark of rRNA depletion versus poly(A) + enrichment methods in human patient-derived tissues for the detection and quantification of long transcripts and complex splicing events. Our results suggest rRNA depletion as the preferred method for transcriptome profiling in clinical contexts, particularly where large transcript coverage and splice variant validation are critical. Materials and methods In-house RNA sequencing data Different set of twenty-three patient-derived skeletal muscle (SM) samples were selected for RNA sequencing for each enrichment strategies: rRNA depletion and poly(A) + selection. Muscle tissue were homogenized in-house using SpeedMill PLUS (Analytik Jena AG, Germany). RNA was extracted with Qiagen RNeasy Plus Universal Mini Kit (Qiagen, Hilden, Germany) according to the manufacturers’ instructions. Total RNA-Seq strand-specific libraries were prepared using the Illumina Ribo-Zero Plus rRNA Depletion Kit (Illumina, Palo Alto, CA, USA) at the Oxford Genomics Center, University of Oxford, Oxford, United Kingdom. Sequencing was performed on NovaSeq 6000 (Illumina), generating approximately 90 million paired-end reads per sample, with a total read length of 302 bp. For poly(A) + enrichment, the NEBNext Ultra II Directional RNA Library Prep kit (E7760) for Illumina (NEB, Beverly, MA, USA) was used to prepare strand-specific RNA-Seq libraries. Libraries were multiplexed and sequenced on HiSeq4000 (Illumina, CA, USA), and approximately 60 million paired-end reads were produced, also with a total read length of 302 bp. Public RNA sequencing data Publicly available blood RNA-Seq data were obtained from the Sequence Read Archive (SRA) under accession number SRP127360. This dataset includes blood samples processed using both rRNA depletion and poly(A) + enrichment 6 . However, prior to the analysis, and in consultation with the data curator and maintainer, we updated the sample annotation to correct an identified discrepancy. The finalized annotation for both blood and skeletal muscle data are provided in the supplementary materials (Supplementary file 1 C & D). Quality control and read alignment Raw sequencing reads were subjected to quality control using FastQC 39 to assess base quality scores, GC content, and adapter contamination. All samples exhibited high Phred quality scores across read lengths and were considered for further analysis. Reads were aligned to the human reference genome GRCh38.p13 using STAR v2.7.0a 40 following the two-pass mapping pipeline. The STAR genome index was generated from the Gencode v39 annotation, comprising 61,533 isoforms. Read Quantification Transcript-level quantification was obtained with Salmon 41 (Supplementary files 2, 3, 4). The resulting transcript per million (TPM) counts were then aggregated by sum to achieve gene-level counts. Genes with TPM > 1 were used for analysis, ensuring that only sufficiently expressed transcripts were included in the gene body coverage assessment. This filtering was performed separately for each tissue type to reflect tissue-specific expression profiles. To enable accurate comparisons between library enrichment approaches, these gene-level counts were converted to counts per million (CPM) and normalized for sequencing library size. For each gene, the log-scaled relative average expression achieved by ribo-depleted RNA-Seq to the average expression achieved by poly(A) + RNA-Seq was measured. Gene length was defined as the transcription length (TL), calculated by summing the lengths of all annotated exons across all transcripts corresponding to each gene in Gencode v39 annotation (Supplementary file 5). RSeQC analysis and uniformity To assess gene body coverage (GBC), the geneBody_coverage.py tool from the RSeQC package 42 was utilized. GBC analysis was performed on the mapped BAM files, restricted to the genes based on tissue-specific expression profiles. This tool divides each transcript into 100 equally sized bins along the 5'-3' end and calculates read coverage within each bin, enabling the evaluation of coverage uniformity across transcripts (Supplementary file 6). The coefficient of variation (CV) was measured across the 100 bins across the length of each gene and log scaled 42 . A low CV value indicates a more uniform read distribution, whereas a higher CV indicates a less uniform read distribution. To further evaluate the 5' and 3' end coverage biases, raw read coverage for the first and last 20% of each transcript were extracted from the bin read coverage values. Their 5'-end to 3'-end ratio was calculated, for specific transcripts, where values closer to zero indicate more balanced coverage across the transcript body. For plotting the transcript body coverage profile of each gene, the raw transcript coverage values were normalized to the sum of the values within each sample. DROP pipeline to detect aberrant splicing effects Four SM samples with confirmed diagnosis were processed using both rRNA depleted and poly(A) + enrichment methods. The aberrant splicing module (version 1.4.0) in DROP 26 was used to detect pathogenic variants and aberrant splicing. The recommended cohort size is 30 samples for statistical significance, we ran DROP for these four SM samples as a part of larger cohorts sharing the same technical aspects of library preparation and sequencing facility (Supplementary file 7). For the rRNA depleted samples we had a cohort of 53 and respectively for the poly(A) + enriched the samples were part of a 96-sample cohort. We evaluated if the predicted splicing events were captured by the aberrant splicing module using the default settings. When interpreting the results, we checked events significant either by their original adjusted p-value or by a Bonferroni-corrected p-value calculated only across myogenes. The use of generative AI and AI-assisted technologies in the writing process For preparation of this work, the authors have used ChatGPT to correct the grammar and proofread the text. After applying ChatGPT, the authors reviewed and further modified the text. The authors take full responsibility for the content in this publication. Declarations Ethics approval and consent to participate This study falls under the ethical approval HUS/16896/2022 by the ethics committee of the Hospital District of Helsinki and Uusimaa (HUS) and was performed in accordance with the Declaration of Helsinki. Consent for publication: Not applicable Competing interests: No conflict of interest to declare Availability of data and materials: RNA sequencing data for human blood samples were used from SRA (SRP127360). RNA sequencing data human skeletal muscle biopsies are protected under GDPR principles. All codes used in this study are acquired from the sample scripts mentioned in each tool package. Funding This study is funded by the European Commission under the CoMPaSS-NMD, funded by HORIZON-HLTH-2022-TOOL-12-two-stage (GA n°101080874 to MS), the Research Council of Finland (#339437, #346209, #361979 to MS), Samfundet Folkhälsan (to MS and BU), the Sigrid Juselius Foundation (#230217 to MS and BU), European Joint Programme on Rare Diseases (‘Improved diagnostic output in large sarcomeric genes IDOLS-G’ to BU), and Magnus Ehrnrooth foundation. Open access was funded by Helsinki University Library. Authors' contributions SNG, PH, MS, and AO conceptualized the study. SNG and AO curated the data. SNG, VL, and AO performed the formal analysis. BU, PH, MS, and AO acquired funding. SNG, VL, and AO carried out the investigation. BU, PH, MS, and AO provided supervision. SNG and AO wrote the original draft, and all authors reviewed and edited the manuscript. MS and AO contributed equally as shared last authors. Acknowledgements We would like to thank the IT Center for Science in Finland (CSC) and the IT Center of the University of Helsinki for providing us with the required computing resources throughout this project. References Peymani F, Farzeen A, Prokisch H (2022) RNA sequencing role and application in clinical diagnostic. Pediatric Investigation 6:29–35 Geraci F, Saha I, Bianchini M (2020) Editorial: RNA-Seq Analysis: Methods, Applications and Challenges. Front Genet 11:220 Stokes T, Cen HH, Kapranov P, et al (2023) Transcriptomics for Clinical and Experimental Biology Research: Hang on a Seq. Advanced Genetics 4:2200024 An W, Yan Y, Ye K (2024) High resolution landscape of ribosomal RNA processing and surveillance. Nucleic Acids Research 52:10630–10644 Venema J, Tollervey D (1999) Ribosome Synthesis in Saccharomyces cerevisiae . Annu Rev Genet 33:261–311 Zhao S, Zhang Y, Gamini R, Zhang B, Von Schack D (2018) Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci Rep 8:4781 Cui P, Lin Q, Ding F, et al (2010) A comparison between ribo-minus RNA-sequencing and polyA-selected RNA-sequencing. Genomics 96:259–265 Barrett A, McWhirter R, Taylor SR, Weinreb A, Miller DM, Hammarlund M (2021) A head-to-head comparison of ribodepletion and polyA selection approaches for Caenorhabditis elegans low input RNA-sequencing libraries. G3 Genes|Genomes|Genetics 11:jkab121 Ding X, Zhang S, Li X, et al (2018) Profiling expression of coding genes, long noncoding RNA , and circular RNA in lung adenocarcinoma by ribosomal RNA ‐depleted RNA sequencing. FEBS Open Bio 8:544–555 Viscardi MJ, Arribere JA (2022) Poly(a) selection introduces bias and undue noise in direct RNA-sequencing. BMC Genomics 23:530 Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13:36–46 Savarese M, Jonson PH, Huovinen S, Paulin L, Auvinen P, Udd B, Hackman P (2018) The complexity of titin splicing pattern in human adult skeletal muscles. Skelet Muscle 8:11 Lopes I, Altab G, Raina P, De Magalhães JP (2021) Gene Size Matters: An Analysis of Gene Length in the Human Genome. Front Genet 12:559998 Uapinyoying P, Goecks J, Knoblach SM, Panchapakesan K, Bonnemann CG, Partridge TA, Jaiswal JK, Hoffman EP (2020) A long-read RNA-seq approach to identify novel transcripts of very large genes. Genome Res 30:885–897 Wang Y, Zhao Y, Bollas A, Wang Y, Au KF (2021) Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39:1348–1365 Brouillette M (2024) Gene length could be a critical factor in the aging of the genome. Proc Natl Acad Sci U S A 121:e2416630121 Soheili-Nezhad S, Ibáñez-Solé O, Izeta A, Hoeijmakers JHJ, Stoeger T (2024) Time is ticking faster for long genes in aging. Trends in Genetics 40:299–312 Bang ML, Centner T, Fornoff F, et al (2001) The complete gene sequence of titin, expression of an unusual approximately 700-kDa titin isoform, and its interaction with obscurin identify a novel Z-line to I-band linking system. Circ Res 89:1065–1072 Savarese M, Maggi L, Vihola A, et al (2018) Interpreting Genetic Variants in Titin in Patients With Muscle Disorders. JAMA Neurol 75:557 Lawlor MW, Ottenheijm CA, Lehtokari V-L, Cho K, Pelin K, Wallgren-Pettersson C, Granzier H, Beggs AH (2011) Novel mutations in NEB cause abnormal nebulin expression and markedly impaired muscle force generation in severe nemaline myopathy. Skeletal Muscle 1:23 Oghabian A, Jonson PH, Gayathri SN, et al (2025) OBSCN undergoes extensive alternative splicing during human cardiac and skeletal muscle development. Skeletal Muscle 15:5 Hong SE, Kneissl J, Cho A, et al (2022) Transcriptome-based variant calling and aberrant mRNA discovery enhance diagnostic efficiency for neuromuscular diseases. J Med Genet 59:1075–1081 Pan Y, Nallamilli BRR, Liu R, et al (2025) Unveiling non-coding DMD variants: synergising RNA sequencing and DNA sequencing for enhanced molecular diagnosis. J Med Genet 62:97–106 Nielsen AF, Bindereif A, Bozzoni I, et al (2022) Best practice standards for circular RNA research. Nat Methods 19:1208–1220 Robinson JT, Thorvaldsdóttir H, Wenger AM, Zehir A, Mesirov JP (2017) Variant Review with the Integrative Genomics Viewer. Cancer Research 77:e31–e34 Yépez VA, Mertes C, Müller MF, et al (2021) Detection of aberrant gene expression events in RNA sequencing data. Nat Protoc 16:1276–1296 Mertes C, Scheller IF, Yépez VA, Çelik MH, Liang Y, Kremer LS, Gusic M, Prokisch H, Gagneur J (2021) Detection of aberrant splicing events in RNA-seq data using FRASER. Nat Commun 12:529 Brechtmann F, Mertes C, Matusevičiūtė A, Yépez VA, Avsec Ž, Herzog M, Bader DM, Prokisch H, Gagneur J (2018) OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data. The American Journal of Human Genetics 103:907–917 MF F, A O, E N, et al (2024) Inferring disease course from differential exon usage in the wide titinopathy spectrum. Ann Clin Transl Neurol. https://doi.org/10.1002/acn3.52189. Byron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, Craig DW (2016) Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet 17:257–271 Feng H, Zhang X, Zhang C (2015) mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA-sequencing data. Nat Commun 6:7816 Wang L, Nie J, Sicotte H, et al (2016) Measure transcript integrity using RNA-seq data. BMC Bioinformatics 17:58 Gonorazky H, Liang M, Cummings B, et al (2016) RNA seq analysis for the diagnosis of muscular dystrophy. Ann Clin Transl Neurol 3:55–60 Kono N, Arakawa K (2019) Nanopore sequencing: Review of potential applications in functional genomics. Dev Growth Differ 61:316–326 Rhoads A, Au KF (2015) PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13:278–289 Pollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS (2018) Long reads: their purpose and place. Human Molecular Genetics 27:R234–R241 Ibrahim F, Oppelt J, Maragkakis M, Mourelatos Z (2021) TERA-Seq: true end-to-end sequencing of native RNA molecules for transcriptome characterization. Nucleic Acids Research 49:e115–e115 Saville L, Wu L, Habtewold J, Cheng Y, Gollen B, Mitchell L, Stuart-Edwards M, Haight T, Mohajerani M, Zovoilis A (2024) NERD-seq: a novel approach of Nanopore direct RNA sequencing that expands representation of non-coding RNAs. Genome Biol 25:233 Lo C-C, Chain PSG (2014) Rapid evaluation and quality control of next generation sequencing data with FaQCs. BMC Bioinformatics 15:366 Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21 Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417–419 Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics 28:2184–2185 Additional Declarations No competing interests reported. Supplementary Files Supplementary1.pdf Supplementaryfile2.csv Supplementaryfile3.csv Supplementaryfile4.csv Supplementaryfile5.csv Supplementaryfile6.xlsx Supplementary7.xlsx Cite Share Download PDF Status: Published Journal Publication published 12 May, 2026 Read the published version in BMC Genomics → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8195045","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":553271108,"identity":"cfd66161-0123-476b-a757-a09ff10f8914","order_by":0,"name":"Swethaa Natraj Gayathri","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABX0lEQVRIie2RzWrCQBCAJyyYy4hXpWL6CBsK/YFWX8VlwZO0Qi+FWisE6iXYa8SXyBt0YSG5hOYa8FBlwVvBIhQPUhpjiySI50LzHZZldr6Z2V2AnJw/CtltMVkLAPSyCTqAAFEG3coamrUzE4VslFZzE0gUlIcU+FVANrcBEdcRqXzD4b7qdOtwpofeYlV9MKgfeuquE16XiD6TneAcUE8pNGoxy/E4XNicjGz0TTfguhnQyW3FQiqdKB6MpJWjtmlhgQAVnACip7mCFyp9OmGuRJC4KEMjrRjjmw8Lvx6Bhopoa/Qabqg2yit7kfo0UTJdYNLWrOKTjCfkhCB2mRslXQRz49YS9wz2NjfHxaGPNFInpIqCjyJ1GiucOTK+CwZlzCiGLadL/Lyv0ZAp7d3uXQ1DNq/013X2PBioJXq9GpbSr/zD9ts1W8Jx9hz35e9Y9cDoH07JycnJ+X98A4yJed4/iLxwAAAAAElFTkSuQmCC","orcid":"","institution":"University of Helsinki","correspondingAuthor":true,"prefix":"","firstName":"Swethaa","middleName":"Natraj","lastName":"Gayathri","suffix":""},{"id":553271109,"identity":"292318d7-2a55-4f90-8740-8bdac5531fbb","order_by":1,"name":"Victoria Lillback","email":"","orcid":"","institution":"University of Helsinki","correspondingAuthor":false,"prefix":"","firstName":"Victoria","middleName":"","lastName":"Lillback","suffix":""},{"id":553271110,"identity":"eab59af4-c4f0-42c1-9b66-2b72d6e734d8","order_by":2,"name":"Bjarne Udd","email":"","orcid":"","institution":"Folkhälsans Forskningscentrum","correspondingAuthor":false,"prefix":"","firstName":"Bjarne","middleName":"","lastName":"Udd","suffix":""},{"id":553271111,"identity":"f4536210-c7b2-4458-a592-07eb2a99831d","order_by":3,"name":"Peter Hackman","email":"","orcid":"","institution":"Folkhälsans Forskningscentrum","correspondingAuthor":false,"prefix":"","firstName":"Peter","middleName":"","lastName":"Hackman","suffix":""},{"id":553271112,"identity":"65b34720-d83d-43d7-a625-0608a2b26c5c","order_by":4,"name":"Marco Savarese","email":"","orcid":"","institution":"Folkhälsans Forskningscentrum","correspondingAuthor":false,"prefix":"","firstName":"Marco","middleName":"","lastName":"Savarese","suffix":""},{"id":553271113,"identity":"3085ac65-7e65-41c7-9838-5c9c6682b939","order_by":5,"name":"Ali Oghabian","email":"","orcid":"","institution":"Folkhälsans Forskningscentrum","correspondingAuthor":false,"prefix":"","firstName":"Ali","middleName":"","lastName":"Oghabian","suffix":""}],"badges":[],"createdAt":"2025-11-24 15:39:27","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8195045/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8195045/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12864-026-12944-z","type":"published","date":"2026-05-13T00:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":97674569,"identity":"57d1f97c-a816-4aa8-9524-92aa611c10b0","added_by":"auto","created_at":"2025-12-08 09:43:37","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1437604,"visible":true,"origin":"","legend":"","description":"","filename":"SwethaaRNASeq.docx","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/a70a6ff460ce416f07376580.docx"},{"id":97674940,"identity":"85de6d6a-f1c9-4758-87f6-afb6a4748a26","added_by":"auto","created_at":"2025-12-08 09:44:55","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8107,"visible":true,"origin":"","legend":"","description":"","filename":"9a565eedadcd40a9afb33a04c025a8fc.json","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/b8db483a7a90134ee589377f.json"},{"id":97662974,"identity":"55034ee7-d8f1-43a0-8cc1-a98a37488b76","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":356064,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/967049da1f57aaebc3e5e126.pdf"},{"id":97674574,"identity":"3065fa25-3f34-4c9b-8018-e3ad6ef244d6","added_by":"auto","created_at":"2025-12-08 09:43:38","extension":"xlsx","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":14718,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary7.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/e1a0e5cef6cab4894e9c3d68.xlsx"},{"id":97674037,"identity":"2d882761-a869-46b3-a795-9465cd90d42d","added_by":"auto","created_at":"2025-12-08 09:42:14","extension":"csv","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":33644605,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile2.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/62895604dcc475430a46ab0d.csv"},{"id":97662981,"identity":"52330d7f-6a88-49e4-982b-2f7913d803ad","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"csv","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":33576005,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile3.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/0c5c7295b9c898854c0c7ffb.csv"},{"id":97662988,"identity":"f6cea8e5-0efc-4380-beca-988494140b20","added_by":"auto","created_at":"2025-12-08 08:19:45","extension":"csv","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":32984708,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile4.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/4eaed20a46619ab8d476ec18.csv"},{"id":97675016,"identity":"866b1a72-1631-4efb-a11c-f943a0cba641","added_by":"auto","created_at":"2025-12-08 09:45:24","extension":"csv","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2106893,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile5.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/56c2ebd171f1b3771261a5d2.csv"},{"id":97674885,"identity":"2ed6dfe0-0aac-461e-8523-fd1998ab247e","added_by":"auto","created_at":"2025-12-08 09:44:38","extension":"xlsx","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":95130,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile6.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/a5770c1d3ba7bc5d0692140b.xlsx"},{"id":97674985,"identity":"ec46b380-f742-4ea0-a625-5fd89a58e912","added_by":"auto","created_at":"2025-12-08 09:45:06","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":66655,"visible":true,"origin":"","legend":"","description":"","filename":"9a565eedadcd40a9afb33a04c025a8fc1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/1fc6a90166b55abe66f140a4.xml"},{"id":97662980,"identity":"0993d493-da6e-4578-ac8a-9ccf2ccc692d","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":83288,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/c3e20d2994e3e2577491fe98.png"},{"id":97673204,"identity":"81eb9cbd-dbf6-4cbc-8fb7-774b6e184c52","added_by":"auto","created_at":"2025-12-08 09:39:37","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":233998,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/6b92d965de10cfb62e434b53.png"},{"id":97662979,"identity":"e7d5591e-f681-44b2-bdd2-7e578004fcf3","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":566341,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/79f8909f6cf5ba03a2f8886d.png"},{"id":97662992,"identity":"6fc0f918-2e47-4bf6-9aac-2aae1b3ad534","added_by":"auto","created_at":"2025-12-08 08:19:45","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":433008,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/7dadd125a274f07893f5dad8.png"},{"id":97662984,"identity":"c4a80664-dcd8-443e-9714-76663028a54c","added_by":"auto","created_at":"2025-12-08 08:19:45","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":30067,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/6e27ae7fda0e2af19cb31c8a.png"},{"id":97674914,"identity":"acae092e-717f-4177-b9a3-0c5f9e4a4473","added_by":"auto","created_at":"2025-12-08 09:44:43","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":84961,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/4e70bce213a0bb91a8a826bd.png"},{"id":97675023,"identity":"27e2584d-f7d6-4599-bb27-c01c1d21b967","added_by":"auto","created_at":"2025-12-08 09:45:26","extension":"png","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":70604,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/5cfbf6c65a8e2d3a04137f37.png"},{"id":97674811,"identity":"0523354d-cc14-4810-a39f-d0d66d78f292","added_by":"auto","created_at":"2025-12-08 09:44:18","extension":"png","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":78753,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/41967a9c0022f00e87134e47.png"},{"id":97662986,"identity":"4bf68ae4-c1e9-439f-b781-e70d29e51526","added_by":"auto","created_at":"2025-12-08 08:19:45","extension":"xml","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":66048,"visible":true,"origin":"","legend":"","description":"","filename":"9a565eedadcd40a9afb33a04c025a8fc1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/55ecda834708d25519f0187e.xml"},{"id":97662989,"identity":"e00d28fa-1ca7-4670-8f68-1213ea4625a5","added_by":"auto","created_at":"2025-12-08 08:19:45","extension":"html","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":72188,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/057d8e71216f566bee019be8.html"},{"id":97662966,"identity":"714c2620-f2da-401b-a957-fe53471c600d","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":118449,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTranscript coverage variation in RNA-Seq.\u003c/strong\u003e Boxplot (left) show CV values across transcripts grouped by TL. Each box indicates the interquartile range; whiskers represent the spread of values across genes, not error bars. Line plot (right) highlight the trend. \u003cstrong\u003eA) Skeletal muscle \u003c/strong\u003eRNA-Seq indicate for transcripts \u0026gt;5 kb, RiboD (rRNA depletion) consistently shows lower variation than poly(A)+ enrichment. \u003cstrong\u003eB) Blood \u003c/strong\u003eRNA-Seq also\u003cstrong\u003e \u003c/strong\u003eindicate that\u003cstrong\u003e \u003c/strong\u003efor transcripts \u0026gt;5 kb, rRNA depletion consistently shows lower variation than poly(A)+ enrichment.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/6858006d63af15474c863c79.png"},{"id":97673189,"identity":"006ab297-8f06-4696-8425-6eb8bcd1084e","added_by":"auto","created_at":"2025-12-08 09:39:35","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":349076,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eNormalized transcript body coverage profiles for RiboD (rRNA depleted) and poly(A)+ selected genes of varying lengths. \u003c/strong\u003eA) Skeletal muscle samples. B) Blood samples. Poly(A)+ libraries display 3′ end bias, whereas rRNA depletion provides more uniform coverage across the transcript body.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/64ed6e88f4b10157fee4cc5c.png"},{"id":97662968,"identity":"26664cd8-769c-4bbb-9e7b-4bcc272f8b57","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":195661,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSashimi plot for same skeletal muscle samples run in both poly(A)+ enrichment and rRNA depletion methods. \u003c/strong\u003eFour human patient samples (A,B,C,D) with confirmed titinopathy were run in both RNA library approaches (green for poly(A)+ enrichment and orange for rRNA depletion). Snapshots from IGV sashimi showcase splice events for each library RNA run in each patient sample. Black arrows indicate the variant site. The splice junctions are coloured like the library group colour. The numbered box within each junction curve denotes the reads accounting for the splice junction. Exon numbers are labelled as Enn in pink. Biorender was used to include sashimi snapshots and mark arrows and read counts box for splice junctions.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/95e4fdb2a80a57c23c33c203.png"},{"id":97662971,"identity":"82304a8b-6f51-4204-aee5-119506694a2a","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":252343,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGene expression detection patterns across total exon lengths and biotypes in poly(A)+ enriched and rRNA depleted RNA-Seq libraries. A) Skeletal muscle. B) Blood samples. \u003c/strong\u003eScatterplot of log\u003csub\u003e2\u003c/sub\u003efold change (Poly(A)+ / rRNA depletion) against transcription length (log\u003csub\u003e10\u003c/sub\u003e scale). Each point represents a gene, colored by its average expression level (CPM). A LOWESS curve (green) highlights overall trends across gene lengths. Positive log\u003csub\u003e2\u003c/sub\u003e values indicate higher expression in poly(A)+ libraries, whereas negative values indicate higher expression in rRNA depleted libraries. Equivalent analysis narrowed to protein-coding genes. The scatterplot illustrates length dependent protein coding expression. Scatter plots made using python.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/3318a068899c9962fdce137b.png"},{"id":109219139,"identity":"0df92508-939d-4c58-ab15-4b7f2ca4b66f","added_by":"auto","created_at":"2026-05-13 19:48:17","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":974255,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/36a5815c-804f-49d0-957e-a533080f5d54.pdf"},{"id":97674851,"identity":"ca7cac05-57f1-4511-8f83-39ee104e381f","added_by":"auto","created_at":"2025-12-08 09:44:27","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":356064,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/867821726958e38c286ebb7f.pdf"},{"id":97674858,"identity":"bcb3b926-a32f-46fd-b7ed-4c6c37cb928b","added_by":"auto","created_at":"2025-12-08 09:44:31","extension":"csv","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":33644605,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile2.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/334bcdbf99f940ebec39a0f9.csv"},{"id":97662994,"identity":"dfd60c49-deb1-4c63-beae-48c465b32681","added_by":"auto","created_at":"2025-12-08 08:19:45","extension":"csv","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":33576005,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile3.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/619d490f46a7e49fd4e7ef46.csv"},{"id":97662995,"identity":"1f54f85d-6bbe-4d89-980a-ef149e8f7f77","added_by":"auto","created_at":"2025-12-08 08:19:46","extension":"csv","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":32984708,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile4.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/0f43afb6bcc04a23270e03e3.csv"},{"id":97674990,"identity":"c9bd9e82-8322-449a-aca2-7cf85d8ad12b","added_by":"auto","created_at":"2025-12-08 09:45:07","extension":"csv","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":2106893,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile5.csv","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/b3737d54b9d6ac5990a08bdb.csv"},{"id":97662978,"identity":"a63b8319-738f-47b3-865f-d526cc2d18fc","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"xlsx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":95130,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfile6.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/00112a40d6d1e976713b298c.xlsx"},{"id":97662976,"identity":"95de2970-bba9-408d-ba5c-a38bda398078","added_by":"auto","created_at":"2025-12-08 08:19:44","extension":"xlsx","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":14718,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary7.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8195045/v1/903b6971d3abe7181b5bba5d.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Poly(A) selection limits detection of long and alternatively spliced transcripts compared with rRNA depletion in RNA-Sequencing","fulltext":[{"header":"Background","content":"\u003cp\u003eRNA sequencing or RNA-Seq enables the characterization of gene expression patterns, alternative splicing, and regulatory pathways in different samples and conditions. It is widely used in a broad range of research fields, including life sciences, clinical diagnostics, and in the development of novel therapeutics\u003csup\u003e1\u0026ndash;3\u003c/sup\u003e. As ribosomal RNAs (rRNAs) account for more than 80% of the total RNA in the eukaryotic cells\u003csup\u003e4, 5\u003c/sup\u003e, often during library preparation either poly(A)\u0026thinsp;+\u0026thinsp;selection is performed, to enrich for polyadenylated mRNA, or ribosomal RNA (rRNA) depletion is applied to remove rRNA and retain the remaining RNA population. These two approaches have previously been benchmarked and compared for the composition of the detected RNAs, quantification of gene expression, and ability to detect lowly expressed genes\u003csup\u003e6\u0026ndash;8\u003c/sup\u003e. In particular, rRNA depleted RNA-Seq has been reported to be capable of detecting more long non-coding RNAs (lncRNA) and to better detect the lowly expressed transcripts\u003csup\u003e6\u0026ndash;9\u003c/sup\u003e. However, this approach also retains immature RNA, including degraded transcripts, which may complicate data interpretation\u003csup\u003e6\u003c/sup\u003e. In contrast, poly(A)\u0026thinsp;+\u0026thinsp;RNA-Seq predominantly captures mature mRNAs but exhibits a bias toward detecting the 3' ends of these transcripts\u003csup\u003e10\u003c/sup\u003e. Overall, the sequence reads achieved from rRNA depleted RNA-Seq cover the gene body (\u003cem\u003ei.e.\u003c/em\u003e from 5' to 3' end of the gene) more uniformly\u003csup\u003e8\u003c/sup\u003e. However, one critical aspect that remains insufficiently explored is the extent to which different RNA-Seq library enrichment methods affect the detection of transcripts of varying sizes. This is important since long transcripts with complex genetic architecture, diverse splicing patterns critical biological functions present challenges that can complicate their detection by RNA-Seq\u003csup\u003e11\u0026ndash;13\u003c/sup\u003e. Furthermore, the size of mRNAs encoded by genes in human genome can exceed 50kb, which is too large to be fully detectable by short-read and even some existing long-read RNA-Seq platforms (\u003cem\u003ee.g.\u003c/em\u003e Iso-Seq by PacBio)\u003csup\u003e14, 15\u003c/sup\u003e. These long-isoform-coding genes are associated with diverse functions including neuronal processes, embryonic development and ageing\u003csup\u003e13, 16, 17\u003c/sup\u003e. Three of the genes that code for the longest mRNAs in humans, namely \u003cem\u003eTTN\u003c/em\u003e, \u003cem\u003eNEB\u003c/em\u003e and \u003cem\u003eOBSCN\u003c/em\u003e, encode for sarcomeric proteins that are essential in muscle formation and function\u003csup\u003e18\u0026ndash;21\u003c/sup\u003e. Therefore, we propose that the limited detection sensitivity and non-uniform coverage of long mRNAs in poly(A)\u0026thinsp;+\u0026thinsp;RNA-Seq data particularly impact research areas such as ageing, neuronal development, muscle biology, and disorders affecting neuronal and muscle tissues, although similar biases are expected across all biological research fields. Importantly, this challenge extends to the clinical diagnostic setting, where RNA-Seq is increasingly used to interpret the pathogenicity of genomic variants, especially splice variants in disease-associated genes\u003csup\u003e22\u0026ndash;24\u003c/sup\u003e. A critical step in the clinical interpretation of RNA-Seq data involves visualizing candidate disease-causing variants in the Integrative Genomics Viewer (IGV)\u003csup\u003e25\u003c/sup\u003e, which is instrumental in identifying splice variants that reveal aberrant splicing patterns. To detect the potential disease-assocaited RNA splicing and gene expression dysregulation, several computational tools, including DROP\u003csup\u003e26\u003c/sup\u003e, FRASER\u003csup\u003e27\u003c/sup\u003e, and OUTRIDER\u003csup\u003e28\u003c/sup\u003e have been developed. These tools enable detection of aberrant splicing and transcriptional outliers by integrating statistical models and multi-omics data, thereby increasing sensitivity and diagnostic yield for rare disease-associated variants.\u003c/p\u003e\u003cp\u003eHere, we aim to systematically compare poly(A)\u0026thinsp;+\u0026thinsp;selection and rRNA depletion, two commonly used RNA-Seq library enrichment methods, by analyzing data from varied human tissue types (\u003cem\u003ei.e.\u003c/em\u003e blood and skeletal muscle). We study transcript-body coverage, gene expression detection, and splice variant detection across transcripts of different size groups, focusing on aspects that have not been thoroughly explored previously. We further illustrate how these differences can influence the interpretation of disease-associated variants in very large transcripts like \u003cem\u003eTTN\u003c/em\u003e (\u0026gt;\u0026thinsp;100kb).\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003e1.\u0026nbsp;\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eRibodepletion RNA-Seq reads offer more uniform transcript coverage\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe compared the distribution of mapped sequence reads across the transcript body of different transcript lengths when using poly(A)+ selection versus rRNA depletion for library enrichment. For this analysis, twenty-three skeletal muscle samples were run in rRNA depletion and poly(A)+ enrichment RNA sequencing (RNA-Seq). The coefficient of variation (CV) values for muscle RNA-Seq libraries were plotted against transcription length (TL) to assess coverage uniformity (\u003cstrong\u003eFigure 1\u003c/strong\u003e\u003cstrong\u003eA\u003c/strong\u003e).\u0026nbsp;rRNA depleted RNA-Seq consistently showed lower CV values compared to those of poly(A)+ enriched, particularly for transcripts longer than 5 kb, indicating more uniform transcript coverage when using rRNA depletion. When examining the TL measurements, poly(A)+ selected RNA-Seq detected only a limited number of long transcripts (\u0026gt; 40 kb): three transcripts in the 40-50 kb range (\u003cem\u003eCWC27, FTX, MYO5A\u003c/em\u003e), one between 50-100 kb (\u003cem\u003eCCDC26\u003c/em\u003e), and one above 100 kb (\u003cem\u003eTTN\u003c/em\u003e) (\u003cstrong\u003eFigure 1\u003c/strong\u003e\u003cstrong\u003eA\u003c/strong\u003e). \u0026nbsp;In contrast, the rRNA depleted RNA-Seq detected a higher number of long transcripts: six of which constitute in the 40-50 kb range (\u003cem\u003eARID1B, DST, KIAA1109, MAPK10, MYO5A, NF1\u003c/em\u003e) (\u003cstrong\u003eFigure 1\u003c/strong\u003e\u003cstrong\u003eA\u003c/strong\u003e), four are in the 50-100 kb range (\u003cem\u003eANK2, CCDC26, KMT2C, MACF1\u003c/em\u003e), and one over 100 kb (\u003cem\u003eTTN\u003c/em\u003e).\u003c/p\u003e\n\u003cp\u003eWe also checked CV distribution against TL in blood samples. In blood RNA-Seq, only a few long transcripts (\u0026gt;40 kb) were detected. Interestingly, only rRNA depleted RNA-Seq could detect transcripts larger than 50kb, namely \u003cem\u003eANK2\u0026nbsp;\u003c/em\u003e(average TPM 6.2) and \u003cem\u003eTTN\u0026nbsp;\u003c/em\u003e(average TPM 2.8), as well as long non-coding transcripts \u003cem\u003eCCDC26, KCNQ1OT1, HELLPAR\u003c/em\u003e. Whereas poly(A)+ RNA-Seq did not detect any transcript larger than 40 kb. Within the 1-40 kb range transcripts, rRNA depleted RNA-Seq consistently showed lower CV than poly(A)+ (\u003cstrong\u003eFigure 1\u003c/strong\u003e \u003cstrong\u003eB)\u003c/strong\u003e. This also highlights the reduced sensitivity of poly(A)+ selection for large genes in blood (\u003cstrong\u003eFigure 1\u003c/strong\u003e \u003cstrong\u003eB)\u003c/strong\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;In order to illustrate how the chosen library enrichment method influences the detection of individual genes, we compared the transcript body-coverage profiles (normalized by total coverage) of two muscle-function genes with long isoforms, \u003cem\u003eOBSCN\u003c/em\u003e (~39 kb) and \u003cem\u003eTTN\u003c/em\u003e (\u0026gt;100 kb), to that of a gene with a substantially shorter isoform, \u003cem\u003eMYOD1\u003c/em\u003e (~2 kb) (\u003cstrong\u003eFigure 2\u003c/strong\u003e\u003cstrong\u003eA\u003c/strong\u003e). Poly(A)+ detection RNA-Seq from muscle biopsies displayed a clear decrease in read coverage toward the 5\u0026apos; end, highlighting an overall strong 3\u0026apos; end detection bias. In contrast, the rRNA depleted RNA-Seq mostly provided uniform coverage signals across the transcript body (\u003cstrong\u003eFigure 2\u003c/strong\u003e\u003cstrong\u003eA)\u003c/strong\u003e, with the exception of localized dips in the coverage due to low exon usage, particularly in \u003cem\u003eTTN\u003c/em\u003e\u003csup\u003e29\u003c/sup\u003e and \u003cem\u003eOBSCN\u003c/em\u003e\u003csup\u003e21\u003c/sup\u003e\u003cem\u003e.\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e\u0026nbsp;\u003c/em\u003eIn blood, transcript body coverage results for \u003cem\u003eSYNE1\u003c/em\u003e (~47 kb), \u003cem\u003eMYO9A\u003c/em\u003e (~20 kb) and \u003cem\u003eLCN2\u003c/em\u003e (~1kb) exhibited a strong 3\u0026prime; end bias and a very low coverage toward the 5\u0026apos; end in poly(A)+ RNA-Seq (\u003cstrong\u003eFigure 2\u003c/strong\u003e\u003cstrong\u003eB)\u003c/strong\u003e. In contrast, rRNA depleted RNA-Seq reads covered more uniformly across each transcript body, except at regions where the exon usage differs among isoforms, resulting in localized dips in the sequence read coverage.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThese transcript body coverage profile results are consistent with the 5\u0026rsquo;end - 3\u0026rsquo;end coverage ratio analysis (Supplementary file 1 A), which reflects the uniformity of read coverage across transcript ends. In our data, rRNA depleted RNA-Seq exhibited coverage ratios closer to zero, indicating more uniform 5\u0026rsquo;-3\u0026rsquo; end transcript coverage. In contrast poly(A)+ RNA-Seq showed negative ratios reflecting its strong 3\u0026rsquo; end bias (Supplementary file 1 A).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e2. \u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eExpression detection across gene lengths and biotypes\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe examined how gene expression estimates differ between the two library enrichment methods. We plotted the log₂ fold change between expression values obtained from poly(A)+ RNA-Seq and those from rRNA-depleted RNA-Seq against TL (log scaled) for muscle and blood samples \u003cstrong\u003e(\u003c/strong\u003e\u003cstrong\u003eFigure 4\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;A\u0026amp;B).\u0026nbsp;\u003c/strong\u003eA LOWESS (Locally Weighted Scatterplot Smoothing) curve was fitted to illustrate the overall relationship between TL and fold-change values.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eIn skeletal muscle (\u003cstrong\u003eFigure 4\u003c/strong\u003e\u003cstrong\u003eA\u003c/strong\u003e), transcripts shorter than 5 kb, showed broadly similar expression levels between the two methods. In contrast, for longer transcripts, the distribution of values shifted toward negative log2 values, indicating higher expression detection in the rRNA-depleted dataset. The LOWESS curve demonstrated a pronounced downward trajectory at log-scaled transcript length of 10\u003csup\u003e3\u003c/sup\u003e-10\u003csup\u003e4\u003c/sup\u003e, indicating that discrepancies in expression estimates between the two methods become more pronounced for transcripts longer than ~5kb. This pattern is consistent with our earlier observations that rRNA-depleted RNA-Seq provides superior coverage and sensitivity for long transcripts. When the analysis was restricted only to protein-coding genes (\u003cstrong\u003eFigure 4\u003c/strong\u003e\u003cstrong\u003eA\u003c/strong\u003e), a similar length-dependent bias favoring rRNA depletion was observed. In blood, the trend was distinct (\u003cstrong\u003eFigure 4\u003c/strong\u003e\u003cstrong\u003eB\u003c/strong\u003e). The LOWESS curve shows a drastic shift below zero for transcript lengths above 5kb, highlighted with a downward trajectory at log-scaled transcript length of 10\u003csup\u003e3\u003c/sup\u003e-10\u003csup\u003e4\u0026nbsp;\u003c/sup\u003e(\u003cstrong\u003eFigure 4\u003c/strong\u003e\u003cstrong\u003eB\u003c/strong\u003e). Similar trend was seen for only protein-coding genes, especially transcripts exceeding 10 kb showed higher expression detection in rRNA-depleted libraries.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3. \u0026nbsp;\u003c/strong\u003e\u003cstrong\u003erRNA depleted RNA-Seq facilitates improved detection and clinical interpretation of splice variants\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo evaluate how the choice of library enrichment strategy influences clinical interpretation and diagnostic sensitivity, we analyzed muscle biopsies from four patients with a confirmed titinopathy diagnosis using both rRNA depletion and poly(A)+ enrichment RNA-Seq methods. Each patient had a confirmed diagnosis of intronic variants in the \u003cem\u003eTTN\u003c/em\u003e gene that caused splicing defects. A two-tiered evaluation combining IGV visualization and the DROP RNA-Seq pipeline was employed to detect pathogenic variants and aberrant splicing events. Comparative analysis across the two sequencing methods indicated that rRNA-depleted RNA-Seq detected pathogenic variants with greater sensitivity and statistical confidence, whereas poly(A)+ enrichment RNA-Seq provided minimal coverage of the affected variant (\u003cstrong\u003eFigure 3\u003c/strong\u003e; Supplementary file 1 B). rRNA-depleted data consistently revealed patient-specific aberrant splice junctions, including complex exon-skipping events and activation of cryptic splice sites with statistical confidence (padjust \u0026lt; 0.01 in DROP pipeline) (\u003cstrong\u003eFigure 3\u003c/strong\u003e). Notably, in this titinopathy cohort, where strong and previously characterized pathogenic splice variants are present, poly(A)+ RNA-Seq failed to detect many novel and cryptic splicing events that were readily captured by rRNA-depleted RNA-Seq (Supplementary file 1 B). \u0026nbsp; \u0026nbsp;\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eOver the last decade, RNA-Seq has become an increasingly important tool in both clinical diagnostics and biomedical research, owing to its ability to quantify gene expression patterns, detect splicing events and provide insights on transcriptome-wide alterations has expanded its role in understanding disease mechanisms and improving diagnostic yield\u003csup\u003e9, 22, 23, 26\u003c/sup\u003e. As RNA-Seq becomes increasingly incorporated into clinical settings\u003csup\u003e1, 30\u003c/sup\u003e, selecting the appropriate enrichment strategy will be essential to maximize diagnostic yield. Previous studies such as Barrett et al. (2021) conducted head-to-head comparisons of poly(A)-based (SMART-seq V4) and rRNA depletion (SoLo Ovation) RNA-Seq in \u003cem\u003eCaenorhabditis elegans\u003c/em\u003e, demonstrating notable advantages for rRNA depletion in the detection of noncoding RNAs, reduction of noise in lowly expressed genes, and more accurate quantification of long transcripts. However, the \u003cem\u003eC. elegans\u003c/em\u003e genome differs significantly from that of humans, with notable differences in intron lengths, splicing complexity and gene-length distributions, and expression heterogeneity. Furthermore, to our knowledge, no study has yet systematically evaluated how these two enrichment strategies perform in the context of human, patient-derived tissues, particularly regarding long transcript coverage and splice variant validation. This study addresses this critical gap by directly comparing rRNA depletion and poly(A)\u0026thinsp;+\u0026thinsp;enrichment in human muscle and blood RNA samples.\u003c/p\u003e\u003cp\u003eOur results indicate that, in both groups of studied samples (skeletal muscle and blood), the use of rRNA depletion leads to markedly lower variation in the coverage of transcripts, suggesting enhanced uniformity of reads along transcript length and, overall, improved transcript coverage. In contrast, poly(A)\u0026thinsp;+\u0026thinsp;enriched libraries exhibited a pronounced 3\u0026prime; end bias, particularly for transcripts longer than 5 kb, resulting in non-uniform coverage. Expression analyses further corroborated these findings, demonstrating that rRNA depletion improves the quantitative detection of transcripts. We calculated relative expression between the two library enrichment methods, where a negative log\u003csub\u003e2\u003c/sub\u003eFC (poly(A)+ / rRNA depletion) indicates that the library-size-normalized read counts are higher in the rRNA depleted libraries. This observation supports the notion that rRNA depletion yields more reads aligning to longer transcripts, implying improved coverage for long transcripts. Moreover, visualization of coverage profiles further demonstrated that complex, multi-exon splicing events caused by pathogenic \u003cem\u003eTTN\u003c/em\u003e intronic variants were robustly detected only in rRNA-depleted datasets. In contrast, these events were missed or underrepresented in poly(A)\u0026thinsp;+\u0026thinsp;enrichment RNA-Seq, as denoted by our results using both IGV and the DROP pipeline.\u003c/p\u003e\u003cp\u003eThe lower estimated expression in poly(A)\u0026thinsp;+\u0026thinsp;RNA-Seq compared to rRNA-depleted libraries, and the exacerbation of this discrepancy with longer genes can be explained under the light of RNA degradation and fragmentation dynamics. Poly(A)\u0026thinsp;+\u0026thinsp;protocols enrich only molecules that still carry an intact 3\u0026prime; poly(A) tail, whereas rRNA-depletion captures both mature mRNAs and a broad range of additional RNA species, including pre-mRNAs and fragmented transcripts, thereby providing a more comprehensive representation of the transcribed RNA population. Furthermore, longer transcripts show greater susceptibility to degradation and accumulate more fragmentation events under the same conditions\u003csup\u003e31, 32\u003c/sup\u003e. Because random fragmentation generates a larger number of tail-less fragments for longer transcripts, only a smaller proportion of their fragments retain the 3\u0026prime; poly(A) tail required for capture in poly(A)\u0026thinsp;+\u0026thinsp;libraries. In contrast, rRNA-depleted protocols can detect any fragment regardless of tail status. This sampling asymmetry results in a length-dependent underestimation of long-gene expression and produces the characteristic 3\u0026prime; coverage bias observed in degraded poly(A)\u0026thinsp;+\u0026thinsp;datasets\u003csup\u003e31, 32\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eFrom a diagnostic perspective, these findings offer major implications: improved coverage of long genes directly translates to enhanced detection of aberrant splicing and more reliable variant interpretation in diseases involving large transcripts, particularly those associated with \u003cem\u003eTTN\u003c/em\u003e, \u003cem\u003eNEB\u003c/em\u003e, and \u003cem\u003eOBSCN\u003c/em\u003e which encode some of the longest mRNAs\u003csup\u003e18\u0026ndash;21\u003c/sup\u003e. Long sarcomeric genes are significant targets in genetic testing for muscular dystrophies and cardiomyopathies, yet their complex architecture and large transcript size often hinder reliable read coverage\u003csup\u003e22, 33\u003c/sup\u003e. Together, our analyses conclude that, although poly(A)\u0026thinsp;+\u0026thinsp;enrichment remains suitable for standard expression profiling, rRNA depletion is technically and functionally superior when comprehensive transcript coverage and splice-aware variant interpretation are required, particularly for studying long clinically relevant transcripts.\u003c/p\u003e\u003cp\u003eDespite the clear advantages demonstrated for rRNA depletion, this study is limited by the use of short-read RNA-Seq data, which cannot resolve full-length transcript isoforms or complex splicing patterns with base-pair precision. Short-read approaches risk missing rare or novel isoforms, particularly in large transcripts and are unable to reliably resolve loci containing long repetitive regions. Therefore, future work integrating long-read sequencing technologies, such as PacBio Iso-Seq or Oxford Nanopore, could complement our findings by enabling the detection of full-length and previously unannotated transcripts that may refine transcript models and isoform-level analyses\u003csup\u003e14, 15, 34\u0026ndash;36\u003c/sup\u003e. Notably, several poly(A)-independent protocols have recently been adapted for long-read platforms, including Nanopore-based workflows\u003csup\u003e37, 38\u003c/sup\u003e. Because long-read sequencing methods capture a much broader fraction of the transcriptome, it remains unclear whether the length-dependent differences observed between poly(A)\u0026thinsp;+\u0026thinsp;and poly(A)-independent libraries persist in long-read total-RNA datasets, and if so, to what extent.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eOur data demonstrate that rRNA depleted RNA-Seq provides superior coverage, sensitivity, uniformity, transcript integrity and statistical confidence, enabling detection of splicing aberrations and enhancing variant interpretation. As RNA-Seq becomes increasingly central to molecular diagnostics, careful selection of library enrichment strategies is essential to maximize diagnostic yield and improve variant interpretation. This study represents the first direct benchmark of rRNA depletion versus poly(A)\u0026thinsp;+\u0026thinsp;enrichment methods in human patient-derived tissues for the detection and quantification of long transcripts and complex splicing events. Our results suggest rRNA depletion as the preferred method for transcriptome profiling in clinical contexts, particularly where large transcript coverage and splice variant validation are critical.\u003c/p\u003e"},{"header":"Materials and methods","content":"\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003eIn-house RNA sequencing data\u003c/h2\u003e\u003cp\u003eDifferent set of twenty-three patient-derived skeletal muscle (SM) samples were selected for RNA sequencing for each enrichment strategies: rRNA depletion and poly(A)\u0026thinsp;+\u0026thinsp;selection. Muscle tissue were homogenized in-house using SpeedMill PLUS (Analytik Jena AG, Germany). RNA was extracted with Qiagen RNeasy Plus Universal Mini Kit (Qiagen, Hilden, Germany) according to the manufacturers\u0026rsquo; instructions. Total RNA-Seq strand-specific libraries were prepared using the Illumina Ribo-Zero Plus rRNA Depletion Kit (Illumina, Palo Alto, CA, USA) at the Oxford Genomics Center, University of Oxford, Oxford, United Kingdom. Sequencing was performed on NovaSeq 6000 (Illumina), generating approximately 90\u0026nbsp;million paired-end reads per sample, with a total read length of 302 bp. For poly(A)\u0026thinsp;+\u0026thinsp;enrichment, the NEBNext Ultra II Directional RNA Library Prep kit (E7760) for Illumina (NEB, Beverly, MA, USA) was used to prepare strand-specific RNA-Seq libraries. Libraries were multiplexed and sequenced on HiSeq4000 (Illumina, CA, USA), and approximately 60\u0026nbsp;million paired-end reads were produced, also with a total read length of 302 bp.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003ePublic RNA sequencing data\u003c/h3\u003e\n\u003cp\u003ePublicly available blood RNA-Seq data were obtained from the Sequence Read Archive (SRA) under accession number SRP127360. This dataset includes blood samples processed using both rRNA depletion and poly(A)\u0026thinsp;+\u0026thinsp;enrichment\u003csup\u003e6\u003c/sup\u003e. However, prior to the analysis, and in consultation with the data curator and maintainer, we updated the sample annotation to correct an identified discrepancy. The finalized annotation for both blood and skeletal muscle data are provided in the supplementary materials (Supplementary file 1 C \u0026amp; D).\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eQuality control and read alignment\u003c/h2\u003e\u003cp\u003eRaw sequencing reads were subjected to quality control using FastQC\u003csup\u003e39\u003c/sup\u003e to assess base quality scores, GC content, and adapter contamination. All samples exhibited high Phred quality scores across read lengths and were considered for further analysis. Reads were aligned to the human reference genome GRCh38.p13 using STAR v2.7.0a \u003csup\u003e40\u003c/sup\u003e following the two-pass mapping pipeline. The STAR genome index was generated from the Gencode v39 annotation, comprising 61,533 isoforms.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eRead Quantification\u003c/h3\u003e\n\u003cp\u003eTranscript-level quantification was obtained with Salmon\u003csup\u003e41\u003c/sup\u003e (Supplementary files 2, 3, 4). The resulting transcript per million (TPM) counts were then aggregated by sum to achieve gene-level counts. Genes with TPM\u0026thinsp;\u0026gt;\u0026thinsp;1 were used for analysis, ensuring that only sufficiently expressed transcripts were included in the gene body coverage assessment. This filtering was performed separately for each tissue type to reflect tissue-specific expression profiles. To enable accurate comparisons between library enrichment approaches, these gene-level counts were converted to counts per million (CPM) and normalized for sequencing library size. For each gene, the log-scaled relative average expression achieved by ribo-depleted RNA-Seq to the average expression achieved by poly(A)\u0026thinsp;+\u0026thinsp;RNA-Seq was measured.\u003c/p\u003e\u003cp\u003eGene length was defined as the transcription length (TL), calculated by summing the lengths of all annotated exons across all transcripts corresponding to each gene in Gencode v39 annotation (Supplementary file 5).\u003c/p\u003e\n\u003ch3\u003eRSeQC analysis and uniformity\u003c/h3\u003e\n\u003cp\u003eTo assess gene body coverage (GBC), the geneBody_coverage.py tool from the RSeQC package\u003csup\u003e42\u003c/sup\u003e was utilized. GBC analysis was performed on the mapped BAM files, restricted to the genes based on tissue-specific expression profiles. This tool divides each transcript into 100 equally sized bins along the 5'-3' end and calculates read coverage within each bin, enabling the evaluation of coverage uniformity across transcripts (Supplementary file 6). The coefficient of variation (CV) was measured across the 100 bins across the length of each gene and log scaled\u003csup\u003e42\u003c/sup\u003e. A low CV value indicates a more uniform read distribution, whereas a higher CV indicates a less uniform read distribution. To further evaluate the 5' and 3' end coverage biases, raw read coverage for the first and last 20% of each transcript were extracted from the bin read coverage values. Their 5'-end to 3'-end ratio was calculated, for specific transcripts, where values closer to zero indicate more balanced coverage across the transcript body. For plotting the transcript body coverage profile of each gene, the raw transcript coverage values were normalized to the sum of the values within each sample.\u003c/p\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eDROP pipeline to detect aberrant splicing effects\u003c/h2\u003e\u003cp\u003eFour SM samples with confirmed diagnosis were processed using both rRNA depleted and poly(A)\u0026thinsp;+\u0026thinsp;enrichment methods. The aberrant splicing module (version 1.4.0) in DROP\u003csup\u003e26\u003c/sup\u003e was used to detect pathogenic variants and aberrant splicing. The recommended cohort size is 30 samples for statistical significance, we ran DROP for these four SM samples as a part of larger cohorts sharing the same technical aspects of library preparation and sequencing facility (Supplementary file 7). For the rRNA depleted samples we had a cohort of 53 and respectively for the poly(A)\u0026thinsp;+\u0026thinsp;enriched the samples were part of a 96-sample cohort. We evaluated if the predicted splicing events were captured by the aberrant splicing module using the default settings. When interpreting the results, we checked events significant either by their original adjusted p-value or by a Bonferroni-corrected p-value calculated only across myogenes.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eThe use of generative AI and AI-assisted technologies in the writing process\u003c/h2\u003e\u003cp\u003eFor preparation of this work, the authors have used ChatGPT to correct the grammar and proofread the text. After applying ChatGPT, the authors reviewed and further modified the text. The authors take full responsibility for the content in this publication.\u003c/p\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study falls under the ethical approval HUS/16896/2022 by the ethics committee of the Hospital District of Helsinki and Uusimaa (HUS) and was performed in accordance with the Declaration of Helsinki.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication:\u0026nbsp;\u003c/strong\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests:\u0026nbsp;\u003c/strong\u003eNo conflict of interest to declare\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials:\u0026nbsp;\u003c/strong\u003eRNA sequencing data for human blood samples were used from SRA (SRP127360). RNA sequencing data human skeletal muscle biopsies are protected under GDPR principles. All codes used in this study are acquired from the sample scripts mentioned in each tool package.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study is funded by the European Commission under the CoMPaSS-NMD, funded by HORIZON-HLTH-2022-TOOL-12-two-stage (GA n°101080874 to MS), the Research Council of Finland (#339437, #346209, #361979 to MS), Samfundet Folkhälsan (to MS and BU), the Sigrid Juselius Foundation (#230217 to MS and BU), European Joint Programme on Rare Diseases (‘Improved diagnostic output in large sarcomeric genes IDOLS-G’ to BU), and Magnus Ehrnrooth foundation. Open access was funded by Helsinki University Library.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors' contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSNG, PH, MS, and AO conceptualized the study. SNG and AO curated the data. SNG, VL, and AO performed the formal analysis. BU, PH, MS, and AO acquired funding. SNG, VL, and AO carried out the investigation. BU, PH, MS, and AO provided supervision. SNG and AO wrote the original draft, and all authors reviewed and edited the manuscript. MS and AO contributed equally as shared last authors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe would like to thank the IT Center for Science in Finland (CSC) and the IT Center of the University of Helsinki for providing us with the required computing resources throughout this project. \u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003ePeymani F, Farzeen A, Prokisch H (2022) RNA sequencing role and application in clinical diagnostic. Pediatric Investigation 6:29\u0026ndash;35\u003c/li\u003e\n \u003cli\u003eGeraci F, Saha I, Bianchini M (2020) Editorial: RNA-Seq Analysis: Methods, Applications and Challenges. Front Genet 11:220\u003c/li\u003e\n \u003cli\u003eStokes T, Cen HH, Kapranov P, et al (2023) Transcriptomics for Clinical and Experimental Biology Research: Hang on a Seq. Advanced Genetics 4:2200024\u003c/li\u003e\n \u003cli\u003eAn W, Yan Y, Ye K (2024) High resolution landscape of ribosomal RNA processing and surveillance. Nucleic Acids Research 52:10630\u0026ndash;10644\u003c/li\u003e\n \u003cli\u003eVenema J, Tollervey D (1999) Ribosome Synthesis in \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e. Annu Rev Genet 33:261\u0026ndash;311\u003c/li\u003e\n \u003cli\u003eZhao S, Zhang Y, Gamini R, Zhang B, Von Schack D (2018) Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci Rep 8:4781\u003c/li\u003e\n \u003cli\u003eCui P, Lin Q, Ding F, et al (2010) A comparison between ribo-minus RNA-sequencing and polyA-selected RNA-sequencing. Genomics 96:259\u0026ndash;265\u003c/li\u003e\n \u003cli\u003eBarrett A, McWhirter R, Taylor SR, Weinreb A, Miller DM, Hammarlund M (2021) A head-to-head comparison of ribodepletion and polyA selection approaches for \u003cem\u003eCaenorhabditis elegans\u003c/em\u003e low input RNA-sequencing libraries. G3 Genes|Genomes|Genetics 11:jkab121\u003c/li\u003e\n \u003cli\u003eDing X, Zhang S, Li X, et al (2018) Profiling expression of coding genes, long noncoding RNA , and circular RNA in lung adenocarcinoma by ribosomal RNA ‐depleted RNA sequencing. FEBS Open Bio 8:544\u0026ndash;555\u003c/li\u003e\n \u003cli\u003eViscardi MJ, Arribere JA (2022) Poly(a) selection introduces bias and undue noise in direct RNA-sequencing. BMC Genomics 23:530\u003c/li\u003e\n \u003cli\u003eTreangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13:36\u0026ndash;46\u003c/li\u003e\n \u003cli\u003eSavarese M, Jonson PH, Huovinen S, Paulin L, Auvinen P, Udd B, Hackman P (2018) The complexity of titin splicing pattern in human adult skeletal muscles. Skelet Muscle 8:11\u003c/li\u003e\n \u003cli\u003eLopes I, Altab G, Raina P, De Magalh\u0026atilde;es JP (2021) Gene Size Matters: An Analysis of Gene Length in the Human Genome. Front Genet 12:559998\u003c/li\u003e\n \u003cli\u003eUapinyoying P, Goecks J, Knoblach SM, Panchapakesan K, Bonnemann CG, Partridge TA, Jaiswal JK, Hoffman EP (2020) A long-read RNA-seq approach to identify novel transcripts of very large genes. Genome Res 30:885\u0026ndash;897\u003c/li\u003e\n \u003cli\u003eWang Y, Zhao Y, Bollas A, Wang Y, Au KF (2021) Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39:1348\u0026ndash;1365\u003c/li\u003e\n \u003cli\u003eBrouillette M (2024) Gene length could be a critical factor in the aging of the genome. Proc Natl Acad Sci U S A 121:e2416630121\u003c/li\u003e\n \u003cli\u003eSoheili-Nezhad S, Ib\u0026aacute;\u0026ntilde;ez-Sol\u0026eacute; O, Izeta A, Hoeijmakers JHJ, Stoeger T (2024) Time is ticking faster for long genes in aging. Trends in Genetics 40:299\u0026ndash;312\u003c/li\u003e\n \u003cli\u003eBang ML, Centner T, Fornoff F, et al (2001) The complete gene sequence of titin, expression of an unusual approximately 700-kDa titin isoform, and its interaction with obscurin identify a novel Z-line to I-band linking system. Circ Res 89:1065\u0026ndash;1072\u003c/li\u003e\n \u003cli\u003eSavarese M, Maggi L, Vihola A, et al (2018) Interpreting Genetic Variants in Titin in Patients With Muscle Disorders. JAMA Neurol 75:557\u003c/li\u003e\n \u003cli\u003eLawlor MW, Ottenheijm CA, Lehtokari V-L, Cho K, Pelin K, Wallgren-Pettersson C, Granzier H, Beggs AH (2011) Novel mutations in NEB cause abnormal nebulin expression and markedly impaired muscle force generation in severe nemaline myopathy. Skeletal Muscle 1:23\u003c/li\u003e\n \u003cli\u003eOghabian A, Jonson PH, Gayathri SN, et al (2025) OBSCN undergoes extensive alternative splicing during human cardiac and skeletal muscle development. Skeletal Muscle 15:5\u003c/li\u003e\n \u003cli\u003eHong SE, Kneissl J, Cho A, et al (2022) Transcriptome-based variant calling and aberrant mRNA discovery enhance diagnostic efficiency for neuromuscular diseases. J Med Genet 59:1075\u0026ndash;1081\u003c/li\u003e\n \u003cli\u003ePan Y, Nallamilli BRR, Liu R, et al (2025) Unveiling non-coding \u003cem\u003eDMD\u003c/em\u003e variants: synergising RNA sequencing and DNA sequencing for enhanced molecular diagnosis. J Med Genet 62:97\u0026ndash;106\u003c/li\u003e\n \u003cli\u003eNielsen AF, Bindereif A, Bozzoni I, et al (2022) Best practice standards for circular RNA research. Nat Methods 19:1208\u0026ndash;1220\u003c/li\u003e\n \u003cli\u003eRobinson JT, Thorvaldsd\u0026oacute;ttir H, Wenger AM, Zehir A, Mesirov JP (2017) Variant Review with the Integrative Genomics Viewer. Cancer Research 77:e31\u0026ndash;e34\u003c/li\u003e\n \u003cli\u003eY\u0026eacute;pez VA, Mertes C, M\u0026uuml;ller MF, et al (2021) Detection of aberrant gene expression events in RNA sequencing data. Nat Protoc 16:1276\u0026ndash;1296\u003c/li\u003e\n \u003cli\u003eMertes C, Scheller IF, Y\u0026eacute;pez VA, \u0026Ccedil;elik MH, Liang Y, Kremer LS, Gusic M, Prokisch H, Gagneur J (2021) Detection of aberrant splicing events in RNA-seq data using FRASER. Nat Commun 12:529\u003c/li\u003e\n \u003cli\u003eBrechtmann F, Mertes C, Matusevičiūtė A, Y\u0026eacute;pez VA, Avsec Ž, Herzog M, Bader DM, Prokisch H, Gagneur J (2018) OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data. The American Journal of Human Genetics 103:907\u0026ndash;917\u003c/li\u003e\n \u003cli\u003eMF F, A O, E N, et al (2024) Inferring disease course from differential exon usage in the wide titinopathy spectrum. Ann Clin Transl Neurol. https://doi.org/10.1002/acn3.52189.\u003c/li\u003e\n \u003cli\u003eByron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, Craig DW (2016) Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet 17:257\u0026ndash;271\u003c/li\u003e\n \u003cli\u003eFeng H, Zhang X, Zhang C (2015) mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA-sequencing data. Nat Commun 6:7816\u003c/li\u003e\n \u003cli\u003eWang L, Nie J, Sicotte H, et al (2016) Measure transcript integrity using RNA-seq data. BMC Bioinformatics 17:58\u003c/li\u003e\n \u003cli\u003eGonorazky H, Liang M, Cummings B, et al (2016) RNA seq analysis for the diagnosis of muscular dystrophy. Ann Clin Transl Neurol 3:55\u0026ndash;60\u003c/li\u003e\n \u003cli\u003eKono N, Arakawa K (2019) Nanopore sequencing: Review of potential applications in functional genomics. Dev Growth Differ 61:316\u0026ndash;326\u003c/li\u003e\n \u003cli\u003eRhoads A, Au KF (2015) PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13:278\u0026ndash;289\u003c/li\u003e\n \u003cli\u003ePollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS (2018) Long reads: their purpose and place. Human Molecular Genetics 27:R234\u0026ndash;R241\u003c/li\u003e\n \u003cli\u003eIbrahim F, Oppelt J, Maragkakis M, Mourelatos Z (2021) TERA-Seq: true end-to-end sequencing of native RNA molecules for transcriptome characterization. Nucleic Acids Research 49:e115\u0026ndash;e115\u003c/li\u003e\n \u003cli\u003eSaville L, Wu L, Habtewold J, Cheng Y, Gollen B, Mitchell L, Stuart-Edwards M, Haight T, Mohajerani M, Zovoilis A (2024) NERD-seq: a novel approach of Nanopore direct RNA sequencing that expands representation of non-coding RNAs. Genome Biol 25:233\u003c/li\u003e\n \u003cli\u003eLo C-C, Chain PSG (2014) Rapid evaluation and quality control of next generation sequencing data with FaQCs. BMC Bioinformatics 15:366\u003c/li\u003e\n \u003cli\u003eDobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15\u0026ndash;21\u003c/li\u003e\n \u003cli\u003ePatro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417\u0026ndash;419\u003c/li\u003e\n \u003cli\u003eWang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics 28:2184\u0026ndash;2185\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"RNA-Sequencing, rRNA, poly(A)+, transcriptomics, TTN, muscle","lastPublishedDoi":"10.21203/rs.3.rs-8195045/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8195045/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe eukaryotic transcriptome diversity arises largely from alternative splicing. One of the widely used high-throughput methods to study this diversity is RNA sequencing. RNA sequencing has become a cornerstone of both basic biology and precision medicine, facilitating the quantification of gene and transcript expression, as well as the characterization of alternative splicing events and regulatory biological pathways in these studies. As there is a wide interest in studying non-ribosomal RNAs, which constitute about 20% of cellular RNAs, it is common to either select for poly(A)\u0026thinsp;+\u0026thinsp;RNAs or to deplete ribosomal RNAs during the library preparation stage of RNA sequencing. Using blood and skeletal muscle transcriptomics data, we show that poly(A)\u0026thinsp;+\u0026thinsp;enriched RNA library data inefficiently detects long transcripts, with lengths larger than 5kb constituting to ~\u0026thinsp;16.5% of isoforms in Gencode v39, and predominantly detects the 3\u0026prime; end compared to the 5\u0026prime; end of these transcripts. In contrast, rRNA depletion provides a more uniform 5\u0026prime;-3\u0026prime; coverage, an improved detection of splicing events, and a robust detection of long disease-relevant transcripts. Furthermore, we show that the improved performance of rRNA depleted RNA sequencing, compared to poly(A)+, is particularly evident in the detection of extremely large transcripts, such as the sarcomeric genes \u003cem\u003eOBSCN\u003c/em\u003e (~\u0026thinsp;39kb) and \u003cem\u003eTTN\u003c/em\u003e (\u0026gt;\u0026thinsp;100 kb). Our findings reveal the advantages of using rRNA depletion over the more commonly used poly(A)\u0026thinsp;+\u0026thinsp;selection for both research and diagnostic applications, especially where RNA-Seq is employed to analyse long muscle transcripts and detect pathogenic splicing defects and refine variant interpretation.\u003c/p\u003e","manuscriptTitle":"Poly(A) selection limits detection of long and alternatively spliced transcripts compared with rRNA depletion in RNA-Sequencing","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-08 08:19:39","doi":"10.21203/rs.3.rs-8195045/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f660eacb-3a36-4173-ad1e-b092177178e0","owner":[],"postedDate":"December 8th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-05-13T19:48:10+00:00","versionOfRecord":{"articleIdentity":"rs-8195045","link":"https://doi.org/10.1186/s12864-026-12944-z","journal":{"identity":"bmc-genomics","isVorOnly":false,"title":"BMC Genomics"},"publishedOn":"2026-05-13 00:00:00","publishedOnDateReadable":"May 13th, 2026"},"versionCreatedAt":"2025-12-08 08:19:39","video":"","vorDoi":"10.1186/s12864-026-12944-z","vorDoiUrl":"https://doi.org/10.1186/s12864-026-12944-z","workflowStages":[]},"version":"v1","identity":"rs-8195045","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8195045","identity":"rs-8195045","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00