{"paper_id":"19c9ddb7-73c8-4684-8622-bb101b2c4c87","body_text":"Tech Note: Simplified protocol for \nSMARTer Pico kit \nIs total RNA fragmentation needed? Is rRNA depletion affected by fragmentation \ntime? Or affected by input? Can we get reliable data when starting with low input? \nby Orlando Contreras-López, Laia Masvidal, Jun Wang and Elísabet Einarsdóttir\n \nAbstract \nThe SMARTer® Stranded Total RNA-\nSeq Kit v2 – Pico Input Mammalian kit \nfrom Takara® (SMARTer Pico) has \nproved successful and reliable in \ngenerating stranded RNA Illumina \nlibraries from degraded total RNA and \nultra-low input amounts of total RNA \nbelow detection level. Here we \nattempted to streamline and simplify \nthe library prep protocol at the key \nfragmentation step bottleneck. Our key \nfindings were that reduced \nfragmentation times neither affect the \ndepletion efficiency, nor the library \ncomplexity. Skipping the \nfragmentation resulted in longer \nlibraries when examined in the \ncapillary electrophoresis  but this was \ncompensated for during sequencing, \nas long fragments are less likely to \nform clusters during sequencing. \nSkipping the fragmentation also \naffected the gene body coverage, \nwhere a bias to the 5´ end was \nobserved though this compromised \nneither the data quality, complexity nor \nreproducibility. Additionally, using 16 \nPCR cycles seems to have little effect \non the library complexity. Overall, we \ncan see that sample input is the key to \nlibrary complexity and reproducibility, \nwhile fragmentation time has less \nimpact on data. \n \nIntroduction \nThe SMARTer® Stranded Total RNA-\nSeq Kit v2 – Pico Input Mammalian kit \nfrom Takara® (SMARTer Pico) has \nproved successful and reliable in \ngenerating stranded RNA Illumina \nlibraries from degraded total RNA and \nultra-low input amounts of total RNA \nbelow detection level. This kit uses \nrandom priming cDNA synthesis of \nboth coding and non-coding RNA \nwhich is ideal for processing degraded \ntotal RNA input, such as those \nobtained from formalin-fixed or \nparaffin-embedded (FFPE) samples. \nMoreover, the kit incorporates a \ntechnology that enables the removal of \nmammalian ribosomal cDNA, without \nloss of other cDNA molecules \noriginating from non-coding or coding \nRNAs. \nThe increased demand for low input \nprotocols such as the SMARTer Pico \nmotivated us to look at ways to \noptimize the processing of samples. In \norder to accomplish this, we have \ninvestigated ways to streamline and \nsimplify the SMARTer Pico library prep \nprotocol (see Figure 1). The original \nprotocol contains a fragmentation \nstep, just before the first-strand \nsynthesis, that has to be customized \ndepending on the quality (RIN \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\nvalue or DV200) of the total RNA \nsample. If a sample is of good quality \n(RIN > 7), the recommended \nfragmentation time is 4 min; while if \nthe quality is low (RIN < 3), it is \nrecommended to skip the \nfragmentation. In a given project, our \nusers might have samples with a \nrange of RIN values or it might be \nimpossible to determine the RIN value \ndue to low sample concentration. \nTherefore, the fragmentation step \npresents a bottleneck in the library \nprep since samples within a project \nmay require different treatment. \nTo investigate the effect of \nfragmentation time, we prepared \nlibraries using aliquots of a good \nquality total RNA sample with different \nfragmentation times. In addition to the \nfragmentation time, we also included \ndifferent input amounts (from 1 ng to \n10 ng), all at a fixed number of PCR \ncycles for PCR2 (Figure 1). Data from a \nuser project as well as data from \nTakara were also used for the analysis. \nThe sequencing data allowed us to \ncompare the different conditions, \ndemonstrating that the fragmentation \ntime is not a determinant in the quality \nof the generated data.  \n \n Figure 1. Schematic of the SMARTer Stranded \nTotal RNA-Seq Kit v2 – Pico Input Mammalian \nprotocol. \nRandom priming allows the generation of cDNA \nfrom all RNA fragments in the sample, including \nrRNA and degraded mRNA. The Reverse \nTranscriptase adds additional nucleotides when it \nreaches the 5′ end. Those extra nucleotides are \nused for the synthesis of the second strand of the \ncDNA. The next step is a short round of PCR \namplification which adds full- length Illumina \nadapters, including barcodes. The cDNA originating \nfrom rRNA is then cleaved by ZapR enzyme in the \npresence of mammalian-specific R-Probes. This \nprocess leaves the library with only fragments from \nnon-rRNA molecules. These fragments are enriched \nvia a second round of PCR amplification. Image \ntaken from Takara®. \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\nTable 1. Library QC. \nConcentration measurements and average size \ndetermination for each library prepared in house \nstarting from Invitrogen™ AM7852 total RNA. QC \npass threshold is concentration > 2 nM. All libraries \npassed QC. Library yield correlates with sample \ninput. \n \n \n \n \n \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\n \nMethodology \nThree different datasets were used to investigate the effects of fragmentation and \ninput amount on the library preparation efficiency and data quality.  \nDataset 1 was generated using Universal Human Reference RNA 2×200 ug from AH \ndiagnostics (cat# 740000). This total RNA was diluted to 1.25 ng/ul and 0.125 ng/ul \nfor the 10 ng and 1ng input, respectively. The total RNA samples were run on the \nFragment Analyzer HS RNA, where the 0.125 ng/ul samples were undetectable. The \n1.25 ng/ul samples had a RIN 9, confirming the high quality of the samples. Library \npreparation was carried out in duplicates, using 1 ng or 10 ng as input. The samples \nwere subjected to three fragmentation times: 0, 2, and 4 mins (see Table 1). The \nnumber of PCR cycles for PCR2 (Figure 1) was 16. As expected, the library \npreparation yields are proportional to the input. Also, the average sizes are larger \nwith shorter fragmentation times (see Table 1 and Figure 2). Samples were \nsequenced on an Illumina NovaSeq 6000 S4 flowcell, with an average of 200M \n2×150 bp reads per sample.  \nDataset 2 was generated using a human total RNA sample provided by an NGI user. \nThree different input amounts were used: 3, 10, and 30 ng. The RIN value determined \nwith the Fragment Analyzer HS RNA of these samples was >9.4. The samples were \nfragmented for 4 mins and 11 cycles were performed for PCR2. Samples were \nsequenced on an Illumina NovaSeq 6000 S4, 2×150 bp reads, with an average of \n200M reads per sample. Samples were named Unsub (Figures 7-9). \nDataset 3 was obtained from Takara’s TechNote on SMARTer Pico. Here, Takara \nused human lung FFPE total RNA without fragmentation, with 16 PCR cycles for \nPCR2. The dataset had the merged data from two library preps for each input, with \n18M and 54.6M reads for the 1 ng and 10 ng input merged data, respectively. \nAll datasets were analyzed using the nfcore/rnaseq pipeline. The same software \nversions and options were used for all the analyses, and MultiQC was used to extract \ndata for all the figures.  \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\nResults \nSample input, not fragmentation time, \ndetermines library similarity \nIn order to determine the effect of the \nfragmentation time and/or sample \ninput on the data, the libraries \nfrom Dataset 1 were clustered. The \ndendrogram is generated from the \nnormalized gene counts through \nedgeR. Euclidean distances between \nlog2 normalized CPM values are then \ncalculated and clustered. The \nclustering result shown in Figure \n3 indicates that libraries from samples \nwith 10 ng input are more similar to \neach other than to those with 1 ng. \nFurthermore, the fragmentation time of \n2 or 4 mins has little impact on the \nlibrary data when 10 ng input is used, \nas those samples fall within the same \nclades. Lower input leads to lower \nreproducibility, regardless of the \nfragmentation time.  \nWe assessed the percentages of reads \naligned to the reference genome, as \nwell as the percentage of reads \nassigned to features in the genome \n(Figure 4). We found no large \ndifferences among the datasets \nsuggesting that regardless of the input \nand fragmentation time, the proportion \nof useful reads remains the same.  \ntime has less impact on data. \n \n \nFigure 2. Bioanalyzer electrpherograms of the final \nlibraries. \nBioanalyzer High Sensitivity DNA Assay was used to \nassess the size distribution of the final libraries \nfrom Dataset 1. Samples without fragmentation \nshow a higher proportion of fragments above 1000 \nbp. \n \nFigure 3. Dendrogram showing sample clustering. \nEuclidean distances between log2 normalized CPM \nvalues are then calculated and clustered. Libraries \nfrom Dataset 1 are grouped by input regardless of \nthe fragmentation time used for library prep. \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\nLibrary insert size is affected by \nfragmentation time \nTo test if longer fragmentation times \nled to shorter library inserts, we \nestimated the insert size distribution \nbased on the data from the inner \ndistance module within the MultiQC \nreport. The distance reported is the \nmRNA length between two paired \nreads. To this value, the length of the \npaired reads was added to get the \ninsert size. Figure 5 shows the \nfrequency of the insert size for the \nlibraries from Dataset 1. The samples \nsubjected to longer fragmentation \ntimes (red and green) have shorter \ninserts, on average. However, this is \nsmaller than the difference in average \nsize of the library obtained from the \ncapillary electrophoresis (see Table 1); \nthe difference there was as large as \n500 bp compared to 80 bp in the \nsequencing data. This could be \nexplained by the difference in \nclustering efficiency between short \nand long fragments, leading to an \nunderrepresentation of the very long \nfragments in the sequencing data. \nOverall, the data shows that the \nfragmentation time does reduce the \nsize of the library inserts. \n \n \nFigure 4. Scatter plot showing read the percentage \nof reads assigned to a genome feature and \npercentage of reads aligned to the genome. \nData were obtained from the MultiQC report. \nPercentage of Assigned reads were acquired \nfrom  Subread featureCounts that counts mapped \nreads for genomic features such as genes, exons, \npromoter, gene bodies, genomic bins, and \nchromosomal locations. The percentage of aligned \nreads was determined with STAR which is an \nultrafast universal RNA-seq aligner. \n \n \nFigure 5. Insert size distribution. \nInsert size was estimated using the data obtained \nfrom the Inner Distance module from  RSeQC in the \nMultiQC report. Average Insert Size is shown for \neach library in their respective color. Shorter \nfragmentation times lead to longer libraries. \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\nLibrary complexity depends on sample \ninput and is not affected by \nfragmentation time   \nLibrary complexity is a measure of the \nnumber of unique molecules in the \nlibrary. For any application, the goal is \nto make the libraries as complex as \npossible. This parameter is helpful in \ndetermining where sensitivity loss can \noccur (detection of low expressed \ngenes) or where sequencing errors are \ncreated (duplicates). It is therefore \nimportant to determine if the \nfragmentation time affects library \ncomplexity and if so, to what degree.  \nLibrary complexity was estimated \nusing the preseq module from the \nMultiQC report. Figure 6 shows the \nrarefaction curves for Dataset 1 (blue, \nred and green), Dataset 2, the user \nsample (orange), and Dataset 3, from \nTakara (grey). One can see that the \nlibrary complexity is more dependent \non the sample input, and not so much \non the fragmentation time. \nrRNA depletion efficiency is \nindependent of sample input and \nfragmentation time  \nOne of the key features of the \nSMARTer Pico kit is the depletion of \nrRNA in mammalian samples. The \nremoval of the rRNA is dependent on \nprobes that target these sequences \nand are then enzymatically degraded \n(Figure 1). According to the \nmanufacturer, excessive \nfragmentation can lead to an \ninefficient depletion process, but we \ndid not expect a difference in the \ndepletion efficiency when \nfragmentation was skipped on good \nquality samples from Dataset \n1 or Dataset 2.  \n \n \nFigure 6. Library Complexity Curve. \nComplexity was estimated using the data obtained \nfrom the Complexity of the MultiQC report. Libraries \ngroup by input regardless of the fragmentation time \nused for library prep. The purple line shows the \ncurve for an ideal library where each molecule is \nunique. \n \nThe depletion efficiency was \nestimated from the percentage of \nreads overlapping the genomic region \nspanned by the rRNA feature. The \nhigher the percentage of reads \nmapped to rRNA, the lower the \ndepletion efficiency.  \nFigure 7 shows that neither the \nfragmentation time nor the sample \ninput affect the depletion efficiency. \nOn the other hand, sample quality \nseems to be crucial since the libraries \nfrom Dataset 3, generated from FFPE \nsamples, show lower depletion \nefficiency than the high-quality \nsamples from Dataset 1 or  Dataset 2. \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\nMitochondrial rRNA depletion \nefficiency is affected neither by \nsample input nor fragmentation time  \nAnother feature of the SMARTer Pico \nkit is the depletion of Human \nMitochondrial rRNA (Mt rRNA). The \nremoval of the Mt rRNA is also \ndependent on probes that target these \nsequences and are then enzymatically \ndegraded (Figure 1). Similar to the \nanalysis above, a higher percentage of \nreads mapping to Mt rRNA indicates \nlower depletion efficiency. Figure \n8 shows that the depletion of the Mt \nrRNA is sample dependent (tissue or \ncell type) and it is neither clearly \naffected by the fragmentation time nor \nthe sample input. \nGene body coverage is biased towards \nthe 5´ end when fragmentation is not \nperformed  \nGene Body Coverage calculates read \ncoverage over gene bodies. This is \nused to check if read coverage is \nuniform and if there is any 5′ or 3′ \nbias. Figure 9 shows that samples \nfrom all datasets have very similar \ngene body coverage, except for the \nhigh-quality samples that were not \nsubjected to fragmentation (blue). \nThese samples show a bias to the 5´ \nend of the gene body but with a similar \ncoverage on the rest of the gene body \nwhen compared to the rest of the \nsamples. Other inputs or \nfragmentation times do not change the \ngene body coverage. \n \nFigure 7. Percentage of reads mapping to \nribosomal RNA. \nBiotype Counts shows reads overlapping genomic \nfeatures of different biotypes, counted \nby featureCounts. \n \nFigure 8. Percentage of reads mapping to \nmitochondrial ribosomal RNA. \nBiotype Counts shows reads overlapping Mt_rRNA, \ncounted by featureCounts in the MultiQC report. \n \n \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\nConclusion \nIn this tech note, we aimed to \ndetermine the effects of different \nfragmentation times and input \namounts on the data produced by the \nSMARTer Pico prep. Although we lack \nstatistical power, this approach can \nguide us to the following conclusions: \nThe shorter than recommended \nfragmentation times neither affect the \ndepletion efficiency (Figures 7 and 8) \nnor the library complexity (Figure 6). \nSkipping the fragmentation resulted in \nlonger libraries when examined in the \ncapillary electrophoresis (Figure 2) but \nthis is compensated for during \nsequencing, as long fragments are \nless likely to form clusters during \nsequencing (Figure 5). Skipping the \nfragmentation also affected the gene \nbody coverage, where a bias to the 5´ \nend was observed (Figure 9) but this \ncompromised neither the data quality \n(Figure 4), the complexity (Figure 6) nor \nthe reproducibility (Figure 3). \nAdditionally, using 16 PCR cycles \nseems to have little effect on the \nlibrary complexity (Figure 6). Overall, \nwe can see that sample input is the \nkey to library complexity and \nreproducibility, while fragmentation \ntime has less impact on data.  \n \nFigure 9. Gene body coverage. \nGene Body Coverage is obtained from the module \ncalled the same from RSeQC in the MultiQC report. \n \n \n \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint \n\n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 5, 2025. ; https://doi.org/10.1101/2025.09.02.673416doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}