Decoding the RNA binding systems by UltraGen

doi:10.21203/rs.3.rs-4461517/v2

Decoding the RNA binding systems by UltraGen

2025 · doi:10.21203/rs.3.rs-4461517/v2

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 196,442 characters · extracted from preprint-html · click to expand

Decoding the RNA binding systems by UltraGen | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Decoding the RNA binding systems by UltraGen Yaqing Zhang, Hui Wang, Zhaoming Chen, Wenjun Lin, Yuan Jiang, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4461517/v2 This work is licensed under a CC BY 4.0 License Status: Under Review Version 2 posted You are reading this latest preprint version Show more versions Abstract RNA plays multifaceted roles in catalytic reactions and gene regulation. The sequence-encoded binding language across diverse RNA-target interactomes is high-dimensional and complex. Here, we introduce UltraGen, an RNA language model designed to capture RNA binding properties. Utilizing fine-grained self-learning, UltraGen identifies RNA aptamers for a wide range of target sizes, including small molecules, proteins, cells, and tissues. Additionally, UltraGen discerns tissue specificity for millions of RNA species across 22 human organs based on their 3’-UTR sequences, predicts the tropism of human-pathogenic RNA viruses, and characterizes SARS-CoV-2 replicase RNA binding at single-base resolution. Biological sciences/Computational biology and bioinformatics/Machine learning Biological sciences/Biochemistry/RNA Biological sciences/Biological techniques/Sequencing/RNA sequencing Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction RNA is central to biology, evolution, and biotechnology, interacting crucially with small molecules, proteins, and other ligands. Their inherent sequence-specific interactions, spanning a wide range of affinities and specificities, mediate essential cellular processes in all life, with even weak interactions often playing regulatory roles 1 , 2 . Comprehensively profiling RNA intermolecular interactions is critical for decoding the RNA binding systems. High throughput sequencing (HTS) has been applied to characterize diverse RNA-target interactions 3 . In vitro SELEX ( S ystematic E volution of L igands by EX ponential E nrichment) iteratively enriches high-affinity RNA binders 4 – 13 , whereas in vivo CLIP (Cross-linking and immunoprecipitation) record native interactions 14 , 15 . Compared to the multivalent interactions of endogenous RNAs, in vitro enriched RNAs represent more target-specific properties. In the last decade, deep learning methods have been developed to refine RNA-target interactions from SELEX datasets 16 – 21 . However, the inherent bias of SELEX toward discarding weak binders limits the generalizability of self-supervised RNA models. To address this gap, the recently introduced UltraSelex captures a series of washing eluates that retain weak (or transient) binders typically discarded during SELEX washing steps, providing a broader landscape of RNA interactions with varying strengths 22 . Here, we present UltraGen, an RNA language model pre-trained on 10 million UltraSelex RNA species. UltraGen decodes RNA binding characteristics by embedding sequences in a latent space that reflects quantitative binding potential. It outperforms existing state-of-the-art models across twelve diverse RNA binding targets including small molecules, proteins, cells, and tissues, with precision up to 75%. Fine-tuning UltraGen on the 3’-untranslated region (3’-UTR) profiles from 22 human tissues, it exhibited a three-fold increase in precision and a ten-fold increase in recall compared to the best benchmark model, highlighting its potential in identifying RNA virus tropism, including for pathogenic RNA viruses such as dengue and measles. Analysis of RNA-binding signatures for the SARS-CoV-2 replicase indicated that a conserved CUUG loop preferentially associates with an adjacent base pair such as A-U or unpaired C-U. Expanding UltraGen's pre-training dataset has further enhanced its robustness and applicability to the RNA binding systems. By analyzing both in vitro and in vivo datasets, UltraGen provides a robust framework for understanding RNA molecular interactions. Results Pre-training UltraGen on RNA binding systems RNA binding often involves the electrostatic properties of negatively charged RNA interacting with positively charged ligands 23 , 24 . For instance, proteins interact with RNA primarily through positively charged amino acids like arginine and lysine 25 . Given fewer geometric constraints in RNA interactions with small molecules, we initiated our study using HTS RNA datasets obtained through UltraSelex on the SiR target 22 ( Extended Data Fig. 1 a,b). To unravel the underlying RNA binding language, we utilized self-supervised learning on UltraGen’s pre-training kernel designed to reconstruct nucleobases and motifs, aiming to identify chemical and conformational properties that govern RNA-target interactions (Fig. 1 a). Additionally, the Rotary Position Embedding 26 was employed to enhance the long-distance dependencies of sequence context. The top-ranked ten million RNA sequences were divided into training and test sets, ensuring less than 90% sequence identity between them (see details in Methods ). Compared to the pre-training process on the UltraSelex dataset targeting nsp12, the SiR dataset exhibited lower learning loss, likely due to its continuous distribution of binding scores, providing a broader range of binding potentials ( Extended Data Fig. 1 a-d). Next, we assessed UltraGen’s learning capacity using datasets of various sizes, from one million top-ranked sequences to the full dataset containing over 40 million sequences ( Extended Data Fig. 1 e). The optimal model, pre-trained on the top ten million sequences of the UltraSelex SiR dataset, was used in all subsequent experiments. Using UMAP dimensionality reduction, UltraGen revealed distinct RNA clustering correlated with indicative binding scores, outperforming both a randomly initialized model and the RNA-FM model 27 that employs a similar transformer framework. (Fig. 1 b and Extended Data Fig. 1 f). These results suggest that reconstructing RNA from a continuously evolving binding context, rather than the model framework, accounts for the effective acquisition of the RNA binding intricacies. High-performance multi-classification by UltraGen We then assessed UltraGen’s ability to rank SiR RNA binders across the entire UltraSelex dataset, focusing on its multi-classification capability ( Supplementary Table 1 ). During fine-tuning, RNA sequences were labeled by their binding potential, with each class split into training, validation, and test sets at a ratio of 6:1:3. Despite several orders of magnitude of multi-class imbalance, UltraGen achieved 78% precision in identifying RNA species with the top 0.01% SiR-binding ability on the held-out test sets ( Extended Data Fig. 1 g), a substantial improvement over the randomly initialized model (5.6%). Additionally, UltraGen demonstrated 63% precision in the dataset generated through conventional SELEX for the same target 28 ( Extended Data Fig. 1 h). To ensure a fair comparison and mitigate potential sequence reuse during pre-training, 0.83% (120,759) RNA species previously used in unsupervised learning were meticulously excluded, even if their binding labels differed ( Extended Data Fig. 1 i). UltraGen outperformed other neural network-coupled models, including RaptRanker 17 , which relied on local sequence-secondary structure-based RNA features, DeepBind 16 , which utilized the convolutional neural network framework, and SA-Net 29 , which employed k-mer embedding. Specifically, UltraGen achieved ~ 10-fold improvement in ranking precision for identifying top binders from the un-pre-trained SiR SELEX dataset (Fig. 1 c). While both RNABERT 30 and RNA-FM, pre-trained on transformer frameworks, showed improvements, their precision extended only up to 60% of UltraGen’s performance. High precision is essential for downstream validation, especially in resource-constrained scenarios. Furthermore, UltraGen also demonstrated advantageous precision and F1 scores across other non-top classes ( Extended Data Fig. 1 i ) , implying its potential to capture weak RNA binders. These findings underscore UltraGen’s advanced capability in efficiently ranking RNA binders within a comprehensive RNA binding context. Identifying RNA aptamers targeted to small molecules, proteins, cells and tissues While RNAs interacting with small molecules exhibit diverse binding motifs with broad structural preferences 22 , 28 , RNA-protein interactions are largely defined by k-mer motifs 3 . To determine whether k-mer motifs primarily determine RNA binding patterns, we analyzed their distribution across targets of varying sizes. Hexamer comparisons between enriched and non-enriched SELEX RNAs showed smaller motif differences for larger targets, even with added secondary structure information (Fig. 2 a and Extended Data Fig. 2 a), implying small ligand binding contains more sequence-based features, whereas RNA-protein interactions are influenced by additional geometric constraints refining motif specificity. We then fine-tuned UltraGen using a cohort of published SELEX datasets against 12 distinct targets, ranging from small molecules 4 – 6 , proteins 7 – 9 , to (multi)cellular targets 10 – 13 . These datasets varied in RNA length, sequence randomness, library architecture, and selection strategies ( Supplementary Table 2–5 ). RNA categories within each SELEX dataset were partitioned into training, validation, and test sets at a ratio of 6:1:3. In the analysis of small-molecule datasets, RNABERT (pre-trained on the 76,237 human small non-coding RNA (nc-RNA) species), RNA-FM (pre-trained on 23 million ncRNA species) and UltraGen models improved their top binder ranking precision compared to the feature-based DeepBind by 10%, 60%, and 160%, respectively (Fig. 2 b). This improvement underscores the efficacy of pre-training on an RNA dataset that features a continuous distribution of binding abilities, rather than species-specific or stochastic ones. Compared to its performance on small-molecules, the precision of RaptRanker, which was derived from protein-targets, declined by over 50% when applied to protein and (multi)cellular targets, indicating the limitations of predicting RNA-target interactions based solely on sequence-based secondary structure analysis. Despite its smaller model size, UltraGen outperformed RNABERT and RNA-FM by up to 346% in ranking precision on protein and (multi)cellular targets. Consistent with findings on the SiR binding dataset, pre-trained models generally performed better, showing higher precision and F1 in both top and non-top binder predictions compared to the feature-based models ( Extended Data Fig. 2 b-d). To further examine the effectiveness of the model’s prediction, we analyzed the predicted top binders from the held-out test set. While target-specific binders showed distinct binding signatures, most experimentally reported binders were identified in the predominant RNA families among the predicted top binders ( Extended Data Fig. 2 e,f). To determine whether UltraGen captures RNA-target binding features solely relying on the sequence similarity, we excluded sequences within various edit distances from the top binders in the downstream analysis. Although eliminating similar species from their sequence family resulted in a noticeable decline in performance, UltraGen still maintained a leading ranking position with minimal shrinkage (Fig. 2 a), indicating that its predictions correlated with the learning source that harbors the most structurally and sequence-similar binding information for inferring interaction. Taken together, UltraGen comprehensively captured in vitro RNA-target binding across molecular size scales. Discerning 3’-UTR of mRNA in human tissue specificity The effectiveness of UltraGen in ranking SELEX datasets from (multi)cellular targets prompted an investigation into its potential for analyzing in vivo RNA interactions, where the RNA binding context involves multiple targets and extends beyond simple binding. To assess the feasibility of fine-tuning UltraGen on endogenous RNA sequences, we examined the distribution of hexamer units in mRNA and ncRNA across species. Our analysis revealed substantial conservation within biological RNA species, particularly among mammals, in contrast to the high diversity observed in SELEX libraries ( Extended Data Fig. 3 a, b). This discrepancy suggests functional RNAs may achieve specificity through subtle but critical sequence features. Supporting this, ubiquitously expressed human genes alter their 3’-UTR isoform ratios via alternative polyadenylation (APA) for tissue-specific regulation 31 . We further explored the specificity of 3’-UTR variants across 22 human primary tissues using HTS data from the APASdb database 32 , which contains 2.89 million unique RNA cleavage sites with their frequencies. We extracted the 100 nt upstream sequence of each cleavage site and its abundances. As over 75% of RNA species are enriched in only a few tissues ( Supplementary Table 6 ), we investigated tissue specificities based on their presence or absence across tissues. This presented a challenging classification task with 88 categories (22 tissues, four abundance levels) aimed at discerning the specificity and the abundance levels of individual RNA species across tissues ( Extended Data Fig. 3 c). The dataset was randomly split into training, validation, and test sets in an approximate ratio of 6:1:3. After fine-tuning, benchmark models exhibited limited ability to predict tissue specificity, with the lowest precision falling below 4% and recall under 1% (Fig. 3 a ). Even 3UTRBERT 33 , a model specifically pre-trained on human 3’-UTRs, achieved only 20% precision and 3% recall. In contrast, UltraGen robustly classified individual RNAs with various tissue specificities, demonstrating three-fold improvement in precision and ten-fold in recall for tissue specificity classification. Moreover, in predicting abundance levels, UltraGen achieved approximately two-fold increase in both precision and recall compared to the best benchmark models ( Extended Data Fig. 3 d). To evaluate the role of low-abundance species, we fine-tuned UltraGen using 3’-UTR datasets filtered by various abundance thresholds ( Supplementary Table 6 ). Including low-abundant RNAs significantly enhanced prediction performance, with higher precision and F1 scores across test sets ( Extended Data Fig. 3 e). UltraGen effectively classified testis RNA species, which are characterized by relatively homogeneous 3’-UTRs due to dominant 3’-UTR shortening at preferred polyadenylation sites 31 (Fig. 3 b). Conversely, its performance was weakened on spleen RNAs, where dynamic T cell-derived heterogeneous 3’-UTR shortening might result in more diverse 3’-UTRs, complicating the modulation of the RNA target interactome 34 . These findings highlight tissue-specific variation in high-dimensional biological RNA binding data. To investigate the impact of sequence context on model performance, we analyzed sequences extending from the 3’- to 5’-end, with lengths ranging from 50 nt to 300 nt. The highest F1 score was achieved with the last 100 nt RNA sequences (Fig. 3 c). Further fine-tuning the pre-trained model with training data including 50 nt or 150 nt sequences confirmed the optimal performance remained with the last 100 nt sequences (Fig. 3 d and Extended Data Fig. 3 f,g). Canonical polyadenylation signal sites (PASs, e.g. AAUAAA) were predominantly located within 30 nt upstream of cleavage sites across different tissues (Fig. 3 e and Extended Data Fig. 4 a). Additionally, the region between − 100 and − 50 nt, which had a substantial impact on model performance, exhibited a less defined A/U-rich element and cleavage factor I binding motif UGUA ( Extended Data Fig. 4 a,b). Base substitutions within this 100 nt region substantially impacted predictive performance, particularly near the PASs (Fig. 3 f). Collectively, the nuances of 3'-UTR sequences, though subtle, are crucial for the models’ ability to predict tissue-specific distribution. RNA virus tropism and binding characteristics Human-pathogenic RNA viruses exhibit marked tropism 35 characterized by tissue-specific infection and replication due to interactions with host molecules. Remarkably, despite viral RNA sequences not being present in the 3’-UTR-based fine-tuning of UltraGen, the model's predictions align closely with the known tissue preferences for various viruses, such as the heart and skeletal muscles for dengue 36 and the lungs and spleen for measles 37 , 38 (Fig. 4 a and Supplementary Table 7 ). This suggests that viral RNA 3’-ends may share similarities with host RNAs in their preferred human tissues. We further extended our investigation to SARS-CoV-2 pandemic variants circulating before 2024. Despite the highly mutated coding regions, the 3’-UTRs of these variants remain evolutionarily conserved ( Extended Data Fig. 5 a). UltraGen consistently identified most variants by their tissue tropism toward the lungs and lymph nodes, aligning with the observed pneumonia with an active immune response associated with SARS-CoV-2 infection 39 (Fig. 4 b and Supplementary Table 8 ). Length-dependent RNA virus tropism was also observed in the model fine-tuned with the full human dataset ( Extended Data Fig. 5 b,c). These findings may suggest a synchronized evolution between RNA viruses and the human host, involving conserved 3’-UTR regulatory interactions at the sequence level. Afterwards, we investigated SARS-CoV-2 replication using UltraGen, focusing on the SARS-CoV-2 replicase, which initiates RNA genome synthesis at the 3’-UTR 35 . Visualizing RNA species enriched in UltraSelex against the SARS-CoV-2 replicase nsp12 protein 22 revealed a centered cluster of RNA species with the highest binding potential, effectively distinguished by UltraGen (Fig. 4 c). k-mer analysis of the RNA species in the centralized cluster identified conserved CUUGA or CUUG motifs (Fig. 4 d and Extended Data Fig. 5 d), which are critical for nsp12 protein binding 22 . To assess whether UltraGen can identify and characterize this binding pattern at single-base resolution, we employed masked language modeling to predict the likelihood of each base within the nearby binding context (Extended Data Fig. 5 e). By resolving this motif at single-base resolution, we observed a high degree of linear correlation between the experimentally measured binding affinities of mutants 22 and UltraGen's predictions without prior fine-tuning (zero-shot inference) (Fig. 4 e,f and Extended Data Fig. 5 f). Furthermore, we experimentally measured variants with another single mutation and a double mutation to examine the breadth of replicase binding features with CUUG ( Extended Data Fig. 5 g and Supplementary Table 9 ). Unlike molecular docking 40 and AlphaFold3 predictions 41 , UltraGen highlighted the stem structure adjacent to the CUUG motif, indicating that the presence of a CUUG motif alone does not ensure binding (Fig. 4 e and Extended Data Fig. 5 h,i). These findings challenge traditional bioinformatic approaches by emphasizing the importance of context-specific structural features beyond conserved motif presence. Enhanced performance with broader pre-training sources To assess how pre-training source influence model performance, we developed two variants, UltraGen source_SiR and UltraGen source_nsp 12 by pre-training exclusively on the SiR and nsp12 SELEX datasets 22 , 28 , respectively, which are enriched for persistent (or stable) RNA binding signatures ( Extended Data Fig. 6a ). Compared to the original UltraGen, both variants showed diminished performance in identifying top SELEX binders and classifying tissue-specific 3'-UTRs ( Extended Data Fig. 6b,c ), underscoring the importance of integrating RNA with more transient binding properties to optimize performance in biological RNA binding systems. To explore the model’s capability to incorporate diverse RNA binding information, we continued pre-training the base UltraGen model on two distinct resources, yielding UltraGen molecules , trained with RNA sequences against the 12 SELEX targets, and UltraGen 3 UTR , trained with endogenous 3’-UTR datasets ( Extended Data Fig. 6d,e ). In the in vitro RNA binding systems, UltraGen molecules exhibited improved precision in predicting top binders, while UltraGen 3 UTR maintained a similar prediction performance comparable to the base UltraGen model (Fig. 5 a and Extended Data Fig. 6f ). Similarly, UltraGen 3 UTR outperformed the UltraGen molecules model at discerning tissue-specific 3’-UTR variants (Fig. 5 b and Extended Data Fig. 6e ). These results suggest that continued pre-training of UltraGen on task-specific data can enhance its ability to capture more diverse and intricate RNA binding information. We then generated UltraGen plus by integrating both SELEX and 3’-UTR resources for continued pre-training ( Extended Data Fig. 6d ), achieving well-balanced performance (Fig. 5 a,b). To extend UltraGen’s capacity for transcriptome-scale modeling, we incorporated RNA immunoprecipitation (RIP) datasets from ENCODE 14 for continued pre-training the UltraGen RIP variant. Following fine-tuning with the individual-nucleotide-resolution UV crosslinking and immunoprecipitation (iCLIP) database 15 ( Supplementary Table 10 ), which covers diverse in vivo human RNA-protein interactions, UltraGen variants demonstrated leading performance in predicting human RNA-protein interactions ( Extended Data Fig. 7a ). UltraGen plus achieved the highest F1 score of 0.789, followed by UltraGen RIP and UltraGen with scores of 0.782 and 0.780, respectively (Fig. 5 c). Consistently, UltraGen plus and its variants achieve the highest prediction performance on the mouse CLIP datasets, with an F1 score of up to 0.671 ( Extended Data Fig. 8a,b and Supplementary Table 11 ). Considering the prevalence of m6A methylation in RNA-protein interactions 42 , we further evaluated the model's ability to recognize this RNA modification. All models, except DeepBind, demonstrated strong performance with F1 scores above 0.948, while UltraGen variants showed a marginal advantage, reaching up to 0.968 ( Extended Data Fig. 9a,b ). These findings imply that incorporating biological RNA sequences from either 3'-UTR or transcriptome-wide RNA binders enhances learning of RNA interactions. Discussion UltraGen demonstrated a robust ability to interpret evolutionary RNA binding contexts, bridging artificial and natural RNA realms through self-learning RNA-target interactions. Its strengths are evident in systematic ranking predictions, tissue-specific recognition, and single-base zero-shot RNA binding characterization, revealing the RNA evolution towards a more refined spatial language. The UltraGen web server is accessible at https://www.ultrarnalab.com . We benchmarked UltraGen against feature-based and pre-trained models across diverse binding systems. The transformer architecture, strengthened by pre-training on RNA species included transient interactions from the UltraSelex system, demonstrated improved sequence feature capture. Moreover, UltraSelex RNA species targeting small molecule exhibited a broader range of interaction landscape compared to those against the protein. Similarly, predicting RNA interactions with small molecules is more challenging than with proteins or (multi)cellular targets in the SELEX systems, likely due to their greater diversity compared to the more structured k-mer motif interactions with proteins 3 . RNA binding proteins preferentially interact with U, followed by A base 25 . Consistently, the 3’-UTR, characterized by AU-rich elements and other regulatory factors, interacts with small molecules and proteins, influencing mRNA stability and translation efficiency 43 – 46 . Low-abundant 3’-terminal RNA species showed notable tissue-specific distribution, likely due to subtle variations upstream of the AU-rich region within the 100 nt segment handled by the 3’-end processing machinery. UltraGen effectively predicted the 3’-UTR of dengue virus, which carries a classical PAS, and SARS-CoV-2, which only has an assumed U-stretch for 3’-end processing 47 , suggesting diverse polyadenylation mechanisms. However, 3’-end polyadenylation sequencing alone does not provide a complete view of mRNA related to alternative 5’-UTR, splicing events, or RNA modifications like m 6 A, which might limit the model's generalization ability. A more comprehensive understanding may emerge from integrating both full-length sequence data and base modification information from single-cell RNA sequencing ( Extended Data Fig. 10 ) to connect multi-level cell-type-specific regulatory networks and clarify their function in different tissues. The first pair in the nearby stem (e.g., G-U, A-U, or C-U, rather than G-C) is critical for RNA interaction with the SARS-CoV-2 replicase nsp12 protein. This finding complements our previous discovery that CUY(U/C)G-containing RNA regions are essential for strong interactions with nsp12 22 . Notably, CUUG-containing RNA aptamers effectively inhibited the SARS-CoV-2 replicase nsp12/7/8 complex in biochemical RNA extension assays 22 . The CUYG motif forms a conserved stem-loop structure (SL2) in coronaviruses (SARS, MHV, BCoV, OC43, and HKU1) 48 . The virulence of MHV was found to be highly sensitive to genomic site mutations in this SL2 CUUG motif 48 . These findings may suggest that the CUUG-motif interacts with the SARS-CoV-2 replicase nsp12 due to its crucial role in viral replication. Future models could benefit from continued pre-training on in vitro selected RNA binding sequences targeting a broader range of molecules with desired biological functions. Additionally, incorporating a larger cohort of dynamic endogenous RNA species and information on RNA epitranscriptomic modifications under various physiological conditions would be valuable for potential model optimization. Exploring training strategies for RNA language models with a larger parameter framework may also be advantageous. In summary, UltraGen has demonstrated its efficacy as a powerful tool for capturing context-specific RNA binding systems, offering substantial potential for advancing our understanding of RNA-target interactions and their implications within biological contexts. Methods UltraGen model architecture. The UltraGen model was constructed using a BERT-style encoder-only transformer architecture 51 , incorporating two key components: multi-head self-attention and feedforward network modules. Additionally, it leveraged Rotary Position Embedding 26 for enhanced processing of long-distance dependencies. With a total of 12 layers and an embedding size of 480, the model comprises 33.5 million parameters. Each nucleotide base (A/U/G/C) was treated as an individual token during RNA sequence tokenization. Unique tokens, such as at the start and at the end, were introduced to enhance the capture of global semantic content. Additionally, the UltraGen vocabulary includes tokens for the separator, for padding, for masking, and for unknown elements. UltraGen pre-training kernel feature description Compared to SELEX RNA libraries that lost most transient RNA binders, the UltraSelex RNA library enriched both persistent and transient RNA binders 22 . We utilized the sequence information of these UltraSelex RNA datasets for our model training. The common construct of UltraSelex RNA binders (103 nt) contained two constant primer-binding regions 52 that could be structurally paired with each other, two randomized stretches of 26 nt each to diversify binding characteristics for the wet-lab selection, and a constant 12 nt internal hairpin loop thought to improve the enrichment of high-affinity binders with relatively less RNA-RNA interaction 53 . Moreover, the UltraSelex SiR library contains more diverse RNA binders, compared to the UltraSelex nsp12 library that targets-specific geometric constraints applied ( Extended Data Fig. 1 a). Therefore, UltraGen was pre-trained on the full-length RNA sequence (103 nt) without leveraging their secondary structure, derived from UltraSelex SiR RNA dataset with different binding potential ( auc value) threshold. The optimal UltraGen variant was achieved by pre-training on the top ten million species ranked by their binding potential, suggesting its learning performance balances the enrichment of binding signals while reducing background noise from the dataset. data pre-processing The raw HTS datasets, including UltraSelex SiR-B 22 (54.6 million species, sub-panel 1.1.2), UltraSelex Nsp-B 22 (76.8 million species, sub-panel 1.2.2), SELEX SiR 22 , 28 (14.6 million species, sub-panel 2.1.4) and SELEX nsp12 22 (29 million species, sub-panel 2.2.1), were obtained from the UltraSelex data archive panel in UltraRNALab ( www.ultrarnalab.com ). RNA sequences originated from UltraSelex underwent quality control, adaptor trimming, hairpin loop confirmation, and were subsequently ranked based on SGREELI auc with default setting to indicate binding potential in descending order 22 . Similarly, RNA species from the final round (the 14th ) of the SiR SELEX underwent a similar process but were ranked by their detection frequency in descending order. Top-ranked RNA sequences were extracted for training, with a held-out test set comprising 2% of the total species, ensuring less than 90% sequence identity to the training data using CD-HIT 54 (Supplementary Table 1). self-supervised learning and model evaluation UltraGen integrated two distinct pre-training components: (1) Base Reconstruction Loss ( L base ), resembling BERT 51 , involves randomly selecting 15% of tokens from each sequence for prediction. Among these, 80% were replaced with , 10% were substituted with other bases, and 10% remain unchanged. (2) Motif Reconstruction Loss ( L motif ), similar to SpanBERT 55 , employed consecutive span masking to predict motifs. Spans follow a Poisson distribution (λ = 5) with lengths ranging from 1 to 10. Unlike SpanBERT, UltraGen reconstructed original bases from masked positions, not tokens at span boundaries. UltraGen optimization combines both pre-training objectives, formally defined as: L = L base + α · L motif , where α adjusts objective weights, set to 0.25 during practical pre-training. We utilized the AdamW optimizer with a warm-up strategy, increasing the learning rate to 4e-4 over 2,000 steps, followed by cosine annealing for each data partition. The model was trained using single-node parallelism across 8 GPUs, with a batch size of 500 per GPU. A mixed-precision training strategy was employed to enhance computational efficiency. The pretraining process exhibited stable convergence, where the loss function gradually decreased and reached a plateau, indicating the stability and high quality of the training. Model evaluation involved comparing the average loss on the test set for each partition and UMAP 56 visualization to differentiate RNA binders. The optimal UltraGen model was pre-trained on the top-ranked 10 million RNA species from the UltraSelex SiR-B dataset, utilizing eight 32GB NVIDIA V100 GPUs over a period of 21 days. Additionally, other pre-trained models, UltraGen source_SiR and UltraGen source_nsp 12 , were constructed using SELEX datasets specifically targeting to SiR and nsp12 target 22 , each based on their respective top-ranked 10 million RNA species. Systematic ranking of in vitro-selected RNA aptamers by UltraGen data pre-processing The SiR SELEX dataset was obtained and processed as described above. Twelve benchmark SELEX datasets were utilized in this study, including four small-molecule targets (benzopyrylium-coumarin fluorophores 4 , paromomycin 5 , maleimide 6 , PPACK 6 ), four protein targets (TAR DNA binding protein 43 7 , ribosomal protein S15 8 , RNA-binding motif protein 24 7 , and HIV-1 reverse transcriptase 9 ), and four (multi)cellular targets (Triple-negative breast cancer cells 10 , Chinese hamster ovary K1 cells 11 , myeloid-derived suppressor cells 12 , and human islets 13 ) (see details in Supplementary Table 2). Original sequencing datasets underwent standard processing, including quality-control, adaptor trimming, length filtering, and RNA conversion (in-house code). RNA species enriched in the final round of SELEX datasets were then extracted and stratified accordingly (Supplementary Table 3–5). Those absent in the final round were classified into a background set. The binding potential of bucketed RNA species were estimated based on their detection frequency, with the highest-ranking bucket comprised of the top-rank 0.1-1% of total species, reflecting desired experimental selection criteria in SELEX 57 , 58 . Data balance was ensured by implementing a downsampling strategy for categories with an excess of species, such as those in the background set, limiting them to 100,000 instances. Conversely, categories with fewer instances underwent oversampling, with instances randomly duplicated to reach the same threshold, ensuring a fair distribution. Each RNA category within the benchmark dataset was subsequently partitioned into training, validation, and test sets at a ratio of 6:1:3. For the classification tasks derived from the cohort of SELEX binding datasets, the target-specific enriched RNA species varied in each dataset due to differences in RNA length, architecture (different primer A/B from different labs), and library randomness (Supplementary Table 2). As these species differ from the RNA used for pre-training, they were not excluded from the analysis. The exact number of sample sizes and data balancing strategies have been included in Supplementary Tables 3–6 model fine-tuning and evaluation In the fine-tuning phase, a BERT strategy was employed to extract features from the first token in the final layer as sequence representation. For each dataset, task-specific non-linear classifiers were implemented, and all parameters were subsequently fine-tuned. Systematic ranking is a multi-class task, for which Categorical Cross-Entropy Loss is calculated. The learning rate was set to 1e-4, with a batch size of 100. An early stopping mechanism was employed during training, terminating the process if validation performance showed no improvement over 10 consecutive epochs. This approach minimized overfitting, optimized computational efficiency, and ensured robust and generalizable model performance. This single-label classification task was evaluated using four metrics: Precision@top , F1@top , F1@all Precision@all , and the Weighted Precision@all scores. Precision@top and F1@top specifically denoted the precision and recall for predicting RNA species with the highest binding potential. Additionally, Precision@all and Weighted Precision@all serves as a comprehensive precision metric, attributing more significance to categories of predicting systematic binding landscape. The Weighted Precision@all metric was calculated using the formula: \(\:\frac{{\sum\:}_{j=1}^{N}\left(\frac{{Preci}_{j}}{j}\right)}{{\sum\:}_{i=1}^{N}\frac{1}{i}}\) , where N denotes the number of categories, and Preci j represents the precision associated with the j th category. Precision@all was calculated without weighting categories . In a further ablation study, sequence homogeneity was assessed using the edit distance metric. Initially, RNA species were ranked in descending order based on their detection frequency in the dataset. Subsequently, RNA species with an edit distance less than or equal to a specified threshold relative to those from the top-ranked bucket were removed, while RNA species from non-top-ranked bucket retained their positions in the training process. After fine-tuning the model’s parameters, the Precision@top metric was calculated and compared across different edit distance criteria. To evaluate the effectiveness of model’s downstream application, predicted binder sequence in the top category from test sets were clustered and compared with reported binding sequences or core motifs from the literature (Supplementary Table 2). Specifically, the probability of the model score and the weighted abundance across all SELEX rounds of each sequence was recorded. Sequences were then sorted in descending order based on their model scores and clustered using the following algorithm: First, the edit distance between each sequence and the central sequence in the clustering pool was calculated sequentially. If the edit distance exceeded a specified cutoff, a new category was introduced to the clustering pool with the current sequence serving as the central sequence for that category. Conversely, if the edit distance was no more than the cutoff, the sequence was assigned to that category. After clustering, the weighted abundance of sequences within each category was summed, and categories were sorted in descending order based on this summed value. Distinct cutoff values were determined based on the level of sequence similarity within the datasets. Datasets exhibiting lower similarity, such as TARDBP and RBM24, utilized larger editing distances as clustering cutoffs. While datasets featuring higher similarity or shorter sequences were analyzed using smaller edit distance cutoffs. The specific cutoff settings for each dataset were as follows: DAse − 3, BC − 7, PR − 3, MI − 3, TARDBP − 14, RT − 2, RBM24–13, S15–6, ISLETS − 4, MDSC − 5, CHO-K1–4, TNBC − 3. The identification of binders within each dataset was determined accordingly. For DAse, TARDBP, and RBM24 dataset, positive sequences were selected according to the binding motifs reported in the literature. For the remaining datasets, experimentally verified sequences reported in the literature were considered as positive sequences. Classifying human tissue-specific hallmarks of 3’-terminal non-coding RNA by UltraGen data preprocessing Tandem 3’-terminal end sequencing datasets from 22 human tissues were obtained from APASdb 32 . Raw sequencing reads underwent base quality control, adaptor trimming, length filtering, and genome mapping (using bowtie version 1.0.0 with parameters -v 2 -k 2 --best, referencing human genome GRCh37/19 from UCSC), along with internal priming filtering as described 59 . The 100 nucleotides (nt) upstream from the genome-mapped 3’-end of each sequence were collected and aggregated into a data frame. This summary data frame comprises 2.89 million unique non-coding RNA species and their corresponding detection frequency across 22 human tissues. Given the tissue-specific chromosome-wide gene expression 60 , RNA species were randomly partitioned into training, validation, and test sets based on the count of distinct tissues they appear in, following an approximately 6:1:3 ratio. Human-pathogenic RNA virus genomes were retrieved from NCBI viruses RefSeq release ( https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/ ), along with their corresponding 3’-UTR annotation using their accession NCBI reference IDs. Subsequently, the 100 nt upstream from the 3’-end of each 3’-UTR was extracted for downstream featurization. SARS-CoV-2 variants genomes were obtained from NCBI Virus database. ( https://www.ncbi.nlm.nih.gov/labs/virus/ ), meeting full nucleotide completeness criteria, within the time range from the end of 2019 to the end of 2023. Variants monitored closely by the World Health Organization were identified and recorded with their accession IDs. Their 3’-UTR sequences were processed similarly to the human-pathogenic RNA virus genomes described above. model fine-tuning and evaluation UltraGen adopted a multi-task learning framework to simultaneously address classification tasks for human RNA tissue specificity and the corresponding abundance level. Briefly, during the tissue specificity calculation, each RNA species presents or absence in the specified tissue was predicted with a probability in the range between 0 and 1. RNA species correctly predicted (probability > 0.5 for presence, and ≤ 0.5 for absence) were summarized using precision, recall, and F1 score metrics across 22 tissues. RNA species were further classified into four abundance levels: high (counts ≥ 100), intermediate (100 > counts ≥ 10), low (10 > counts ≥ 1), and non-existent. Thus, each RNA species was associated 22 (tissues)*4 (abundance levels) labels for supervised learning. The overall loss function was defined as follows: \(\:Loss\:=\:\sum\:_{i=1}^{T}BCE({s}_{i}\:,\:{A}_{i})\:+\sum\:_{i=1}^{T}CE({e}_{i}\:,\:{B}_{i})\:\:\:\) . Each sample sequence is associated with specificity labels A i ∈ {0,1}, indicating tissue presence across T tissues, and abundance level labels B i ∈ {0, 1, 2, 3], representing different abundance levels. The model outputs s i predicts tissue preference, while e i predicts tissue abundance levels. Binary cross-entropy loss and categorical cross-entropy loss were computed separately for these tasks. The learning rate was set to 1e-3, and the batch size was configured as 200. The representation from the last model layer served as the sequence feature, followed by the addition of two nonlinear layers for predicting tissue specificity and abundance level. UltraGen's entire parameters were fine-tuned, and performance metrics, including macro precision, recall, and F1 score, were computed and compared with other methods. Zero-shot inference was conducted for human pathogenic RNA viruses (Supplementary Table 7) and SARS-CoV-2 variants (Supplementary Table 8) using their 3’-end sequences as input. The tissue specificity of these viruses was further compared with data reported in clinical and research articles. Predicting in vivo RNA-protein interaction from CLIP datasets and m6A modification. The eleven human iCLIP sequencing datasets were curated from the iONMF repertoire 15 , encompassing nine protein targets: hnRNPC 61 , U2AF2 61 , hnRNPL 62 , hnRNPL-like 63 , Nsun2 64 , TDP-43 65 , TIA1 66 , and TIAL1 66 . Each dataset had been split into three parts using three-fold cross-validation, with 40,000 samples per part, further divided into training and testing sets in a 3:1 ratio. The mouse CLIP sequencing datasets were curated from the CLIPdb 67 , encompassing positive sequences (101 nt centering to the middle site) from eleven protein targets: EZH2, FUS, HNRNPR, LIN28A, RBFOX2, RBM10, SRSF2, SRSF3, TARDBP, U2AF2, YTHDC2 (detailed sequence data source ID is from the "mouse.txt" under CLIPdb sever http://clipdb.ncrnalab.org ). Negative sequences (101 nt) were randomly sampled from the mouse transcriptome regions (M2, GRCm38.p2 from Gencode) using the following commands "bedtools random -l 101 -n 10000000 -g GRCm38_p2.chrom.sizes -seed 922 > mouse_genome_random.bed" and "bedtools intersect -a mouse_genome_random.bed -b gencode_M2_mouse_gene.bed -wa > mouse_gene_101nt_random.bed". To ensure that negative sequences did not overlap with the regions of positive sequences, they were filtered using: "bedtools intersect -a mouse_gene_101nt_random.bed -b positive_regions.bed -v > negative_regions_raw.bed". The remaining negative sequences were randomly shuffled using random.shuffle, with seeds ranging from 922 to 932. To remove sequence redundancy over 80%, CD-HIT 54 was applied to both positive and negative datasets. Equal numbers of non-redundant RNA sequences were randomly selected and split into training, validation, and test sets in a 6:1:3 ratio. RNA-protein interactions were defined as a binary classification task. We adopted the same fine-tuning strategy as in the method “ Systematic Ranking of in vitro-selected RNA Aptamers ” and then used macro precision, recall, F1 score, and area under the ROC curve (AUC) for model performance comparison. For predicting RNA species harboring m6A modification, a total of non-redundant 79,021 m6A modification sites (filtered from m6A-altas 131,703 raw signals 68 ) and 849,005 non-m6A sites, along with their flanking 20 nt upstream and 20 nt downstream regions, from nine cell lines (A549, CD8T, ESC, HCT116, HEK293, HEK293T, Hela, HepG2, and MOLM113) were obtained from a previous study 33 . Model performance was then evaluated through cross-validation, paring the positive set with each of 10 different negative sets 33 . Performance metrics, including precision, recall, F1 score, and AUC were assessed using five different model seeds. Benchmark deep learning models The UltraGen classification benchmarks comprised feature-based models (DeepBind 16 , SA-Net 29 , and RaptRanker 17 ) and pre-trained models (RNABERT 30 , RNA-FM 27 , 3UTRBERT 33 ). DeepBind, a convolutional neural network specifically designed to process sequential nucleotides input features for predicting RNA protein binding, underwent augmentation with task-specific nonlinear classification layers and comprehensive parameters training. Similarly, SA-Net, utilizing a self-attention mechanism and sequence k-mer embedding, along with RaptRanker, which incorporates both sequence and secondary structure information, were augmented with the same non-linear framework. The large RNA-protein binding model RNABERT was originally pre-trained on 76,237 human small ncRNAs with 0.47 million parameters, while RNA-FM was pre-trained on a 23 million ncRNA source from RNACentral, utilizing 99.52 million parameters, and 3UTRBERT was pre-trained on 20,362 3’-UTRs with 86.09 million parameters. RNA-RBP interaction model BERT-RBP 69 adopts the BERT architecture and was built upon the DNABERT-3 model that was pre-trained on the human reference genome GRCh38.p13, using 3-mer representations of nucleotide sequence and comprising approximately 86 million parameters. The first special classification token from the final layer of these pre-trained models was utilized to represent the sequence. Furthermore, task-specific nonlinear classification layers were integrated and subsequently fine-tuned to optimize all parameters. When comparing pre-trained UltraGen with other benchmark approaches, all models were augmented with identical task-specific nonlinear layers and fine-tuned using the same downstream binding datasets. Sequence motif and structural analysis of SELEX RNA species. RNA species from the SELEX datasets were classified into 'Binding' (detected enrichment > 0) and " Non-binding " (no detection) groups based on their abundance in the final SELEX round. The 'Binding' group includes all enriched RNA species, while an equal number of non-detected species were randomly selected to form the " Non-binding " group. For sequence analysis, RNA sequences were dissected into consecutive hexamer units to examine distribution patterns between groups. For structural analysis, RNA secondary structures were predicted by LinearPartition 70 using maximum expected accuracy (MEA). Each nucleotide was further annotated with one of six structural elements: dangling start (F), dangling end (T), internal loop (I), hairpin loop (H), multibranched loop (M), and stem region (S) as previously described 69 . The positional frequencies of these structural elements were then analyzed and compared between two groups. Characterizing SARs-CoV-2 replicase nsp12 in single-base resolution The experimental binder sequence (wild type, 113-50H + L) and mutated variants (M1-5) of the nsp12 RNA binders were obtained from our previous work 22 . Additionally, two mutants (M6-7) (Supplementary Table 9) were designed and analyzed in this study. Each base of the wild-type sequences was masked, and position-specific likelihoods were calculated. These likelihoods were then converted to nucleotide base probabilities using the softmax function. The representation probabilities of each base were compared with the wild-type sequence using the log odds ratio score to indicate binding preference. For simultaneous mutations, the collective effect was determined using the average score of model predictions, calculated as follows: where x wt and x mt represent the wild-type and mutant sequences, respectively. x i refers to the nucleotide base at position, and x i−1 represents the sequence with a mask applied to position i . m denotes the count of mutations, and M specifies their positions; for example, with mutations at positions 6 and 9, M = (6, 9). Evolutionary similarity analysis of mRNA and ncRNA across species mRNA and non-coding RNA were obtained from the transcriptome database of various organisms: Homo sapiens (human, Ensemble release 109), Mus musculus (mouse, Ensemble release 109), Danio rerio (zebrafish, Ensemble release 109), Saccharomyces cerevisiae (yeast, Ensemble release 109), Arabidopsis thaliana (plant, Ensemble release 57), 5508 random selected bacterial species (bacterial, Ensemble release 57), RNAcentral (rnacentral_active.fasta.gz from https://ftp.ebi.ac.uk/pub/databases/RNAcentral ), and RNA viruses (Cardiovirus, Cosavirus ,Coxsackie, Rhinovirus, Poliovirus, Dengue, West Nile, Yellow Fever, Zika, H1N1, H3N2, Marburg, Ebola, Astrovirus, Chikungunya, Hantavirus, HIV, Lassa, Leishmania, Rabies) sourced from NCBI viruses RefSeq release. Subsequently, the RNA sequences were segmented into hexamers using a sliding window approach. The abundance distribution of each hexamer was then compared with that of other species using Pearson’s correlation. Experimental binding affinity determination by Bio-layer interferometry Bio-layer interferometry (BLI) measurements were conducted using an Octet® R8 system (Sartorius). Prior to the equilibrium dissociation constant ( K D ) measurement, Octet Ni-NTA biosensors (Sartorius) were equilibrated in 1X ERBL buffer (2 mM Tris-HCl pH 7.5, 100 mM KCl, 5% (v/v) glycerol, 10 mM Mg(OAc) 2 , 1 mM TCEP, 0.02% TWEEN 20 (Carl Roth)) for approximately 5 minutes. A two-fold dilution series of each RNA ligand (RNA was in vitro transcribed from corresponding double strand DNA template and then purified by 10% PAGE electrophoresis) in 1X ERBL buffer was prepared, with 1X ERBL buffer without RNA serving as a reference. Protein loading was performed using 20 ng/µL of nsp12-His 10 . The assay comprised a 60-second baseline-1 step, a 180–240 second protein loading step, a 60-second baseline-2 step, a 900 second association step, and a 600 second dissociation step. Data analysis was conducted using the Octet Data Analysis software, involving pre-processing steps such as reference subtraction, y-axis alignment based on the average of the baseline, inter-step correction by dissociation, and Savitzky-Golay filtering. A 1:1 binding model was applied, and the K D was calculated by fitting using the fitting method (either locally or globally). Continued pre-training of UltraGen for diverse data sources To enhance the model's capability in integrating diverse RNA binding information, the basic UltraGen model underwent further pre-training, resulting in four additional variants: The first variant, UltraGen molecules , was continued pre-trained using 3.24 million RNA species from the 12 SELEX targets’ training sets (Supplementary Table 3–5). The second variant, UltraGen 3 UTR , involved continued pre-trained with 1.69 million endogenous RNA sequences from the preliminary 3’-UTR full training datasets (Supplementary Table 6). The third variant, UltraGen plus , was developed as the comprehensive model by integrating all data used for the previous two variants. The fourth variant, UltraGen RIP , was continued pre-trained using 6.88 million RNA species from human RIP-Seq in ENCODE 14 . For human RIP-Seq data preprocessing, nine ENCODE experiments were selected, and their genome-mapped BAM files ( ENCFF660QLC, ENCFF041SHW, ENCFF241CCX, ENCFF956PEU, ENCFF879QSE, ENCFF844CIH, ENCFF023IWI, ENCFF257FPS, ENCFF492HQI ) were downloaded from ENCODE. The properly mapped paired-end sequencing reads were extracted using samtools (v1.13) with the parameter -f 2. These genome-mapped regions were extracted from reference GRCh37/hg19 from UCSC based on their mapping information. For sequences sharing the same left-most mapped position but differing in length, the most abundant one within a length range of 90 to 150 nt were collected, considering the RIP-Seq library’s insert length is mostly within 150 nt. This process resulted in 6.88 million high-quality non-redundant RNA species for continue pre-training of the model. Each model variant underwent an additional round of pretraining on its respective dataset, adhering to the pre-training kernel while building upon the base UltraGen model. This approach was designed to minimize the risk of overfitting while preserving the core knowledge embedded in the base model. Continued pre-training of UltraGen was performed on a single 32GB NVIDIA V100 GPU with a batch size of 500, using the AdamW optimizer and a learning rate warm-up strategy. The learning rate was increased to 4e-4 over 2,000 steps, followed by cosine annealing for each data partition. Analysis of single-cell 3’ readout RNA sequencing of human lung adenocarcinoma We performed single-cell analysis for human lung adenocarcinoma (LUAD) cells 71 (BioProject ID: PRJNA973717), which contained five single cell 3’ readout RNA sequencing datasets (SRR24626848, SRR24626849, SRR24626850, SRR24626853, SRR24626854). For each dataset, we ran the NCBI fastq-dump utility with the --split-files argument to retrieve the corresponding FASTQ files. These retrieved FASTQ files were further renamed according to the bcl2fastq file naming convention to meet the requirement of Cell Ranger (v8.0.1). To build a custom reference for Cell Ranger, we ran the cellranger mkref command on the human genome GRCh38 data ( https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz ). The FASTQ files were then aligned to the reference genome with the cellranger count command, specifying the path to reference genome (GRCh38 with Genecode v44) and sequence data (e.g. SRR24626848). The resulting output files contained per-molecule information and feature barcode matrices, which were further aggregated by cellranger aggr utility. This process produced a unified feature barcode matrix along with secondary analysis outputs, including clustered sequences and their two-dimensional coordinates. Clustering was performed using the t-SNE algorithm with default parameters: tsne_perplexity 30, tsne_theta 0.5, tsne_max_dims 2, tsne_max_iter 1000, tsne_stop_lying_iter 250, and tsne_mom_swith_iter 250. The clustered sequences were then annotated using Azimuth (0.5.0), an automated cell type annotation tool that utilizes a pre-annotated reference single-cell dataset, following a standard 10X Genomics annotation process ( www.10xgenomics.com/cn/analysis-guides/automated-cell-type-annotation-from-r-to-loupe-using-louper ). Briefly, the human lung reference data was downloaded using the command "InstallData('lungref')". Azimuth was then utilized to compare the gene expression profile of each individual cell in the query dataset against the reference, assigning its cell type accordingly. The annotated Seurat object was subsequently converted into a . cloupe file containing embedded dimension coordinates. Finally, the t-SNE plot at annotation level 3 was generated by loading the .cloupe file into the LoupeR program (1.1.1). The cell numbers of each cell type from the clustered datasets were as follows: T cell lineage − 29140, Innate lymphoid cell NK − 5025, Mast cells − 4048, B cell lineage − 3818, AT2–3634, Secretory − 3420, Fibroblasts − 3066, EC capillary − 2878, AT1–2507, EC venous − 2452, Macrophages − 1748, Dendritic cells − 1716, Monocytes − 1635, EC arterial − 1211, None − 343, Multiciliated lineage − 330, Myofibroblasts − 271, Lymphatic EC differentiating − 108, Basal − 72, Lymphatic EC mature − 24, Ionocyte − 3, SM activated stress response − 1, Neuroendocrine – 1. For lung tissue specificity analysis, we extracted the 3'-end reads by cutting off the longest continuous T bases at their 5'-end. The resulting reads were then assigned to the corresponding cell types based on their 5'-end cell barcodes. These reads were subsequently processed as previously described for the SAPAS 3'UTR analysis 59 . In short, the 3'-end sequencing reads underwent quality control, genome mapping (hg19 from USCS. bowtie -v 3 -k 2 --best), internal priming filtering, and duplicate sequence removal from the 22 tissues training dataset overlap. The tandem 100 nt RNA sequences extracted from mapped loci were analyzed using UltraGen model. Clustered cells were annotated with the average logistic regression coefficient of five biological samples, with high predicted values indicating lung tissue specificity (the highest classification to lung or probability (range 0–1) above 0.9). Method only reference Devlin, J., Chang, M.-W., Lee , K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 1 , 4171–4186 (2019). Famulok, M. Molecular Recognition of Amino Acids by RNA-Aptamers: An L-Citrulline Binding RNA Motif and Its Evolution into an L-Arginine Binder. J Am Chem Soc 116 , 1698-1706 (2002). Davis, J.H. & Szostak, J.W. Isolation of high-affinity GTP aptamers from partially structured RNA libraries. Proc Natl Acad Sci U S A 99 , 11616-11621 (2002). Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28 , 3150-3152 (2012). Joshi, M. et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics 8 , 64-77 (2020). McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3 (2018). Boussebayle, A., Groher, F. & Suess, B. RNA-based Capture-SELEX for the selection of small molecule-binding aptamers. Methods 161 , 10-15 (2019). Sunbul, M. et al. Super-resolution RNA imaging using a rhodamine-binding aptamer with fast exchange kinetics. Nat Biotechnol 39 , 686-690 (2021). Fu, Y. et al. Differential genome-wide profiling of tandem 3' UTRs among human breast cancer and normal cells by high-throughput sequencing. Genome Res 21 , 741-747 (2011). Patkar, S. et al. Hard wiring of normal tissue-specific chromosome-wide gene expression levels is an additional factor driving cancer type-specific aneuploidies. Genome Med 13 , 93 (2021). Zarnack, K. et al. Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements. Cell 152 , 453-466 (2013). Konig, J. et al. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol 17 , 909-915 (2010). Rossbach, O. et al. Crosslinking-immunoprecipitation (iCLIP) analysis reveals global regulatory roles of hnRNP L. RNA Biol 11 , 146-155 (2014). Hussain, S. et al. NSun2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAs. Cell Rep 4 , 255-261 (2013). Tollervey, J.R. et al. Characterizing the RNA targets and position-dependent splicing regulation by TDP-43. Nat Neurosci 14 , 452-458 (2011). Wang, Z. et al. iCLIP predicts the dual splicing effects of TIA-RNA interactions. PLoS Biol 8 , e1000530 (2010). Yang, Y.C. et al. CLIPdb: a CLIP-seq database for protein-RNA interactions. BMC Genomics 16 , 51 (2015). Tang, Y. et al. m6A-Atlas: a comprehensive knowledgebase for unraveling the N6-methyladenosine (m6A) epitranscriptome. Nucleic Acids Res 49 , D134-D143 (2021). Yamada, K. & Hamada, M. Prediction of RNA-protein interactions using a nucleotide language model. Bioinform Adv 2 , vbac023 (2022). Zhang, H., Zhang, L., Mathews, D.H. & Huang, L. LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. Bioinformatics 36 , i258-i267 (2020). Fan, F. et al. Elevated Mast Cell Abundance Is Associated with Enrichment of CCR2+ Cytotoxic T Cells and Favorable Prognosis in Lung Adenocarcinoma. Cancer Res 83 , 2690-2703 (2023). Declarations Data Availability UltraSelex HTS raw data supporting the findings of this study are available for academic use in the online repository (https://www.ultrarnalab.com). A mirror data repository can be accessed on NCBI Sequence Read Archive under BioProject (PRJNA1216547). Other HTS data utilized in this study are available on Zenodo (https://doi.org/10.5281/zenodo.15294875). Code Availability The code is freely available at CodeOcean (https://codeocean.com/capsule/1240603/tree/v1) as well as the online repository (https://www.ultrarnalab.com). Acknowledgements We thank BAAI-Health Center members, S. Li (EMBL) and M. Möhler (IPMB) for discussions; B. Suess (TU Darmstadt), J. Taipale (University of Cambridge), M. Meyer (Boston College), D. Burke (University of Missouri), H. Craighead (Cornell University) for access to SELEX library information. W. Nickel (Heidelberg University) for access to BLI-Octet; D. Ibberson (Heidelberg University) and V. Benes (EMBL) for access to HTS; Y. Wu (University of Heidelberg) for aesthetic figures; BAAI-JIUDING for GPU cluster computation resources and data storage. SDS@hd for data sharing and bwForCluster Helix for CPU cluster computation resources. Contributions Y.Z., H.W., Z.C., and Q.Y. conceptualized the study. H.W., Z.C., W.H., and Y.Z. performed UltraGen modeling and data analysis. Y.Z., Y.J., and J.Z. constructed HTS RNA libraries and measured ligand affinities. W.L. and Y.Z. simulated RNA ligand interaction. H.W. and H.X. constructed UltraGen API server. Y.Z., H.W., and Z.C. wrote the original draft. Y.Z. supervised this study. A.J., Y.F., and all authors reviewed and edited the draft. Competing interests The authors declare no competing interests. References Duss, O., Stepanyuk, G.A., Puglisi, J.D. & Williamson, J.R. Transient Protein-RNA Interactions Guide Nascent Ribosomal RNA Folding. Cell 179 , 1357-1369 e1316 (2019). Van Treeck, B. & Parker, R. Emerging Roles for Intermolecular RNA-RNA Interactions in RNP Assemblies. Cell 174 , 791-802 (2018). Van Nostrand, E.L. et al. A large-scale binding and functional map of human RNA-binding proteins. Nature 583 , 711-719 (2020). Zhang, J., Wang, L., Jaschke, A. & Sunbul, M. A Color-Shifting Near-Infrared Fluorescent Aptamer-Fluorophore Module for Live-Cell RNA Imaging. Angew Chem Int Ed Engl 60 , 21441-21448 (2021). Boussebayle, A. et al. Next-level riboswitch development-implementation of Capture-SELEX facilitates identification of a new synthetic riboswitch. Nucleic Acids Res 47 , 4883-4895 (2019). Ameta, S., Winz, M.L., Previti, C. & Jaschke, A. Next-generation sequencing reveals how RNA catalysts evolve from random space. Nucleic Acids Res 42 , 1303-1310 (2014). Jolma, A. et al. Binding specificities of human RNA-binding proteins toward structured and linear RNA sequences. Genome Res 30 , 962-973 (2020). Pei, S., Slinger, B.L. & Meyer, M.M. Recognizing RNA structural motifs in HT-SELEX data for ribosomal protein S15. BMC Bioinformatics 18 , 298 (2017). Whatley, A.S. et al. Potent Inhibition of HIV-1 Reverse Transcriptase and Replication by Nonpseudoknot, "UCAA-motif" RNA Aptamers. Mol Ther Nucleic Acids 2 , e71 (2013). Camorani, S. et al. Novel Aptamers Selected on Living Cells for Specific Recognition of Triple-Negative Breast Cancer. iScience 23 , 100979 (2020). Nguyen Quang, N., Bouvier, C., Henriques, A., Lelandais, B. & Duconge, F. Time-lapse imaging of molecular evolution by high-throughput sequencing. Nucleic Acids Res 46 , 7480-7494 (2018). De La Fuente, A. et al. Aptamers against mouse and human tumor-infiltrating myeloid cells as reagents for targeted chemotherapy. Sci Transl Med 12 (2020). Van Simaeys, D. et al. RNA aptamers specific for transmembrane p24 trafficking protein 6 and Clusterin for the targeted delivery of imaging reagents and RNA therapeutics to human beta cells. Nat Commun 13 , 1815 (2022). Consortium, E.P. An integrated encyclopedia of DNA elements in the human genome. Nature 489 , 57-74 (2012). Strazar, M., Zitnik, M., Zupan, B., Ule, J. & Curk, T. Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioinformatics 32 , 1527-1535 (2016). Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33 , 831-838 (2015). Ishida, R. et al. RaptRanker: in silico RNA aptamer selection from HT-SELEX experiment based on local sequence and structure information. Nucleic Acids Res 48 , e82 (2020). Bashir, A. et al. Machine learning guided aptamer refinement and discovery. Nat Commun 12 , 2366 (2021). Chen, J.C. et al. Generating experimentally unrelated target molecule-binding highly functionalized nucleic-acid polymers using machine learning. Nat Commun 13 , 4541 (2022). Iwano, N., Adachi, T., Aoki, K., Nakamura, Y. & Hamada, M. Generative aptamer discovery using RaptGen. Nature Computational Science 2 , 378-386 (2022). Rube, H.T. et al. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. Nature Biotechnology 40 , 1520-1527 (2022). Zhang, Y. et al. Single-step discovery of high-affinity RNA ligands by UltraSelex. Nat Chem Biol (2025). Muller, F. et al. A prebiotically plausible scenario of an RNA-peptide world. Nature 605 , 279-284 (2022). Lincoln, T.A. & Joyce, G.F. Self-sustained replication of an RNA enzyme. Science 323 , 1229-1232 (2009). Corley, M., Burns, M.C. & Yeo, G.W. How RNA-Binding Proteins Interact with RNA: Molecules and Mechanisms. Mol Cell 78 , 9-29 (2020). Su, J. et al. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing 568 (2024). Chen, J. et al. Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions. arXiv preprint (2022). Wirth, R., Gao, P., Nienhaus, G.U., Sunbul, M. & Jäschke, A. SiRA: A Silicon Rhodamine-Binding Aptamer for Live-Cell Super-Resolution RNA Imaging. J Am Chem Soc 141 , 7562-7571 (2019). Wang, X., Zhang, M., Long, C., Yao, L. & Zhu, M. Self-Attention Based Neural Network for Predicting RNA-Protein Binding Sites. IEEE/ACM Trans Comput Biol Bioinform 20 , 1469-1479 (2023). Akiyama, M. & Sakakibara, Y. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genom Bioinform 4 , lqac012 (2022). Mayr, C. Evolution and Biological Roles of Alternative 3′UTRs. Trends in Cell Biology 26 , 227-237 (2016). You, L. et al. APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals. Nucleic Acids Res 43 , D59-67 (2015). Yang, Y. et al. Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. Adv Sci (Weinh) , e2407013 (2024). Gruber, A.R. et al. Global 3′ UTR shortening has a limited effect on protein abundance in proliferating T cells. Nature Communications 5 (2014). Malone, B., Urakova, N., Snijder, E.J. & Campbell, E.A. Structures and functions of coronavirus replication-transcription complexes and their relevance for SARS-CoV-2 drug design. Nat Rev Mol Cell Biol 23 , 21-39 (2022). Salgado, D.M. et al. Heart and skeletal muscle are targets of dengue virus infection. Pediatr Infect Dis J 29 , 238-242 (2010). Takeda, M. et al. A human lung carcinoma cell line supports efficient measles virus growth and syncytium formation via a SLAM- and CD46-independent mechanism. J Virol 81 , 12091-12096 (2007). Oldstone, M.B. et al. Measles virus infection in a transgenic model: virus-induced immunosuppression and central nervous system disease. Cell 98 , 629-640 (1999). Grant, R.A. et al. Circuits between infected macrophages and T cells in SARS-CoV-2 pneumonia. Nature 590 , 635-641 (2021). van Zundert, G.C.P. et al. The HADDOCK2.2 Web Server: User-Friendly Integrative Modeling of Biomolecular Complexes. J Mol Biol 428 , 720-725 (2016). Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature (2024). Wang, S. et al. Dynamic regulation and functions of mRNA m6A modification. Cancer Cell Int 22 , 48 (2022). Jing, Q. et al. Involvement of microRNA in AU-rich element-mediated mRNA instability. Cell 120 , 623-634 (2005). Sandberg, R., Neilson, J.R., Sarma, A., Sharp, P.A. & Burge, C.B. Proliferating cells express mRNAs with shortened 3' untranslated regions and fewer microRNA target sites. Science 320 , 1643-1647 (2008). Mitschka, S. & Mayr, C. Context-specific regulation and function of mRNA alternative polyadenylation. Nat Rev Mol Cell Biol 23 , 779-796 (2022). Zou, T. et al. Polyamines regulate the stability of JunD mRNA by modulating the competitive binding of its 3' untranslated region to HuR and AUF1. Mol Cell Biol 30 , 5021-5032 (2010). Brant, A.C., Tian, W., Majerciak, V., Yang, W. & Zheng, Z.M. SARS-CoV-2: from its discovery to genome structure, transcription, and replication. Cell Biosci 11 , 136 (2021). Lee, C.W., Li, L. & Giedroc, D.P. The solution structure of coronaviral stem-loop 2 (SL2) reveals a canonical CUYG tetraloop fold. FEBS Lett 585 , 1049-1053 (2011). Stroup, E.K. & Ji, Z. Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease. Nat Commun 14 , 7378 (2023). Griesemer, D. et al. Genome-wide functional screen of 3'UTR variants uncovers causal variants for human disease and evolution. Cell 184 , 5247-5260 e5219 (2021). Additional Declarations There is NO Competing Interest. Supplementary Files nreditorialpolicychecklistflatten.pdf editorial policy checklist nrreportingsummaryflatten.pdf reporting summary nrsoftwarepolicyflatten.pdf software policy UltraGenSI20240428.pdf Supplementary Materials Cite Share Download PDF Status: Under Review Version 2 posted You are reading this latest preprint version Show more versions Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4461517","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[{"code":1,"date":"2024-07-08 05:58:14","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"articleType":"Article","associatedPublications":[],"authors":[{"id":452982830,"identity":"2c6c39c7-3a14-4b97-a3ad-c876da1df452","order_by":0,"name":"Yaqing Zhang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABR0lEQVRIie3Rv0vDQBQH8BcCzXKx6yst6J9wEKgWK/lXUgqZ0hgoSIcOBSFT6KzQP0LxDzDloF2icczg0FLoYoaAS5FSvDsQoeePVTDf4Xjce5+74QGUKfNHE4uDiqMAIID8hsCJI3vkF6JdfRL8kcAH0eUAysnvCZ2zWxa8gX1sPLCXs/C5AfXLaZwP0K/W50+QD1SSuAG7HoPeinz3tBeuCTRmznSSYL829gJtkqgk9igzI6jwomn1QkbsTNyE2LlJiKOboUrSXBLCi6bV4gTwvGDmDjv3kuxUIt4kG0BeWCtNEg+YOeK/ECPWzZFCatk6EAOUZnlTix4Fcel0MsM+JgR4oZCDtHv3SrZtm6aeVWwumA3YXRX5sO1XI2O5yIcKOYr5RrRQ7AIquNckNFYAwOFILHEra73YaxqLL0SZMmXK/L+8A7GHcy5G/MxaAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0003-2879-3068","institution":"Heidelberg University","correspondingAuthor":true,"prefix":"","firstName":"Yaqing","middleName":"","lastName":"Zhang","suffix":""},{"id":452982831,"identity":"a5fddce6-5deb-4a3b-8ab0-5dd34917ef27","order_by":1,"name":"Hui Wang","email":"","orcid":"","institution":"Beijing Academy of Artificial Intelligence (BAAI)","correspondingAuthor":false,"prefix":"","firstName":"Hui","middleName":"","lastName":"Wang","suffix":""},{"id":452982832,"identity":"5d7760a0-224c-4ca1-b78a-6be9fc0c53df","order_by":2,"name":"Zhaoming Chen","email":"","orcid":"","institution":"Beijing Academy of Artificial Intelligence (BAAI)","correspondingAuthor":false,"prefix":"","firstName":"Zhaoming","middleName":"","lastName":"Chen","suffix":""},{"id":452982833,"identity":"96fdfabc-8d84-4a9f-8a16-1eb487c766be","order_by":3,"name":"Wenjun Lin","email":"","orcid":"","institution":"Beijing Academy of Artificial Intelligence (BAAI)","correspondingAuthor":false,"prefix":"","firstName":"Wenjun","middleName":"","lastName":"Lin","suffix":""},{"id":452982834,"identity":"90587746-04e5-49d0-89c2-21e085e5d939","order_by":4,"name":"Yuan Jiang","email":"","orcid":"","institution":"Heidelberg University","correspondingAuthor":false,"prefix":"","firstName":"Yuan","middleName":"","lastName":"Jiang","suffix":""},{"id":452982835,"identity":"f36f2b81-d76d-473e-a778-2769da080c07","order_by":5,"name":"Jingye Zhang","email":"","orcid":"","institution":"Heidelberg University","correspondingAuthor":false,"prefix":"","firstName":"Jingye","middleName":"","lastName":"Zhang","suffix":""},{"id":452982836,"identity":"36865ae0-281e-4e7a-864f-f1262f17f0c6","order_by":6,"name":"Wenhao Huang","email":"","orcid":"","institution":"Beijing Academy of Artificial Intelligence (BAAI)","correspondingAuthor":false,"prefix":"","firstName":"Wenhao","middleName":"","lastName":"Huang","suffix":""},{"id":452982837,"identity":"1f27136b-cf2a-41eb-afd4-0d4feb13f8d9","order_by":7,"name":"Yonggui Fu","email":"","orcid":"","institution":"Sun Yat-sen University","correspondingAuthor":false,"prefix":"","firstName":"Yonggui","middleName":"","lastName":"Fu","suffix":""},{"id":452982838,"identity":"37623503-8374-4cd0-8be8-2d48098b2540","order_by":8,"name":"Hongwang Xiao","email":"","orcid":"","institution":"Beijing Academy of Artificial Intelligence (BAAI)","correspondingAuthor":false,"prefix":"","firstName":"Hongwang","middleName":"","lastName":"Xiao","suffix":""},{"id":452982839,"identity":"f2a7305d-4099-420f-84ab-07ba8c356da8","order_by":9,"name":"David Kuster","email":"","orcid":"","institution":"Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG)","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"","lastName":"Kuster","suffix":""},{"id":452982840,"identity":"8e1cec02-296b-4c34-88db-7798439263f2","order_by":10,"name":"Andres Jäschke","email":"","orcid":"","institution":"Heidelberg University","correspondingAuthor":false,"prefix":"","firstName":"Andres","middleName":"","lastName":"Jäschke","suffix":""},{"id":452982841,"identity":"d6c740d9-923f-4ab2-b315-57ceec6eb902","order_by":11,"name":"Qiwei Ye","email":"","orcid":"","institution":"Beijing Academy of Artificial Intelligence, Beijing, China","correspondingAuthor":false,"prefix":"","firstName":"Qiwei","middleName":"","lastName":"Ye","suffix":""}],"badges":[],"createdAt":"2024-05-22 14:15:23","currentVersionCode":2,"declarations":"","doi":"10.21203/rs.3.rs-4461517/v2","doiUrl":"https://doi.org/10.21203/rs.3.rs-4461517/v2","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":82231080,"identity":"eda95168-555d-45e9-96a4-3c57517ab444","added_by":"auto","created_at":"2025-05-08 06:05:16","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":216227,"visible":true,"origin":"","legend":"\u003cp\u003eModel architecture and ranking performance of UltraGen.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea,\u003c/strong\u003e UltraGen employs masked self-supervised learning for capturing RNA-target binding context in artificial and natural RNA evolution. \u003cstrong\u003eb,\u003c/strong\u003e UMAP projections of the top one million RNA binders from UltraSelex SiR-B by UltraGen (top panel) and RNA-FM (bottom panel), colored by UltraSelex \u003cem\u003eauc\u003c/em\u003e values (“Indicative binding score”). \u003cstrong\u003ec,\u003c/strong\u003e Model ranking comparison for top 1% binders on the held-out test set between UltraGen and other feature-based and pre-trained models. Error bars represent the mean ± standard deviation (sd), with n=5 model replicates.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/f024c7e7872b20b9e83c5992.png"},{"id":82231085,"identity":"201ee881-c7d2-4304-93fa-d6640df3898e","added_by":"auto","created_at":"2025-05-08 06:05:17","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":262681,"visible":true,"origin":"","legend":"\u003cp\u003eUltraGen identifies RNA aptamers against various targets \u003cem\u003ein vitro\u003c/em\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea,\u003c/strong\u003e Distribution discrepancy of hexamer units from RNA species in SELEX systems, distinguishing enriched (“Binding”) and not enriched (“Non-binding”) species by their distribution peaks in dashed lines with arrows. Inset line plots model performance upon removing no (“N”), certain (\"C”), and most(\"M”) similar species from the dataset. Error bars represent mean ± sd, with n=3 model replicates. \u003cstrong\u003eb,\u003c/strong\u003e Comprehensive ranking of top binders in SELEX datasets by UltraGen and other benchmark models. SELEX datasets abbreviations: DAse (Diels-Alderase), BC (Benzopyrylium-coumarin fluorophores), PR (Paromomycin), MI (Mechanistic inhibitor of serine proteases PPACK), TARDBP (TAR DNA binding protein 43), RT(HIV-1 reverse transcriptase), RMB24 (RNA-binding motif protein 24), S15 (Ribosomal protein S15), ISLETS (Human islets), MDSC (Myeloid-derived suppressor cells), CHO-K1 (Chinese hamster ovary K1 cells), and TNBC (Triple-negative breast cancer cells). See ranking criteria in Supplementary Table 3-5. Error bars represent the mean + sd, with n=5 model replicates.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/d3e9671e3c7e02ca562c23b8.png"},{"id":82229971,"identity":"3ad9a630-cd4c-4083-becc-2feefdf19b66","added_by":"auto","created_at":"2025-05-08 05:40:58","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":266294,"visible":true,"origin":"","legend":"\u003cp\u003eUltraGen resolves tissue specificity of human 3’-UTRs.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea,\u003c/strong\u003e Comparison of multi-classification performance for individual 3’-UTRs across 22 human tissues by UltraGen and other models. Tissue specificity prediction of each RNA species on individual tissue (\u0026gt;0.5 for presence and ≤ 0.5 for absence) was compared with its actual presence or absence. The average classification metrics across the 22 human tissues as presented, with error bars representing mean + sd from n=5 model replicates. \u003cstrong\u003eb,\u003c/strong\u003e Detailed multi-classification metrics for tissue specificity on individual human tissues by UltraGen. \u003cstrong\u003ec,\u003c/strong\u003e Length specificity of input RNA for model prediction of tissue classification. Tandem 3’-end sequences (50 - 300 nt) were analyzed using UltraGen model fine-tuned with 100 nt sequences. Error bands indicate sd, with n=5 model replicates. \u003cstrong\u003ed,\u003c/strong\u003e Length specificity comparison with UltraGen variants fine-tuned with 50, 100, and 150 nt sequences. Violin plots showing optimal performance for test sequences ranging from 50 to 300 nt, with n=5 model replicates. The bar inside the violin represents the interquartile range that covering the 25% and 75% quantiles with a point that indicates median. \u003cstrong\u003ee,\u003c/strong\u003e Distribution of polyadenylation \u003cem\u003ecis\u003c/em\u003e-regulatory elements from the upstream region of the cleavage site across all 22 human tissues. G- and U-rich motif are defined as hexamers starting with these nucleotides and containing five or more of the same nucleotides\u003csup\u003e49\u003c/sup\u003e. CU/GU-rich motifs are referenced accordingly\u003csup\u003e50\u003c/sup\u003e. \u003cstrong\u003ef,\u003c/strong\u003e Impact of single nucleotide mutation on predicting tissue specificity along the tandem 3’-end region. For each tissue type \u003cem\u003ei\u003c/em\u003e, 1000 RNA species were randomly selected from the test set for the analysis. Single nucleotide substitutions (A, C, G, and U) were introduced at each position within these sequences. The change in probability (“prob”) was defined as \u003cem\u003eProb \u003c/em\u003e\u003csub\u003etissue_type_\u003c/sub\u003e\u003csub\u003e\u003cem\u003ei \u003c/em\u003e\u003c/sub\u003e\u0026nbsp;seq_mutant - \u003cem\u003eProb\u003c/em\u003e\u003csub\u003e tissue_type_i \u003c/sub\u003eseq_origin, with lower values indicating more impact of the mutation on tissue specificity prediction. The line plot displays average probability changes across all 22 tissues, with error bands indicating sd. The box plot shows the distribution of summed probability changes for single species, and tissues were ranked by mutation effect dynamics.\u0026nbsp;\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/70b0d00d11eb60c0b35016b3.png"},{"id":82231220,"identity":"ac8b467c-e400-4dbd-88a6-6397a01d85c9","added_by":"auto","created_at":"2025-05-08 06:06:08","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":260586,"visible":true,"origin":"","legend":"\u003cp\u003eCharacterization of RNA virus interaction features by UltraGen\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea,\u003c/strong\u003e Characterization of tissue-specific hallmarks in human-pathogenic RNA viruses’ 3’-UTR by UltraGen. RNA virus abbreviation: TBE (Tick-borne encephalitis virus), MERS (Middle East respiratory syndrome-related coronavirus), Coxsackie (Human coxsackievirus A2 strain Fleetwood). Other sequence information could be seen in Supplementary Table 6. \u003cstrong\u003eb,\u003c/strong\u003e Predominant tissue-specific classification of 3’-UTRs from SARS-CoV-2 variants over time. \u003cstrong\u003ec,\u003c/strong\u003e UMAP projections of the top one million RNA binders from UltraSelex Nsp-B by UltraGen, colored by UltraSelex \u003cem\u003eauc\u003c/em\u003evalues. \u003cstrong\u003ed,\u003c/strong\u003e Histogram of 5-mer distribution of high-\u003cem\u003eauc\u003c/em\u003e RNA species from the centered cluster denoted by the red rectangle in Fig. 4c. \u003cstrong\u003ee,\u003c/strong\u003eZero-shot inference of nsp12 RNA binder (113-50H+T) in single-base resolution by UltraGen. \u003cstrong\u003ef,\u003c/strong\u003e Bio-layer interference measured binding affinity of RNA mutants with nsp12 protein. The likelihood of each base (“predictive binding”) correlated to the experimental binding affinity. For mutant abbreviations, see Supplementary Table 9.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/b6711e6e7717c0987a5c3724.png"},{"id":82229980,"identity":"82c7439c-0c54-4a78-8434-4e477f354c1c","added_by":"auto","created_at":"2025-05-08 05:40:58","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":283356,"visible":true,"origin":"","legend":"\u003cp\u003eExpansion of pre-training RNA sources for optimizing model performance.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea, \u003c/strong\u003eComparison of ranking top binders from SELEX datasets by UltraGen and other continued pre-trained UltraGen variants\u003cstrong\u003e. \u003c/strong\u003eUltraGen\u003csup\u003emolecules\u003c/sup\u003e and UltraGen\u003csup\u003e3UTR\u003c/sup\u003e were continued pre-training using the RNA sequences against the twelve SELEX targets and endogenous RNA sequences from the preliminary 3’-UTR dataset, respectively. UltraGen\u003csup\u003eplus\u003c/sup\u003e was pre-trained using the unified RNA sequences from the twelve SELEX datasets and 3’-UTR sequences. SELEX datasets abbreviations are the same as in Fig. 2b. Error bars represent the mean + sd, with n=5 model replicates. \u003cstrong\u003eb, \u003c/strong\u003eComparison of human tissue specificity classification between UltraGen and other continued pre-trained UltraGen variants in panel a. The analysis is identical to Fig. 3a. Error bars represent the mean + sd, with n=5 model replicates. \u003cstrong\u003ec,\u003c/strong\u003e Comparison of human RNA-protein interactions binary classification from the iCLIP datasets between UltraGen variants and other models. The average performance metrics across the eleven iCLIP datasets are presented.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/7e3d9ef50f7593bb7b1fa4a6.png"},{"id":82231233,"identity":"19fcb3df-02b9-4008-b237-4f346072a4f1","added_by":"auto","created_at":"2025-05-08 06:06:18","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2980809,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/23412d8b-e287-4b61-96c5-e8d48399ce3b.pdf"},{"id":82231088,"identity":"0796f5d4-87bd-4049-b64b-13645e62ac23","added_by":"auto","created_at":"2025-05-08 06:05:25","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":54056,"visible":true,"origin":"","legend":"editorial policy checklist","description":"","filename":"nreditorialpolicychecklistflatten.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/979c706790eb9dcc4182ce08.pdf"},{"id":82229968,"identity":"073a9379-5b1c-49f1-b9c4-c95da47b5dfa","added_by":"auto","created_at":"2025-05-08 05:40:57","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":50751,"visible":true,"origin":"","legend":"reporting summary","description":"","filename":"nrreportingsummaryflatten.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/4b7b6c0fa9077fed930487e1.pdf"},{"id":82229266,"identity":"d4766aa7-4205-4c14-8801-331ac0ed4aa1","added_by":"auto","created_at":"2025-05-08 05:32:58","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":149083,"visible":true,"origin":"","legend":"software policy","description":"","filename":"nrsoftwarepolicyflatten.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/5a69dbb6bdcd671742e9acac.pdf"},{"id":82229282,"identity":"7544e1cd-260c-4efe-9e26-db8d9930b982","added_by":"auto","created_at":"2025-05-08 05:32:58","extension":"pdf","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":6169870,"visible":true,"origin":"","legend":"Supplementary Materials","description":"","filename":"UltraGenSI20240428.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4461517/v2/a256a186cce372d89ac99b4d.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Decoding the RNA binding systems by UltraGen","fulltext":[{"header":"Introduction","content":"\u003cp\u003eRNA is central to biology, evolution, and biotechnology, interacting crucially with small molecules, proteins, and other ligands. Their inherent sequence-specific interactions, spanning a wide range of affinities and specificities, mediate essential cellular processes in all life, with even weak interactions often playing regulatory roles\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Comprehensively profiling RNA intermolecular interactions is critical for decoding the RNA binding systems.\u003c/p\u003e \u003cp\u003eHigh throughput sequencing (HTS) has been applied to characterize diverse RNA-target interactions\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. \u003cem\u003eIn vitro\u003c/em\u003e SELEX (\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eS\u003c/span\u003eystematic \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eE\u003c/span\u003evolution of \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eL\u003c/span\u003eigands by \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eEX\u003c/span\u003eponential \u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003eE\u003c/span\u003enrichment) iteratively enriches high-affinity RNA binders\u003csup\u003e\u003cspan additionalcitationids=\"CR5 CR6 CR7 CR8 CR9 CR10 CR11 CR12\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e, whereas \u003cem\u003ein vivo\u003c/em\u003e CLIP (Cross-linking and immunoprecipitation) record native interactions\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Compared to the multivalent interactions of endogenous RNAs, \u003cem\u003ein vitro\u003c/em\u003e enriched RNAs represent more target-specific properties. In the last decade, deep learning methods have been developed to refine RNA-target interactions from SELEX datasets\u003csup\u003e\u003cspan additionalcitationids=\"CR17 CR18 CR19 CR20\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. However, the inherent bias of SELEX toward discarding weak binders limits the generalizability of self-supervised RNA models. To address this gap, the recently introduced UltraSelex captures a series of washing eluates that retain weak (or transient) binders typically discarded during SELEX washing steps, providing a broader landscape of RNA interactions with varying strengths\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eHere, we present UltraGen, an RNA language model pre-trained on 10\u0026nbsp;million UltraSelex RNA species. UltraGen decodes RNA binding characteristics by embedding sequences in a latent space that reflects quantitative binding potential. It outperforms existing state-of-the-art models across twelve diverse RNA binding targets including small molecules, proteins, cells, and tissues, with precision up to 75%. Fine-tuning UltraGen on the 3\u0026rsquo;-untranslated region (3\u0026rsquo;-UTR) profiles from 22 human tissues, it exhibited a three-fold increase in precision and a ten-fold increase in recall compared to the best benchmark model, highlighting its potential in identifying RNA virus tropism, including for pathogenic RNA viruses such as dengue and measles. Analysis of RNA-binding signatures for the SARS-CoV-2 replicase indicated that a conserved CUUG loop preferentially associates with an adjacent base pair such as A-U or unpaired C-U. Expanding UltraGen's pre-training dataset has further enhanced its robustness and applicability to the RNA binding systems. By analyzing both \u003cem\u003ein vitro\u003c/em\u003e and \u003cem\u003ein vivo\u003c/em\u003e datasets, UltraGen provides a robust framework for understanding RNA molecular interactions.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003ePre-training UltraGen on RNA binding systems\u003c/h2\u003e \u003cp\u003eRNA binding often involves the electrostatic properties of negatively charged RNA interacting with positively charged ligands\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. For instance, proteins interact with RNA primarily through positively charged amino acids like arginine and lysine\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. Given fewer geometric constraints in RNA interactions with small molecules, we initiated our study using HTS RNA datasets obtained through UltraSelex on the SiR target\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea,b). To unravel the underlying RNA binding language, we utilized self-supervised learning on UltraGen\u0026rsquo;s pre-training kernel designed to reconstruct nucleobases and motifs, aiming to identify chemical and conformational properties that govern RNA-target interactions (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). Additionally, the Rotary Position Embedding\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e was employed to enhance the long-distance dependencies of sequence context.\u003c/p\u003e \u003cp\u003eThe top-ranked ten million RNA sequences were divided into training and test sets, ensuring less than 90% sequence identity between them (see details in \u003cb\u003eMethods\u003c/b\u003e). Compared to the pre-training process on the UltraSelex dataset targeting nsp12, the SiR dataset exhibited lower learning loss, likely due to its continuous distribution of binding scores, providing a broader range of binding potentials (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea-d).\u003c/p\u003e \u003cp\u003eNext, we assessed UltraGen\u0026rsquo;s learning capacity using datasets of various sizes, from one million top-ranked sequences to the full dataset containing over 40\u0026nbsp;million sequences (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ee). The optimal model, pre-trained on the top ten million sequences of the UltraSelex SiR dataset, was used in all subsequent experiments. Using UMAP dimensionality reduction, UltraGen revealed distinct RNA clustering correlated with indicative binding scores, outperforming both a randomly initialized model and the RNA-FM model\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e that employs a similar transformer framework. (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb and \u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ef). These results suggest that reconstructing RNA from a continuously evolving binding context, rather than the model framework, accounts for the effective acquisition of the RNA binding intricacies.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eHigh-performance multi-classification by UltraGen\u003c/h3\u003e\n\u003cp\u003eWe then assessed UltraGen\u0026rsquo;s ability to rank SiR RNA binders across the entire UltraSelex dataset, focusing on its multi-classification capability (\u003cb\u003eSupplementary Table\u0026nbsp;1\u003c/b\u003e). During fine-tuning, RNA sequences were labeled by their binding potential, with each class split into training, validation, and test sets at a ratio of 6:1:3. Despite several orders of magnitude of multi-class imbalance, UltraGen achieved 78% precision in identifying RNA species with the top 0.01% SiR-binding ability on the held-out test sets (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eg), a substantial improvement over the randomly initialized model (5.6%). Additionally, UltraGen demonstrated 63% precision in the dataset generated through conventional SELEX for the same target\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eh).\u003c/p\u003e \u003cp\u003eTo ensure a fair comparison and mitigate potential sequence reuse during pre-training, 0.83% (120,759) RNA species previously used in unsupervised learning were meticulously excluded, even if their binding labels differed (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ei). UltraGen outperformed other neural network-coupled models, including RaptRanker\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e, which relied on local sequence-secondary structure-based RNA features, DeepBind\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e, which utilized the convolutional neural network framework, and SA-Net\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e, which employed k-mer embedding. Specifically, UltraGen achieved\u0026thinsp;~\u0026thinsp;10-fold improvement in ranking precision for identifying top binders from the un-pre-trained SiR SELEX dataset (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). While both RNABERT\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e and RNA-FM, pre-trained on transformer frameworks, showed improvements, their precision extended only up to 60% of UltraGen\u0026rsquo;s performance. High precision is essential for downstream validation, especially in resource-constrained scenarios. Furthermore, UltraGen also demonstrated advantageous precision and F1 scores across other non-top classes (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ei\u003cb\u003e)\u003c/b\u003e, implying its potential to capture weak RNA binders. These findings underscore UltraGen\u0026rsquo;s advanced capability in efficiently ranking RNA binders within a comprehensive RNA binding context.\u003c/p\u003e\n\u003ch3\u003eIdentifying RNA aptamers targeted to small molecules, proteins, cells and tissues\u003c/h3\u003e\n\u003cp\u003eWhile RNAs interacting with small molecules exhibit diverse binding motifs with broad structural preferences\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, RNA-protein interactions are largely defined by k-mer motifs\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. To determine whether k-mer motifs primarily determine RNA binding patterns, we analyzed their distribution across targets of varying sizes. Hexamer comparisons between enriched and non-enriched SELEX RNAs showed smaller motif differences for larger targets, even with added secondary structure information (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea and \u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea), implying small ligand binding contains more sequence-based features, whereas RNA-protein interactions are influenced by additional geometric constraints refining motif specificity.\u003c/p\u003e \u003cp\u003eWe then fine-tuned UltraGen using a cohort of published SELEX datasets against 12 distinct targets, ranging from small molecules\u003csup\u003e\u003cspan additionalcitationids=\"CR5\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e, proteins\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e, to (multi)cellular targets\u003csup\u003e\u003cspan additionalcitationids=\"CR11 CR12\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. These datasets varied in RNA length, sequence randomness, library architecture, and selection strategies (\u003cb\u003eSupplementary Table\u0026nbsp;2\u0026ndash;5\u003c/b\u003e). RNA categories within each SELEX dataset were partitioned into training, validation, and test sets at a ratio of 6:1:3. In the analysis of small-molecule datasets, RNABERT (pre-trained on the 76,237 human small non-coding RNA (nc-RNA) species), RNA-FM (pre-trained on 23\u0026nbsp;million ncRNA species) and UltraGen models improved their top binder ranking precision compared to the feature-based DeepBind by 10%, 60%, and 160%, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb). This improvement underscores the efficacy of pre-training on an RNA dataset that features a continuous distribution of binding abilities, rather than species-specific or stochastic ones.\u003c/p\u003e \u003cp\u003eCompared to its performance on small-molecules, the precision of RaptRanker, which was derived from protein-targets, declined by over 50% when applied to protein and (multi)cellular targets, indicating the limitations of predicting RNA-target interactions based solely on sequence-based secondary structure analysis. Despite its smaller model size, UltraGen outperformed RNABERT and RNA-FM by up to 346% in ranking precision on protein and (multi)cellular targets. Consistent with findings on the SiR binding dataset, pre-trained models generally performed better, showing higher precision and F1 in both top and non-top binder predictions compared to the feature-based models (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb-d). To further examine the effectiveness of the model\u0026rsquo;s prediction, we analyzed the predicted top binders from the held-out test set. While target-specific binders showed distinct binding signatures, most experimentally reported binders were identified in the predominant RNA families among the predicted top binders (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee,f).\u003c/p\u003e \u003cp\u003eTo determine whether UltraGen captures RNA-target binding features solely relying on the sequence similarity, we excluded sequences within various edit distances from the top binders in the downstream analysis. Although eliminating similar species from their sequence family resulted in a noticeable decline in performance, UltraGen still maintained a leading ranking position with minimal shrinkage (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea), indicating that its predictions correlated with the learning source that harbors the most structurally and sequence-similar binding information for inferring interaction. Taken together, UltraGen comprehensively captured \u003cem\u003ein vitro\u003c/em\u003e RNA-target binding across molecular size scales.\u003c/p\u003e\n\u003ch3\u003eDiscerning 3’-UTR of mRNA in human tissue specificity\u003c/h3\u003e\n\u003cp\u003eThe effectiveness of UltraGen in ranking SELEX datasets from (multi)cellular targets prompted an investigation into its potential for analyzing \u003cem\u003ein vivo\u003c/em\u003e RNA interactions, where the RNA binding context involves multiple targets and extends beyond simple binding.\u003c/p\u003e \u003cp\u003eTo assess the feasibility of fine-tuning UltraGen on endogenous RNA sequences, we examined the distribution of hexamer units in mRNA and ncRNA across species. Our analysis revealed substantial conservation within biological RNA species, particularly among mammals, in contrast to the high diversity observed in SELEX libraries (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea, b). This discrepancy suggests functional RNAs may achieve specificity through subtle but critical sequence features. Supporting this, ubiquitously expressed human genes alter their 3\u0026rsquo;-UTR isoform ratios via alternative polyadenylation (APA) for tissue-specific regulation\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eWe further explored the specificity of 3\u0026rsquo;-UTR variants across 22 human primary tissues using HTS data from the APASdb database\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e, which contains 2.89\u0026nbsp;million unique RNA cleavage sites with their frequencies. We extracted the 100 nt upstream sequence of each cleavage site and its abundances. As over 75% of RNA species are enriched in only a few tissues (\u003cb\u003eSupplementary Table\u0026nbsp;6\u003c/b\u003e), we investigated tissue specificities based on their presence or absence across tissues. This presented a challenging classification task with 88 categories (22 tissues, four abundance levels) aimed at discerning the specificity and the abundance levels of individual RNA species across tissues (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec). The dataset was randomly split into training, validation, and test sets in an approximate ratio of 6:1:3.\u003c/p\u003e \u003cp\u003eAfter fine-tuning, benchmark models exhibited limited ability to predict tissue specificity, with the lowest precision falling below 4% and recall under 1% (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea\u003cb\u003e).\u003c/b\u003e Even 3UTRBERT\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e, a model specifically pre-trained on human 3\u0026rsquo;-UTRs, achieved only 20% precision and 3% recall. In contrast, UltraGen robustly classified individual RNAs with various tissue specificities, demonstrating three-fold improvement in precision and ten-fold in recall for tissue specificity classification. Moreover, in predicting abundance levels, UltraGen achieved approximately two-fold increase in both precision and recall compared to the best benchmark models (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed).\u003c/p\u003e \u003cp\u003eTo evaluate the role of low-abundance species, we fine-tuned UltraGen using 3\u0026rsquo;-UTR datasets filtered by various abundance thresholds (\u003cb\u003eSupplementary Table\u0026nbsp;6\u003c/b\u003e). Including low-abundant RNAs significantly enhanced prediction performance, with higher precision and F1 scores across test sets (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee). UltraGen effectively classified testis RNA species, which are characterized by relatively homogeneous 3\u0026rsquo;-UTRs due to dominant 3\u0026rsquo;-UTR shortening at preferred polyadenylation sites\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb). Conversely, its performance was weakened on spleen RNAs, where dynamic T cell-derived heterogeneous 3\u0026rsquo;-UTR shortening might result in more diverse 3\u0026rsquo;-UTRs, complicating the modulation of the RNA target interactome\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. These findings highlight tissue-specific variation in high-dimensional biological RNA binding data.\u003c/p\u003e \u003cp\u003eTo investigate the impact of sequence context on model performance, we analyzed sequences extending from the 3\u0026rsquo;- to 5\u0026rsquo;-end, with lengths ranging from 50 nt to 300 nt. The highest F1 score was achieved with the last 100 nt RNA sequences (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec). Further fine-tuning the pre-trained model with training data including 50 nt or 150 nt sequences confirmed the optimal performance remained with the last 100 nt sequences (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed and \u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ef,g). Canonical polyadenylation signal sites (PASs, e.g. AAUAAA) were predominantly located within 30 nt upstream of cleavage sites across different tissues (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee and \u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea). Additionally, the region between \u0026minus;\u0026thinsp;100 and \u0026minus;\u0026thinsp;50 nt, which had a substantial impact on model performance, exhibited a less defined A/U-rich element and cleavage factor I binding motif UGUA (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea,b). Base substitutions within this 100 nt region substantially impacted predictive performance, particularly near the PASs (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ef). Collectively, the nuances of 3'-UTR sequences, though subtle, are crucial for the models\u0026rsquo; ability to predict tissue-specific distribution.\u003c/p\u003e\n\u003ch3\u003eRNA virus tropism and binding characteristics\u003c/h3\u003e\n\u003cp\u003eHuman-pathogenic RNA viruses exhibit marked tropism\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e characterized by tissue-specific infection and replication due to interactions with host molecules. Remarkably, despite viral RNA sequences not being present in the 3\u0026rsquo;-UTR-based fine-tuning of UltraGen, the model's predictions align closely with the known tissue preferences for various viruses, such as the heart and skeletal muscles for dengue\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e and the lungs and spleen for measles\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea and \u003cb\u003eSupplementary Table\u0026nbsp;7\u003c/b\u003e). This suggests that viral RNA 3\u0026rsquo;-ends may share similarities with host RNAs in their preferred human tissues. We further extended our investigation to SARS-CoV-2 pandemic variants circulating before 2024. Despite the highly mutated coding regions, the 3\u0026rsquo;-UTRs of these variants remain evolutionarily conserved (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea). UltraGen consistently identified most variants by their tissue tropism toward the lungs and lymph nodes, aligning with the observed pneumonia with an active immune response associated with SARS-CoV-2 infection\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb and \u003cb\u003eSupplementary Table\u0026nbsp;8\u003c/b\u003e). Length-dependent RNA virus tropism was also observed in the model fine-tuned with the full human dataset (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb,c). These findings may suggest a synchronized evolution between RNA viruses and the human host, involving conserved 3\u0026rsquo;-UTR regulatory interactions at the sequence level.\u003c/p\u003e \u003cp\u003eAfterwards, we investigated SARS-CoV-2 replication using UltraGen, focusing on the SARS-CoV-2 replicase, which initiates RNA genome synthesis at the 3\u0026rsquo;-UTR\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e. Visualizing RNA species enriched in UltraSelex against the SARS-CoV-2 replicase nsp12 protein\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e revealed a centered cluster of RNA species with the highest binding potential, effectively distinguished by UltraGen (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec). k-mer analysis of the RNA species in the centralized cluster identified conserved CUUGA or CUUG motifs (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed and \u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed), which are critical for nsp12 protein binding\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eTo assess whether UltraGen can identify and characterize this binding pattern at single-base resolution, we employed masked language modeling to predict the likelihood of each base within the nearby binding context \u003cb\u003e(Extended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ee). By resolving this motif at single-base resolution, we observed a high degree of linear correlation between the experimentally measured binding affinities of mutants\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e and UltraGen's predictions without prior fine-tuning (zero-shot inference) (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ee,f and \u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ef). Furthermore, we experimentally measured variants with another single mutation and a double mutation to examine the breadth of replicase binding features with CUUG (\u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eg and \u003cb\u003eSupplementary Table\u0026nbsp;9\u003c/b\u003e). Unlike molecular docking\u003csup\u003e\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e and AlphaFold3 predictions\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e, UltraGen highlighted the stem structure adjacent to the CUUG motif, indicating that the presence of a CUUG motif alone does not ensure binding (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ee and \u003cb\u003eExtended Data\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eh,i). These findings challenge traditional bioinformatic approaches by emphasizing the importance of context-specific structural features beyond conserved motif presence.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eEnhanced performance with broader pre-training sources\u003c/h2\u003e \u003cp\u003eTo assess how pre-training source influence model performance, we developed two variants, UltraGen\u003csup\u003esource_SiR\u003c/sup\u003e and UltraGen\u003csup\u003esource_nsp\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e by pre-training exclusively on the SiR and nsp12 SELEX datasets\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, respectively, which are enriched for persistent (or stable) RNA binding signatures (\u003cb\u003eExtended Data Fig.\u0026nbsp;6a\u003c/b\u003e). Compared to the original UltraGen, both variants showed diminished performance in identifying top SELEX binders and classifying tissue-specific 3'-UTRs (\u003cb\u003eExtended Data Fig.\u0026nbsp;6b,c\u003c/b\u003e), underscoring the importance of integrating RNA with more transient binding properties to optimize performance in biological RNA binding systems.\u003c/p\u003e \u003cp\u003eTo explore the model\u0026rsquo;s capability to incorporate diverse RNA binding information, we continued pre-training the base UltraGen model on two distinct resources, yielding UltraGen\u003csup\u003emolecules\u003c/sup\u003e, trained with RNA sequences against the 12 SELEX targets, and UltraGen\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003eUTR\u003c/sup\u003e, trained with endogenous 3\u0026rsquo;-UTR datasets (\u003cb\u003eExtended Data Fig.\u0026nbsp;6d,e\u003c/b\u003e). In the \u003cem\u003ein vitro\u003c/em\u003e RNA binding systems, UltraGen\u003csup\u003emolecules\u003c/sup\u003e exhibited improved precision in predicting top binders, while UltraGen\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003eUTR\u003c/sup\u003e maintained a similar prediction performance comparable to the base UltraGen model (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea and \u003cb\u003eExtended Data Fig.\u0026nbsp;6f\u003c/b\u003e). Similarly, UltraGen\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003eUTR\u003c/sup\u003e outperformed the UltraGen\u003csup\u003emolecules\u003c/sup\u003e model at discerning tissue-specific 3\u0026rsquo;-UTR variants (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb and \u003cb\u003eExtended Data Fig.\u0026nbsp;6e\u003c/b\u003e). These results suggest that continued pre-training of UltraGen on task-specific data can enhance its ability to capture more diverse and intricate RNA binding information. We then generated UltraGen\u003csup\u003eplus\u003c/sup\u003e by integrating both SELEX and 3\u0026rsquo;-UTR resources for continued pre-training (\u003cb\u003eExtended Data Fig.\u0026nbsp;6d\u003c/b\u003e), achieving well-balanced performance (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea,b).\u003c/p\u003e \u003cp\u003eTo extend UltraGen\u0026rsquo;s capacity for transcriptome-scale modeling, we incorporated RNA immunoprecipitation (RIP) datasets from ENCODE\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e for continued pre-training the UltraGen\u003csup\u003eRIP\u003c/sup\u003e variant. Following fine-tuning with the individual-nucleotide-resolution UV crosslinking and immunoprecipitation (iCLIP) database\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e (\u003cb\u003eSupplementary Table\u0026nbsp;10\u003c/b\u003e), which covers diverse \u003cem\u003ein vivo\u003c/em\u003e human RNA-protein interactions, UltraGen variants demonstrated leading performance in predicting human RNA-protein interactions (\u003cb\u003eExtended Data Fig.\u0026nbsp;7a\u003c/b\u003e). UltraGen\u003csup\u003eplus\u003c/sup\u003e achieved the highest F1 score of 0.789, followed by UltraGen\u003csup\u003eRIP\u003c/sup\u003e and UltraGen with scores of 0.782 and 0.780, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec). Consistently, UltraGen\u003csup\u003eplus\u003c/sup\u003e and its variants achieve the highest prediction performance on the mouse CLIP datasets, with an F1 score of up to 0.671 (\u003cb\u003eExtended Data Fig.\u0026nbsp;8a,b\u003c/b\u003e and \u003cb\u003eSupplementary Table\u0026nbsp;11\u003c/b\u003e). Considering the prevalence of m6A methylation in RNA-protein interactions\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e, we further evaluated the model's ability to recognize this RNA modification. All models, except DeepBind, demonstrated strong performance with F1 scores above 0.948, while UltraGen variants showed a marginal advantage, reaching up to 0.968 (\u003cb\u003eExtended Data Fig.\u0026nbsp;9a,b\u003c/b\u003e). These findings imply that incorporating biological RNA sequences from either 3'-UTR or transcriptome-wide RNA binders enhances learning of RNA interactions.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eUltraGen demonstrated a robust ability to interpret evolutionary RNA binding contexts, bridging artificial and natural RNA realms through self-learning RNA-target interactions. Its strengths are evident in systematic ranking predictions, tissue-specific recognition, and single-base zero-shot RNA binding characterization, revealing the RNA evolution towards a more refined spatial language. The UltraGen web server is accessible at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.ultrarnalab.com\u003c/span\u003e\u003cspan address=\"https://www.ultrarnalab.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eWe benchmarked UltraGen against feature-based and pre-trained models across diverse binding systems. The transformer architecture, strengthened by pre-training on RNA species included transient interactions from the UltraSelex system, demonstrated improved sequence feature capture. Moreover, UltraSelex RNA species targeting small molecule exhibited a broader range of interaction landscape compared to those against the protein. Similarly, predicting RNA interactions with small molecules is more challenging than with proteins or (multi)cellular targets in the SELEX systems, likely due to their greater diversity compared to the more structured k-mer motif interactions with proteins\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eRNA binding proteins preferentially interact with U, followed by A base\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. Consistently, the 3\u0026rsquo;-UTR, characterized by AU-rich elements and other regulatory factors, interacts with small molecules and proteins, influencing mRNA stability and translation efficiency\u003csup\u003e\u003cspan additionalcitationids=\"CR44 CR45\" citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e. Low-abundant 3\u0026rsquo;-terminal RNA species showed notable tissue-specific distribution, likely due to subtle variations upstream of the AU-rich region within the 100 nt segment handled by the 3\u0026rsquo;-end processing machinery. UltraGen effectively predicted the 3\u0026rsquo;-UTR of dengue virus, which carries a classical PAS, and SARS-CoV-2, which only has an assumed U-stretch for 3\u0026rsquo;-end processing\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003e, suggesting diverse polyadenylation mechanisms. However, 3\u0026rsquo;-end polyadenylation sequencing alone does not provide a complete view of mRNA related to alternative 5\u0026rsquo;-UTR, splicing events, or RNA modifications like m\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003eA, which might limit the model's generalization ability. A more comprehensive understanding may emerge from integrating both full-length sequence data and base modification information from single-cell RNA sequencing (\u003cb\u003eExtended Data Fig.\u0026nbsp;10\u003c/b\u003e) to connect multi-level cell-type-specific regulatory networks and clarify their function in different tissues.\u003c/p\u003e \u003cp\u003eThe first pair in the nearby stem (e.g., G-U, A-U, or C-U, rather than G-C) is critical for RNA interaction with the SARS-CoV-2 replicase nsp12 protein. This finding complements our previous discovery that CUY(U/C)G-containing RNA regions are essential for strong interactions with nsp12\u003csup\u003e22\u003c/sup\u003e. Notably, CUUG-containing RNA aptamers effectively inhibited the SARS-CoV-2 replicase nsp12/7/8 complex in biochemical RNA extension assays\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. The CUYG motif forms a conserved stem-loop structure (SL2) in coronaviruses (SARS, MHV, BCoV, OC43, and HKU1)\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e. The virulence of MHV was found to be highly sensitive to genomic site mutations in this SL2 CUUG motif\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e. These findings may suggest that the CUUG-motif interacts with the SARS-CoV-2 replicase nsp12 due to its crucial role in viral replication.\u003c/p\u003e \u003cp\u003eFuture models could benefit from continued pre-training on \u003cem\u003ein vitro\u003c/em\u003e selected RNA binding sequences targeting a broader range of molecules with desired biological functions. Additionally, incorporating a larger cohort of dynamic endogenous RNA species and information on RNA epitranscriptomic modifications under various physiological conditions would be valuable for potential model optimization. Exploring training strategies for RNA language models with a larger parameter framework may also be advantageous.\u003c/p\u003e \u003cp\u003eIn summary, UltraGen has demonstrated its efficacy as a powerful tool for capturing context-specific RNA binding systems, offering substantial potential for advancing our understanding of RNA-target interactions and their implications within biological contexts.\u003c/p\u003e "},{"header":"Methods","content":"\u003cp\u003e\u003cstrong\u003eUltraGen model architecture.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe UltraGen model was constructed using a BERT-style encoder-only transformer architecture\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e, incorporating two key components: multi-head self-attention and feedforward network modules. Additionally, it leveraged Rotary Position Embedding\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e for enhanced processing of long-distance dependencies. With a total of 12 layers and an embedding size of 480, the model comprises 33.5 million parameters. Each nucleotide base (A/U/G/C) was treated as an individual token during RNA sequence tokenization. Unique tokens, such as \u0026lt;\u0026thinsp;CLS\u0026thinsp;\u0026gt;\u0026thinsp;at the start and \u0026lt;\u0026thinsp;EOS\u0026thinsp;\u0026gt;\u0026thinsp;at the end, were introduced to enhance the capture of global semantic content. Additionally, the UltraGen vocabulary includes tokens\u0026thinsp;\u0026lt;\u0026thinsp;EOS\u0026thinsp;\u0026gt;\u0026thinsp;for the separator, \u0026lt;PAD\u0026thinsp;\u0026gt;\u0026thinsp;for padding, \u0026lt;MASK\u0026thinsp;\u0026gt;\u0026thinsp;for masking, and \u0026lt;\u0026thinsp;UNK\u0026thinsp;\u0026gt;\u0026thinsp;for unknown elements.\u003c/p\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003eUltraGen pre-training kernel\u003c/h2\u003e\n \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e\n \u003ch2\u003efeature description\u003c/h2\u003e\n \u003cp\u003eCompared to SELEX RNA libraries that lost most transient RNA binders, the UltraSelex RNA library enriched both persistent and transient RNA binders\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. We utilized the sequence information of these UltraSelex RNA datasets for our model training. The common construct of UltraSelex RNA binders (103 nt) contained two constant primer-binding regions\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e52\u003c/span\u003e\u003c/sup\u003e that could be structurally paired with each other, two randomized stretches of 26 nt each to diversify binding characteristics for the wet-lab selection, and a constant 12 nt internal hairpin loop thought to improve the enrichment of high-affinity binders with relatively less RNA-RNA interaction\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e53\u003c/span\u003e\u003c/sup\u003e. Moreover, the UltraSelex SiR library contains more diverse RNA binders, compared to the UltraSelex nsp12 library that targets-specific geometric constraints applied (\u003cstrong\u003eExtended Data\u003c/strong\u003e Fig. \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003ea). Therefore, UltraGen was pre-trained on the full-length RNA sequence (103 nt) without leveraging their secondary structure, derived from UltraSelex SiR RNA dataset with different binding potential (\u003cem\u003eauc\u003c/em\u003e value) threshold. The optimal UltraGen variant was achieved by pre-training on the top ten million species ranked by their binding potential, suggesting its learning performance balances the enrichment of binding signals while reducing background noise from the dataset.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003edata pre-processing\u003c/h2\u003e\n \u003cp\u003eThe raw HTS datasets, including UltraSelex SiR-B\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e (54.6 million species, sub-panel 1.1.2), UltraSelex Nsp-B\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e (76.8 million species, sub-panel 1.2.2), SELEX SiR\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e (14.6 million species, sub-panel 2.1.4) and SELEX nsp12\u003csup\u003e22\u003c/sup\u003e (29 million species, sub-panel 2.2.1), were obtained from the UltraSelex data archive panel in UltraRNALab (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ewww.ultrarnalab.com\u003c/span\u003e\u003c/span\u003e). RNA sequences originated from UltraSelex underwent quality control, adaptor trimming, hairpin loop confirmation, and were subsequently ranked based on SGREELI \u003cem\u003eauc\u003c/em\u003e with default setting to indicate binding potential in descending order\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Similarly, RNA species from the final round (the 14th ) of the SiR SELEX underwent a similar process but were ranked by their detection frequency in descending order. Top-ranked RNA sequences were extracted for training, with a held-out test set comprising 2% of the total species, ensuring less than 90% sequence identity to the training data using CD-HIT\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e (Supplementary Table 1).\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\n \u003ch2\u003eself-supervised learning and model evaluation\u003c/h2\u003e\n \u003cp\u003eUltraGen integrated two distinct pre-training components: (1) Base Reconstruction Loss (\u003cem\u003eL\u003c/em\u003e\u003csub\u003e\u003cem\u003ebase\u003c/em\u003e\u003c/sub\u003e), resembling BERT\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e, involves randomly selecting 15% of tokens from each sequence for prediction. Among these, 80% were replaced with \u0026lt;\u0026thinsp;MASK\u0026gt;, 10% were substituted with other bases, and 10% remain unchanged. (2) Motif Reconstruction Loss (\u003cem\u003eL\u003c/em\u003e\u003csub\u003e\u003cem\u003emotif\u003c/em\u003e\u003c/sub\u003e), similar to SpanBERT\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e55\u003c/span\u003e\u003c/sup\u003e, employed consecutive span masking to predict motifs. Spans follow a Poisson distribution (\u0026lambda;\u0026thinsp;=\u0026thinsp;5) with lengths ranging from 1 to 10. Unlike SpanBERT, UltraGen reconstructed original bases from masked positions, not tokens at span boundaries. UltraGen optimization combines both pre-training objectives, formally defined as: \u003cem\u003eL\u003c/em\u003e\u0026thinsp;=\u0026thinsp;\u003cem\u003eL\u003c/em\u003e\u003csub\u003e\u003cem\u003ebase\u003c/em\u003e\u003c/sub\u003e\u0026thinsp;+\u0026thinsp;\u0026alpha; \u0026middot; \u003cem\u003eL\u003c/em\u003e\u003csub\u003e\u003cem\u003emotif\u003c/em\u003e\u003c/sub\u003e, where \u0026alpha; adjusts objective weights, set to 0.25 during practical pre-training. We utilized the AdamW optimizer with a warm-up strategy, increasing the learning rate to 4e-4 over 2,000 steps, followed by cosine annealing for each data partition. The model was trained using single-node parallelism across 8 GPUs, with a batch size of 500 per GPU. A mixed-precision training strategy was employed to enhance computational efficiency. The pretraining process exhibited stable convergence, where the loss function gradually decreased and reached a plateau, indicating the stability and high quality of the training. Model evaluation involved comparing the average loss on the test set for each partition and UMAP\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e56\u003c/span\u003e\u003c/sup\u003e visualization to differentiate RNA binders. The optimal UltraGen model was pre-trained on the top-ranked 10 million RNA species from the UltraSelex SiR-B dataset, utilizing eight 32GB NVIDIA V100 GPUs over a period of 21 days. Additionally, other pre-trained models, UltraGen\u003csup\u003esource_SiR\u003c/sup\u003e and UltraGen\u003csup\u003esource_nsp\u003cspan class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e, were constructed using SELEX datasets specifically targeting to SiR and nsp12 target\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, each based on their respective top-ranked 10\u0026nbsp;million RNA species.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\n \u003ch2\u003eSystematic ranking of in vitro-selected RNA aptamers by UltraGen\u003c/h2\u003e\n \u003cdiv id=\"Sec16\" class=\"Section3\"\u003e\n \u003ch2\u003edata pre-processing\u003c/h2\u003e\n \u003cp\u003eThe SiR SELEX dataset was obtained and processed as described above. Twelve benchmark SELEX datasets were utilized in this study, including four small-molecule targets (benzopyrylium-coumarin fluorophores\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e, paromomycin\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e, maleimide\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e, PPACK\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e), four protein targets (TAR DNA binding protein 43\u003csup\u003e7\u003c/sup\u003e, ribosomal protein S15\u003csup\u003e8\u003c/sup\u003e, RNA-binding motif protein 24\u003csup\u003e7\u003c/sup\u003e, and HIV-1 reverse transcriptase\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e), and four (multi)cellular targets (Triple-negative breast cancer cells\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e, Chinese hamster ovary K1 cells\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e, myeloid-derived suppressor cells\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e, and human islets\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e) (see details in Supplementary Table\u0026nbsp;2). Original sequencing datasets underwent standard processing, including quality-control, adaptor trimming, length filtering, and RNA conversion (in-house code). RNA species enriched in the final round of SELEX datasets were then extracted and stratified accordingly (Supplementary Table\u0026nbsp;3\u0026ndash;5). Those absent in the final round were classified into a background set. The binding potential of bucketed RNA species were estimated based on their detection frequency, with the highest-ranking bucket comprised of the top-rank 0.1-1% of total species, reflecting desired experimental selection criteria in SELEX\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e57\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e58\u003c/span\u003e\u003c/sup\u003e. Data balance was ensured by implementing a downsampling strategy for categories with an excess of species, such as those in the background set, limiting them to 100,000 instances. Conversely, categories with fewer instances underwent oversampling, with instances randomly duplicated to reach the same threshold, ensuring a fair distribution. Each RNA category within the benchmark dataset was subsequently partitioned into training, validation, and test sets at a ratio of 6:1:3.\u003c/p\u003e\n \u003cp\u003eFor the classification tasks derived from the cohort of SELEX binding datasets, the target-specific enriched RNA species varied in each dataset due to differences in RNA length, architecture (different primer A/B from different labs), and library randomness (Supplementary Table\u0026nbsp;2). As these species differ from the RNA used for pre-training, they were not excluded from the analysis. The exact number of sample sizes and data balancing strategies have been included in Supplementary Tables\u0026nbsp;3\u0026ndash;6\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\n \u003ch2\u003emodel fine-tuning and evaluation\u003c/h2\u003e\n \u003cp\u003eIn the fine-tuning phase, a BERT strategy was employed to extract features from the first\u0026thinsp;\u0026lt;\u0026thinsp;CLS\u0026thinsp;\u0026gt;\u0026thinsp;token in the final layer as sequence representation. For each dataset, task-specific non-linear classifiers were implemented, and all parameters were subsequently fine-tuned. Systematic ranking is a multi-class task, for which Categorical Cross-Entropy Loss is calculated. The learning rate was set to 1e-4, with a batch size of 100. An early stopping mechanism was employed during training, terminating the process if validation performance showed no improvement over 10 consecutive epochs. This approach minimized overfitting, optimized computational efficiency, and ensured robust and generalizable model performance. This single-label classification task was evaluated using four metrics: \u003cem\u003ePrecision@top\u003c/em\u003e, \u003cem\u003eF1@top\u003c/em\u003e, \u003cem\u003eF1@all Precision@all\u003c/em\u003e, and the \u003cem\u003eWeighted Precision@all\u003c/em\u003e scores. \u003cem\u003ePrecision@top\u003c/em\u003e and \u003cem\u003eF1@top\u003c/em\u003e specifically denoted the precision and recall for predicting RNA species with the highest binding potential. Additionally, \u003cem\u003ePrecision@all\u003c/em\u003e and \u003cem\u003eWeighted Precision@all\u003c/em\u003e serves as a comprehensive precision metric, attributing more significance to categories of predicting systematic binding landscape. The \u003cem\u003eWeighted Precision@all\u003c/em\u003e metric was calculated using the formula:\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{{\\sum\\:}_{j=1}^{N}\\left(\\frac{{Preci}_{j}}{j}\\right)}{{\\sum\\:}_{i=1}^{N}\\frac{1}{i}}\\)\u003c/span\u003e\u003c/span\u003e, where N denotes the number of categories, and \u003cem\u003ePreci\u003c/em\u003e\u003csub\u003e\u003cem\u003ej\u003c/em\u003e\u003c/sub\u003e represents the precision associated with the \u003cem\u003ej\u003c/em\u003e\u003csup\u003e\u003cem\u003eth\u003c/em\u003e\u003c/sup\u003e category. \u003cem\u003ePrecision@all\u003c/em\u003e was calculated without weighting \u003cem\u003ecategories\u003c/em\u003e. In a further ablation study, sequence homogeneity was assessed using the edit distance metric. Initially, RNA species were ranked in descending order based on their detection frequency in the dataset. Subsequently, RNA species with an edit distance less than or equal to a specified threshold relative to those from the top-ranked bucket were removed, while RNA species from non-top-ranked bucket retained their positions in the training process. After fine-tuning the model\u0026rsquo;s parameters, the \u003cem\u003ePrecision@top\u003c/em\u003e metric was calculated and compared across different edit distance criteria.\u003c/p\u003e\n \u003cp\u003eTo evaluate the effectiveness of model\u0026rsquo;s downstream application, predicted binder sequence in the top category from test sets were clustered and compared with reported binding sequences or core motifs from the literature (Supplementary Table\u0026nbsp;2). Specifically, the probability of the model score and the weighted abundance across all SELEX rounds of each sequence was recorded. Sequences were then sorted in descending order based on their model scores and clustered using the following algorithm: First, the edit distance between each sequence and the central sequence in the clustering pool was calculated sequentially. If the edit distance exceeded a specified cutoff, a new category was introduced to the clustering pool with the current sequence serving as the central sequence for that category. Conversely, if the edit distance was no more than the cutoff, the sequence was assigned to that category. After clustering, the weighted abundance of sequences within each category was summed, and categories were sorted in descending order based on this summed value. Distinct cutoff values were determined based on the level of sequence similarity within the datasets. Datasets exhibiting lower similarity, such as TARDBP and RBM24, utilized larger editing distances as clustering cutoffs. While datasets featuring higher similarity or shorter sequences were analyzed using smaller edit distance cutoffs. The specific cutoff settings for each dataset were as follows: DAse \u0026minus;\u0026thinsp;3, BC \u0026minus;\u0026thinsp;7, PR \u0026minus;\u0026thinsp;3, MI \u0026minus;\u0026thinsp;3, TARDBP \u0026minus;\u0026thinsp;14, RT \u0026minus;\u0026thinsp;2, RBM24\u0026ndash;13, S15\u0026ndash;6, ISLETS \u0026minus;\u0026thinsp;4, MDSC \u0026minus;\u0026thinsp;5, CHO-K1\u0026ndash;4, TNBC \u0026minus;\u0026thinsp;3. The identification of binders within each dataset was determined accordingly. For DAse, TARDBP, and RBM24 dataset, positive sequences were selected according to the binding motifs reported in the literature. For the remaining datasets, experimentally verified sequences reported in the literature were considered as positive sequences.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\n \u003ch2\u003eClassifying human tissue-specific hallmarks of 3\u0026rsquo;-terminal non-coding RNA by UltraGen\u003c/h2\u003e\n \u003cdiv id=\"Sec19\" class=\"Section3\"\u003e\n \u003ch2\u003edata preprocessing\u003c/h2\u003e\n \u003cp\u003eTandem 3\u0026rsquo;-terminal end sequencing datasets from 22 human tissues were obtained from APASdb\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. Raw sequencing reads underwent base quality control, adaptor trimming, length filtering, and genome mapping (using bowtie version 1.0.0 with parameters -v 2 -k 2 --best, referencing human genome GRCh37/19 from UCSC), along with internal priming filtering as described\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e59\u003c/span\u003e\u003c/sup\u003e. The 100 nucleotides (nt) upstream from the genome-mapped 3\u0026rsquo;-end of each sequence were collected and aggregated into a data frame. This summary data frame comprises 2.89\u0026nbsp;million unique non-coding RNA species and their corresponding detection frequency across 22 human tissues. Given the tissue-specific chromosome-wide gene expression\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e60\u003c/span\u003e\u003c/sup\u003e, RNA species were randomly partitioned into training, validation, and test sets based on the count of distinct tissues they appear in, following an approximately 6:1:3 ratio.\u003c/p\u003e\n \u003cp\u003eHuman-pathogenic RNA virus genomes were retrieved from NCBI viruses RefSeq release (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://ftp.ncbi.nlm.nih.gov/refseq/release/viral/\u003c/span\u003e\u003c/span\u003e), along with their corresponding 3\u0026rsquo;-UTR annotation using their accession NCBI reference IDs. Subsequently, the 100 nt upstream from the 3\u0026rsquo;-end of each 3\u0026rsquo;-UTR was extracted for downstream featurization. SARS-CoV-2 variants genomes were obtained from NCBI Virus database. (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.ncbi.nlm.nih.gov/labs/virus/\u003c/span\u003e\u003c/span\u003e), meeting full nucleotide completeness criteria, within the time range from the end of 2019 to the end of 2023. Variants monitored closely by the World Health Organization were identified and recorded with their accession IDs. Their 3\u0026rsquo;-UTR sequences were processed similarly to the human-pathogenic RNA virus genomes described above.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\n \u003ch2\u003emodel fine-tuning and evaluation\u003c/h2\u003e\n \u003cp\u003eUltraGen adopted a multi-task learning framework to simultaneously address classification tasks for human RNA tissue specificity and the corresponding abundance level. Briefly, during the tissue specificity calculation, each RNA species presents or absence in the specified tissue was predicted with a probability in the range between 0 and 1. RNA species correctly predicted (probability\u0026thinsp;\u0026gt;\u0026thinsp;0.5 for presence, and \u0026le;\u0026thinsp;0.5 for absence) were summarized using precision, recall, and F1 score metrics across 22 tissues. RNA species were further classified into four abundance levels: high (counts\u0026thinsp;\u0026ge;\u0026thinsp;100), intermediate (100\u0026thinsp;\u0026gt;\u0026thinsp;counts\u0026thinsp;\u0026ge;\u0026thinsp;10), low (10\u0026thinsp;\u0026gt;\u0026thinsp;counts\u0026thinsp;\u0026ge;\u0026thinsp;1), and non-existent. Thus, each RNA species was associated 22 (tissues)*4 (abundance levels) labels for supervised learning. The overall loss function was defined as follows: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:Loss\\:=\\:\\sum\\:_{i=1}^{T}BCE({s}_{i}\\:,\\:{A}_{i})\\:+\\sum\\:_{i=1}^{T}CE({e}_{i}\\:,\\:{B}_{i})\\:\\:\\:\\)\u003c/span\u003e\u003c/span\u003e. Each sample sequence is associated with specificity labels \u003cem\u003eA\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e \u0026isin; {0,1}, indicating tissue presence across \u003cem\u003eT\u003c/em\u003e tissues, and abundance level labels \u003cem\u003eB\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e \u0026isin; {0, 1, 2, 3], representing different abundance levels. The model outputs \u003cem\u003es\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e predicts tissue preference, while \u003cem\u003ee\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e predicts tissue abundance levels. Binary cross-entropy loss and categorical cross-entropy loss were computed separately for these tasks. The learning rate was set to 1e-3, and the batch size was configured as 200. The \u0026lt;CLS\u0026gt; representation from the last model layer served as the sequence feature, followed by the addition of two nonlinear layers for predicting tissue specificity and abundance level. UltraGen\u0026apos;s entire parameters were fine-tuned, and performance metrics, including macro precision, recall, and F1 score, were computed and compared with other methods. Zero-shot inference was conducted for human pathogenic RNA viruses (Supplementary Table 7) and SARS-CoV-2 variants (Supplementary Table 8) using their 3\u0026rsquo;-end sequences as input. The tissue specificity of these viruses was further compared with data reported in clinical and research articles.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003ePredicting\u003c/strong\u003e \u003cstrong\u003ein vivo\u003c/strong\u003e \u003cstrong\u003eRNA-protein interaction from CLIP datasets and m6A modification.\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eThe eleven human iCLIP sequencing datasets were curated from the iONMF repertoire\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e, encompassing nine protein targets: hnRNPC\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e61\u003c/span\u003e\u003c/sup\u003e, U2AF2\u003csup\u003e61\u003c/sup\u003e, hnRNPL\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e62\u003c/span\u003e\u003c/sup\u003e, hnRNPL-like\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e63\u003c/span\u003e\u003c/sup\u003e, Nsun2\u003csup\u003e64\u003c/sup\u003e, TDP-43\u003csup\u003e65\u003c/sup\u003e, TIA1\u003csup\u003e66\u003c/sup\u003e, and TIAL1\u003csup\u003e66\u003c/sup\u003e. Each dataset had been split into three parts using three-fold cross-validation, with 40,000 samples per part, further divided into training and testing sets in a 3:1 ratio.\u003c/p\u003e\n \u003cp\u003eThe mouse CLIP sequencing datasets were curated from the CLIPdb\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e67\u003c/span\u003e\u003c/sup\u003e, encompassing positive sequences (101 nt centering to the middle site) from eleven protein targets: EZH2, FUS, HNRNPR, LIN28A, RBFOX2, RBM10, SRSF2, SRSF3, TARDBP, U2AF2, YTHDC2 (detailed sequence data source ID is from the \u0026quot;mouse.txt\u0026quot; under CLIPdb sever \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://clipdb.ncrnalab.org\u003c/span\u003e\u003c/span\u003e). Negative sequences (101 nt) were randomly sampled from the mouse transcriptome regions (M2, GRCm38.p2 from Gencode) using the following commands \u0026quot;bedtools random -l 101 -n 10000000 -g GRCm38_p2.chrom.sizes -seed 922\u0026thinsp;\u0026gt;\u0026thinsp;mouse_genome_random.bed\u0026quot; and \u0026quot;bedtools intersect -a mouse_genome_random.bed -b gencode_M2_mouse_gene.bed -wa\u0026thinsp;\u0026gt;\u0026thinsp;mouse_gene_101nt_random.bed\u0026quot;. To ensure that negative sequences did not overlap with the regions of positive sequences, they were filtered using: \u0026quot;bedtools intersect -a mouse_gene_101nt_random.bed -b positive_regions.bed -v\u0026thinsp;\u0026gt;\u0026thinsp;negative_regions_raw.bed\u0026quot;. The remaining negative sequences were randomly shuffled using random.shuffle, with seeds ranging from 922 to 932. To remove sequence redundancy over 80%, CD-HIT\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e was applied to both positive and negative datasets. Equal numbers of non-redundant RNA sequences were randomly selected and split into training, validation, and test sets in a 6:1:3 ratio.\u003c/p\u003e\n \u003cp\u003eRNA-protein interactions were defined as a binary classification task. We adopted the same fine-tuning strategy as in the method \u0026ldquo;\u003cem\u003eSystematic Ranking of in vitro-selected RNA Aptamers\u003c/em\u003e\u0026rdquo; and then used macro precision, recall, F1 score, and area under the ROC curve (AUC) for model performance comparison.\u003c/p\u003e\n \u003cp\u003eFor predicting RNA species harboring m6A modification, a total of non-redundant 79,021 m6A modification sites (filtered from m6A-altas 131,703 raw signals\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e68\u003c/span\u003e\u003c/sup\u003e) and 849,005 non-m6A sites, along with their flanking 20 nt upstream and 20 nt downstream regions, from nine cell lines (A549, CD8T, ESC, HCT116, HEK293, HEK293T, Hela, HepG2, and MOLM113) were obtained from a previous study\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. Model performance was then evaluated through cross-validation, paring the positive set with each of 10 different negative sets\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. Performance metrics, including precision, recall, F1 score, and AUC were assessed using five different model seeds.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec21\" class=\"Section2\"\u003e\n \u003ch2\u003eBenchmark deep learning models\u003c/h2\u003e\n \u003cp\u003eThe UltraGen classification benchmarks comprised feature-based models (DeepBind\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e, SA-Net\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e, and RaptRanker\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e) and pre-trained models (RNABERT\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e, RNA-FM\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e, 3UTRBERT\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e). DeepBind, a convolutional neural network specifically designed to process sequential nucleotides input features for predicting RNA protein binding, underwent augmentation with task-specific nonlinear classification layers and comprehensive parameters training. Similarly, SA-Net, utilizing a self-attention mechanism and sequence k-mer embedding, along with RaptRanker, which incorporates both sequence and secondary structure information, were augmented with the same non-linear framework. The large RNA-protein binding model RNABERT was originally pre-trained on 76,237 human small ncRNAs with 0.47\u0026nbsp;million parameters, while RNA-FM was pre-trained on a 23\u0026nbsp;million ncRNA source from RNACentral, utilizing 99.52\u0026nbsp;million parameters, and 3UTRBERT was pre-trained on 20,362 3\u0026rsquo;-UTRs with 86.09\u0026nbsp;million parameters. RNA-RBP interaction model BERT-RBP\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e69\u003c/span\u003e\u003c/sup\u003e adopts the BERT architecture and was built upon the DNABERT-3 model that was pre-trained on the human reference genome GRCh38.p13, using 3-mer representations of nucleotide sequence and comprising approximately 86 million parameters. The first special classification token from the final layer of these pre-trained models was utilized to represent the sequence. Furthermore, task-specific nonlinear classification layers were integrated and subsequently fine-tuned to optimize all parameters. When comparing pre-trained UltraGen with other benchmark approaches, all models were augmented with identical task-specific nonlinear layers and fine-tuned using the same downstream binding datasets.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eSequence motif and structural analysis of SELEX RNA species.\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eRNA species from the SELEX datasets were classified into \u0026apos;Binding\u0026apos; (detected enrichment\u0026thinsp;\u0026gt;\u0026thinsp;0) and \u0026quot;\u003cem\u003eNon-binding\u003c/em\u003e\u0026quot; (no detection) groups based on their abundance in the final SELEX round. The \u0026apos;Binding\u0026apos; group includes all enriched RNA species, while an equal number of non-detected species were randomly selected to form the \u0026quot;\u003cem\u003eNon-binding\u003c/em\u003e\u0026quot; group.\u003c/p\u003e\n \u003cp\u003eFor sequence analysis, RNA sequences were dissected into consecutive hexamer units to examine distribution patterns between groups. For structural analysis, RNA secondary structures were predicted by LinearPartition\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e70\u003c/span\u003e\u003c/sup\u003e using maximum expected accuracy (MEA). Each nucleotide was further annotated with one of six structural elements: dangling start (F), dangling end (T), internal loop (I), hairpin loop (H), multibranched loop (M), and stem region (S) as previously described\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e69\u003c/span\u003e\u003c/sup\u003e. The positional frequencies of these structural elements were then analyzed and compared between two groups.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\n \u003ch2\u003eCharacterizing SARs-CoV-2 replicase nsp12 in single-base resolution\u003c/h2\u003e\n \u003cp\u003eThe experimental binder sequence (wild type, 113-50H\u0026thinsp;+\u0026thinsp;L) and mutated variants (M1-5) of the nsp12 RNA binders were obtained from our previous work\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Additionally, two mutants (M6-7) (Supplementary Table 9) were designed and analyzed in this study. Each base of the wild-type sequences was masked, and position-specific likelihoods were calculated. These likelihoods were then converted to nucleotide base probabilities using the softmax function. The representation probabilities of each base were compared with the wild-type sequence using the log odds ratio score to indicate binding preference. For simultaneous mutations, the collective effect was determined using the average score of model predictions, calculated as follows: where \u003cem\u003ex\u003c/em\u003e\u003csup\u003e\u003cem\u003ewt\u003c/em\u003e\u003c/sup\u003e and \u003cem\u003ex\u003c/em\u003e\u003csup\u003e\u003cem\u003emt\u003c/em\u003e\u003c/sup\u003e represent the wild-type and mutant sequences, respectively. \u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e refers to the nucleotide base at position, and \u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u0026minus;1\u003c/em\u003e\u003c/sub\u003e represents the sequence with a mask applied to position \u003cem\u003ei\u003c/em\u003e. \u003cem\u003em\u003c/em\u003e denotes the count of mutations, and \u003cem\u003eM\u003c/em\u003e specifies their positions; for example, with mutations at positions 6 and 9, M = (6, 9).\u003c/p\u003e\n \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e\n \u003ch2\u003eEvolutionary similarity analysis of mRNA and ncRNA across species\u003c/h2\u003e\n \u003cp\u003emRNA and non-coding RNA were obtained from the transcriptome database of various organisms: \u003cem\u003eHomo sapiens\u003c/em\u003e (human, Ensemble release 109), \u003cem\u003eMus musculus\u003c/em\u003e (mouse, Ensemble release 109), \u003cem\u003eDanio rerio\u003c/em\u003e (zebrafish, Ensemble release 109), \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e (yeast, Ensemble release 109), \u003cem\u003eArabidopsis thaliana\u003c/em\u003e (plant, Ensemble release 57), 5508 random selected bacterial species (bacterial, Ensemble release 57), RNAcentral (rnacentral_active.fasta.gz from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://ftp.ebi.ac.uk/pub/databases/RNAcentral\u003c/span\u003e\u003c/span\u003e), and RNA viruses (Cardiovirus, Cosavirus ,Coxsackie, Rhinovirus, Poliovirus, Dengue, West Nile, Yellow Fever, Zika, H1N1, H3N2, Marburg, Ebola, Astrovirus, Chikungunya, Hantavirus, HIV, Lassa, Leishmania, Rabies) sourced from NCBI viruses RefSeq release. Subsequently, the RNA sequences were segmented into hexamers using a sliding window approach. The abundance distribution of each hexamer was then compared with that of other species using Pearson\u0026rsquo;s correlation.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\n \u003ch2\u003eExperimental binding affinity determination by Bio-layer interferometry\u003c/h2\u003e\n \u003cp\u003eBio-layer interferometry (BLI) measurements were conducted using an Octet\u0026reg; R8 system (Sartorius). Prior to the equilibrium dissociation constant (\u003cem\u003eK\u003c/em\u003e\u003csub\u003e\u003cem\u003eD\u003c/em\u003e\u003c/sub\u003e) measurement, Octet Ni-NTA biosensors (Sartorius) were equilibrated in 1X ERBL buffer (2 mM Tris-HCl pH 7.5, 100 mM KCl, 5% (v/v) glycerol, 10 mM Mg(OAc)\u003csub\u003e2\u003c/sub\u003e, 1 mM TCEP, 0.02% TWEEN 20 (Carl Roth)) for approximately 5 minutes. A two-fold dilution series of each RNA ligand (RNA was \u003cem\u003ein vitro\u003c/em\u003e transcribed from corresponding double strand DNA template and then purified by 10% PAGE electrophoresis) in 1X ERBL buffer was prepared, with 1X ERBL buffer without RNA serving as a reference. Protein loading was performed using 20 ng/\u0026micro;L of nsp12-His\u003csub\u003e10\u003c/sub\u003e. The assay comprised a 60-second baseline-1 step, a 180\u0026ndash;240 second protein loading step, a 60-second baseline-2 step, a 900 second association step, and a 600 second dissociation step. Data analysis was conducted using the Octet Data Analysis software, involving pre-processing steps such as reference subtraction, y-axis alignment based on the average of the baseline, inter-step correction by dissociation, and Savitzky-Golay filtering. A 1:1 binding model was applied, and the \u003cem\u003eK\u003c/em\u003e\u003csub\u003e\u003cem\u003eD\u003c/em\u003e\u003c/sub\u003e was calculated by fitting using the fitting method (either locally or globally).\u003c/p\u003e\n \u003cdiv id=\"Sec25\" class=\"Section3\"\u003e\n \u003ch2\u003eContinued pre-training of UltraGen for diverse data sources\u003c/h2\u003e\n \u003cp\u003eTo enhance the model\u0026apos;s capability in integrating diverse RNA binding information, the basic UltraGen model underwent further pre-training, resulting in four additional variants: The first variant, UltraGen\u003csup\u003emolecules\u003c/sup\u003e, was continued pre-trained using 3.24\u0026nbsp;million RNA species from the 12 SELEX targets\u0026rsquo; training sets (Supplementary Table\u0026nbsp;3\u0026ndash;5). The second variant, UltraGen\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003eUTR\u003c/sup\u003e, involved continued pre-trained with 1.69\u0026nbsp;million endogenous RNA sequences from the preliminary 3\u0026rsquo;-UTR full training datasets (Supplementary Table\u0026nbsp;6). The third variant, UltraGen\u003csup\u003eplus\u003c/sup\u003e, was developed as the comprehensive model by integrating all data used for the previous two variants. The fourth variant, UltraGen\u003csup\u003eRIP\u003c/sup\u003e, was continued pre-trained using 6.88\u0026nbsp;million RNA species from human RIP-Seq in ENCODE\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. For human RIP-Seq data preprocessing, nine ENCODE experiments were selected, and their genome-mapped BAM files (\u003cem\u003eENCFF660QLC, ENCFF041SHW, ENCFF241CCX, ENCFF956PEU, ENCFF879QSE, ENCFF844CIH, ENCFF023IWI, ENCFF257FPS, ENCFF492HQI\u003c/em\u003e) were downloaded from ENCODE. The properly mapped paired-end sequencing reads were extracted using samtools (v1.13) with the parameter -f 2. These genome-mapped regions were extracted from reference GRCh37/hg19 from UCSC based on their mapping information. For sequences sharing the same left-most mapped position but differing in length, the most abundant one within a length range of 90 to 150 nt were collected, considering the RIP-Seq library\u0026rsquo;s insert length is mostly within 150 nt. This process resulted in 6.88\u0026nbsp;million high-quality non-redundant RNA species for continue pre-training of the model.\u003c/p\u003e\n \u003cp\u003eEach model variant underwent an additional round of pretraining on its respective dataset, adhering to the pre-training kernel while building upon the base UltraGen model. This approach was designed to minimize the risk of overfitting while preserving the core knowledge embedded in the base model. Continued pre-training of UltraGen was performed on a single 32GB NVIDIA V100 GPU with a batch size of 500, using the AdamW optimizer and a learning rate warm-up strategy. The learning rate was increased to 4e-4 over 2,000 steps, followed by cosine annealing for each data partition.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e\n \u003ch2\u003eAnalysis of single-cell 3\u0026rsquo; readout RNA sequencing of human lung adenocarcinoma\u003c/h2\u003e\n \u003cp\u003eWe performed single-cell analysis for human lung adenocarcinoma (LUAD) cells\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e71\u003c/span\u003e\u003c/sup\u003e (BioProject ID: PRJNA973717), which contained five single cell 3\u0026rsquo; readout RNA sequencing datasets (SRR24626848, SRR24626849, SRR24626850, SRR24626853, SRR24626854). For each dataset, we ran the NCBI fastq-dump utility with the --split-files argument to retrieve the corresponding FASTQ files. These retrieved FASTQ files were further renamed according to the bcl2fastq file naming convention to meet the requirement of Cell Ranger (v8.0.1). To build a custom reference for Cell Ranger, we ran the cellranger mkref command on the human genome GRCh38 data (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz\u003c/span\u003e\u003c/span\u003e). The FASTQ files were then aligned to the reference genome with the cellranger count command, specifying the path to reference genome (GRCh38 with Genecode v44) and sequence data (e.g. SRR24626848). The resulting output files contained per-molecule information and feature barcode matrices, which were further aggregated by cellranger aggr utility. This process produced a unified feature barcode matrix along with secondary analysis outputs, including clustered sequences and their two-dimensional coordinates. Clustering was performed using the t-SNE algorithm with default parameters: tsne_perplexity 30, tsne_theta 0.5, tsne_max_dims 2, tsne_max_iter 1000, tsne_stop_lying_iter 250, and tsne_mom_swith_iter 250. The clustered sequences were then annotated using Azimuth (0.5.0), an automated cell type annotation tool that utilizes a pre-annotated reference single-cell dataset, following a standard 10X Genomics annotation process (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ewww.10xgenomics.com/cn/analysis-guides/automated-cell-type-annotation-from-r-to-loupe-using-louper\u003c/span\u003e\u003c/span\u003e). Briefly, the human lung reference data was downloaded using the command \u0026quot;InstallData(\u0026apos;lungref\u0026apos;)\u0026quot;. Azimuth was then utilized to compare the gene expression profile of each individual cell in the query dataset against the reference, assigning its cell type accordingly. The annotated Seurat object was subsequently converted into a .\u003cem\u003ecloupe\u003c/em\u003e file containing embedded dimension coordinates. Finally, the t-SNE plot at annotation level 3 was generated by loading the .cloupe file into the LoupeR program (1.1.1). The cell numbers of each cell type from the clustered datasets were as follows: T cell lineage \u0026minus;\u0026thinsp;29140, Innate lymphoid cell NK \u0026minus;\u0026thinsp;5025, Mast cells \u0026minus;\u0026thinsp;4048, B cell lineage \u0026minus;\u0026thinsp;3818, AT2\u0026ndash;3634, Secretory \u0026minus;\u0026thinsp;3420, Fibroblasts \u0026minus;\u0026thinsp;3066, EC capillary \u0026minus;\u0026thinsp;2878, AT1\u0026ndash;2507, EC venous \u0026minus;\u0026thinsp;2452, Macrophages \u0026minus;\u0026thinsp;1748, Dendritic cells \u0026minus;\u0026thinsp;1716, Monocytes \u0026minus;\u0026thinsp;1635, EC arterial \u0026minus;\u0026thinsp;1211, None \u0026minus;\u0026thinsp;343, Multiciliated lineage \u0026minus;\u0026thinsp;330, Myofibroblasts \u0026minus;\u0026thinsp;271, Lymphatic EC differentiating \u0026minus;\u0026thinsp;108, Basal \u0026minus;\u0026thinsp;72, Lymphatic EC mature \u0026minus;\u0026thinsp;24, Ionocyte \u0026minus;\u0026thinsp;3, SM activated stress response \u0026minus;\u0026thinsp;1, Neuroendocrine \u0026ndash; 1.\u003c/p\u003e\n \u003cp\u003eFor lung tissue specificity analysis, we extracted the 3\u0026apos;-end reads by cutting off the longest continuous T bases at their 5\u0026apos;-end. The resulting reads were then assigned to the corresponding cell types based on their 5\u0026apos;-end cell barcodes. These reads were subsequently processed as previously described for the SAPAS 3\u0026apos;UTR analysis\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e59\u003c/span\u003e\u003c/sup\u003e. In short, the 3\u0026apos;-end sequencing reads underwent quality control, genome mapping (hg19 from USCS. bowtie -v 3 -k 2 --best), internal priming filtering, and duplicate sequence removal from the 22 tissues training dataset overlap. The tandem 100 nt RNA sequences extracted from mapped loci were analyzed using UltraGen model. Clustered cells were annotated with the average logistic regression coefficient of five biological samples, with high predicted values indicating lung tissue specificity (the highest classification to lung or probability (range 0\u0026ndash;1) above 0.9).\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cstrong\u003eMethod only reference\u003c/strong\u003e\u003c/p\u003e\n\u003col start=\"51\"\u003e\n \u003cli\u003eDevlin, J., Chang, M.-W., Lee , K. \u0026amp; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. \u003cem\u003eProceedings of NAACL-HLT\u003c/em\u003e \u003cstrong\u003e1\u003c/strong\u003e, 4171\u0026ndash;4186 (2019).\u003c/li\u003e\n \u003cli\u003eFamulok, M. Molecular Recognition of Amino Acids by RNA-Aptamers: An L-Citrulline Binding RNA Motif and Its Evolution into an L-Arginine Binder. \u003cem\u003eJ Am Chem Soc\u003c/em\u003e \u003cstrong\u003e116\u003c/strong\u003e, 1698-1706 (2002).\u003c/li\u003e\n \u003cli\u003eDavis, J.H. \u0026amp; Szostak, J.W. Isolation of high-affinity GTP aptamers from partially structured RNA libraries. \u003cem\u003eProc Natl Acad Sci U S A\u003c/em\u003e \u003cstrong\u003e99\u003c/strong\u003e, 11616-11621 (2002).\u003c/li\u003e\n \u003cli\u003eFu, L., Niu, B., Zhu, Z., Wu, S. \u0026amp; Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e28\u003c/strong\u003e, 3150-3152 (2012).\u003c/li\u003e\n \u003cli\u003eJoshi, M. et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans. \u003cem\u003eTransactions of the Association for Computational Linguistics\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 64-77 (2020).\u003c/li\u003e\n \u003cli\u003eMcInnes, L., Healy, J., Saul, N. \u0026amp; Gro\u0026szlig;berger, L. UMAP: Uniform Manifold Approximation and Projection. \u003cem\u003eJournal of Open Source Software\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e (2018).\u003c/li\u003e\n \u003cli\u003eBoussebayle, A., Groher, F. \u0026amp; Suess, B. RNA-based Capture-SELEX for the selection of small molecule-binding aptamers. \u003cem\u003eMethods\u003c/em\u003e \u003cstrong\u003e161\u003c/strong\u003e, 10-15 (2019).\u003c/li\u003e\n \u003cli\u003eSunbul, M. et al. Super-resolution RNA imaging using a rhodamine-binding aptamer with fast exchange kinetics. \u003cem\u003eNat Biotechnol\u003c/em\u003e \u003cstrong\u003e39\u003c/strong\u003e, 686-690 (2021).\u003c/li\u003e\n \u003cli\u003eFu, Y. et al. Differential genome-wide profiling of tandem 3\u0026apos; UTRs among human breast cancer and normal cells by high-throughput sequencing. \u003cem\u003eGenome Res\u003c/em\u003e \u003cstrong\u003e21\u003c/strong\u003e, 741-747 (2011).\u003c/li\u003e\n \u003cli\u003ePatkar, S. et al. Hard wiring of normal tissue-specific chromosome-wide gene expression levels is an additional factor driving cancer type-specific aneuploidies. \u003cem\u003eGenome Med\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 93 (2021).\u003c/li\u003e\n \u003cli\u003eZarnack, K. et al. Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e152\u003c/strong\u003e, 453-466 (2013).\u003c/li\u003e\n \u003cli\u003eKonig, J. et al. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. \u003cem\u003eNat Struct Mol Biol\u003c/em\u003e \u003cstrong\u003e17\u003c/strong\u003e, 909-915 (2010).\u003c/li\u003e\n \u003cli\u003eRossbach, O. et al. Crosslinking-immunoprecipitation (iCLIP) analysis reveals global regulatory roles of hnRNP L. \u003cem\u003eRNA Biol\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 146-155 (2014).\u003c/li\u003e\n \u003cli\u003eHussain, S. et al. NSun2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAs. \u003cem\u003eCell Rep\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e, 255-261 (2013).\u003c/li\u003e\n \u003cli\u003eTollervey, J.R. et al. Characterizing the RNA targets and position-dependent splicing regulation by TDP-43. \u003cem\u003eNat Neurosci\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 452-458 (2011).\u003c/li\u003e\n \u003cli\u003eWang, Z. et al. iCLIP predicts the dual splicing effects of TIA-RNA interactions. \u003cem\u003ePLoS Biol\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, e1000530 (2010).\u003c/li\u003e\n \u003cli\u003eYang, Y.C. et al. CLIPdb: a CLIP-seq database for protein-RNA interactions. \u003cem\u003eBMC Genomics\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, 51 (2015).\u003c/li\u003e\n \u003cli\u003eTang, Y. et al. m6A-Atlas: a comprehensive knowledgebase for unraveling the N6-methyladenosine (m6A) epitranscriptome. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e49\u003c/strong\u003e, D134-D143 (2021).\u003c/li\u003e\n \u003cli\u003eYamada, K. \u0026amp; Hamada, M. Prediction of RNA-protein interactions using a nucleotide language model. \u003cem\u003eBioinform Adv\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, vbac023 (2022).\u003c/li\u003e\n \u003cli\u003eZhang, H., Zhang, L., Mathews, D.H. \u0026amp; Huang, L. LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e36\u003c/strong\u003e, i258-i267 (2020).\u003c/li\u003e\n \u003cli\u003eFan, F. et al. Elevated Mast Cell Abundance Is Associated with Enrichment of CCR2+ Cytotoxic T Cells and Favorable Prognosis in Lung Adenocarcinoma. \u003cem\u003eCancer Res\u003c/em\u003e \u003cstrong\u003e83\u003c/strong\u003e, 2690-2703 (2023).\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Declarations","content":"\u003cp\u003eData Availability\u003c/p\u003e\n\u003cp\u003eUltraSelex HTS raw data supporting the findings of this study are available for academic use in the online repository (https://www.ultrarnalab.com). A mirror data repository can be accessed on NCBI Sequence Read Archive under BioProject (PRJNA1216547). Other HTS data utilized in this study are available on Zenodo (https://doi.org/10.5281/zenodo.15294875).\u003c/p\u003e\n\u003cp\u003eCode Availability\u003c/p\u003e\n\u003cp\u003eThe code is freely available at CodeOcean (https://codeocean.com/capsule/1240603/tree/v1) as well as the online repository (https://www.ultrarnalab.com).\u003c/p\u003e\n\u003cp\u003eAcknowledgements\u003c/p\u003e\n\u003cp\u003eWe thank BAAI-Health Center members, S. Li (EMBL) and M. M\u0026ouml;hler (IPMB) for discussions; B. Suess (TU Darmstadt), J. Taipale (University of Cambridge), M. Meyer (Boston College), D. Burke (University of Missouri), H. Craighead (Cornell University) for access to SELEX library information. W. Nickel (Heidelberg University) for access to BLI-Octet; D. Ibberson (Heidelberg University) and V. Benes (EMBL) for access to HTS; Y. Wu (University of Heidelberg) for aesthetic figures; BAAI-JIUDING for GPU cluster computation resources and data storage. SDS@hd for data sharing and bwForCluster Helix for CPU cluster computation resources.\u003c/p\u003e\n\n\u003cp\u003eContributions\u003c/p\u003e\n\u003cp\u003eY.Z., H.W., Z.C., and Q.Y. conceptualized the study. H.W., Z.C., W.H., and Y.Z. performed UltraGen modeling and data analysis. Y.Z., Y.J., and J.Z. constructed HTS RNA libraries and measured ligand affinities. W.L. and Y.Z. simulated RNA ligand interaction. H.W. and H.X. constructed UltraGen API server. Y.Z., H.W., and Z.C. wrote the original draft. Y.Z. supervised this study. A.J., Y.F., and all authors reviewed and edited the draft.\u003c/p\u003e\n\n\u003cp\u003eCompeting interests\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\n\n"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eDuss, O., Stepanyuk, G.A., Puglisi, J.D. \u0026amp; Williamson, J.R. Transient Protein-RNA Interactions Guide Nascent Ribosomal RNA Folding. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e179\u003c/strong\u003e, 1357-1369 e1316 (2019).\u003c/li\u003e\n\u003cli\u003eVan Treeck, B. \u0026amp; Parker, R. Emerging Roles for Intermolecular RNA-RNA Interactions in RNP Assemblies. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e174\u003c/strong\u003e, 791-802 (2018).\u003c/li\u003e\n\u003cli\u003eVan Nostrand, E.L. et al. A large-scale binding and functional map of human RNA-binding proteins. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e583\u003c/strong\u003e, 711-719 (2020).\u003c/li\u003e\n\u003cli\u003eZhang, J., Wang, L., Jaschke, A. \u0026amp; Sunbul, M. A Color-Shifting Near-Infrared Fluorescent Aptamer-Fluorophore Module for Live-Cell RNA Imaging. \u003cem\u003eAngew Chem Int Ed Engl\u003c/em\u003e \u003cstrong\u003e60\u003c/strong\u003e, 21441-21448 (2021).\u003c/li\u003e\n\u003cli\u003eBoussebayle, A. et al. Next-level riboswitch development-implementation of Capture-SELEX facilitates identification of a new synthetic riboswitch. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e47\u003c/strong\u003e, 4883-4895 (2019).\u003c/li\u003e\n\u003cli\u003eAmeta, S., Winz, M.L., Previti, C. \u0026amp; Jaschke, A. Next-generation sequencing reveals how RNA catalysts evolve from random space. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, 1303-1310 (2014).\u003c/li\u003e\n\u003cli\u003eJolma, A. et al. Binding specificities of human RNA-binding proteins toward structured and linear RNA sequences. \u003cem\u003eGenome Res\u003c/em\u003e \u003cstrong\u003e30\u003c/strong\u003e, 962-973 (2020).\u003c/li\u003e\n\u003cli\u003ePei, S., Slinger, B.L. \u0026amp; Meyer, M.M. Recognizing RNA structural motifs in HT-SELEX data for ribosomal protein S15. \u003cem\u003eBMC Bioinformatics\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, 298 (2017).\u003c/li\u003e\n\u003cli\u003eWhatley, A.S. et al. Potent Inhibition of HIV-1 Reverse Transcriptase and Replication by Nonpseudoknot, \u0026quot;UCAA-motif\u0026quot; RNA Aptamers. \u003cem\u003eMol Ther Nucleic Acids\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, e71 (2013).\u003c/li\u003e\n\u003cli\u003eCamorani, S. et al. Novel Aptamers Selected on Living Cells for Specific Recognition of Triple-Negative Breast Cancer. \u003cem\u003eiScience\u003c/em\u003e \u003cstrong\u003e23\u003c/strong\u003e, 100979 (2020).\u003c/li\u003e\n\u003cli\u003eNguyen Quang, N., Bouvier, C., Henriques, A., Lelandais, B. \u0026amp; Duconge, F. Time-lapse imaging of molecular evolution by high-throughput sequencing. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e46\u003c/strong\u003e, 7480-7494 (2018).\u003c/li\u003e\n\u003cli\u003eDe La Fuente, A. et al. Aptamers against mouse and human tumor-infiltrating myeloid cells as reagents for targeted chemotherapy. \u003cem\u003eSci Transl Med\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e (2020).\u003c/li\u003e\n\u003cli\u003eVan Simaeys, D. et al. RNA aptamers specific for transmembrane p24 trafficking protein 6 and Clusterin for the targeted delivery of imaging reagents and RNA therapeutics to human beta cells. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 1815 (2022).\u003c/li\u003e\n\u003cli\u003eConsortium, E.P. An integrated encyclopedia of DNA elements in the human genome. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e489\u003c/strong\u003e, 57-74 (2012).\u003c/li\u003e\n\u003cli\u003eStrazar, M., Zitnik, M., Zupan, B., Ule, J. \u0026amp; Curk, T. Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e32\u003c/strong\u003e, 1527-1535 (2016).\u003c/li\u003e\n\u003cli\u003eAlipanahi, B., Delong, A., Weirauch, M.T. \u0026amp; Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. \u003cem\u003eNature Biotechnology\u003c/em\u003e \u003cstrong\u003e33\u003c/strong\u003e, 831-838 (2015).\u003c/li\u003e\n\u003cli\u003eIshida, R. et al. RaptRanker: in silico RNA aptamer selection from HT-SELEX experiment based on local sequence and structure information. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e48\u003c/strong\u003e, e82 (2020).\u003c/li\u003e\n\u003cli\u003eBashir, A. et al. Machine learning guided aptamer refinement and discovery. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 2366 (2021).\u003c/li\u003e\n\u003cli\u003eChen, J.C. et al. Generating experimentally unrelated target molecule-binding highly functionalized nucleic-acid polymers using machine learning. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 4541 (2022).\u003c/li\u003e\n\u003cli\u003eIwano, N., Adachi, T., Aoki, K., Nakamura, Y. \u0026amp; Hamada, M. Generative aptamer discovery using RaptGen. \u003cem\u003eNature Computational Science\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, 378-386 (2022).\u003c/li\u003e\n\u003cli\u003eRube, H.T. et al. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. \u003cem\u003eNature Biotechnology\u003c/em\u003e \u003cstrong\u003e40\u003c/strong\u003e, 1520-1527 (2022).\u003c/li\u003e\n\u003cli\u003eZhang, Y. et al. Single-step discovery of high-affinity RNA ligands by UltraSelex. \u003cem\u003eNat Chem Biol\u003c/em\u003e (2025).\u003c/li\u003e\n\u003cli\u003eMuller, F. et al. A prebiotically plausible scenario of an RNA-peptide world. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e605\u003c/strong\u003e, 279-284 (2022).\u003c/li\u003e\n\u003cli\u003eLincoln, T.A. \u0026amp; Joyce, G.F. Self-sustained replication of an RNA enzyme. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e323\u003c/strong\u003e, 1229-1232 (2009).\u003c/li\u003e\n\u003cli\u003eCorley, M., Burns, M.C. \u0026amp; Yeo, G.W. How RNA-Binding Proteins Interact with RNA: Molecules and Mechanisms. \u003cem\u003eMol Cell\u003c/em\u003e \u003cstrong\u003e78\u003c/strong\u003e, 9-29 (2020).\u003c/li\u003e\n\u003cli\u003eSu, J. et al. RoFormer: Enhanced transformer with Rotary Position Embedding. \u003cem\u003eNeurocomputing\u003c/em\u003e \u003cstrong\u003e568\u003c/strong\u003e (2024).\u003c/li\u003e\n\u003cli\u003eChen, J. et al. Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions. \u003cem\u003earXiv preprint\u003c/em\u003e (2022).\u003c/li\u003e\n\u003cli\u003eWirth, R., Gao, P., Nienhaus, G.U., Sunbul, M. \u0026amp; J\u0026auml;schke, A. SiRA: A Silicon Rhodamine-Binding Aptamer for Live-Cell Super-Resolution RNA Imaging. \u003cem\u003eJ Am Chem Soc\u003c/em\u003e \u003cstrong\u003e141\u003c/strong\u003e, 7562-7571 (2019).\u003c/li\u003e\n\u003cli\u003eWang, X., Zhang, M., Long, C., Yao, L. \u0026amp; Zhu, M. Self-Attention Based Neural Network for Predicting RNA-Protein Binding Sites. \u003cem\u003eIEEE/ACM Trans Comput Biol Bioinform\u003c/em\u003e \u003cstrong\u003e20\u003c/strong\u003e, 1469-1479 (2023).\u003c/li\u003e\n\u003cli\u003eAkiyama, M. \u0026amp; Sakakibara, Y. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. \u003cem\u003eNAR Genom Bioinform\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e, lqac012 (2022).\u003c/li\u003e\n\u003cli\u003eMayr, C. Evolution and Biological Roles of Alternative 3\u0026prime;UTRs. \u003cem\u003eTrends in Cell Biology\u003c/em\u003e \u003cstrong\u003e26\u003c/strong\u003e, 227-237 (2016).\u003c/li\u003e\n\u003cli\u003eYou, L. et al. APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e43\u003c/strong\u003e, D59-67 (2015).\u003c/li\u003e\n\u003cli\u003eYang, Y. et al. Deciphering 3\u0026apos;UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. \u003cem\u003eAdv Sci (Weinh)\u003c/em\u003e, e2407013 (2024).\u003c/li\u003e\n\u003cli\u003eGruber, A.R. et al. Global 3\u0026prime; UTR shortening has a limited effect on protein abundance in proliferating T cells. \u003cem\u003eNature Communications\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e (2014).\u003c/li\u003e\n\u003cli\u003eMalone, B., Urakova, N., Snijder, E.J. \u0026amp; Campbell, E.A. Structures and functions of coronavirus replication-transcription complexes and their relevance for SARS-CoV-2 drug design. \u003cem\u003eNat Rev Mol Cell Biol\u003c/em\u003e \u003cstrong\u003e23\u003c/strong\u003e, 21-39 (2022).\u003c/li\u003e\n\u003cli\u003eSalgado, D.M. et al. Heart and skeletal muscle are targets of dengue virus infection. \u003cem\u003ePediatr Infect Dis J\u003c/em\u003e \u003cstrong\u003e29\u003c/strong\u003e, 238-242 (2010).\u003c/li\u003e\n\u003cli\u003eTakeda, M. et al. A human lung carcinoma cell line supports efficient measles virus growth and syncytium formation via a SLAM- and CD46-independent mechanism. \u003cem\u003eJ Virol\u003c/em\u003e \u003cstrong\u003e81\u003c/strong\u003e, 12091-12096 (2007).\u003c/li\u003e\n\u003cli\u003eOldstone, M.B. et al. Measles virus infection in a transgenic model: virus-induced immunosuppression and central nervous system disease. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e98\u003c/strong\u003e, 629-640 (1999).\u003c/li\u003e\n\u003cli\u003eGrant, R.A. et al. Circuits between infected macrophages and T cells in SARS-CoV-2 pneumonia. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e590\u003c/strong\u003e, 635-641 (2021).\u003c/li\u003e\n\u003cli\u003evan Zundert, G.C.P. et al. The HADDOCK2.2 Web Server: User-Friendly Integrative Modeling of Biomolecular Complexes. \u003cem\u003eJ Mol Biol\u003c/em\u003e \u003cstrong\u003e428\u003c/strong\u003e, 720-725 (2016).\u003c/li\u003e\n\u003cli\u003eAbramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. \u003cem\u003eNature\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eWang, S. et al. Dynamic regulation and functions of mRNA m6A modification. \u003cem\u003eCancer Cell Int\u003c/em\u003e \u003cstrong\u003e22\u003c/strong\u003e, 48 (2022).\u003c/li\u003e\n\u003cli\u003eJing, Q. et al. Involvement of microRNA in AU-rich element-mediated mRNA instability. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e120\u003c/strong\u003e, 623-634 (2005).\u003c/li\u003e\n\u003cli\u003eSandberg, R., Neilson, J.R., Sarma, A., Sharp, P.A. \u0026amp; Burge, C.B. Proliferating cells express mRNAs with shortened 3\u0026apos; untranslated regions and fewer microRNA target sites. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e320\u003c/strong\u003e, 1643-1647 (2008).\u003c/li\u003e\n\u003cli\u003eMitschka, S. \u0026amp; Mayr, C. Context-specific regulation and function of mRNA alternative polyadenylation. \u003cem\u003eNat Rev Mol Cell Biol\u003c/em\u003e \u003cstrong\u003e23\u003c/strong\u003e, 779-796 (2022).\u003c/li\u003e\n\u003cli\u003eZou, T. et al. Polyamines regulate the stability of JunD mRNA by modulating the competitive binding of its 3\u0026apos; untranslated region to HuR and AUF1. \u003cem\u003eMol Cell Biol\u003c/em\u003e \u003cstrong\u003e30\u003c/strong\u003e, 5021-5032 (2010).\u003c/li\u003e\n\u003cli\u003eBrant, A.C., Tian, W., Majerciak, V., Yang, W. \u0026amp; Zheng, Z.M. SARS-CoV-2: from its discovery to genome structure, transcription, and replication. \u003cem\u003eCell Biosci\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 136 (2021).\u003c/li\u003e\n\u003cli\u003eLee, C.W., Li, L. \u0026amp; Giedroc, D.P. The solution structure of coronaviral stem-loop 2 (SL2) reveals a canonical CUYG tetraloop fold. \u003cem\u003eFEBS Lett\u003c/em\u003e \u003cstrong\u003e585\u003c/strong\u003e, 1049-1053 (2011).\u003c/li\u003e\n\u003cli\u003eStroup, E.K. \u0026amp; Ji, Z. Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 7378 (2023).\u003c/li\u003e\n\u003cli\u003eGriesemer, D. et al. Genome-wide functional screen of 3\u0026apos;UTR variants uncovers causal variants for human disease and evolution. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e184\u003c/strong\u003e, 5247-5260 e5219 (2021).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4461517/v2","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4461517/v2","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eRNA plays multifaceted roles in catalytic reactions and gene regulation. The sequence-encoded binding language across diverse RNA-target interactomes is high-dimensional and complex. Here, we introduce UltraGen, an RNA language model designed to capture RNA binding properties. Utilizing fine-grained self-learning, UltraGen identifies RNA aptamers for a wide range of target sizes, including small molecules, proteins, cells, and tissues. Additionally, UltraGen discerns tissue specificity for millions of RNA species across 22 human organs based on their 3\u0026rsquo;-UTR sequences, predicts the tropism of human-pathogenic RNA viruses, and characterizes SARS-CoV-2 replicase RNA binding at single-base resolution.\u003c/p\u003e","manuscriptTitle":"Decoding the RNA binding systems by UltraGen","msid":"","msnumber":"","nonDraftVersions":[{"code":2,"date":"2025-05-08 05:32:53","doi":"10.21203/rs.3.rs-4461517/v2","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-machine-intelligence","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"natmachintell","sideBox":"Learn more about [Nature Machine Intelligence](http://www.nature.com/natmachintell/)","snPcode":"","submissionUrl":"","title":"Nature Machine Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Research","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"7b7119f1-4e6d-4730-8234-08b3d204202c","owner":[],"postedDate":"May 8th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":48172622,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"},{"id":48172623,"name":"Biological sciences/Biochemistry/RNA"},{"id":48172624,"name":"Biological sciences/Biological techniques/Sequencing/RNA sequencing"}],"tags":[],"updatedAt":"2025-06-05T14:12:04+00:00","versionOfRecord":[],"versionCreatedAt":"2025-05-08 05:32:53","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v2","identity":"rs-4461517","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4461517","identity":"rs-4461517","version":["v2"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0