A direct RNA-seq-based EBV Latency Transcriptome Offers Insights into the Biogenesis of EBV Gene Products

preprint OA: gold CC-BY-NC-4.0
📄 Open PDF Full text JSON View at publisher
Full text 85,609 characters · extracted from oa-pdf · 8 sections · click to expand

Abstract

Epstein-Barr virus (EBV) is a ubiquitous human pathogen that establishes lifelong persistence in B cells. To reach this persistent state, EBV activates and drives proliferation of the infected B cell. In vitro, these EBV-infected cells can grow almost indefinitely as a lymphoblastoid cell line (LCL). EBV’s transcription profile in LCLs (Latency III) produces six nuclear proteins (EBNAs), two latency membrane proteins (LMPs) and various non-coding RNAs, including microRNAs from the BHRF1 gene locus and the BamA-rightward transcripts (BARTs). The BART and EBNA transcription units are characterised by extensive alternative splicing. Here we have conducted a nanopore direct RNA-seq analysis of the LCL transcriptome of the widely used B95-8 BAC clone, including viruses with engineered variations (barcodes) in the first and last repeat of internal repeat 1 (IR1). Our pipeline ensures appropriate mapping of the W promoter (Wp) 5’ exon, and corrects W1-W2 exon counts that misalign to IR1. Characterisation of EBNA reads revealed that transcripts using Wp more frequently encoded the EBNA-LP start codon than Cp transcripts; identified a novel splice donor in exon W2 and polyA site just before the EBNA2 start codon; provided insights into BHRF1 miRNA processing; and suggested co-ordination between polyA site usage and alternative splicing decisions, although improved read depth and integrity are required to confirm this. The BAC region disrupts the integrity of BART transcripts through premature polyadenylation and cryptic splice sites in the antisense of the hygromycin expression cassette. Finally, we identified a few extended reads that cross established gene boundaries, running from EBNA, to BART to LMP2 gene regions, including some novel exon usage between EBNA1 and the BART promoter that may have more relevance or abundance in other phases of the EBV lifecycle. We provide annotations based on these findings to enable better characterisation of the B95-8 BAC transcriptome. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint

Introduction

Epstein-Barr virus (EBV) is a ubiquitous human gamma-herpesvirus, that establishes life-long persistence in resting memory B cells in the vast majority of the adult population. EBV establishes a persistent latent state, transitioning to productive replication when these differentiate into plasma cells, or when infecting mucosal epithelia, where it is shed into body fluids such as saliva, which facilitates transmission. Immunodeficiency and/or mutations that disrupt normal virus-host relationships in these cell types can promote the wide range of B cell and epithelial malignancies in which EBV is implicated. Infection of B cells in vitro results in the activation and hyper-proliferation of B cells, that can lead to the outgrowth of transformed lymphoblastoid cell lines (LCLs). Different virus strains have different transformation efficiencies, with the lab strain B95-8 being widely used due to its ready ability to generate LCLs. This strain of EBV was cloned in an F-factor plasmid (better known as a bacterial artificial chromosome, or BAC) 25 years ago [1], enabling the genetic modification of the EBV genome in bacteria, as a reverse genetic system for studying EBV gene products. All herpesviruses are large DNA viruses, that generate a large number of transcripts, many of which overlap, with shared polyadenylation sites, which makes it challenging to accurately quantitate transcript abundance from short-read RNA-seq data. Nevertheless, considerable progress has been made in characterising the transcriptome of the productive cycle of EBV [2]. However, EBV has a particularly complex latency transcriptome, combining alternative splicing, polyadenylation and promoter usage to encode a diverse array of proteins and non-coding RNAs. The transcription programme found in LCLs (termed Latency III) has three main zones of transcription: The EBV nuclear antigen (EBNA) transcripts; the latency membrane protein (LMP) transcripts; and the BamHI A rightward transcripts (BARTs), shown schematically in figure 1. The EBNA transcripts are the first to be transcribed after infection. In genomic order: EBNA-LP (EBNA-leader protein) facilitates expression of the other virus genes by preventing genome silencing by antiviral proteins [3, 4]; EBNA2 initiates the activation and proliferation of the B cells [5], while the EBNA3 proteins induce epigenetically stable changes to gene expression that prevent apoptosis and facilitate germinal centre differentiation [6]. Finally, EBNA1 – through binding to the EBV latency replication origin, oriP – both facilitates viral gene expression and mediates the replication and segregation of the extra-chromosomal EBV genome in synchrony with the cell cycle. EBV transcription begins using a promoter (Wp) within the major internal repeats (IR1, or BamW repeats): there is a copy of Wp in each of the 3kb repeat units of IR1. Then, once newly produced EBNA1 has established oriP [7], and EBNA2 has bound to Cp [8] and to regulatory regions upstream of the first Wp [9], transcription largely switches from Wp to the BamC promoter (Cp) that lies just upstream of IR1. Both Cp and Wp-derived transcripts are alternatively spliced and polyadenylated to variously encode the six EBNA proteins and the pro-survival Bcl2 homologue, BHRF1 [10, 11]. In addition, three miRNAs (mirBHRF1-1,-2 and -3) are processed from sites either side of the BHRF1 ORF [12, 13], while some RNAs excised from pre-mRNAs are stable non-coding RNAs of unknown function [14, 15]. In vivo, as EBV progresses to its final latency state, Cp and Wp shut down, switching to Qp that makes only EBNA1 transcripts. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Figure 1. Schematic overview of main latency transcripts of B95-8 EBV. Gene and repeat structure of the virus genome (normally circular, but viewed with its ends at oriP) is shown above an overview of the main transcript structures. Gene and transcript arrows indicate transcription direction. Genes are colour-coded as EBV nuclear antigens (EBNA - red) and latency membrane proteins (LMPs – blue). EBNA promoters (Cp, Wp and Qp) and the U exon are labelled. Repeat regions (magenta) are family of repeats in oriP, internal repeats IR1 (Bam W repeat); IR2 (part of oriLyt); IR3 (glycine-alanine repeats in EBNA1), while B95-8 deletion (∆B95-8) loses IR4 and contains the BAC (not shown); Terminal repeats (TR) lie between LMPs. Note that some genes have more than the indicated number of splicing events. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint On the other side of oriP from Cp are two genes for EBV-encoded RNAs (EBERs), encoding abundant RNA polymerase III-transcribed non-coding RNAs around 170 nucleotides in length [16, 17]. Slightly further from oriP, and spanning the viral terminal repeats, lie the latency membrane protein genes LMP1 and LMP2. The LMP1 and LMP2A proteins provide signals required for B cell survival during affinity maturation, while LMP2B lacks the LMP2A signalling domain through alternative promoter usage: A divergent promoter region gives rise to LMP1 and LMP2B transcripts [18], with the LMP2A promoter situated beyond the LMP1 pA site, regulated by a CTCF-binding region between LMP2A exon and LMP1 [19]. Finally, the Bam A rightward transcripts (BARTs) are a collection of alternatively spliced RNAs that span an approximately 20 kb region of the EBV genome [20]. The BART RNAs appear to be nuclear long non-coding RNAs [21, 22], but some have the potential to produce proteins (A73; RPMS1; RK-BARF0), although these proteins have yet to be detected in EBV-infected cells [21]. The BART introns produce around 40 miRNAs from 22 pre-miRNA stem-loops, mainly in two clusters [23, 24], that modulate a wide range of cell functions [25, 26]. However, much of the BART region is deleted in the commonly used B95-8 strain, and the B95-8 reverse genetic system has a bacterial artificial chromosome (BAC) and expression cassettes for GFP and hygromycin phosphotransferase inserted into this deletion, the impact of which is unknown. The complexity of the splicing in EBV makes the quantitation of distinct transcripts from RNA-seq data challenging, as simplifications of the transcriptome annotation may offer misleading indications of the abundance of different proteins. Therefore we aimed to develop a pipeline to accurately determine and quantify EBV transcript diversity, and assess any co-dependencies of alternative splicing options in the EBV latency transcriptome of the B95-8 BAC, using long-read direct RNA sequencing (dRNA-seq) developed by Oxford Nanopore. Directly sequencing the RNA molecules avoids amplification biases while potentially reading intact whole transcript molecules. This comes at the cost of base-calling accuracy, which is less developed for RNA than DNA, and complicated by the larger range of RNA modifications [27, 28]. Lower nucleotide accuracy is mitigated by the ability to sequence whole transcripts (except for 10-12 nucleotides at the 5’ end) in a single read, allowing quantification of different combinations of splice sites, transcription start sites and polyadenylation sites of individual reads, even across repetitive regions of the genome like IR1. Here we present a dRNA-seq pipeline with EBV-specific modifications (to cope with the small 5’ exon of Wp, and the repetitiveness of IR1 exons) to characterise the latency transcriptome of LCLs established with BAC-cloned B95-8. We demonstrate how this allows characterisation and quantitation of EBV’s most complex transcripts, and provide a detailed analysis of the B95-8 EBV-BAC’s transcriptiona and splicing profile in LCLs.

Results

Generation of a high quality long read transcriptome for EBV. To enable us to specifically characterise splicing of EBNA transcripts spanning IR1 exons, we generated a recombinant EBV genome (pMR11) containing synonymous SNPs (“barcodes”) in the first and last W2 exons of IR1 (Supporting Figure S1), to allow quantitation of how consistently the first and last W2 exon are included in such transcripts. Two LCLs were established from pMR11 EBV, as well .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint as an LCL from pHB9 (the parental B95-8 BAC which – like B95-8 – harbours a STOP codon at the end of one of its six W1 exons [29]) and its derivative WTw whose homogeneous IR1 lacks this W1 STOP codon [3]. All these BAC clones contain 6.6 repeat units in IR1 (i.e. 6 pairs of W1 and W2 exons). RNA from these LCLs was directly sequenced using an Oxford Nanopore MinION v9 flow cell, reads were base-called, and aligned to the EBV genome (to both the generic WTw sequence, and the matched pMR11 or pHB9 genomes – “self”) in a splice-aware manner to avoid spurious splice junctions. The origin of pHB9 was set within oriP (NC_007605 position 8315) to ensure all reads could align in full, as this was the only part of the EBV genome not crossed by transcripts (and aligners cannot use circular templates). A script was developed to correct for the failure of minimap2 to align 5’ ends of reads to exon W0. A second script was then used to identify commonly seen 5’ and 3’ ends (at least 5 reads across the four samples that start within a 5 nt window), and define reads with 5’ and 3’ ends within 20 nt either side of these positions as “full length”. Across the 4 LCLs, 3693 reads aligned to the EBV genome. Almost half of these were from the WTw-LCL, and across all four samples, the vast majority of reads mapped to latency loci (LMPs, EBNAs and BARTs) and GFP transgene. A few reads were classical ‘lytic’ transcripts, but their numbers were too few for any meaningful analysis in these study, There were also no EBER transcripts identified by our pipeline, most likely because RNA polymerase III transcripts lack the polyA sequence required by the nanopore library preparation. EBNA transcripts primarily use Cp, but Wp contributes significantly to EBNA- LP production. As illustrated in Figure 2A, the EBNA and BHRF1 loci are transcribed from either Cp or Wp, and the splice junction from promoter exons (W0 or C2) to the first IR1 exon (W1) has both alternative donor exons (W0, C2 or C2’ [30]) and alternative acceptors (W1 or W1’). Splicing of C2 or W0 to W1’ produces the AUG required for translation of EBNA-LP. Other splice junctions allow translation of downstream proteins [30]. Among reads that have both conventional 5’ Cp or Wp exons and a 3’ Y exon, 90% arise from Cp, and 10% from Wp (Figure 2B). Cp transcripts, we observed use of both C2 and C2’ exons (C2’ being found in 35-42% of Cp reads, consistent with the ≈35% in a PacBio RNA-seq of a B95-8 LCL [30]), while a very small number (≈1%) of Cp reads skipped exon C2. When the splice donor is W0 or C2’, there is about a 60:40 ratio of W1:W1’ splice acceptor usage. In contrast the C2 splice donor has a lower preference for the W1’ splice acceptor - approximately 75:25%. Overall, 18% of EBNA transcripts can initiate EBNA-LP translation, although Wp transcripts are more than twice as likely to encode EBNA-LP than Cp transcripts. It is, however, notable that the pHB9 LCL has a much lower frequency – under 7% – of EBNA-LP transcripts (Figure 2C). This does not reflect reduced usage of the W1’ splice acceptor, as C2’-W1’ splicing is similar to (or higher than) the other LCLs. Reviewing the reads aligned to the W exons (Supporting Figure S2) confirmed the presence of a STOP codon [29] in the first W1 exon of pHB9. This mutation in the first W1 exon from B95-8 was also found in two cDNA clones [31] and in PacBio reads from a B95-8 LCL (data from [30], which observed a similar 8% of Cp transcripts from B95-8 encoding EBNA-LP). This AUG-dependent reduction in transcript abundance, matched by reduced EBNA-LP protein levels in pHB9 LCLs (Figure 3A and [29]) suggests translation-associated nonsense-mediated decay of pHB9 transcripts with premature STOP codons. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint B Figure 2. Promoter usage and alternative splicing of the EBNA transcripts. (A) Schematic diagram showing the main splice variants and their frequency (expressed as percentage of transcripts from all four samples combined splicing into or within IR1). Magenta splice junctions would retain the EBNA-LP open reading frame (B) Graph showing distribution of promoter splice junctions for all combined samples (as A) and separated by the four LCLs analysed. (C) The frequency of W1’ usage as the splice acceptor for different splice donors, for all reads combined (total) or individual LCLs made with the different viruses indicated. C A ∆ W1’ usage for different splice donors by sample C2-W1 W0-W1 EBNA -LP C2’-W1’ C2’-W1 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Splicing across IR1 is efficient, and produces a novel EBNA-LP splice variant. To define the origin of the ladder of different sized EBNA-LP proteins commonly seen on western blots, as seen for the four LCLs analysed (Figure 3A), we compared protein sizes with transcript structure. The WTw and MR11 LCLs produce EBNA-LP protein containing predominantly 6 W repeat domains (EBNA-LP- 6W), with progressively less EBNA-LP with fewer repeats, whereas the HB9 LCL contains very little EBNA-LP-6W, but more EBNA-LP-4W (Figure 3A). There are also traces of larger EBNA-LP protein isoforms. Densitometry suggests an approximately 3:1 ratio of 6W:5W in the two MR11 cell lines, and a lower 3:2 ratio for WTw (Figure 3B). All three lines show an approximately 2:1 ratio of 5 repeat to 4 repeat EBNA-LPs. We counted the numbers of W1 and W2 exons in reads derived from IR1 to examine how this size variation arises, using the length of reads (in nucleotides) to test the integrity of the alignment of reads to IR1. Around 25% of reads were >85 nucleotides longer than was expected from the number of exons aligned to by the read. This was either because of retained introns (presumably incompletely spliced pre-mRNAs) or a large insertion within a W2 exon, almost always approximately a multiple of the 198 nucleotides of a W1-W2 exon pair (Supplementary figure S3), suggesting misalignment of reads to IR1. These misaligned insertions were always found in exon W2, most frequently in the penultimate W2 exon (Supplementary figure S4A). Further analysis also identified a novel splice donor within exon W2 that lies 21 nt (sometimes 24 nt) upstream of the canonical splice donor (see examples in Supplementary figure S3 and schematic Figure 2A), producing exon W2∆ – a shorter version of exon W2 – that if in a transcript encoding EBNA-LP, would lack its last 7 (or 8) nucleotides of its repeat domain. Exon W2∆ was observed in 3-4% of exons aligned to repeats 1-5, and half that (1.7%) for the last repeat exon (Supplementary table ST3). In total, 13% of IR1 transcripts contain at least one W2∆ exon. To accurately reflect the IR copy number for each read we calculated the sum of nucleotides aligned to IR1 exons (thereby omitting retained introns), relative to the expected nucleotide size of those exons, and multiplied this value by the number of W1+W2 exons we observed to give a final estimate of repeat exon number (Supplementary table ST3). These estimates are distributed close to whole numbers of exons in IR1 (Supplementary figure S3B), suggesting this approach is a reasonable estimate of IR1 exon usage. By this measure, the vast majority of Cp transcripts had 5 or more pairs of repeats, while Wp transcripts had 4 or more (Figure 3C), and when split by cell line, most of the reads with fewer than 6 W exon pairs are from the WTw cell line (Figure 3D and Supplementary figure S4B). Rescuing EBV BACs from the LCLs into bacteria, and analysing their restriction pattern, showed that the WTw LCLs had a higher proportion of EBV genomes with a 5 repeat IR1 than the other LCLs (Supplementary table ST2), albeit not enough to fully explain the increased 5 repeat transcripts in the WTw LCL. One read had 7 W1-W2 exon pairs: more than expected from the genome sequence, but implied by the larger EBNA-LP isoforms detected on western blots (Figure 3A and [29]). Nevertheless together these data imply that number of repeats in transcripts largely matches the number in the underlying genomes, suggesting that exon skipping within IR1 is rare. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Figure 3. Transcripts spanning IR1 show relatively efficient splicing, and diverse stochastic splicing decisions. (A) Western blot of EBNA-LP from the LCLs used in this study. Position and size (kDa) of the marker are shown right. Numerical labels to the left indicate the likely number of W1W2 exon pairs likely in each band size. (B) Densitometry of western blot, corrected for the different numbers of antibody epitopes in different length EBNA-LPs, showing relative abundance of EBNA-LP-W4, -W5, and -W6 for each cell line. Number indicates number of W1W2 exon pairs. (C) Graph shows percentage of Cp or Wp transcripts with different numbers of W1 and W1’ exons based on counting exons to which nucleotides are mapped, and correcting for total number of aligned nucleotides. (D) Distribution of IR1 lengths in Cp transcripts in different cell lines. A 55 43 34 26 EBNA-LP (JF186) 6W 5W 4W ɣ-tubulin HB9 WTw AJ2 MJ2 MR11 B C Number of exon pairs % of transcripts 0% 20% 40% 60% 80% 100% WTw HB9 MR11-AJ2 MR11_MJ2 Cp all Cp all Cp all Cp all 6 5.5 5 4.5 4 6 5 4 WTw HB9 AJ2 MJ2 MR11 6 5 4 6 5 4 6 5 4 D 0% 20% 40% 60% 80% WTw HB9 MR11-1 AJ2 MR11-2 MJ2 6 5 4 WTw HB9 AJ2 MJ2 MR11 6 5 4 6 5 4 6 5 4 Relative abundance of EBNA-LP sizes marker kDa .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Alignment to IR1 is slightly improved by barcoding first and last repeat. We hypothesised that barcoding the first and last exons, and mapping to the barcoded sequence would improve the fidelity of mapping reads to IR1 exons, by better securing the alignment of the first and last repeat unit. Alignment to the non- barcoded WTw sequence increased the reads skipping exons in the first repeat unit by about 50%, with a corresponding increase in reads aligning to the second repeat unit as their first W exons, and also a slight increase in reads aligning to repeat 5 but not the last repeat (Supplementary Table ST4). However, barcoding did not reduce the overall number of mis-mapped large insertions, although it did alter their positioning (Supplementary Table ST5). Thus the fidelity of alignment to homogeneous repeats is only marginally improved by barcodes in first and last repeats. Defining full-length transcripts reveals RNA processing sites. In principle, transcript abundance can be estimated from nanopore dRNA-seq data by either measuring frequencies based on either polyadenylated 3’ ends; from defined full length reads; or from any reads that span the region being analysed. We initially extracted ‘full-length’ reads based on the expectation that authentic 5’ and 3’ ends would be present in at least 5 transcripts. This identified the main viral latency promoters (Cp, Wp in repeats 1 and 2, LMP1, LMP2A and BARTs) and polyA sites, but also a number of non-canonical 5’ ends (Supplementary table ST6). Indeed, over 96% of reads contained a reproducible or previously annotated polyadenylation site, whereas under 60% of reads had a biologically explicable 5’ end (Supplementary table ST7). Non-canonical 5’ ends were particularly frequent across the GC-rich repeat sequences of EBNA1 and EBNA2, and the 3’ UTR of LMP1, suggesting these regions are more frequently break, or have RNA structures that can block the nanopore, curtailing sequencing. Conversely, a large number of reads with a 3’ end at the annotated BHRF1 polyadenylation site, had a 5’ end close to the end of miR- BHRF1-2 (and much more rarely miR-BHRF1-3 or miR-BHRF1-1) (Supplementary table ST7). The much higher abundance of ends generated by miR-BHRF1-2 cleavage may have implications for the order in which BHRF1 miRNAs are processed, while reads polyadenylated at BHRF1 and cleaved at the site of mirBHRF1-1 processing suggest that at least some BHRF1 miRNA processing precedes RNA splicing, rather than miR-BHRF1-1 excision from the intron (Figure 4A). Characterising the 3’ ends of transcripts confirmed the known polyadenylation sites of EBNA, LMP and BART transcripts [32], and identified a novel EBNA transcript polyA site between the Y3 splice donor and the EBNA2 start codon (herein called Y3 pA) (Figure 4A). EBNA transcripts most commonly used the canonical EBNA2 or BHRF1 polyA sites, with Y3 pA used in 10% of transcripts (Figure 4B). Since a third of EBNA transcripts use the BHRF1 pA site (and lytic BHRF1 transcripts were rare), we can infer that – at least in LCLs – BHRF1 transcripts are a fundamental part of the latency gene expression profile of LCLs, despite the BHRF1 protein being barely detectable. EBNA3C pA reads whose 5’ end was upstream of the EBNA3C exon 1 splice acceptor (n=26) showed 5:8 ratio (approximately 60:40 split) of 3C:3B-encoding transcripts. Combining this with pA site usage (assuming all RNAs are equally stable) allows inference of the ratio of different splicing events downstream of the U exon, suggesting that more 3’ EBNA gene transcripts are progressively more frequent (Figure 4A & 4B), suggesting the U exon splice donor .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint more frequently uses the more distant acceptor, though whether this is determined by splicing or polyadenylation mechanisms cannot be distinguished. We did not observe any of these downstream EBNA transcripts initiating at the latency II/I-specific Qp promoter, but cannot formally exclude its usage due to the very high breakage rates of EBNA1 and EBNA3 transcripts (Supplementary table ST7). We did detect sporadic transcripts indicative of viral productive replication (lytic cycle) in a small proportion of cells, but will not discuss these further as they have been studied in detail elsewhere [2, 33]. EBNA promoter, polyadenylation and splice site usages appear to influence each other. The alternative splicing of the EBNA transcripts has a wide range of options, and we aimed to test the hypothesis that there is co-ordination between promoter usage, polyadenylation, and alternative splicing. Certain splice acceptor-donor combinations appear to be preferred, with Y2-BHRF1 being hugely preferred to Y3- BHRF1, and conversely the U exon acceptor was spliced exclusively to Exon Y3 in our data, even though Y2-U splice junctions have previously been observed elsewhere [30, 34, 35]. We next assessed whether different polyadenylation signals are associated with different proportions of EBNA-LP initiating transcripts. Wp-initiated transcripts were consistent (albeit with low read numbers), with an approximately 50:50 use of W1 and W1’ regardless of pA site. Shorter Cp transcripts appeared generally more likely to encode the EBNA-LP start codon than longer ones, although this trend slightly reversed in U exon transcripts (Figure 4C). This change is associated with progressively increasing use of C2’/W1 splicing and reduced C2/W1’ in the EBNA2- pAand BHRF1-pA transcripts relative to those with Y3 pA. The other two splice junctions are unchanged in frequency (Figure 4D). While the trend is clear, these differences in frequency from the average for all reads are not statistically significant (p<0.05) with two-tail analysis and multiple test correction). The BART transcripts in the EBV BAC interact with inserted expression cassettes. The deletion in B95-8 removes BART Exons II to IV (including the RPMS1 start codon), and retains only four BART miRNAs from cluster 1 (Figure 5A) and miR-BART-2 (between exon IV and V). We therefore assessed how this deletion and the inserted BAC expression cassette (lying between the BARTs’ promoter and downstream exons) affected transcripts from this region. Antisense to the BARTs, nearly 400 reads used the GFP polyadenylation site were detected, compared to just 11 for the SV40 pA used by the hygromycin phosphotransferase cassette (Supplementary tables ST7). Despite the large number of GFP reads, when viewed by fluorescence microscopy, only a small proportion of the cells are visibly green, which suggests that many of the GFP reads may derive from a small percentage of the cells. The BART transcripts predominantly used either the canonical site at the end of BART exon VII [20], which confirms that BARTs can still be transcribed across the BAC insert, or the SV40-derived pA sequence between the Hygromycin and GFP cassettes (Supplementary table ST7 and Figure 5A). In total over 200 reads use one of these pA sites (that we collectively call BART transcripts), of which one third reach Exon VII, while two thirds terminate in the SV40-derived pA ussed to terminate GFP transcripts. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Figure 4. Frequencies of EBNA transcript polyadenylation and alternative splicing events. (A) Schematic representation of the main Cp/Wp-derived RNAs observed or inferred from the nanopore reads in this study. Blue arrows indicate poly adenylation sites. Reads/exons in grey are previously observed splicing events that are infrequent or absent in these data. (B) Table of relative frequencies of transcript pA sites inferred to arise from Cp or Wp promoters (EBNA transcripts). EBNA3B and EBNA3C relative abundance within the EBNA3C pA group was inferred from the ratio of reads extending upstream of 3C Exon1. (C & D) Graphs testing co- ordination between alternative splicing of promoter exons and the polyadenylation site usage, shown either (C) in terms of whether the splicing generates the AUG start codon for EBNA-LP, or (D) frequency of different Cp splice combinations for each pA site. All pA sites downstream of the U exon have been merged into one group. Dashed lines indicate trends noted in the article text. 53% 23% 15% 9% <1% 13% 3% 40% 11% 33% EBNA pA % Y3 pA 10.3% E2 pA 43.2% BHRF1 pA 33.1% 3A pA 1.2% 3B/C pA 3B 1.9% 3C 3.1% E1 pA 7.1% B A all pA Y3 pA E2 pA BHRF1 pA U exon n=432 n=86 n=211 n=62 n=47 all pA Y3 pA E2 pA BHRF1 pA U exon n=432 n=86 n=211 n=62 n=47 C D .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint The short BART transcripts using SV40pA start with Exon I and usually (60%) Exon Ia, splicing to either of two cryptic splice sites, within either the truncated mirBART5 sequence or 60 nt before the end of the antisense Hygromycin phosphotransferase (Hyg) ORF (Figure 5A). Both of these exons (BAC_Ex2a and b respectively) extend across all but 20 nt of the Hyg ORF, then splicing to another exon (BAC_Ex3) just past the SV40 promoter. The GFP and BAC_Ex3 polyadenylation sites are those used by the bidirectional transcripts of the SV40 polyomavirus, having 3’UTRs with 80 nt of reverse complementarity to each other. (Figure 5A). The longer BART transcripts – ending in the canonical exon VII – exhibit alternative splicing in exons V and VII that have been previously reported [36]. We saw usage of exons V, Va and a 281 nucleotide exon (Va’’) [37], but no examples (in 41 reads) of exon V’ required to encode BARF0 [36], which implies (p<0.05) that exon Va’ has no more than 7% abundance in LCLs. Similarly there was no evidence for transcripts initiating close enough upstream of hypothesised ORFs A73 or RPMS1 for their translation. EBV transcripts can extend between distinct transcription units, spanning the EBV genome. As well as these canonical BARTs, ≈15% of BART reads extend 5’ of the canonical BART promoter (Supplementary table ST8). These reads contain a collection of exons from 20 kb upstream of the BARTs, that connect to the BARTs via a novel splice acceptor around 25 nt upstream of the BART Exon I transcription start site. Many of these are antisense to established late gene ORFs, but some arise within the BGRF-BDRF transcript region (Figure 5B). The novel exons could encode possible alternative versions of the BGRF1-BDRF1 terminase large subunit, and three other small ORFs, but whether these are translated will depend on the promoter usage of these transcripts, which may differ depending on cell type and latency transcription programs. In these LCL transcripts, there is no new promoter, but rather the few with long enough sequences are spliced from Cp, W repeat or U exons within EBNA transcripts. For the LMP2 transcripts, there is a 3:1 ratio of the LMP2A:2B splice junctions, but 2.5% of splice donors to LMP2 Ex2 come from exons further upstream (Supplementary table ST8): half are from a novel splice donor within BART Exon VII, but others use BART Exon I, BAC_Ex2, and EBNA U exon splice donors. Overall, around 1% of reads with a Cp or Wp intiation site terminates in a canonical polyA site downstream of EBNA1. Because these longer reads are more likely to fragment, it is unclear how common they truly are, but does suggest that either RNA polymerase can extend transcripts for the full length of the EBV genome from Cp to the LMP2 polyA, or that compound EBV reads can arise from trans-splicing at a detectable level. Whether such reads are biological noise, or hold relevance to EBV biology remains an open question.

Discussion

The transcriptome of EBV has been studied for over 40 years, from cosmid cloning and sequencing of the B95-8 EBV genome [38], followed by Northern blotting, S1 mapping, sequencing of cDNA clones [38, 39], and subsequently updated to the composite EBV annotation NC_007605 that is the current EBV reference [40]. Additional details have been revealed by RNA-seq of latency III B95-8 [41] and reactivated Akata and Mutu EBVs [33], and mapping of lytic cycle transcription start .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Figure 5. The splicing profile of BART transcripts of the EBV BAC. (A) Schematic representation of transcripts from the BART promoter spanning the EBV BAC. Percentage values are in vertical groups, and indicate the proportion of differential splice donor, splice acceptor or exon usage events. The Hyg transcript is omitted because only one read supported its existence. (B) Novel exons identified upstream of the canonical BART promoter (ExI) are shown as rectangles below the EBV genetic map. Novel exon names are based on the ORFs in which they are found. ORFs are colour coded by sense (orange) or antisense (yellow) orientation. B A BAC_Ex2a BAC_Ex2b BAC_Ex3 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint [42] and polyadenylation sites [32], and recent long-read sequencing of lytic cycle cDNAs [2, 43], to mapping of viral circular RNAs [44]. Novel insights into the EBV latency transcriptome. This long-read direct RNA transcriptome of EBV latency transcripts defines a new latency polyA site near exon Y3, short version of Exon W2, the altered BART splicing of the B95-8 BAC, and previously unreported exons mapping in between EBNA1 and BARTs, as examples of long transcripts that can span conventional gene boundaries. Similar transcripts extending from the BARTs into LMP2 [2] were identified in the EBV lytic cycle, although those transcripts were characterised by extensive exon skipping in LMP2, in contrast with the canonical LMP2 splicing in our extended transcripts. Intriguingly, long transcripts were also seen in a reactivated KSHV transcriptome [45] although these were generally 2 or 3 exon transcripts. While some of these long distance splice junctions were detectable during KSHV latency, it is unclear whether complex EBV-like transcripts are also made by KSHV during latency, and whether they are biologically important. Since our (and other long-read herpesvirus transcriptomes) analysed total RNA, is unclear which RNAs are exported to the cytoplasm for translation, are functional nuclear lncRNAs, or are incompletely processed or aberrant RNAs with no coherent function. The complexity of the EBNA transcripts can only be addressed with long read approaches. The vast majority of Cp transcripts include all IR1-derived exons, suggesting exon skipping across IR1 is rare, although there remains uncertainty because the IR1 repeat number is not entirely stable. Conversely, Wp-intiated transcripts, although they often initiat at the Wp promoters closest to Cp in these LCLs, vary more in length than Cp transcripts. This implies that the majority of EBNA-LP deriving from Cp will correspond to the maximum size for that genome, while Wp activity likely contribute the smaller EBNA-LP isoforms. Additionally, since Wp more often generates EBNA-LP transcripts than Cp, Wp transcripts likely produce a disproportionate amount of the cells’ EBNA-LP. Since Wp is the first promoter activated after infection [46], its higher production of EBNA-LP suggests that EBNA-LP protein production is more important in the early stages of infection, fitting both EBNA-LP’s early detection after B cell infection (alongside EBNA2) [47, 48], and its importance for transcription of other viral genes [3], likely by preventing Sp100 and Sp140L from repressing un-transcribed regions of the EBV genome [4]. Similarly, BHRF1 transcripts are made early after infection, but after the first few days post infection, BHRF1 protein becomes very hard to detect by western blotting [49]. Our analysis does not provide an easy explanation for why this might be: BHRF1 transcripts from Cp are less likely to contain an EBNA-LP AUG than those terminating at Y3 or EBNA2 polyA sites (Figure 4D), so competition with EBNA-LP translation is unlikely. Perhaps skipping exon Y3 makes the cryptic AUG from exon Y2 more accessible (by changing the mRNA secondary structure), thereby suppressing downstream translation of BHRF1. Given the abundant fully processed BHRF1 transcripts, it seems unlikely that miRNA processing (representing only 15% of transcripts, albeit probably more short-lived) would be sufficient to so substantially reduce translation. Together this suggests that the suppression of BHRF1 protein synthesis in latency III cells may lie downstream of RNA processing. Nevertheless, miRNA processing was evident in 15% of BHRF1 reads, most of which were truncated at the 3’ cleavage site mirBHRF1-2. Since expression constructs with a mutated mirBHRF1-2 hairpin exhibit loss of mirBHRF1- 3 production [50], this suggests that mirBHRF1-2 processing is a pre-requisite for .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint (and precedes) mirBHRF1-3 processing, and that after mirBHRF1-3 processing the RNA is rapidly degraded. BHRF1 transcripts containing W1-W1 splice junctions reportedly produced during the lytic cycle of Akata and Mutu EBV [2] were absent from our latency III transcriptome. Instead, >99% of BHRF1 transcripts spliced from exon Y2 to BHRF1 exon 2, consistent with previous analyses [30, 49]. This contrasts with splicing to the U exon, where ≈90% of transcripts also include exon Y3 (similar to the 84% seen by PacBio sequencing [30]). Thus both Y2 and Y3 splice junctions are key in defining alternative splicing of EBNA transcripts. We have also characterised the disruption of BART splicing that results from the

Introduction

of a BAC into the B95-8 genome. Both the Hygromycin phosphotransferase and GFP expression cassettes use pA sites from SV40 polyomavirus. The pA downstream of the Hygromycin resistance gene contains only the SV40 T antigen polyA site, whereas the longer region of SV40 after GFP also contains the antisense pA used by the SV40 late transcripts. It is this antisense pA that terminates 2/3 of the BART transcripts prematurely. It is perhaps surprising, however, that the cryptic exons antisense to Hygromycin are never incorporated in longer BART transcripts. This suggests that there are evolved RNA structures or RNA-binding proteins that support the correct processing BART transcripts using the canonical exon VII pA. Improving EBV genome annotation. One of the goals of this project was to create an EBV transcriptome annotation that captures the most common latency transcript isoforms (downloadable for the B95-8 BAC from ebv.org.uk). The relatively efficient IR1 splicing of Cp transcripts supports a simplification that Cp transcripts contain all IR1 exons. However, the variable Wp usage and new W2∆ exon (being distributed throughout IR1) are challenging to annotate without creating an unreasonable number of distinct mRNAs. The identification of transcripts spanning almost the entire genome led us to use oriP as the “ends” of the EBV genome, rather than the terminal repeats used in NCBI annotations, which breaks LMP2 transcripts. Although LCLs had no reads in oriP, these were observed during lytic reactivation [51]. Thus, in the absence of RNA-seq aligners able to map to circular genomes, future studies will need to tailor annotation files and analysis pipelines to specifically capture the biology they are investigating. Limitations, future optimisation and utility of dRNA-seq of EBV. Despite the advantages of direct RNA sequencing, which avoids biases-introduced through PCR, and captures long-range associations between promoter usage, polyadenylation, and splicing, we identified shortcomings and biases even in this minimally manipulated direct RNA sequencing data. We wrote a script to compensate for the failure of minimap2 to correctly align the short 5’ W0 exon, while imprecise splice site assignments (also reported elsewhere [52]) required us to use reference-guided alignment, but could not resolve the position of the W2∆ splice donor. While these will be improved by better basecalling accuracy, the miscounting of W exons due to incorrect alignment to the IR1 repeat require strategies like the nucleotide counting used herein to resolve, until aligners’ handling of repetitive sequences improves. Other inaccuracies stem from sequencing being terminated before the canonical 5’ end of the transcript. This was particularly prevalent in EBNA1, but also in GC-rich regions of EBNA3B/C, EBNA2 and BHRF1 transcripts. The Gly-Ala repeat .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint in EBNA1 forms G quadruplex structures [53], suggesting that RNA structures blocking pores (rather than hydrolysis of the RNA’s phosphate backbone) is responsible for regions of increased sequencing termination. This could be reduced by using a reverse transcriptase with better processivity, such as Induro (NEB), which improves the proportion of intact 5’ ends of transcripts [54], and is now recommended by Oxford Nanopore Technologies. This, alongside improved flowcells and basecalling, should produce even more robust dRNA-seq data going foreward. The complexity and low abundance of EBV latency transcripts makes it particularly challenging to fully characterise. Alongside improved RNA integrity, the small proportion of cellular reads originating from EBV limits the study of EBV transcript complexity. To increase the number of EBV-derived reads requires development of enrichment strategies such as targeted library preparation, hybrid capture strategies analogous those used for EBV genome sequencing [55] but suitable for maintaining intact RNA transcripts, or adaptive sampling during sequencing [56]. The polyA-based seqeuncing means that EBV’s many abundant but non- coding RNAs are not detected in our data-set. These include 1.3kb and 0.9kb RNAs from BHRF1 liberated from BHRF1 transcripts by miRNA processing, that are abundant during the EBV lytic cycle [15, 50]; the RNA-pol III-transcribed EBERs; the putative stable intronic sequence RNAs (sisRNA-1 and -2) that arise from the introns between repeating W1 and W2 exons [14]; intronic RNAs spanning the terminal repeats that work with EBER2 to regulate LMP2 [57]. Detecting such RNAs in direct RNA-seq would require specifically targeted adapters, as used for direct rRNA sequencing (eg [58]). Nevertheless, direct RNA whole transcript sequencing promises to provide the most comprehensive and least biased snapshot of the EBV transcriptome. It will be important to determine differences between lab strains and authentic circulating EBVs, and validate whether genetic modifications have unexpected impacts on transcript processing, as seen for an accidental splice acceptor, and elevated EBNA2/LP and LMP1 protein expression that characterised two mirBHRF1-mutant viruses [30], and the BAC-induced alterations in BART splicing and polyadenylation, and reduced C2-W1’ splice junctions as a signature of nonsense mediated decay in HB9 EBNA-LP-encoding transcripts reported here. Interpreting the transcriptome of EBV is further complicated by the recent discovery that LCLs are not homogeneous, but rather represent a collection of cells with diverse B cell differentiation states [59, 60]. It remains possible that some EBV transcript variants are tied to particular B cell differentiation states (influencing, or influenced by specific aspects of B cell biology), rather than simply being the stochastic variations in splice site usage. Truly understanding the nature and control of EBV transcription will require a combination of sorted or single cell techniques with targeted long read sequencing.

Materials

and methods. Generation of recombinant EBV containing modified W2 exons. To generate EBV with a ‘barcoded’ IR1, the Mfe I/Pci I restriction fragment from IR1, which contains the W exons, was cloned into a pBR322-based plasmid. The plasmid was amplified by PCR (primers in Supplementary table ST1) and circularised by InFusion cloning (Takara Bio), introducing pairs of synonymous SNPs (creating either Mlu I or Pvu I restriction sites) in exon W2. The Mfe I/Pci I restriction .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint fragments were cloned into either p8359 (Mlu I barcode) or p8342 (Pvu I barcode), which were combined by Gibson Assembly (NEBuilder HiFi Assembly mix) with wild- type BamW fragments to assemble an IR1 repeat array as described previously [3], with the Pvu I SNPs in the first repeat unit, and the Mlu I SNPs in the last repeat unit. This IR1 repeat array was recombineered into the B95-8 BAC (WKO.4 from [3]) using RecA-based recombineering [61]. The resulting BAC (pHB9-W4-MR11+4.27) was purified and transformed into HEK293 cells to generate EBV-producing cell clones (-AJ2 and -MJ2) as previously described [3]. Cells and viruses Virus production was induced from EBV-producing 293 cell clones carrying episomal EBV genomes as described previously [3]. Specifically, virus was isolated from the two barcoded B95-8-BAC producers (293-MR11-AJ2 and -MJ2); from 293-HB9-A4 containing our clone (pHB9) of the p2089 B95-8 BAC [1], whose IR1 contains 6.6 repeat units [29]; and 293-WTw.1-A11 [3], which is pHB9 engineered with a homogeneous IR1 repeat [3]. These viruses were used to infect B cells isolated (RossetteSep B cell purification kit) and pooled from buffy coat residues of two anonymous blood donors, and grown into LCLs in RPMI supplemented with 10% FCS and 4mM L-glutamine. RNA extraction and sequencing. LCLs were seeded at 3x105 cells/ml 24 hours prior to harvesting cells. RNA was extracted using Trizol, followed by two additional washes in 70% ethanol prior to resuspension, and polyA-based library preparation with RNA002 kit (Oxford Nanopore Technologies) including reverse transcription to stabilise the RNA (using Superscript III; Life Technologies), followed by direct RNA sequencing using MIN106D version R9.4.1 flow cells. FAST5 files were base-called to generate FASTQ files using Guppy (version 6.1.5) [62]. Transcript mapping and characterisation. Specific commands used for the processing and mapping of FastQ reads are available as a script (see Data Availability below). To summarise, reads were aligned separately for each cell line in an annotation-aware manner, to either the corresponding exact EBV-BAC sequence (“vs self”) or to WTw EBV-BAC sequence (all identical IR1 repeat units) with minimap2 (v2.24) [63, 64]. Where not specified, analyses were performed on the latter “vs self” alignment. Manipulations of SAM and BAM files were performed using samtools (v1.3.1). For pooled analyses, SAM files were concatenated. IDs of EBV-aligned reads were extracted from the SAM file and used to extract EBV-specific FAST5 data (fast5_subset command within ont-fast5- api v4.1.1). An in-house script (W0_correcting.r) identified W0 exon-derived sequences that had been removed by soft-clipping during the initial alignment by Minimap2, and modified the SAM files to appropriately align these W0-derived sequences. Common polyadenylation (pA) sites and transcription start sites (TSS) were identified and reads sorted into full-length and various incomplete/rare read groups using an in-house script (transcript_classifying.r). Full length transcripts are defined as those whose 5’ and 3’ ends are both no more than 20 nucleotides away from a reproducible start or stop position. Reproducible start and stop sites are where a 5’ or 3’ end occurs in at least 5 reads within a 5 nt window. Next a third in- house script (IR1_repeat_counting.r) was used to extract the number of IR1 repeat .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint units to which reads were aligned, and the number of nucleotides aligned to each IR1 W exon in each repeat unit (including the novel W2’ exon identified herein). Information about polyA lengths was extracted using nanopolish (v0.14.0) [65]. Reads were then parsed using previously published perl scripts (classify_transcripts_and_polya_segmented_V2.pl and all_transcripts_grouped_by_polya.pl [66]) to extract and collate 5’ end, 3’ end, and splice site co-ordinates from each read, to filter reads according to their pA tail status, and adjust the 5’ end position to correct for the ≈10 most 5’ nucleotides, that are lost by nanopore sequencing [66, 67]. Subsequent characterisation and quantitation of splice junctions were conducted using excel formulae to extract details from these outputs. IR1 analyses were conducted on full-length reads only. Other transcript analyses used the polyadenylated subset of reads to define splice donors and acceptors, but used all reads to quantify splice site usage. Read alignments were visualised using Integrated Genome Viewer (IGV) v2.16.2. Western blotting. Immunoblots were performed using RIPA protein extracts and mini-Protean gel electrophoresis and blotting equipment (Bio-Rad) to transfer the gel contents to nitrocellulose as described previously [29] and imaged using an Azure 600 chemiluminescence imager. The anti-EBNA-LP JF186 antibody hybridoma supernatant was used at 1:100; Mouse anti-ɣ-tubulin (clone GTU-88; Sigma-Aldrich) was used at 1:1000. Analysis of EBV episomes. To extract EBV genomes from stably infected cell lines (“episome rescue”), DNA was extracted from approximately 3x106 cells using the low molecular weight DNA extraction protocol [68], with additional use of Maxtract tubes (Qiagen) to maintain separation of phases during the phenol extraction step. DNA was resuspended in 50 µl of TE-RNAse, of which 2 µl was electroporated into NEB10-beta bacteria using a Bio-Rad Gene Pulser. BAC DNA was extracted chloramphenicol-resistant colonies by alkaline-lysis miniprep, digested with EcoRI and analysed by pulsed field gel electrophoresis (Bio-Rad CHEF-DR II) to visualise the IR1 band (largest fragment). Data availability. Perl scripts from [66] are available with documentation from zenodo.org/record/7101768. In-house scripts with documentation is available from github.com/robertewhite/ebv-transcriptomics-tools. The Bam files for reads aligned to EBV (with modified W0 alignment), the FAST5 and fastq files for EBV reads are available in the EMBL Nucleotide Archive (Study accession number: PRJEB83447). All raw and processed EBV data from this study, including EBV-BAC genome annotation, EBV plus cellular fastq reads, and scripts are linked from ebv.org.uk.

Acknowledgements

Thanks to Claudio Elgueta for providing the B cells used to produce the LCLs described in this project. Thanks to Micah Luftig for unpublished observations from .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint PacBio sequencing from [30]. Thanks to Paul Farrell for critical reading of this manuscript. Funding REW and WW were supported by MRC grant MR/S0022597/1. AM-L, IZ, MR, JH were supported by Imperial College MSc project stipends. AG was supported by Wellcome Trust (grant WT099273/Z/12/Z) to Prof Martin Allday. DAM and ID-B were supported by BBSRC grant BB/M02542X/1. Author contributions Conceptualisation: REW, AM-L, IZ Computational Biology and programming: REW, AM-L, IZ, DAM Data Analysis REW, AM-L, IZ, JC, HIC Transcriptome annotation: REW, AG Experimental Work: JH, MR, WW, ID-B Manuscript preparation: REW, AM-L

References

1. Delecluse HJ, Hilsendegen T, Pich D, Zeidler R, Hammerschmidt W. Propagation and recovery of intact, infectious Epstein-Barr virus from prokaryotic to human cells. Proc Natl Acad Sci U S A 1998;95(14):8245-8250. 2. O’Grady T, Wang X, Höner Zu Bentrup K, Baddoo M, Concha M et al. Global transcript structure resolution of high gene density genomes through multi-platform data integration. Nucleic Acids Research 2016;44(18):e145-e145. 3. Szymula A, Palermo RD, Bayoumy A, Groves IJ, Ba Abdullah M et al. Epstein-Barr virus nuclear antigen EBNA-LP is essential for transforming naïve B cells, and facilitates recruitment of transcription factors to the viral genome. PLOS Pathogens 2018;14(2):e1006890. 4. Cable JM, Wongwiwat W, Grabowski JC, White RE, Luftig MA. Sp140L Is a Novel Herpesvirus Restriction Factor. bioRxiv 2024:2024.2012.2013.628399. 5. Kempkes B, Ling PD. EBNA2 and Its Coactivator EBNA-LP. Curr Top Microbiol Immunol 2015;391:35-59. 6. Styles CT, Paschos K, White RE, Farrell PJ. The Cooperative Functions of the EBNA3 Proteins Are Central to EBV Persistence and Latency. Pathogens (Basel, Switzerland), Review 2018;7(1):31. 7. Altmann M, Pich D, Ruiss R, Wang J, Sugden B et al. Transcriptional activation by EBV nuclear antigen 1 is essential for the expression of EBV's transforming genes. Proceedings of the National Academy of Sciences 2006;103(38):14188-14193. 8. Yoo L, Speck SH. Determining the role of the Epstein-Barr virus Cp EBNA2-dependent enhancer during the establishment of latency by using mutant and wild-type viruses recovered from cottontop marmoset lymphoblastoid cell lines. J Virol 2000;74(23):11115- 11120. 9. Yoo LI, Woloszynek J, Templeton S, Speck SH. Deletion of Epstein-Barr virus regulatory sequences upstream of the EBNA gene promoter Wp1 is unfavorable for B-Cell immortalization. J Virol 2002;76(22):11763-11769. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint 10. Austin PJ, Flemington E, Yandava CN, Strominger JL, Speck SH. Complex transcription of the Epstein-Barr virus BamHI fragment H rightward open reading frame 1 (BHRF1) in latently and lytically infected B lymphocytes. Proc Natl Acad Sci U S A 1988;85(11):3678-3682. 11. Rogers RP, Woisetschlaeger M, Speck SH. Alternative splicing dictates translational start in Epstein-Barr virus transcripts. The EMBO Journal, Comparative Study 1990;9(7):2273-2277. 12. Cai X, Schäfer A, Lu S, Bilello JP, Desrosiers RC et al. Epstein-Barr virus microRNAs are evolutionarily conserved and differentially expressed. PLOS Pathogens 2006;2(3):e23. 13. Pfeffer S, Zavolan M, Grässer FA, Chien M, Russo JJ et al. Identification of virus- encoded microRNAs. Science (New York, NY) 2004;304(5671):734-736. 14. Moss WN, Steitz JA. Genome-wide analyses of Epstein-Barr virus reveal conserved RNA structures and a novel stable intronic sequence RNA. BMC Genomics 2013;14:543. 15. Xing L, Kieff E. Epstein-Barr virus BHRF1 micro- and stable RNAs during latency III and after induction of replication. The Journal of Virology 2007;81(18):9967-9975. 16. Rosa MD, Gottlieb E, Lerner MR, Steitz JA. Striking similarities are exhibited by two small Epstein-Barr virus-encoded ribonucleic acids and the adenovirus-associated ribonucleic acids VAI and VAII. Mol Cell Biol 1981;1(9):785-796. 17. Howe JG, Shu MD. Epstein-Barr virus small RNA (EBER) genes: unique transcription units that combine RNA polymerase II and III promoter elements. Cell 1989;57(5):825-834. 18. Laux G, Economou A, Farrell PJ. The terminal protein gene 2 of Epstein-Barr virus is transcribed from a bidirectional latent promoter region. J Gen Virol 1989;70 ( Pt 11):3079- 3084. 19. Chen H-S, Martin KA, Lu F, Lupey LN, Mueller JM et al. Epigenetic deregulation of the LMP1/LMP2 locus of Epstein-Barr virus by mutation of a single CTCF-cohesin binding site. Journal of virology 2014;88(3):1703-1713. 20. Smith PR, de Jesus O, Turner D, Hollyoake M, Karstegl CE et al. Structure and coding content of CST (BART) family RNAs of Epstein-Barr virus. The Journal of Virology 2000;74(7):3082-3092. 21. Al-Mozaini M, Bodelon G, Karstegl CE, Jin B, Al-Ahdal M et al. Epstein-Barr virus BART gene expression. J Gen Virol 2009;90(Pt 2):307-316. 22. Marquitz AR, Raab-Traub N. Host Gene Expression Is Regulated by Two Types of Noncoding RNAs Transcribed from the Epstein-Barr Virus BamHI A Rightward Transcript Region. Journal of virology 2015;89(22):11256-11268. 23. Edwards RH, Marquitz AR, Raab-Traub N. Epstein-Barr virus BART microRNAs are produced from a large intron prior to splicing. Journal of virology 2008;82(18):9094-9106. 24. Marquitz AR, Raab-Traub N. The role of miRNAs and EBV BARTs in NPC. Semin Cancer Biol 2012;22(2):166-172. 25. Iizasa H, Kim H, Kartika AV, Kanehiro Y, Yoshiyama H. Role of Viral and Host microRNAs in Immune Regulation of Epstein-Barr Virus-Associated Diseases. Front Immunol 2020;11:367. 26. Fachko DN, Goff B, Chen Y, Skalsky RL. Functional Targets for Epstein-Barr Virus BART MicroRNAs in B Cell Lymphomas. Cancers (Basel) 2024;16(20). 27. Jain M, Abu-Shumays R, Olsen HE, Akeson M. Advances in nanopore direct RNA sequencing. Nat Methods 2022;19(10):1160-1164. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint 28. Workman RE, Tang AD, Tang PS, Jain M, Tyson JR et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nature Methods, OriginalPaper 2019;16(12):1297-1305. 29. Ba abdullah MM, Palermo RD, Palser AL, Grayson NE, Kellam P et al. Heterogeneity of the Epstein-Barr Virus (EBV) Major Internal Repeat Reveals Evolutionary Mechanisms of EBV and a Functional Defect in the Prototype EBV Strain B95-8. Journal of virology 2017;91(23):e00920-00917. 30. Poling BC, Price AM, Luftig MA, Cullen BR. The Epstein-Barr virus miR-BHRF1 microRNAs regulate viral gene expression in cis. Virology 2017;512:113-123. 31. Allan GJ. Structure and function of the Epstein-Barr virus leader protein. Doctor of Philosophy Thesis, University of London; 1991. 32. Majerciak V, Yang W, Zheng J, Zhu J, Zheng Z-M. A genome-wide Epstein-Barr virus polyadenylation map and its antisense RNA to EBNA. Journal of virology 2018:JVI.01593- 01518. 33. Concha M, Wang X, Cao S, Baddoo M, Fewell C et al. Identification of new viral genes and transcript isoforms during Epstein-Barr virus reactivation using RNA-Seq. J Virol 2012;86(3):1458-1467. 34. Speck SH, Pfitzner A, Strominger JL. An Epstein-Barr virus transcript from a latently infected, growth-transformed B-cell line encodes a highly repetitive polypeptide. Proceedings of the National Academy of Sciences 1986;83(24):9298-9302. 35. Sample J, Brooks L, Sample C, Young L, Rowe M et al. Restricted Epstein-Barr virus protein expression in Burkitt lymphoma is due to a different Epstein-Barr nuclear antigen 1 transcriptional initiation site. Proc Natl Acad Sci U S A 1991;88(14):6343-6347. 36. Sadler RH, Raab-Traub N. Structural analyses of the Epstein-Barr virus BamHI A transcripts. J Virol 1995;69(2):1132-1141. 37. Yamamoto T, Iwatsuki K. Diversity of Epstein-Barr virus BamHI-A rightward transcripts and their expression patterns in the lytic and latent infections. Journal of medical microbiology 2012. 38. Baer R, Bankier AT, Biggin MD, Deininger PL, Farrell PJ et al. DNA sequence and expression of the B95-8 Epstein-Barr virus genome. Nature 1984;310(5974):207-211. 39. Farrell PJ. DNA Viruses: Epstein-Barr virus (EBV). In: O’Brien SJ (editor). Genetic Maps - Locus Maps of Complex Genomes, Book 1: Viruses, 6th Edition: Cold Spring Harbour Laboratory Press; 1993. pp. 120-133. 40. de Jesus O, Smith PR, Spender LC, Elgueta Karstegl C, Niller HH et al. Updated Epstein-Barr virus (EBV) DNA sequence and analysis of a promoter for the BART (CST, BARF0) RNAs of EBV. The Journal of general virology 2003;84(Pt 6):1443-1450. 41. Arvey A, Tempera I, Tsai K, Chen HS, Tikhmyanova N et al. An atlas of the Epstein- Barr virus transcriptome and epigenome reveals host-virus regulatory interactions. Cell Host Microbe 2012;12(2):233-245. 42. Wang Y, Ungerleider N, Hoffman BA, Kara M, Farrell PJ et al. A Polymorphism in the Epstein-Barr Virus EBER2 Noncoding RNA Drives In Vivo Expansion of Latently Infected B Cells. mBio 2022;13(3):e0083622. 43. Fulop A, Torma G, Moldovan N, Szenthe K, Banati F et al. Integrative profiling of Epstein-Barr virus transcriptome using a multiplatform approach. Virol J 2022;19(1):7. 44. Ungerleider N, Concha M, Lin Z, Roberts C, Wang X et al. The Epstein Barr virus circRNAome. PLOS Pathogens 2018;14(8):e1007206. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint 45. Shekhar R, O'Grady T, Keil N, Feswick A, Amador DAM et al. High-density resolution of the Kaposi's sarcoma associated herpesvirus transcriptome identifies novel transcript isoforms generated by long-range transcription and alternative splicing. Nucleic Acids Res 2024;52(13):7720-7739. 46. Woisetschlaeger M, Yandava CN, Furmanski LA, Strominger JL, Speck SH. Promoter switching in Epstein-Barr virus during the initial stages of infection of B lymphocytes. Proceedings of the National Academy of Sciences 1990;87(5):1725-1729. 47. Alfieri C, Birkenbach M, Kieff E. Early events in Epstein-Barr virus infection of human B lymphocytes. Virology 1991;181(2):595-608. 48. Allday MJ, Crawford DH, Griffin BE. Epstein-Barr virus latent gene expression during the initiation of B cell immortalization. The Journal of general virology 1989;70 ( Pt 7)(7):1755-1764. 49. Kelly GL, Long HM, Stylianou J, Thomas WA, Leese A et al. An Epstein-Barr virus anti-apoptotic protein constitutively expressed in transformed cells and implicated in burkitt lymphomagenesis: the Wp/BHRF1 link. PLOS Pathogens 2009;5(3):e1000341. 50. Xing L, Kieff E. cis-Acting effects on RNA processing and Drosha cleavage prevent Epstein-Barr virus latency III BHRF1 expression. Journal of virology 2011;85(17):8929-8939. 51. Cao S, Moss W, O'Grady T, Concha M, Strong MJ et al. New Noncoding Lytic Transcripts Derived from the Epstein-Barr Virus Latency Origin of Replication, oriP, Are Hyperedited, Bind the Paraspeckle Protein, NONO/p54nrb, and Support Viral Lytic Transcription. J Virol 2015;89(14):7120-7132. 52. Nguyen Quang N, Goudey S, Segeral E, Mohammad A, Lemoine S et al. Dynamic nanopore long-read sequencing analysis of HIV-1 splicing events during the early steps of infection. Retrovirology 2020;17(1):25. 53. Murat P, Zhong J, Lekieffre L, Cowieson NP, Clancy JL et al. G-quadruplexes regulate Epstein-Barr virus-encoded nuclear antigen 1 mRNA translation. Nat Chem Biol 2014;10(5):358-364. 54. Zeglinski K, Montellese C, Ritchie ME, Alhamdoosh M, Vonarburg C et al. An optimised protocol for quality control of gene therapy vectors using Nanopore direct RNA sequencing. bioRxiv 2024:2023.2012.2003.569756. 55. Kwok H, Wu CW, Kellam P, Chiang AKS. Genomic diversity of Epstein-Barr virus genomes isolated from primary nasopharyngeal carcinoma biopsy samples. Journal of virology 2014;88(18):10662-10672. 56. Wang J, Yang L, Cheng A, Tham CY, Tan W et al. Direct RNA sequencing coupled with adaptive sampling enriches RNAs of interest in the transcriptome. Nat Commun 2024;15(1):481. 57. Lee N, Moss WN, Yario TA, Steitz JA. EBV Noncoding RNA Binds Nascent RNA to Drive Host PAX5 to Viral DNA. Cell 2015;160(4):607-618. 58. Naarmann-de Vries IS, Zorbas C, Lemsara A, Piechotta M, Ernst FGM et al. Comprehensive identification of diverse ribosomal RNA modifications by targeted nanopore direct RNA sequencing and JACUSA2. RNA Biol 2023;20(1):652-665. 59. SoRelle ED, Reinoso-Vizcaino NM, Horn GQ, Luftig MA. Epstein-Barr virus perpetuates B cell germinal center dynamics and generation of autoimmune-associated phenotypes in vitro. Front Immunol 2022;13:1001145. 60. SoRelle ED, Dai J, Bonglack EN, Heckenberg EM, Zhou JY et al. Single-cell RNA-seq reveals transcriptomic heterogeneity mediated by host-pathogen dynamics in lymphoblastoid cell lines. Elife 2021;10. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint 61. White RE, Calderwood MA, Whitehouse A. Generation and precise modification of a herpesvirus saimiri bacterial artificial chromosome demonstrates that the terminal repeats are required for both virus production and episomal persistence. The Journal of general virology 2003;84(Pt 12):3393-3403. 62. Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 2019;20(1):129. 63. Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 2021;37(23):4572-4574. 64. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018;34(18):3094-3100. 65. Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods 2015;12(8):733-735. 66. Donovan-Banfield I, Turnell AS, Hiscox JA, Leppard KN, Matthews DA. Deep splicing plasticity of the human adenovirus type 5 transcriptome drives virus evolution. Commun Biol 2020;3(1):124. 67. Abebe JS, Alwie Y, Fuhrmann E, Leins J, Mai J et al. Nanopore guided annotation of transcriptome architectures. mSystems 2024;9(7):e0050524. 68. Wade-Martins R, Frampton J, James MR. Long-term stability of large insert genomic DNA episomal shuttle vectors in human cells. Nucleic Acids Research 1999;27(7):1674-1682. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Figure S1. Barcoding sequences of pHB9-MR11. Schematic representation of the IR1 repeat indicating the Cp, W1 and Y exons (Orange), with the W2 exons coloured according to their sequence, which is shown below: With the first W2 exon PvuI barcoded (Magenta; top sequence); the next four having Wild-type Exon W2 (Brick Red; middle sequence) and final W2 exon MluI barcoded (Blue; bottom sequence). exon W1.1 exon W1.2 exon W1.3 exon W1.4 exon W1.5 exon W1.6 Figure S2. The STOP codon in the W1 exon of HB9 is in the first exon. Visualisations (using Integrated Genome Browser) of the BAM file of the pHB9-LCL reads aligned to the WTw genome, showing each of the six W1 exons, with genome positions (genome position 1 is the Mlu I restriction site in oriP) marked above. Mismatches of reads with consensus sequence are indicated with coloured boxes, and summarised across all reads in the bars at the top. Grey indicates matches to consensus; black line is deletion and blue flanged element insertion, both relative to consensus. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Figure S3. Misalignment of reads to IR1 is evident as multuiple exon pair insertions. Visualisation, using integrated genome browser (indels <4 nt are hidden), of reads aligned to a W2 exon. Un-aligned insertions are indicated by blue bars containing white writing. Expected size of W1-W2 exon pair is 193 nt (W1-W2’ is 172nt). .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint 0% 10% 20% 30% 40% 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 Cp transcripts Wp transcripts Figure S4. Validation of IR1 counting strategy. (A) Frequency of mis-mapping in different W2 exons in IR1 (repeat number is after the “W2.” Percent with insertion is defined as having a number of nucleotides mapped to the exon that is over 1.5 times the ≈130 nucleotides normally observed aligning to that exon. (B) The percentage Cp or Wp transcripts (that also map to at least one canonical exon downstream of IR1) with calculated IR1 repeat numbers, plotted as 0.1 repeat bins (centred on integer numbers). The read numbers cluster around 0.5 repeat values, suggesting this approach is effective at estimating the number of IR1 exons in each transcript. (C) The distribution of IR1 length estimates of Wp reads, comparing WTw with the other three samples. 0% 5% 10% W2.1 W2.2 W2.3 W2.4 W2.5 W2.6 % with large insertion Exon A B 0.0% 20.0% 40.0% 60.0% 80.0% WTwNon WTw WTw n=29 6 5.5 5 4.5 4 all non-WTw n=14 6 5.5 5 4.5 4 C Wp transcript length distribution % of Wp transcripts .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Figure Legends Figure 1. Schematic overview of main latency transcripts of B95-8 EBV. Gene and repeat structure of the virus genome (normally circular, but viewed with its ends at oriP) is shown above an overview of the main transcript structures. Gene and transcript arrows indicate transcription direction. Genes are colour-coded as EBV nuclear antigens (EBNA - red) and latency membrane proteins (LMPs – blue). EBNA promoters (Cp, Wp and Qp) and the U exon are labelled. Repeat regions (magenta) are family of repeats in oriP, internal repeats IR1 (Bam W repeat); IR2 (part of oriLyt); IR3 (glycine-alanine repeats in EBNA1), while B95-8 deletion (∆B95-8) loses IR4 and contains the BAC (not shown); Terminal repeats (TR) lie between LMPs. Note that some genes have more than the indicated number of splicing events. Figure 2. Promoter usage and alternative splicing of the EBNA transcripts. (A) Schematic diagram showing the main splice variants and their frequency (expressed as percentage of transcripts from all four samples combined splicing into or within IR1). Magenta splice junctions would retain the EBNA-LP open reading frame (B) Graph showing distribution of promoter splice junctions for all combined samples (as A) and separated by the four LCLs analysed. (C) The frequency of W1’ usage as the splice acceptor for different splice donors, for all reads combined (total) or individual LCLs made with the different viruses indicated. Figure 3. Transcripts spanning IR1 show relatively efficient splicing, and diverse stochastic splicing decisions. (A) Western blot of EBNA-LP from the LCLs used in this study. Position and size (kDa) of the marker are shown right. Numerical labels to the left indicate the likely number of W1W2 exon pairs likely in each band size. (B) Densitometry of western blot, corrected for the different numbers of antibody epitopes in different length EBNA-LPs, showing relative abundance of EBNA-LP-W4, -W5, and -W6 for each cell line. Number indicates number of W1W2 exon pairs. (C) Graph shows percentage of Cp or Wp transcripts with different numbers of W1 and W1’ exons based on counting exons to which nucleotides are mapped, and correcting for total number of aligned nucleotides. (D) Distribution of IR1 lengths in Cp transcripts in different cell lines. Figure 4. Frequencies of EBNA transcript polyadenylation and alternative splicing events. (A) Schematic representation of the main Cp/Wp-derived RNAs observed or inferred from the nanopore reads in this study. Blue arrows indicate poly adenylation sites. Reads/exons in grey are previously observed splicing events that are infrequent or absent in these data. (B) Table of relative frequencies of transcript pA sites inferred to arise from Cp or Wp promoters (EBNA transcripts). EBNA3B and EBNA3C relative abundance within the EBNA3C pA group was inferred from the ratio of reads extending upstream of 3C Exon1. (C & D) Graphs testing co-ordination between alternative splicing of promoter exons and the polyadenylation site usage, shown either (C) in terms of whether the splicing generates the AUG start codon for EBNA-LP, or (D) frequency of different Cp splice combinations for each pA site. All pA sites downstream of the U exon have been merged into one group. Dashed lines indicate trends noted in the article text. Figure 5. The splicing profile of BART transcripts of the EBV BAC. (A) Schematic representation of transcripts from the BART promoter spanning the EBV BAC. Percentage values are in vertical groups, and indicate the proportion of differential splice donor, splice acceptor or exon usage events. The Hyg transcript is omitted because only one read supported its existence. (B) Novel exons identified upstream of the canonical BART promoter (ExI) are shown as rectangles below the EBV genetic map. Novel exon names are based on the ORFs in which they are found. ORFs are colour coded by sense (orange) or antisense (yellow) orientation. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Supporting data – Figure and table legends. Supporting Figure S1. Barcoding sequences of pHB9-MR11. Schematic representation of the IR1 repeat indicating the Cp, W1 and Y exons (Orange), with the W2 exons coloured according to their sequence, which is shown below: With the first W2 exon PvuI barcoded (Magenta; top sequence); the next four having Wild-type Exon W2 (Brick Red; middle sequence) and final W2 exon MluI barcoded (Blue; bottom sequence). Supporting Figure S2. The STOP codon in the W1 exon of HB9 is in the first exon. Visualisations (using Integrated Genome Browser) of the BAM file of the pHB9-LCL reads aligned to the WTw genome, showing each of the six W1 exons, with genome positions (genome position 1 is the Mlu I restriction site in oriP) marked above. Mismatches of reads with consensus sequence are indicated with coloured boxes, and summarised across all reads in the bars at the top. Grey indicates matches to consensus; black line is deletion and blue flanged element insertion, both relative to consensus. Supporting Figure S3. Misalignment of reads to IR1 is evident as multuiple exon pair insertions. Visualisation, using integrated genome browser (indels <4 nt are hidden), of reads aligned to a W2 exon. Un-aligned insertions are indicated by blue bars containing white writing. Expected size of W1-W2 exon pair is 193 nt (W1-W2’ is 172nt). Supporting Figure S4. Validation of IR1 counting strategy. (A) Frequency of mis- mapping in different W2 exons in IR1 (repeat number is after the “W2.” Percent with insertion is defined as having a number of nucleotides mapped to the exon that is over 1.5 times the ≈130 nucleotides normally observed aligning to that exon. (B) The percentage Cp or Wp transcripts (that also map to at least one canonical exon downstream of IR1) with calculated IR1 repeat numbers, plotted as 0.1 repeat bins (centred on integer numbers). The read numbers cluster around 0.5 repeat values, suggesting this approach is effective at estimating the number of IR1 exons in each transcript. (C) The distribution of IR1 length estimates of Wp reads, comparing WTw with the other three samples. Supplementary table ST1: Oligos for barcoding IR1 Supplementary table ST2: Episome rescue W repeat totals Supporting table ST3: Calculation of IR1 number of repeats. In-house script IR1_repeat_counting.r was used to count the number of nucleotides aligned to each of the latency exons. Only reads with C1 or C2 exons, plus those previously processed for W0 correction are analysed. Calulcated W1-W2 exon number compares the observed number of nucleotides aligned to W exons with the enumber expected for the count of exons to which nucleotides are aligned (column O) , and uses that to calculate a repeat number corrected for alignment problems (column P). Highlighting indicates mismatches between the number of W1 and W2 exons, or between observed and expected repeat number. Supporting table ST4: Reads skipping barcoded exons in vs self and vs WTw alignments. Supporting table ST5: Number of oversize W2 exon alignments by sample. Supporting table ST6: 5’ and 3’ positions used to define full lenth reads. Unbiased identification of "full length" reads defines promoters and pA usage, but 5' ends also include sites of frequent read breakage in EBNA2 and EBNA1 genes. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint Supporting table ST7: Read integrity separated by different promoter and polyA site usages. This table combines all canonical 5' and 3' ends identified in table ST6, plus all sites that correspond to previously described promoters or polyA sites. Total usage is the read count for reads with that pA site or 5' end. % full length represents % of reads by pA site atha also contain a defined 5' end (left two tables); or 5' ends that also contain a defined pA site (right tables). These data are summarised in the tables below that show how frequently a read has a canonical 3' (left) or 5' end, showing that (as expected for the pA-intiated seqeuncing method) that the vast majority of reads have canonical 3' ends (pA sites) but under half of sense reads run all the way from pA site to promoter. Supporting table 8. Sense orientation splice junctions. Includes only sites that are frequent in our data or previously annotated. Numbers in cells are number of reads with that junction. Donors marked undefined do not have a consistent exon. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 23, 2025. ; https://doi.org/10.1101/2025.04.23.648900doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-21T05:10:58.409756+00:00
License: CC-BY-NC-4.0