A dedicated caller forDUX4rearrangements from whole-genome sequencing data

preprint OA: gold CC-BY-ND-4.0
📄 Open PDF Full text JSON View at publisher
Full text 65,562 characters · extracted from oa-pdf · 13 sections · click to expand

Abstract

Rearrangements involving the DUX4 gene (DUX4-r) define a subtype of paediatric and adult acute lymphoblasMc leukaemia (ALL) with a favourable outcome. Currently, there is no ‘standard of care’ diagnosMc method for their confident idenMficaMon. Here, we present Pelops, an open-source socware tool designed to detect DUX4-r from short-read, whole- genome sequencing (WGS) data. EvaluaMon on a cohort of 210 paediatric ALL cases showed that Pelops detects all known, as well as previously unidenMfied, cases of IGH::DUX4 and rearrangements with other partner genes. These findings demonstrate the possibility of robustly detecMng DUX4-r using WGS in the rouMne clinical segng. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 3

Keywords

DUX4; acute lymphoblasMc leukaemia; ALL; whole-genome sequencing; IGH::DUX4; IGH enhancer hijacking .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 4

Background

In childhood and adult B-cell acute lymphoblasMc leukaemia (B-ALL), chromosomal abnormalities play a significant role in risk stratification for treatment within clinical trials worldwide. Although a wide range of genetic aberrations have been known for many years, recent sequencing studies have uncovered a wealth of additional genetic information of prognostic relevance with implications for changes to treatment strategies. One such genetic alteration, DUX4 rearrangements (DUX4-r), defines a recently reported subtype that affects 4–7% of paediatric paMents [1] and ∼5% of adolescents and young adults [2]. PaMents in all age groups exhibit favourable outcomes but due to the presence of concomitant risk factors are frequently treated as intermediate or high risk [3,4]. However, for reasons described below, the detecMon of DUX4-r is challenging and if opMmal therapy is to be given to these paMents, their accurate idenMficaMon is paramount. DUX4 (Double Homeobox 4) is a transcripMon factor that is selecMvely and transiently expressed in cleavage-stage embryos [5] and germ cells of the tesMs [1]. A copy of the DUX4 gene, encoding two homeoboxes, is located within each unit of the D4Z4 macrosatellite repeat array in the subtelomeric region of chromosome 4 long arm (4q) and in a similar repeat array on chromosome 10q. The ∼3.3 kb D4Z4 repeat is polymorphic in length and has 11-100 copies in healthy individuals [6]. When ectopically acMvated, DUX4 can upregulate expression of mulMple genes and iniMate transcripMon from alternaMve promoters, leading to non-canonical transcript isoforms [7]. In fact, it has been shown that contracMon of the D4Z4 repeat array below 11 copies decreases the epigeneMc repression of DUX4, causing autosomal dominant facioscapulohumeral muscular dystrophy (FSHD) [8]. DUX4-r cases in ALL were iniMally discovered through their disMncMve gene expression profile [9]. DUX4 is commonly rearranged with the Immunoglobulin Heavy Chain Locus (IGH), although mulMple fusion partners have been idenMfied [10–14]. The rearrangement typically creates a chimeric transcript that retains the 5’ end of DUX4 but replaces the 3’ coding sequence with a secMon of IGH. This event, most likely via IGH enhancer hijacking, leads to acMvaMon of expression of DUX4 in developing lymphocytes. The resulMng change in the transcripMonal landscape of the affected cell is thought to lead to oncogenic .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 5 transformaMon [13]. DUX4-r have also been discovered in a disMnct rare subtype of CIC (Capicua TranscripMonal Repressor)-rearranged non-Ewing sarcoma, accounMng for less than 1% of all sarcomas [15], primarily affecMng young adults and associated with a poor outcome [16]. In CIC::DUX4 sarcoma, similar to ALL, the resulMng fusion protein acts as a transcripMonal acMvator driving the oncogenesis [17]. In ALL, DUX4-r were first thought to be driven by co-occurring deleMons in the ERG transcripMon factor, which was used as a surrogate for their idenMficaMon. However, more recent studies have now indicated that ERG deleMons are present in only a subset of DUX4-r paMents and that they are likely to be subclonal [3,12,14,18,19]. Subsequently, using next generaMon sequencing (NGS) approaches, two independent studies confirmed DUX4-r to be the driving lesion [11,13]. The complex and crypMc nature of DUX4, and the presence of very similar DUX4 copies throughout the genome, have precluded its accurate detection using current bioinformatics tools and other standard of care genetic tests. Common approaches that look for accumulaMon of discordant sequence read-pairs to idenMfy breakpoints typically fail due to the mulMple possible mapping locaMons leading to scarering of supporMng reads. Furthermore, most structural variant callers disregard sequence reads with mulMple equivalent mapping posiMons. Clearly, robust geneMc tesMng methods are urgently needed. We reasoned that a custom DUX4-r caller, which takes into account all read-pairs spanning any DUX4 copy and the gene partner of interest, was required: an approach that we have described previously [12]. We present here an improved implementaMon of this method as Pelops, an open-source socware tool that can be integrated into exisMng bioinformaMcs analysis pipelines. We evaluated Pelops on a paediatric B-ALL cohort of 210 paMents [12] and demonstrate that Pelops is a robust tool for idenMfying DUX4 rearrangements from tumour- only WGS data. This proof-of-concept work indicates a path to improved diagnostic testing for this good risk genetic subtype in clinical WGS pipelines. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 6

Results

Pelops overview Pelops is a socware tool that implements a tumour-only analysis to idenMfy the signal of DUX4 rearrangements from short-read WGS data, building on our earlier version of the

Method

[12]. The most common DUX4-r, IGH::DUX4, is idenMfied using a targeted approach by finding read-pairs spanning any part of the IGH and DUX4 regions and then calculaMng the number of spanning read pairs per billion total reads (SRPB) (Figure 1A, Methods , Supplementary Figure S1). For DUX4-r cases involving DUX4 fusions with other partner genes, an untargeted, genome- wide approach is required. Pelops finds evidence for these cases by idenMfying genomic regions containing mulMple mates of improperly paired reads anchored in the DUX4 region (Figure 1B, Methods , Supplementary Figure S1). Evalua/on on the paediatric ALL cohort IGH::DUX4 rearrangements An earlier implementaMon of this approach idenMfied 57 IGH::DUX4 cases in a cohort of 210 paediatric B-cell ALL paMents [12] (Supplementary Table 1). As described in the Methods secMon, we modified the original method to find all spanning reads in each sample and developed the Pelops socware. IniMally, we ran Pelops on the same 210 leukaemia samples, using their matched germline samples as negaMve controls. We found excellent concordance between published spanning reads per billion (SRPB) from the earlier implementaMon with those from Pelops (Supplementary Figure S2). The SRPB values depend on how the DUX4 region is defined, and we show results for two complementary definiMons: the ‘core DUX4’ region and the ‘extended DUX4’ region. The extended region covers the DUX4 repeat arrays on chromosomes 4 and 10 with a 100 kb margin, as well as addiMonal DUX4 pseudogenes on other chromosomes. The core region is a subset of the extended region, and only covers the subtelomeric DUX4 repeats on chromosomes 4 and 10, with a 1 kb margin (Figure 1A, Supplementary Table 2). By using SRPB values calculated for these two regions, it was possible to idenMfy all samples with an IGH::DUX4 fusion (Figure 2). These results were independent of the read aligner used: while we discuss results for DRAGEN alignments .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 7 (Figure 2) for the remainder of this secMon, bwa and Isaac [20,21] yielded qualitaMvely idenMcal results (Supplementary Figure S3, Supplementary Table 3). Among the 57 samples idenMfied by Pelops as IGH::DUX4 fusions, orthogonal evidence was available for a total of 55 (Figure 3, Supplementary Table 1). Previous analysis confirmed 43 samples through ERG deleMons, and through whole-transcriptome sequencing (WTS) [12]. ERG deleMons associated with DUX4-r were confirmed by WGS and by M ulMplex LigaMon- dependent Probe AmplificaMon (MLPA). Whole-transcriptome sequencing (WTS) was used to confirm cases based on their gene expression profile and presence of IGH::DUX4 RNA fusions. In this study, we used Pelops outputs to create de novo sequence assemblies of DUX4-r breakpoint juncMons, as described in more detail below. Based on these sequences, we then designed primers to confirm a further 12 IGH::DUX4 cases by PCR amplicon sequencing (Figure 3, Supplementary Table 1). Using the core DUX4 region, all but one of the orthogonally validated samples were correctly called using a threshold of SRPB ≥ 5 (Figure 2). The remaining case (paMent 20720) was called using the extended region as described below. Of the 21 IGH::DUX4 samples with relaMvely low SRPB (5 ≤ SRPB < 40), 18 had orthogonal evidence of DUX4-r (Figure 3). The level of noise in germline and leukaemia samples with no known IGH::DUX4 fusion was very low, with SRPB < 3.1 in the core region. Thus, we are confident that Pelops has a precision of 100% in this cohort, at a threshold of SRPB ≥ 5 for DRAGEN alignments in the core DUX4 region. The paediatric validaMon cohort contained only one case (paMent 20720) with an ERG deleMon but no spanning reads between the core DUX4 region and IGH (Figure 2). The DUX4-r was successfully idenMfied (SRPB = 52.8) only by using the extended DUX4 region that includes sequence 100 kb upstream of DUX4. In other samples, using the extended DUX4 region also uncovered addiMonal breakpoints (Supplementary Figure S6). However, it also led to an increase in false posiMve spanning reads in germline samples (e.g. paMent 21322; SRPB = 9.1). This necessitated an increase in the threshold to SRPB ≥ 15 when using the extended region definiMon. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 8 The number of spanning reads defining a genuine IGH::DUX4 rearrangement should theoreMcally depend on several sequencing and sample parameters. A longer read length should increase the split read evidence, while a larger insert size should increase the paired read evidence. Higher tumour content and sequencing coverage should lead to an increase in both types of spanning reads. There is a clear correlaMon between all these parameters with SRPB (Supplementary Figure S4). While the SRPB measure is normalised for the number of reads, sequencing parameters changed over the course of the paediatric ALL study and tumour samples sequenced earlier had lower coverage, read length and fragment size compared to those sequenced later (Supplementary Figure S5). As a result of this interdependency, it was not possible to separate the effects of each of the parameters on SRPB. However, the data in Supplementary Figure S4 show that Pelops called IGH::DUX4 rearrangements in the most challenging samples of the cohort, which have 100 base read- length (vs 150 base in most samples), short median insert size (< 300 bp), low coverage (30- 40´), or low tumour content (< 50%). Finally, the choice of aligner also has a clear impact on the SRPB values as they are sensiMve to the aligner’s behaviour in challenging-to-map regions (Supplementary Figure S3, Supplementary Table 3). For reads aligned with bwa, SRPB values are higher on average than with DRAGEN, while Isaac alignments yield lower values. Considering these factors, it is remarkable that simple fixed thresholds of SRPB ≥ 5 for the core DUX4 region and SRPB ≥ 15 for the extended DUX4 region consistently yield correct results in this cohort irrespecMve of the aligner used. DUX4 rearrangements with other partner genes The earlier analysis of the paediatric B-ALL cohort [12] idenMfied two DUX4-r in which DUX4 expression was possibly acMvated through a fusion with non-IGH genes acMve in developing lymphocytes, MYB and DNTT. We further improved and automated the method for calling such rearrangements (Figure 1B), as described in more detail in the Methods. Pelops was run on all leukaemia samples from the paediatric cohort and was able to recall the MYB::DUX4 and DNTT::DUX4 rearrangements (Figure 3). It also called a QSOX1::DUX4 rearrangement which is part of an IGH::QSOX1::DUX4 triple fusion. Three more, previously undescribed, rearrangements were called in samples which also contained an IGH::DUX4 fusion. In paMents 10876 and 19827, the translocaMons were found in introns of SDR16C6P .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 9 and ENTREP1, respecMvely. In paMent 23507, the rearrangement did not fall inside a gene or pseudogene. Furthermore, despite the presence of a somaMc ERG missense variant, we found no evidence for a DUX4-r in one case (paMent 22355) using Pelops, which is in agreement with the previous implementaMon. De novo assembly and valida-on of DUX4-rearrangement breakpoints To provide addiMonal confidence in this method, we sought to confirm that the spanning reads idenMfied by Pelops could be used for a de novo assembly of the translocaMon breakpoints. For all 59 DUX4-r cases, we were able to obtain at least one sequence assembly that aligned to a relevant breakpoint juncMon (Methods, Supplementary Table 4). All six cases of DUX4-r with non-IGH regions were also confirmed in this way. This provides an important addiMonal validaMon of the predicted DUX4-r and confirmed that Pelops made no false posiMve calls within the paediatric paMent cohort. Amongst the 57 samples with IGH::DUX4 rearrangements, we were able to find two or more breakpoint juncMons in 67% (38/57) of cases (Supplementary Figure S6). Due to the repeMMve nature of the DUX4 region, it was not always possible to disMnguish between an inserMon of a DUX4 sequence into IGH, a reciprocal translocaMon event, or mulMple independent rearrangements of the two alleles. However, for five samples, the de novo assembled sequence captures an enMre inserMon of a DUX4 sequence into the IGH locus, which has been described as the most common mechanism for creaMng a DUX4 fusion that acts as an oncogenic transcripMonal acMvator [1]. When looking at the genomic posiMons of breakpoints, we observed a clear associaMon with V, D and J segments of the IGH gene (Supplementary Figure S7), the targets of V(D)J rearrangements during lymphocyte development. Most breakpoints were clustered in a 60 kb genomic region containing IGH-D and IGH-J segments, which makes up less than 5% of the enMre IGH region. To obtain orthogonal evidence and confirm Pelops results for the 16 DUX4-r cases that were not validated in the previous study [12], the sequence assemblies described above were .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 10 used to aid PCR primer design for amplicon sequencing. Out of these 16 samples, DNA was available for 13. Two addiMonal samples, previously confirmed through RNA-sequencing, were used as posiMve controls, bringing the total number of samples used in this experiment to 15. A unique PCR product was obtained in 14 cases (Supplementary Table 1, Figure 3). PCR primer design was not possible on the remaining case due to the repeMMve nature of the rearrangement juncMon. The PCR amplificaMon products were successfully sequenced, confirming that the de novo assembly based on spanning reads idenMfied by Pelops accurately represented the rearranged sequence (Methods). .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 11

Discussion

In the NHS England Genomic Medicine Service, all acute leukaemia cases are eligible for WGS for geneMc subtyping to assist in treatment decisions. However, there is currently no available bioinformaMcs tool that can robustly detect the DUX4-r cases from WGS data. The

Results

presented here show that Pelops, our dedicated DUX4-r caller, can accurately detect IGH::DUX4 rearrangements that have been previously idenMfied by one or more of the following: gene expression profiling, PCR-based assays (MLPA) , co-occurring ERG deleMons, and the presence of RNA fusions. Furthermore, in agreement with our previously published study [12], Pelops detected cases of either IGH::DUX4 or DUX4 rearranged with other gene partners, that were not idenMfied using current standard-of-care methods. AddiMonal validaMon of these cases using de novo assemblies of supporMng sequencing reads idenMfied by Pelops followed by PCR amplicon sequencing provides further confidence by confirming these results. Importantly, there were no false posiMve cases among 208 germline samples and 151 leukaemia samples from other currently known B-ALL subtypes. To our knowledge, no other direct or indirect method can robustly detect DUX4-r cases. Although the correlaMon of DUX4-r with ERG abnormaliMes, primarily deleMons, is known, our data confirm the findings of other studies [3,12,14,18,19] by showing that only 34/57 (60%) of DUX4-r cases in our paediatric B-ALL cohort had a concurrent ERG deleMon. We also observed no clear correlaMon of missense mutaMons in ERG with DUX4 rearrangements. Furthermore, while WTS may be used to classify some DUX4-r cases from their disMnct gene expression profile, RNA sequencing of leukaemia samples is not yet widely implemented in the clinical segng. We have shown that rearrangement supporMng reads idenMfied by Pelops can be used to assemble conMgs and/or scaffolds spanning the juncMon between fusion genes, which provides addiMonal confidence in our tool. These assemblies can be used for further validaMon by PCR-based approaches and examined to elucidate mechanisMc origins of these rearrangements. One intriguing possibility that requires further invesMgaMon is involvement of the aberrant RAG1/2 recombinase acMvity in the mutagenesis process, given the fact that many IGH rearrangement breakpoints joining DUX4 are in the proximity of RAG .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 12 recombinaMon signal sequence flanking V, D and J segments of the IGH variable region. Similar off-target V(D)J rearrangements involving known oncogenes and tumour suppressor genes have previously been described in the ETV6::RUNX1 subtype of ALL [22] and in mouse models of lymphoma [23]. It was reassuring that Pelops successfully detected IGH::DUX4 rearrangements from samples that were analysed using a variety of sequencing and library preparaMon methods, and had variable tumour content. While we do observe some posiMve correlaMon of SRPB values with insert size, sequencing coverage and tumour content, the method was robust enough to detect all known cases in the cohort we analysed. Subclonal complexity and low tumour content were not observed in our validaMon cohort, but these could impair detecMon and should be considered when developing sequencing coverage targets. We demonstrate that Pelops can be successfully used on data generated by different sequencing read aligners, including the most widely used open-source tool, bwa. However, this study showed that each alignment tool leads to a different level of “background noise”, which can be further complicated by using different versions of the reference genome. While detecMon sensiMvity of IGH::DUX4 rearrangements can be easily adjusted by shicing the SRPB threshold for calling posiMve cases, detecMng rearrangements of DUX4 with other gene partners in a genome-wide approach may require both calibraMng the SRPB threshold, mapping quality scores of read mates, as well as blacklisMng genomic regions that lead to recurrent false posiMve calls. Besides its applicaMon in ALL, where the majority of rearrangements occur between DUX4 and IGH, we envisage other applicaMons in ALL, as well as other cancers, where DUX4 is rearranged with other gene partners. Notably, a subtype of non-Ewing sarcoma with CIC- rearrangements is known to primarily consist of CIC::DUX4 fusions, which are similarly difficult to detect with common bioinformaMcs tools [16]. To accommodate this need, we designed Pelops to detect DUX4 rearrangements in a gene-partner agnosMc approach in addiMon to the IGH-targeted approach. In fact, Pelops has been able to detect an orthogonally validated CIC::DUX4 fusion in one available case of non-Ewing sarcoma (data not shown). Although validaMon of the gene-agnosMc method is not the subject of this study, .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 13 we expect that availability of Pelops will catalyse this process in appropriate cohorts and implementaMon of DUX4-r detecMon in a wider range of tumour types. We also envisage that this approach could be successfully adapted and applied to idenMfy rearrangements in other genes within repeMMve regions that are challenging to detect by current workflows. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 14

Conclusions

Pelops is an open-source socware tool designed to detect DUX4 rearrangements in short read whole genome sequencing data. In our cohort of paediatric ALL samples, Pelops reliably detected all known IGH::DUX4 rearrangements as well as addiMonal cases, including DUX4 rearrangements with other gene partners. Pelops is easy to integrate into exisMng bioinformaMcs pipelines and supports inputs created from the most widely used alignment tools such as bwa and DRAGEN. The work described here demonstrates that there is a path to using WGS to meet the current need to aid in the diagnosis and clinical management of ALL and other cancer types driven by DUX4-r. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 15

Methods

Sequence alignment Sequence reads were aligned to human genome reference GRCh38 using three different workflows: • Isaac Aligner (version SAAC01325.18.01.29). The full workflow was described previously [12]. • DRAGEN (version 4.0.3). • bwa mem (version 0.7.17). Following alignment, read pairs were annotated (‘fixmate’) and sorted, and duplicates were marked using samtools version 1.15.1. Calling IGH::DUX4 fusions In order to call an IGH::DUX4 fusion, Pelops finds read pairs spanning the translocaMon breakpoint(s), which can either be paired reads or split reads. In paired reads, one of the reads maps to DUX4 and the other to IGH. In split reads, one of the two paired reads has a split alignment, with the primary segment aligned to IGH and a supplementary alignment in DUX4, or vice versa. As both IGH and DUX4 have numerous homologous repeats, segments of spanning reads are frequently mapped to different repeats and could be found across the enMre IGH and DUX4 regions. Therefore, the following strategy was used to find IGH::DUX4 fusions: 1. Count reads spanning the DUX4 and IGH regions. Reads flagged as duplicates or QC failed are filtered out. There is no filter based on mapping quality as relevant reads are frequently in regions with low mappability due to the repeMMve character of IGH and DUX4. 2. Normalise the counts to account for sample-specific coverage, to obtain spanning read pairs per billion (SRPB): SRPB = Number of spanning read pairs Total number of unique and mapped reads × 10!. 3. Call IGH::DUX4 fusion if SRPB is above a given threshold. Defining the IGH and DUX4 regions. The full definiMons of the IGH and DUX4 regions are listed in Supplementary Table 2. The IGH region definiMon was taken from IMGT [24]. To define the DUX4 regions, several aspects were taken into account: .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 16 • IGH has a strong enhancer that can acMvate DUX4 expression from a long genomic distance, which implies that read evidence for DUX4-r could be mapped far upstream of the DUX4 repeat arrays. • There are mulMple annotated DUX4-like pseudogenes outside the DUX4 repeat arrays. While they are not known to be implicated in DUX4-r, spanning reads may sMll be mapped there. • A wide region definiMon could lead to false posiMve spanning reads due to the presence of repeMMve genomic elements. To accommodate those conflicMng requirements, SRPB was calculated based on two DUX4 region definiMons. The “core DUX4 region” contains all DUX4 genes (and pseudogenes) in the repeat arrays on chromosomes 4 and 10 as annotated by Ensembl v91 [25], with a 1 kb margin. It covers a very limited amount of intergenic sequence and leads to few false posiMve spanning reads in negaMve samples. Furthermore, this definiMon is used to call non-IGH DUX4-rearrangements with Pelops (see below). The “extended DUX4 region” contains the complete subtelomeric repeat arrays on chromosomes 4 and 10 with a 100 kb margin. DUX4-like pseudogenes on other chromosomes, as annotated by Ensembl v91, are also included with a 1 kb margin. This region can be used to call rare IGH::DUX4 rearrangements with breakpoints further than 1 kb upstream of the DUX4 repeat arrays and rearrangements where the majority of spanning reads were mapped to DUX4 pseudogenes outside the repeat arrays (although we did not observe such a case in our study). This comes at the expense of having to set a higher SRPB threshold for calling DUX4-r, due to the higher rate of false posiMve spanning reads. Differences to previously published method. While Pelops results are similar to those of the previously published method, three key improvements were made: 1. CounMng of all spanning reads. Previously, only reads flagged as improperly paired in the BAM file were used to idenMfy spanning reads. In Pelops, split reads are included which are ocen flagged as properly paired. 2. CalculaMon of SRPB. Previously, the numerator of the SRPB equaMon was based on the number of spanning reads. In Pelops, the calculaMon is based on spanning read pairs to avoid ambiguity in how to count reads with mulMple aligned segments. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 17 3. DefiniMons of DUX4 and IGH regions. Both the IGH and the DUX4 region definiMons were changed to reduce noise. The IGH region used by Pelops is reduced by 50 kb on either side to match the IMGT [24] definiMon, as all previously described IGH::DUX4 rearrangements have breakpoints inside the IGH region [1]. Pelops uses two complementary DUX4 region definiMons while the previous implementaMon used only one to call all DUX4 rearrangements. Both the core and the extended DUX4 regions are different to the one previously used, which covered the DUX4 repeat arrays with approximately 70 kb and 50 kb margins on chromosomes 4 and 10, respecMvely. Calling DUX4-rearrangements with non-IGH regions Pelops’ method for calling DUX4 fusions with non-IGH genes involves three main steps. First, candidate regions in the genome that have a potenMal translocaMon with DUX4 are idenMfied. Secondly, a slightly modified version of the algorithm developed for IGH::DUX4 fusions is run for each of the candidate regions. Finally, candidate regions with low evidence based on the results of the previous step are filtered out. Finding candidate regions. Candidate regions are found by looking for reads flagged as improperly paired in the core DUX4 region. Only rearrangements with the core DUX4 region are considered to increase specificity of this caller and to reduce noise from repeMMve genomic elements located in the extended DUX4 region. M apping locaMon of their mates are counted in 1 kb genomic region bins. Bins with more than two improperly paired reads are retained, and adjacent genomic bins are merged. All genomic bins overlapping with the extended DUX4 and IGH regions are removed. This forms a provisional set of candidate genomic regions which may be rearranged with DUX4. Finding evidence for rearrangements in each candidate region. For each candidate region, spanning reads are found in the same manner as described for IGH::DUX4 fusions, except for one difference: spanning reads are only counted if the aligned segment in the candidate region has a mapping quality ≥ 10 (default threshold that can be changed by the user). This is to reduce the number of false posiMve rearrangements with candidate regions that contain repeMMve sequences. Filtering candidate regions. For the final output, blacklisted regions that were repeatedly called in normal samples, and regions with SRPB < 20 (default threshold that can be changed .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 18 by the user) are removed. The blacklist used for filtering candidate regions contains all candidate regions that are called at least twice with SRPB ≥ 10 amongst the normal samples of the paediatric ALL cohort. As the blacklist is highly dependent on the read aligner used, separate blacklists were generated for DRAGEN and bwa analyses. The regions included in each blacklist are listed in Supplementary Table 5. ValidaMon of opMmal parameters. The non-IGH DUX4-r caller is dependent on several parameters: the minimum mapping quality of reads in a candidate region, the minimum SRPB threshold for filtering candidate regions, and the minimum SRPB threshold used for generaMng the blacklist. To find the opMmal segngs, a random permutaMon cross validaMon was performed 10 Mmes. The normal samples were randomly divided into a training (80%) and an evaluaMon (20%) set (the test set consists of the tumour samples). All candidate regions that were called at least twice in the training set, passing a given blacklist SRPB threshold, were added to the blacklist. Then, in the evaluaMon set, candidate regions above a given filtering SRPB threshold were compared against this blacklist. The effect of changing the minimum mapping quality threshold to 1 was also invesMgated in this way. This validaMon was done for both bwa and DRAGEN alignments. The final parameter values were chosen such that there were no addiMonal calls in the evaluaMon set for any of the validaMon rounds, while yielding a minimal blacklist to minimise the risk of missing true posiMves in the test set. De novo assembly of breakpoints As input for de novo assembly, spanning reads from DRAGEN-aligned BAM files were exported to a SAM file using Pelops. De novo assemblies were produced with SPAdes version 3.11.1 [26] with default parameters. Where this method failed, Velvet version 1.2.10 [27] was used for de novo assembly, with the following non-default parameters: • K. The size of the k-mers used to create de Bruijn graphs in Velvet, was set to K = 21. • Expected k-mer coverage. This was calculated from the average coverage over the enMre genome reported by DRAGEN. • Fragment length. The fragment length was calculated from the median insert length and read length reported by DRAGEN. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 19 Mapping of de novo assemblies to the reference genome The assembled scaffolds were mapped to the GRCh38 reference genome using BLAT [28]. Per sample, up to 500 alignments with an alignment score ≥ 20 were evaluated and ranked by their alignment score. Next, BLAT results were manually filtered to remove duplicate alignments resulMng from the repeMMve nature of the IGH and DUX4 genomic regions. In order of importance, the following criteria were used: • The scaffold should be aligned as completely as possible to segments of the

Reference

genome. • The highest-scoring alignments should be used to cover each segment of the scaffold. • Alignments should be consistent (e.g. for DUX4, all segments should be aligned to the same chromosome). If mulMple alignments were equivalent according to these criteria, the alignment most upstream on the reference genome was used. In BLAT, each alignment can consist of mulMple blocks. A custom Python script was used to extract these blocks and merge adjacent blocks if they differed only by short indels of length < 20 bp. Another custom Python script was used to annotate each aligned segment as IGH, core DUX4, extended DUX4 (if not already covered by core DUX4), other, or unknown sequence (gaps in scaffold marked by Ns). Unannotated sequences > 25 bp were separately aligned with BLAT, with no minimum alignment score, to annotate short sequences that were previously missed. From these alignments, a list of breakpoints was obtained. For our analyses, we only considered those breakpoints that have a conMnuous sequence in the de novo assembly, that is, without any unknown sequence connecMng IGH and DUX4 segments. Amplicon sequencing Rearrangement juncMons were confirmed by successful amplificaMon of PCR products using primer pairs specific to each assembly. Primers were designed using primer3 socware [29– 31] (Supplementary Table 6). PCR products were converted to sequencing libraries by combining 1 µl of PCR product with 19 µl of nuclease free water for input into the Nextera XT DNA Library Prep Kit (Illumina PN# 15032354, 15032355, 15052163). PCR amplicons were .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 20 simultaneously fragmented and tagged then adapter sequences added by 12 cycles of PCR as per manufacturer’s protocol. Final library concentraMon was assessed on the Agilent 4200 TapeStaMon System using the High SensiMvity D1000 tape, and libraries were diluted to 2 nM with RSB and pooled. A 10 µl aliquot of pooled libraries was denatured with 10 µl freshly diluted 0.2 N NaOH (Sigma-Aldrich PN# 72068), through incubaMon for 5 minutes at room temperature. Following denaturaMon, the pooled library was then diluted to 20 pM by with addiMon of 980 µl of pre-chilled HT1 buffer (Illumina PN#15027041) and further diluted to a final loading concentraMon of 14 pM. The library was mixed with a 50% PhiX spike-in (Illumina PN# 15017397) and the final pool was sequenced on a MiSeq using a MiSeq reagent kit 150cy v3 (2x75bp reads) as per the manufacturer’s instrucMons (Illumina PN# 15043893, 15043894). .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 21 Ethics approval and consent to par>cipate Not applicable. Consent for publica>on Not applicable. Availability of data and materials Pelops is publicly available as a Python package named ilmn-pelops. The results of this paper are based on Pelops version 0.8.0, which is available through the Python Package Index (hrps://pypi.org/project/ilmn-pelops/0.8.0/). The source code can also be found at hrps://github.com/Illumina/Pelops. The code has extensive unit test coverage (98.1% at commit 6f29f33e) and tests pass on Python 3.7, 3.9, 3.11. Pelops has minimal dependencies on third-party packages, with the excepMon of pysam [32]. The IGH::DUX4 caller will also be implemented in Illumina DRAGEN 4.3. Compe>ng interests PG, SB, KJC, CF, IA, HN, DJM, PJC, JB, DRB, MM, and MTR are employees of Illumina, a public company that develops and markets systems for geneMc analysis. Funding This study was supported by Blood Cancer UK (grant 15036). SLR is a Career Development Fellow supported by Cancer Research UK (CRUK) (grant C60802/A27193). Authors' contribu>ons PG and MM provided guidance for socware development, evaluated Pelops on the paediatric paMent cohort and wrote the manuscript. SB, KJC, PG, and PJC developed the Pelops socware. JFP conceived the methods for calling DUX4-rearrangements. DJM designed PCR primers for amplicon sequencing of the rearrangement juncMons, while CF, AI and HN performed and opMmised PCR and sequencing. SLR, CJH, and AVM provided paMent samples and orthogonal data. JB, MTR, SLR, CJH, and AVM revised the manuscript. SLR, DRB, CJH, AVM and MTR conceived the study. All authors read and approved the final manuscript. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 22

Acknowledgements

Primary childhood leukaemia samples used in this study were provided by the VIVO Biobank for Children and Young People with Cancer. We also thank all the members of the NCRI Childhood Cancer and Leukaemia Group (CCLG) Leukaemia Subgroup for access to material and data on clinical trial paMents. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 23

References

1. Rehn JA, O’Connor MJ, White DL, Yeung DT. DUX HunMng—Clinical Features and DiagnosMc Challenges Associated with DUX4-Rearranged Leukaemia. Cancers. 2020;12:2815. 2. Paiera E, Roberts KG, Wang V, Gu Z, Buck GAN, Pei D, et al. Molecular classificaMon improves risk assessment in adult BCR-ABL1-negaMve B-ALL. Blood. 2021;138:948–58. 3. Harvey RC, Mullighan CG, Wang X, Dobbin KK, Davidson GS, Bedrick EJ, et al. IdenMficaMon of novel cluster groups in pediatric high-risk B-precursor acute lymphoblasMc leukemia with gene expression profiling: correlaMon with genome-wide DNA copy number alteraMons, clinical characterisMcs, and outcome. Blood. 2010;116:4874–84. 4. Schwab C, Cranston RE, Ryan SL, Butler E, Winterman E, Hawking Z, et al. IntegraMve genomic analysis of childhood acute lymphoblasMc leukaemia lacking a geneMc biomarker in the UKALL2003 clinical trial. Leukemia. 2023;37:529–38. 5. Hendrickson PG, Doráis JA, Grow EJ, Whiddon JL, Lim J-W, Wike CL, et al. Conserved roles of mouse DUX and human DUX4 in acMvaMng cleavage-stage genes and MERVL/HERVL retrotransposons. Nat Genet. 2017;49:925–34. 6. van Deutekom JC, Wijmenga C, van Tienhoven EA, Gruter AM, Hewir JE, Padberg GW, et al. FSHD associated DNA rearrangements are due to deleMons of integral copies of a 3.2 kb tandemly repeated unit. Hum Mol Genet. 1993;2:2037–42. 7. Choi SH, Gearhart MD, Cui Z, Bosnakovski D, Kim M, Schennum N, et al. DUX4 recruits p300/CBP through its C-terminus and induces global H3K27 acetylaMon changes. Nucleic Acids Res. 2016;44:5161–73. 8. Young JM, Whiddon JL, Yao Z, Kasinathan B, Snider L, Geng LN, et al. DUX4 binding to retroelements creates promoters that are acMve in FSHD muscle and tesMs. PLoS Genet. 2013;9:e1003947. 9. Yeoh E-J, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. ClassificaMon, subtype discovery, and predicMon of outcome in pediatric acute lymphoblasMc leukemia by gene expression profiling. Cancer Cell. 2002;1:133–43. 10. Gu Z, Churchman ML, Roberts KG, Moore I, Zhou X, Nakitandwe J, et al. PAX5-driven subtypes of B-progenitor acute lymphoblasMc leukemia. Nat Genet. 2019;51:296–307. 11. Lilljebjörn H, Henningsson R, Hyrenius-Wirsten A, Olsson L, Orsmark-Pietras C, von Palffy S, et al. IdenMficaMon of ETV6-RUNX1-like and DUX4-rearranged subtypes in paediatric B-cell precursor acute lymphoblasMc leukaemia. Nat Commun. 2016;7:11790. 12. Ryan SL, Peden JF, Kingsbury Z, Schwab CJ, James T, Polonen P , et al. Whole genome sequencing provides comprehensive geneMc tesMng in childhood B-cell acute lymphoblasMc leukaemia. Leukemia. 2023;1–11. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 24 13. Yasuda T, Tsuzuki S, Kawazu M, Hayakawa F, Kojima S, Ueno T, et al. Recurrent DUX4 fusions in B cell acute lymphoblasMc leukemia of adolescents and young adults. Nat Genet. 2016;48:569–74. 14. Zhang J, McCastlain K, Yoshihara H, Xu B, Chang Y , Churchman ML, et al. DeregulaMon of DUX4 and ERG in acute lymphoblasMc leukemia. Nat Genet. 2016;48:1481–9. 15. Mancarella C, Carrabora M, Toracchio L, Scotlandi K. CIC-Rearranged Sarcomas: An Intriguing EnMty That May Lead the Way to the Comprehension of More Common Cancers. Cancers. 2022;14:5411. 16. Antonescu CR, Owosho AA, Zhang L, Chen S, Deniz K, Huryn JM, et al. Sarcomas with CIC- rearrangements are a disMnct pathologic enMty with aggressive outcome: A clinicopathologic and molecular study of 115 cases. Am J Surg Pathol. 2017;41:941–9. 17. Kawamura-Saito M, Yamazaki Y , Kaneko K, Kawaguchi N, Kanda H, Mukai H, et al. Fusion between CIC and DUX4 up-regulates PEA3 family genes in Ewing-like sarcomas with t(4;19)(q35;q13) translocaMon. Hum Mol Genet. 2006;15:2125–37. 18. Potuckova E, Zuna J, Hovorkova L, Starkova J, Stary J, Trka J, et al. Intragenic ERG DeleMons Do Not Explain the Biology of ERG-Related Acute LymphoblasMc Leukemia. PloS One. 2016;11:e0160385. 19. Zaliova M, Potuckova E, Hovorkova L, Musilova A, Winkowska L, Fiser K, et al. ERG deleMons in childhood acute lymphoblasMc leukemia with DUX4 rearrangements are mostly polyclonal, prognosMcally relevant and their detecMon rate strongly depends on screening

Method

sensiMvity. Haematologica. 2019;104:1407–16. 20. Li H. Aligning sequence reads, clone sequences and assembly conMgs with BWA-MEM [Internet]. arXiv; 2013 [cited 2024 May 10]. Available from: hrp://arxiv.org/abs/1303.3997 21. Raczy C, Petrovski R, Saunders CT, Chorny I, Kruglyak S, Margulies EH, et al. Isaac: ultra- fast whole-genome secondary analysis on Illumina sequencing pla„orms. BioinformaMcs. 2013;29:2041–3. 22. Papaemmanuil E, Rapado I, Li Y , Porer NE, Wedge DC, Tubio J, et al. RAG-mediated recombinaMon is the predominant driver of oncogenic rearrangement in ETV6-RUNX1 acute lymphoblasMc leukemia. Nat Genet. 2014;46:116–25. 23. Miju šković M, Chou Y -F, Gigi V, Lindsay CR, Shestova O, Lewis SM, et al. Off-Target V(D)J RecombinaMon Drives Lymphomagenesis and Is Escalated by Loss of the Rag2 C Terminus. Cell Rep. 2015;12:1842–52. 24. Manso T, Folch G, Giudicelli V, Jabado-Michaloud J, Kushwaha A, Nguefack Ngoune V, et al. IMGT ® databases, related tools and web resources through three main axes of research and development. Nucleic Acids Res. 2022;50:D1262–72. 25. MarMn FJ, Amode MR, Aneja A, AusMne -Orimoloye O, Azov AG, Barnes I, et al. Ensembl 2023. Nucleic Acids Res. 2023;51:D933–41. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 25 26. Prjibelski A, AnMpov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinforma. 2020;70:e102. 27. Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–9. 28. Kent WJ. BLAT—The BLAST-Like Alignment Tool. Genome Res. 2002;12:656–64. 29. Koressaar T, Remm M. Enhancements and modificaMons of primer design program Primer3. Bioinforma Oxf Engl. 2007;23:1289–91. 30. Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, et al. Primer3--new capabiliMes and interfaces. Nucleic Acids Res. 2012;40:e115. 31. Kõressaar T, Lepamets M, Kaplinski L, Raime K, Andreson R, Remm M. Primer3_masker: integraMng masking of template sequence with primer design socware. Bioinforma Oxf Engl. 2018;34:1937–8. 32. Bonfield JK, Marshall J, Danecek P , Li H, Ohan V, Whitwham A, et al. HTSlib: C library for reading/wriMng high-throughput sequencing data. GigaScience. 2021;10:giab007. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 26 Tables See separate Excel sheet for Supplementary Tables. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 27 Figures Figure 1. Overview of Pelops’ DUX4-r detecMon method. (A) DetecMon of IGH::DUX4 fusions. Reads spanning the IGH::DUX4 translocaMon breakpoint contain segments that align to IGH and DUX4. These segments are frequently aligned to several different repeats. Pelops finds spanning reads across the whole IGH and DUX4 regions and normalises their count to spanning read pairs per billion (SRPB). (B) DetecMon of other DUX4 rearrangements. First, Pelops idenMfies improperly paired reads in the core DUX4 region, with mates mapping anywhere else across the genome. Then, regions where mulMple mates are clustering are idenMfied. Finally, any regions with the number of mates below a threshold, as well as recurrent false posiMve regions are removed. SRPB values are calculated for all remaining regions. IGH DUX4 SRPB = Spanning read pairs Total reads ×10! Find spanning reads Normalise Core DUX4Extended DUX4 Multiple equivalent alignments of reads Pelops IGH 14q32 4q35 10q36 A B Core DUX4 Improperly paired reads Region X Region Y Find mates Region Z Calculate SRPB Too few reads Recurrent false-positive .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 28 Figure 2. Spanning read pairs per billion (SRPB) for IGH::DUX4 calculated by Pelops for all 210 tumour and 208 matched germline samples of the evaluaMon cohort, using alignments by DRAGEN. The x-axis shows the SRPB distribuMon calculated based on the core DUX4 region definiMon, while on the y-axis calculaMons are based on the extended DUX4 region definiMon. Colours indicate DUX4 fusion type of the samples predicted by Pelops. The dashed horizontal and verMcal lines indicate SRPB thresholds of 15 and 5, for extended and core DUX4 regions, respecMvely, on which the idenMficaMon of IGH::DUX4 fusions is based. The marker indicates whether orthogonal evidence for DUX4-r is available, based either on RNA-sequencing (gene expression profile and/or IGH::DUX4 fusion), the presence of ERG deleMons, or amplicon sequencing. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 29 Figure 3. Summary of orthogonal evidence for DUX4 rearrangements available for samples classified as DUX4-r in Ryan et al, 2023 and evidence from Pelops. PaMent IDs are shown at the borom of the plot. 0 50 100 IGH::DUX4 SRPB IGH::DUX4 (Pelops) DUX4::other (Pelops) Expression profile RNA fusion ERG disruptive SV (WGS) ERG small variant (WGS) ERG deletion (MLPA) PCR amplicon seq 12460 23132 23507 21322 22804 23863 22065 23842 12816 23114 22045 20696 12083 11178 22897 23769 11811 21230 20683 24423 9469 11556 23078 20515 22354 24391 10925 23533 11053 11672 12118 10186 21437 22918 22405 12134 20724 22417 22387 20716 19827 12820 22006 20035 21689 22346 10876 12356 23074 21568 12335 22037 12334 11957 22224 10310 11148 20720 23445 22355 Pelops prediction IGH::DUX4 QSOX1::DUX4 DNTT::DUX4 MYB::DUX4 SDR16C6P::DUX4 ENTREP1::DUX4 chr14_interg::DUX4 None found WTS DUX4−r IGH::DUX4 IGH::QSOX1::DUX4 None found Not available WGS DEL INV BND missense frameshift inframe None found ERG deletion (MLPA) DEL None found Not available PCR amplicon seq IGH::DUX4 None found Not available IGH::DUX4 SRPB Core DUX4 Extended DUX4 .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 30 Supplementary Figures .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 31 Targeted IGH::DUX4 caller Find evidence for DU X4- rearrangement core DU X4 region IGH minimum MAPQ = 0 JSON report SAM file for each rearrangement Untargeted DUX4-rearrangement caller Find candidate regions Start Pelops SRPB ≥ minimum SRPB Yes Nominimum SRPB = 20 Include rearrangement with candidate region For each candidate region End for-loop End Input BAM/CRAM Blacklist of common false- positive regions Find evidence for DU X4- rearrangement extended DU X4 region IGH minimum MAPQ = 0 Find evidence for DU X4- rearrangement core DU X4 region candidate region minimum MAPQ = 10 For each read segment aligned to rearrangement region Rearrange- ment region minimum MAPQ Read segment is duplicate No Yes MAPQ ≥ minimum MAPQ Yes No Keep read segment End for-loop Calculate SRPB Count split and paired reads Total number of unique and mapped reads in BAM/CRAM Find evidence for DUX4-rearrangement For each read segment aligned to DU X4 region DU X4 region Read segment is duplicate No Yes Keep read segment End for-loop Count spanning read pairs, which have segments aligned to both regions End for-loop Merge directly adjacent candidate regions Remove candidate region overlapping with IGH or extended DU X4 Remove candidate region overlapping with any blacklisted region Blacklist of common false- positive regions Divide GRCh38 into bins of size 1 kb For each 1 kb region bin Count number of improperly paired reads in bin with mate in core DU X4 region Count > 2 Yes No Keep bin as candidate region Find all improperly paired reads in core DU X4 region Find candidate regions .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 32 Supplementary Figure S1. Flowchart overview of Pelops. This overview illustrates the algorithm used by Pelops for calling DUX4-rearrangements, and the inputs, outputs and parameters of the pipeline. All parameters in the light orange parallelograms with a black frame can be changed by the user of Pelops through its command line interface. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 33 Supplementary Figure S2. Comparison of Pelops to previously published implementaMon (Ryan et al., 2023) using 210 tumour and 208 matched germline samples for the paediatric ALL cohort on Isaac-aligned BAM files. Colours indicate DUX4-r status of the sample, as previously published. The line indicates the expected difference based on counMng spanning read pairs in Pelops versus the previously published method, which counted spanning reads individually. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 34 Supplementary Figure S3. Comparison of Pelops’ IGH::DUX4 caller results for three different aligners. Spanning read pairs per billion (SRPB) for IGH::DUX4 were calculated by Pelops for all 210 tumour and 208 germline samples of the paediatric ALL validaMon cohort, using read alignments created by DRAGEN, bwa, and Isaac. The x-axis shows the SRPB distribuMon calculated based on the core DUX4 region definiMon, while on the y-axis calculaMons are based on the extended DUX4 region definiMon. Colours indicate DUX4 fusion status of the samples predicted by Pelops. The dashed horizontal and verMcal lines indicate SRPB thresholds of 15 and 5, for extended and core DUX4 regions, respecMvely. The marker indicates whether orthogonal evidence for DUX4-r is available, based on RNA sequencing, the presence of ERG deleMons, or amplicon sequencing. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 35 Supplementary Figure S4. Spanning read pairs per billion (SRPB) are plored against the following sequencing parameters: read length, median insert size, mean coverage over genome and tumour content esMmated by DRAGEN. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 36 Supplementary Figure S5. Pair plot showing relaMonships between read length, median insert size, average alignment coverage over genome, and esMmated tumour content in cohort samples. Off-diagonal scarerplots show relaMonships between two variables, while the on-diagonal plots visualise distribuMons of a single variable. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 37 Supplementary Figure S6. Number of IGH::DUX4 breakpoint juncMons per sample in the de novo assemblies. Only juncMons which have a conMnuously assembled sequence are counted. The lec plot shows the number of juncMons in the core DUX4 region, the right plot the numbers in the extended DUX4 region. Samples for which a full inserMon of DUX4 is observed are highlighted. Note that each inserMon is counted as two separate breakpoint juncMons. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint Pelops: A dedicated caller for DUX4-r from WGS data 38 Supplementary Figure S7. Genomic locaMon of DUX4-r breakpoints in the IGH locus for the paediatric paMent cohort. Breakpoint orientaMon + means that the DUX4 segment is downstream of the breakpoint, while - means that the DUX4 segment is upstream of the breakpoint. Genomic posiMons are based on the GRCh38 reference. Segments of the IGH locus not shown here do not contain any DUX4-r breakpoint. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-21T05:10:58.409756+00:00
License: CC-BY-ND-4.0