Abstract
Rearrangements involving the DUX4 gene (DUX4-r) define a subtype of paediatric and adult
acute lymphoblasMc leukaemia (ALL) with a favourable outcome. Currently, there is no
‘standard of care’ diagnosMc method for their confident idenMficaMon. Here, we present
Pelops, an open-source socware tool designed to detect DUX4-r from short-read, whole-
genome sequencing (WGS) data. EvaluaMon on a cohort of 210 paediatric ALL cases showed
that Pelops detects all known, as well as previously unidenMfied, cases of IGH::DUX4 and
rearrangements with other partner genes. These findings demonstrate the possibility of
robustly detecMng DUX4-r using WGS in the rouMne clinical segng.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 3
Keywords
DUX4; acute lymphoblasMc leukaemia; ALL; whole-genome sequencing; IGH::DUX4; IGH
enhancer hijacking
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 4
Background
In childhood and adult B-cell acute lymphoblasMc leukaemia (B-ALL), chromosomal
abnormalities play a significant role in risk stratification for treatment within clinical trials
worldwide. Although a wide range of genetic aberrations have been known for many years,
recent sequencing studies have uncovered a wealth of additional genetic information of
prognostic relevance with implications for changes to treatment strategies. One such
genetic alteration, DUX4 rearrangements (DUX4-r), defines a recently reported subtype that
affects 4–7% of paediatric paMents [1] and ∼5% of adolescents and young adults [2].
PaMents in all age groups exhibit favourable outcomes but due to the presence of
concomitant risk factors are frequently treated as intermediate or high risk [3,4]. However,
for reasons described below, the detecMon of DUX4-r is challenging and if opMmal therapy is
to be given to these paMents, their accurate idenMficaMon is paramount.
DUX4 (Double Homeobox 4) is a transcripMon factor that is selecMvely and transiently
expressed in cleavage-stage embryos [5] and germ cells of the tesMs [1]. A copy of the DUX4
gene, encoding two homeoboxes, is located within each unit of the D4Z4 macrosatellite
repeat array in the subtelomeric region of chromosome 4 long arm (4q) and in a similar
repeat array on chromosome 10q. The ∼3.3 kb D4Z4 repeat is polymorphic in length and has
11-100 copies in healthy individuals [6]. When ectopically acMvated, DUX4 can upregulate
expression of mulMple genes and iniMate transcripMon from alternaMve promoters, leading to
non-canonical transcript isoforms [7]. In fact, it has been shown that contracMon of the D4Z4
repeat array below 11 copies decreases the epigeneMc repression of DUX4, causing
autosomal dominant facioscapulohumeral muscular dystrophy (FSHD) [8].
DUX4-r cases in ALL were iniMally discovered through their disMncMve gene expression
profile [9]. DUX4 is commonly rearranged with the Immunoglobulin Heavy Chain Locus
(IGH), although mulMple fusion partners have been idenMfied [10–14]. The rearrangement
typically creates a chimeric transcript that retains the 5’ end of DUX4 but replaces the 3’
coding sequence with a secMon of IGH. This event, most likely via IGH enhancer hijacking,
leads to acMvaMon of expression of DUX4 in developing lymphocytes. The resulMng change in
the transcripMonal landscape of the affected cell is thought to lead to oncogenic
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 5
transformaMon [13]. DUX4-r have also been discovered in a disMnct rare subtype of CIC
(Capicua TranscripMonal Repressor)-rearranged non-Ewing sarcoma, accounMng for less than
1% of all sarcomas [15], primarily affecMng young adults and associated with a poor outcome
[16]. In CIC::DUX4 sarcoma, similar to ALL, the resulMng fusion protein acts as a
transcripMonal acMvator driving the oncogenesis [17].
In ALL, DUX4-r were first thought to be driven by co-occurring deleMons in the ERG
transcripMon factor, which was used as a surrogate for their idenMficaMon. However, more
recent studies have now indicated that ERG deleMons are present in only a subset of DUX4-r
paMents and that they are likely to be subclonal [3,12,14,18,19]. Subsequently, using next
generaMon sequencing (NGS) approaches, two independent studies confirmed DUX4-r to be
the driving lesion [11,13].
The complex and crypMc nature of DUX4, and the presence of very similar DUX4 copies
throughout the genome, have precluded its accurate detection using current bioinformatics
tools and other standard of care genetic tests. Common approaches that look for
accumulaMon of discordant sequence read-pairs to idenMfy breakpoints typically fail due to
the mulMple possible mapping locaMons leading to scarering of supporMng reads.
Furthermore, most structural variant callers disregard sequence reads with mulMple
equivalent mapping posiMons. Clearly, robust geneMc tesMng methods are urgently needed.
We reasoned that a custom DUX4-r caller, which takes into account all read-pairs spanning
any DUX4 copy and the gene partner of interest, was required: an approach that we have
described previously [12]. We present here an improved implementaMon of this method as
Pelops, an open-source socware tool that can be integrated into exisMng bioinformaMcs
analysis pipelines. We evaluated Pelops on a paediatric B-ALL cohort of 210 paMents [12] and
demonstrate that Pelops is a robust tool for idenMfying DUX4 rearrangements from tumour-
only WGS data. This proof-of-concept work indicates a path to improved diagnostic testing
for this good risk genetic subtype in clinical WGS pipelines.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 6
Results
Pelops overview
Pelops is a socware tool that implements a tumour-only analysis to idenMfy the signal of
DUX4 rearrangements from short-read WGS data, building on our earlier version of the
Method
[12]. The most common DUX4-r, IGH::DUX4, is idenMfied using a targeted approach
by finding read-pairs spanning any part of the IGH and DUX4 regions and then calculaMng
the number of spanning read pairs per billion total reads (SRPB) (Figure 1A, Methods ,
Supplementary Figure S1).
For DUX4-r cases involving DUX4 fusions with other partner genes, an untargeted, genome-
wide approach is required. Pelops finds evidence for these cases by idenMfying genomic
regions containing mulMple mates of improperly paired reads anchored in the DUX4 region
(Figure 1B, Methods , Supplementary Figure S1).
Evalua/on on the paediatric ALL cohort
IGH::DUX4 rearrangements
An earlier implementaMon of this approach idenMfied 57 IGH::DUX4 cases in a cohort of 210
paediatric B-cell ALL paMents [12] (Supplementary Table 1). As described in the Methods
secMon, we modified the original method to find all spanning reads in each sample and
developed the Pelops socware. IniMally, we ran Pelops on the same 210 leukaemia samples,
using their matched germline samples as negaMve controls. We found excellent concordance
between published spanning reads per billion (SRPB) from the earlier implementaMon with
those from Pelops (Supplementary Figure S2). The SRPB values depend on how the DUX4
region is defined, and we show results for two complementary definiMons: the ‘core DUX4’
region and the ‘extended DUX4’ region. The extended region covers the DUX4 repeat arrays
on chromosomes 4 and 10 with a 100 kb margin, as well as addiMonal DUX4 pseudogenes on
other chromosomes. The core region is a subset of the extended region, and only covers the
subtelomeric DUX4 repeats on chromosomes 4 and 10, with a 1 kb margin (Figure 1A,
Supplementary Table 2). By using SRPB values calculated for these two regions, it was
possible to idenMfy all samples with an IGH::DUX4 fusion (Figure 2). These results were
independent of the read aligner used: while we discuss results for DRAGEN alignments
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 7
(Figure 2) for the remainder of this secMon, bwa and Isaac [20,21] yielded qualitaMvely
idenMcal results (Supplementary Figure S3, Supplementary Table 3).
Among the 57 samples idenMfied by Pelops as IGH::DUX4 fusions, orthogonal evidence was
available for a total of 55 (Figure 3, Supplementary Table 1). Previous analysis confirmed 43
samples through ERG deleMons, and through whole-transcriptome sequencing (WTS) [12].
ERG deleMons associated with DUX4-r were confirmed by WGS and by M ulMplex LigaMon-
dependent Probe AmplificaMon (MLPA). Whole-transcriptome sequencing (WTS) was used to
confirm cases based on their gene expression profile and presence of IGH::DUX4 RNA
fusions. In this study, we used Pelops outputs to create de novo sequence assemblies of
DUX4-r breakpoint juncMons, as described in more detail below. Based on these sequences,
we then designed primers to confirm a further 12 IGH::DUX4 cases by PCR amplicon
sequencing (Figure 3, Supplementary Table 1).
Using the core DUX4 region, all but one of the orthogonally validated samples were correctly
called using a threshold of SRPB ≥ 5 (Figure 2). The remaining case (paMent 20720) was
called using the extended region as described below. Of the 21 IGH::DUX4 samples with
relaMvely low SRPB (5 ≤ SRPB < 40), 18 had orthogonal evidence of DUX4-r (Figure 3). The
level of noise in germline and leukaemia samples with no known IGH::DUX4 fusion was very
low, with SRPB < 3.1 in the core region. Thus, we are confident that Pelops has a precision of
100% in this cohort, at a threshold of SRPB ≥ 5 for DRAGEN alignments in the core DUX4
region.
The paediatric validaMon cohort contained only one case (paMent 20720) with an ERG
deleMon but no spanning reads between the core DUX4 region and IGH (Figure 2). The
DUX4-r was successfully idenMfied (SRPB = 52.8) only by using the extended DUX4 region
that includes sequence 100 kb upstream of DUX4. In other samples, using the extended
DUX4 region also uncovered addiMonal breakpoints (Supplementary Figure S6). However, it
also led to an increase in false posiMve spanning reads in germline samples
(e.g. paMent 21322; SRPB = 9.1). This necessitated an increase in the threshold to SRPB ≥ 15
when using the extended region definiMon.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 8
The number of spanning reads defining a genuine IGH::DUX4 rearrangement should
theoreMcally depend on several sequencing and sample parameters. A longer read length
should increase the split read evidence, while a larger insert size should increase the paired
read evidence. Higher tumour content and sequencing coverage should lead to an increase
in both types of spanning reads. There is a clear correlaMon between all these parameters
with SRPB (Supplementary Figure S4). While the SRPB measure is normalised for the number
of reads, sequencing parameters changed over the course of the paediatric ALL study and
tumour samples sequenced earlier had lower coverage, read length and fragment size
compared to those sequenced later (Supplementary Figure S5). As a result of this
interdependency, it was not possible to separate the effects of each of the parameters on
SRPB. However, the data in Supplementary Figure S4 show that Pelops called IGH::DUX4
rearrangements in the most challenging samples of the cohort, which have 100 base read-
length (vs 150 base in most samples), short median insert size (< 300 bp), low coverage (30-
40´), or low tumour content (< 50%). Finally, the choice of aligner also has a clear impact on
the SRPB values as they are sensiMve to the aligner’s behaviour in challenging-to-map
regions (Supplementary Figure S3, Supplementary Table 3). For reads aligned with bwa,
SRPB values are higher on average than with DRAGEN, while Isaac alignments yield lower
values. Considering these factors, it is remarkable that simple fixed thresholds of SRPB ≥ 5
for the core DUX4 region and SRPB ≥ 15 for the extended DUX4 region consistently yield
correct results in this cohort irrespecMve of the aligner used.
DUX4 rearrangements with other partner genes
The earlier analysis of the paediatric B-ALL cohort [12] idenMfied two DUX4-r in which DUX4
expression was possibly acMvated through a fusion with non-IGH genes acMve in developing
lymphocytes, MYB and DNTT. We further improved and automated the method for calling
such rearrangements (Figure 1B), as described in more detail in the Methods. Pelops was
run on all leukaemia samples from the paediatric cohort and was able to recall the
MYB::DUX4 and DNTT::DUX4 rearrangements (Figure 3). It also called a QSOX1::DUX4
rearrangement which is part of an IGH::QSOX1::DUX4 triple fusion. Three more, previously
undescribed, rearrangements were called in samples which also contained an IGH::DUX4
fusion. In paMents 10876 and 19827, the translocaMons were found in introns of SDR16C6P
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 9
and ENTREP1, respecMvely. In paMent 23507, the rearrangement did not fall inside a gene or
pseudogene. Furthermore, despite the presence of a somaMc ERG missense variant, we
found no evidence for a DUX4-r in one case (paMent 22355) using Pelops, which is in
agreement with the previous implementaMon.
De novo assembly and valida-on of DUX4-rearrangement breakpoints
To provide addiMonal confidence in this method, we sought to confirm that the spanning
reads idenMfied by Pelops could be used for a de novo assembly of the translocaMon
breakpoints. For all 59 DUX4-r cases, we were able to obtain at least one sequence assembly
that aligned to a relevant breakpoint juncMon (Methods, Supplementary Table 4). All six
cases of DUX4-r with non-IGH regions were also confirmed in this way. This provides an
important addiMonal validaMon of the predicted DUX4-r and confirmed that Pelops made no
false posiMve calls within the paediatric paMent cohort.
Amongst the 57 samples with IGH::DUX4 rearrangements, we were able to find two or more
breakpoint juncMons in 67% (38/57) of cases (Supplementary Figure S6). Due to the
repeMMve nature of the DUX4 region, it was not always possible to disMnguish between an
inserMon of a DUX4 sequence into IGH, a reciprocal translocaMon event, or mulMple
independent rearrangements of the two alleles. However, for five samples, the de novo
assembled sequence captures an enMre inserMon of a DUX4 sequence into the IGH locus,
which has been described as the most common mechanism for creaMng a DUX4 fusion that
acts as an oncogenic transcripMonal acMvator [1].
When looking at the genomic posiMons of breakpoints, we observed a clear associaMon with
V, D and J segments of the IGH gene (Supplementary Figure S7), the targets of V(D)J
rearrangements during lymphocyte development. Most breakpoints were clustered in a 60
kb genomic region containing IGH-D and IGH-J segments, which makes up less than 5% of
the enMre IGH region.
To obtain orthogonal evidence and confirm Pelops results for the 16 DUX4-r cases that were
not validated in the previous study [12], the sequence assemblies described above were
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 10
used to aid PCR primer design for amplicon sequencing. Out of these 16 samples, DNA was
available for 13. Two addiMonal samples, previously confirmed through RNA-sequencing,
were used as posiMve controls, bringing the total number of samples used in this experiment
to 15. A unique PCR product was obtained in 14 cases (Supplementary Table 1, Figure 3).
PCR primer design was not possible on the remaining case due to the repeMMve nature of
the rearrangement juncMon. The PCR amplificaMon products were successfully sequenced,
confirming that the de novo assembly based on spanning reads idenMfied by Pelops
accurately represented the rearranged sequence (Methods).
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 11
Discussion
In the NHS England Genomic Medicine Service, all acute leukaemia cases are eligible for
WGS for geneMc subtyping to assist in treatment decisions. However, there is currently no
available bioinformaMcs tool that can robustly detect the DUX4-r cases from WGS data. The
Results
presented here show that Pelops, our dedicated DUX4-r caller, can accurately detect
IGH::DUX4 rearrangements that have been previously idenMfied by one or more of the
following: gene expression profiling, PCR-based assays (MLPA) , co-occurring ERG deleMons,
and the presence of RNA fusions. Furthermore, in agreement with our previously published
study [12], Pelops detected cases of either IGH::DUX4 or DUX4 rearranged with other gene
partners, that were not idenMfied using current standard-of-care methods. AddiMonal
validaMon of these cases using de novo assemblies of supporMng sequencing reads idenMfied
by Pelops followed by PCR amplicon sequencing provides further confidence by confirming
these results. Importantly, there were no false posiMve cases among 208 germline samples
and 151 leukaemia samples from other currently known B-ALL subtypes.
To our knowledge, no other direct or indirect method can robustly detect DUX4-r cases.
Although the correlaMon of DUX4-r with ERG abnormaliMes, primarily deleMons, is known,
our data confirm the findings of other studies [3,12,14,18,19] by showing that only 34/57
(60%) of DUX4-r cases in our paediatric B-ALL cohort had a concurrent ERG deleMon. We also
observed no clear correlaMon of missense mutaMons in ERG with DUX4 rearrangements.
Furthermore, while WTS may be used to classify some DUX4-r cases from their disMnct gene
expression profile, RNA sequencing of leukaemia samples is not yet widely implemented in
the clinical segng.
We have shown that rearrangement supporMng reads idenMfied by Pelops can be used to
assemble conMgs and/or scaffolds spanning the juncMon between fusion genes, which
provides addiMonal confidence in our tool. These assemblies can be used for further
validaMon by PCR-based approaches and examined to elucidate mechanisMc origins of these
rearrangements. One intriguing possibility that requires further invesMgaMon is involvement
of the aberrant RAG1/2 recombinase acMvity in the mutagenesis process, given the fact that
many IGH rearrangement breakpoints joining DUX4 are in the proximity of RAG
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 12
recombinaMon signal sequence flanking V, D and J segments of the IGH variable region.
Similar off-target V(D)J rearrangements involving known oncogenes and tumour suppressor
genes have previously been described in the ETV6::RUNX1 subtype of ALL [22] and in mouse
models of lymphoma [23].
It was reassuring that Pelops successfully detected IGH::DUX4 rearrangements from samples
that were analysed using a variety of sequencing and library preparaMon methods, and had
variable tumour content. While we do observe some posiMve correlaMon of SRPB values with
insert size, sequencing coverage and tumour content, the method was robust enough to
detect all known cases in the cohort we analysed. Subclonal complexity and low tumour
content were not observed in our validaMon cohort, but these could impair detecMon and
should be considered when developing sequencing coverage targets.
We demonstrate that Pelops can be successfully used on data generated by different
sequencing read aligners, including the most widely used open-source tool, bwa. However,
this study showed that each alignment tool leads to a different level of “background noise”,
which can be further complicated by using different versions of the reference genome. While
detecMon sensiMvity of IGH::DUX4 rearrangements can be easily adjusted by shicing the
SRPB threshold for calling posiMve cases, detecMng rearrangements of DUX4 with other gene
partners in a genome-wide approach may require both calibraMng the SRPB threshold,
mapping quality scores of read mates, as well as blacklisMng genomic regions that lead to
recurrent false posiMve calls.
Besides its applicaMon in ALL, where the majority of rearrangements occur between DUX4
and IGH, we envisage other applicaMons in ALL, as well as other cancers, where DUX4 is
rearranged with other gene partners. Notably, a subtype of non-Ewing sarcoma with CIC-
rearrangements is known to primarily consist of CIC::DUX4 fusions, which are similarly
difficult to detect with common bioinformaMcs tools [16]. To accommodate this need, we
designed Pelops to detect DUX4 rearrangements in a gene-partner agnosMc approach in
addiMon to the IGH-targeted approach. In fact, Pelops has been able to detect an
orthogonally validated CIC::DUX4 fusion in one available case of non-Ewing sarcoma (data
not shown). Although validaMon of the gene-agnosMc method is not the subject of this study,
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 13
we expect that availability of Pelops will catalyse this process in appropriate cohorts and
implementaMon of DUX4-r detecMon in a wider range of tumour types. We also envisage
that this approach could be successfully adapted and applied to idenMfy rearrangements in
other genes within repeMMve regions that are challenging to detect by current workflows.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 14
Conclusions
Pelops is an open-source socware tool designed to detect DUX4 rearrangements in short
read whole genome sequencing data. In our cohort of paediatric ALL samples, Pelops
reliably detected all known IGH::DUX4 rearrangements as well as addiMonal cases, including
DUX4 rearrangements with other gene partners. Pelops is easy to integrate into exisMng
bioinformaMcs pipelines and supports inputs created from the most widely used alignment
tools such as bwa and DRAGEN. The work described here demonstrates that there is a path
to using WGS to meet the current need to aid in the diagnosis and clinical management of
ALL and other cancer types driven by DUX4-r.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 15
Methods
Sequence alignment
Sequence reads were aligned to human genome reference GRCh38 using three different
workflows:
• Isaac Aligner (version SAAC01325.18.01.29). The full workflow was described
previously [12].
• DRAGEN (version 4.0.3).
• bwa mem (version 0.7.17). Following alignment, read pairs were annotated
(‘fixmate’) and sorted, and duplicates were marked using samtools version 1.15.1.
Calling IGH::DUX4 fusions
In order to call an IGH::DUX4 fusion, Pelops finds read pairs spanning the translocaMon
breakpoint(s), which can either be paired reads or split reads. In paired reads, one of the
reads maps to DUX4 and the other to IGH. In split reads, one of the two paired reads has a
split alignment, with the primary segment aligned to IGH and a supplementary alignment in
DUX4, or vice versa. As both IGH and DUX4 have numerous homologous repeats, segments
of spanning reads are frequently mapped to different repeats and could be found across the
enMre IGH and DUX4 regions. Therefore, the following strategy was used to find IGH::DUX4
fusions:
1. Count reads spanning the DUX4 and IGH regions. Reads flagged as duplicates or
QC failed are filtered out. There is no filter based on mapping quality as relevant
reads are frequently in regions with low mappability due to the repeMMve
character of IGH and DUX4.
2. Normalise the counts to account for sample-specific coverage, to obtain spanning
read pairs per billion (SRPB): SRPB =
Number of spanning read pairs
Total number of unique and mapped reads × 10!.
3. Call IGH::DUX4 fusion if SRPB is above a given threshold.
Defining the IGH and DUX4 regions. The full definiMons of the IGH and DUX4 regions are
listed in Supplementary Table 2. The IGH region definiMon was taken from IMGT [24]. To
define the DUX4 regions, several aspects were taken into account:
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 16
• IGH has a strong enhancer that can acMvate DUX4 expression from a long
genomic distance, which implies that read evidence for DUX4-r could be mapped
far upstream of the DUX4 repeat arrays.
• There are mulMple annotated DUX4-like pseudogenes outside the DUX4 repeat
arrays. While they are not known to be implicated in DUX4-r, spanning reads may
sMll be mapped there.
• A wide region definiMon could lead to false posiMve spanning reads due to the
presence of repeMMve genomic elements.
To accommodate those conflicMng requirements, SRPB was calculated based on two DUX4
region definiMons.
The “core DUX4 region” contains all DUX4 genes (and pseudogenes) in the repeat arrays on
chromosomes 4 and 10 as annotated by Ensembl v91 [25], with a 1 kb margin. It covers a
very limited amount of intergenic sequence and leads to few false posiMve spanning reads in
negaMve samples. Furthermore, this definiMon is used to call non-IGH DUX4-rearrangements
with Pelops (see below).
The “extended DUX4 region” contains the complete subtelomeric repeat arrays on
chromosomes 4 and 10 with a 100 kb margin. DUX4-like pseudogenes on other
chromosomes, as annotated by Ensembl v91, are also included with a 1 kb margin. This
region can be used to call rare IGH::DUX4 rearrangements with breakpoints further than 1
kb upstream of the DUX4 repeat arrays and rearrangements where the majority of spanning
reads were mapped to DUX4 pseudogenes outside the repeat arrays (although we did not
observe such a case in our study). This comes at the expense of having to set a higher SRPB
threshold for calling DUX4-r, due to the higher rate of false posiMve spanning reads.
Differences to previously published method. While Pelops results are similar to those of the
previously published method, three key improvements were made:
1. CounMng of all spanning reads. Previously, only reads flagged as improperly
paired in the BAM file were used to idenMfy spanning reads. In Pelops, split reads
are included which are ocen flagged as properly paired.
2. CalculaMon of SRPB. Previously, the numerator of the SRPB equaMon was based
on the number of spanning reads. In Pelops, the calculaMon is based on spanning
read pairs to avoid ambiguity in how to count reads with mulMple aligned
segments.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 17
3. DefiniMons of DUX4 and IGH regions. Both the IGH and the DUX4 region
definiMons were changed to reduce noise. The IGH region used by Pelops is
reduced by 50 kb on either side to match the IMGT [24] definiMon, as all
previously described IGH::DUX4 rearrangements have breakpoints inside the IGH
region [1]. Pelops uses two complementary DUX4 region definiMons while the
previous implementaMon used only one to call all DUX4 rearrangements. Both
the core and the extended DUX4 regions are different to the one previously used,
which covered the DUX4 repeat arrays with approximately 70 kb and 50 kb
margins on chromosomes 4 and 10, respecMvely.
Calling DUX4-rearrangements with non-IGH regions
Pelops’ method for calling DUX4 fusions with non-IGH genes involves three main steps. First,
candidate regions in the genome that have a potenMal translocaMon with DUX4 are
idenMfied. Secondly, a slightly modified version of the algorithm developed for IGH::DUX4
fusions is run for each of the candidate regions. Finally, candidate regions with low evidence
based on the results of the previous step are filtered out.
Finding candidate regions. Candidate regions are found by looking for reads flagged as
improperly paired in the core DUX4 region. Only rearrangements with the core DUX4 region
are considered to increase specificity of this caller and to reduce noise from repeMMve
genomic elements located in the extended DUX4 region. M apping locaMon of their mates
are counted in 1 kb genomic region bins. Bins with more than two improperly paired reads
are retained, and adjacent genomic bins are merged. All genomic bins overlapping with the
extended DUX4 and IGH regions are removed. This forms a provisional set of candidate
genomic regions which may be rearranged with DUX4.
Finding evidence for rearrangements in each candidate region. For each candidate region,
spanning reads are found in the same manner as described for IGH::DUX4 fusions, except for
one difference: spanning reads are only counted if the aligned segment in the candidate
region has a mapping quality ≥ 10 (default threshold that can be changed by the user). This
is to reduce the number of false posiMve rearrangements with candidate regions that
contain repeMMve sequences.
Filtering candidate regions. For the final output, blacklisted regions that were repeatedly
called in normal samples, and regions with SRPB < 20 (default threshold that can be changed
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 18
by the user) are removed. The blacklist used for filtering candidate regions contains all
candidate regions that are called at least twice with SRPB ≥ 10 amongst the normal samples
of the paediatric ALL cohort. As the blacklist is highly dependent on the read aligner used,
separate blacklists were generated for DRAGEN and bwa analyses. The regions included in
each blacklist are listed in Supplementary Table 5.
ValidaMon of opMmal parameters. The non-IGH DUX4-r caller is dependent on several
parameters: the minimum mapping quality of reads in a candidate region, the minimum
SRPB threshold for filtering candidate regions, and the minimum SRPB threshold used for
generaMng the blacklist. To find the opMmal segngs, a random permutaMon cross validaMon
was performed 10 Mmes. The normal samples were randomly divided into a training (80%)
and an evaluaMon (20%) set (the test set consists of the tumour samples). All candidate
regions that were called at least twice in the training set, passing a given blacklist SRPB
threshold, were added to the blacklist. Then, in the evaluaMon set, candidate regions above
a given filtering SRPB threshold were compared against this blacklist. The effect of changing
the minimum mapping quality threshold to 1 was also invesMgated in this way. This
validaMon was done for both bwa and DRAGEN alignments. The final parameter values were
chosen such that there were no addiMonal calls in the evaluaMon set for any of the validaMon
rounds, while yielding a minimal blacklist to minimise the risk of missing true posiMves in the
test set.
De novo assembly of breakpoints
As input for de novo assembly, spanning reads from DRAGEN-aligned BAM files were
exported to a SAM file using Pelops. De novo assemblies were produced with SPAdes version
3.11.1 [26] with default parameters. Where this method failed, Velvet version 1.2.10 [27]
was used for de novo assembly, with the following non-default parameters:
• K. The size of the k-mers used to create de Bruijn graphs in Velvet, was set to
K = 21.
• Expected k-mer coverage. This was calculated from the average coverage over the
enMre genome reported by DRAGEN.
• Fragment length. The fragment length was calculated from the median insert
length and read length reported by DRAGEN.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 19
Mapping of de novo assemblies to the reference genome
The assembled scaffolds were mapped to the GRCh38 reference genome using BLAT [28].
Per sample, up to 500 alignments with an alignment score ≥ 20 were evaluated and ranked
by their alignment score. Next, BLAT results were manually filtered to remove duplicate
alignments resulMng from the repeMMve nature of the IGH and DUX4 genomic regions. In
order of importance, the following criteria were used:
• The scaffold should be aligned as completely as possible to segments of the
Reference
genome.
• The highest-scoring alignments should be used to cover each segment of the
scaffold.
• Alignments should be consistent (e.g. for DUX4, all segments should be aligned to
the same chromosome).
If mulMple alignments were equivalent according to these criteria, the alignment most
upstream on the reference genome was used.
In BLAT, each alignment can consist of mulMple blocks. A custom Python script was used to
extract these blocks and merge adjacent blocks if they differed only by short indels of
length < 20 bp. Another custom Python script was used to annotate each aligned segment as
IGH, core DUX4, extended DUX4 (if not already covered by core DUX4), other, or unknown
sequence (gaps in scaffold marked by Ns). Unannotated sequences > 25 bp were separately
aligned with BLAT, with no minimum alignment score, to annotate short sequences that
were previously missed.
From these alignments, a list of breakpoints was obtained. For our analyses, we only
considered those breakpoints that have a conMnuous sequence in the de novo assembly,
that is, without any unknown sequence connecMng IGH and DUX4 segments.
Amplicon sequencing
Rearrangement juncMons were confirmed by successful amplificaMon of PCR products using
primer pairs specific to each assembly. Primers were designed using primer3 socware [29–
31] (Supplementary Table 6). PCR products were converted to sequencing libraries by
combining 1 µl of PCR product with 19 µl of nuclease free water for input into the Nextera
XT DNA Library Prep Kit (Illumina PN# 15032354, 15032355, 15052163). PCR amplicons were
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 20
simultaneously fragmented and tagged then adapter sequences added by 12 cycles of PCR
as per manufacturer’s protocol. Final library concentraMon was assessed on the Agilent 4200
TapeStaMon System using the High SensiMvity D1000 tape, and libraries were diluted to 2 nM
with RSB and pooled. A 10 µl aliquot of pooled libraries was denatured with 10 µl freshly
diluted 0.2 N NaOH (Sigma-Aldrich PN# 72068), through incubaMon for 5 minutes at room
temperature. Following denaturaMon, the pooled library was then diluted to 20 pM by with
addiMon of 980 µl of pre-chilled HT1 buffer (Illumina PN#15027041) and further diluted to a
final loading concentraMon of 14 pM. The library was mixed with a 50% PhiX spike-in
(Illumina PN# 15017397) and the final pool was sequenced on a MiSeq using a MiSeq
reagent kit 150cy v3 (2x75bp reads) as per the manufacturer’s instrucMons (Illumina PN#
15043893, 15043894).
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 21
Ethics approval and consent to par>cipate
Not applicable.
Consent for publica>on
Not applicable.
Availability of data and materials
Pelops is publicly available as a Python package named ilmn-pelops. The results of this paper
are based on Pelops version 0.8.0, which is available through the Python Package Index
(hrps://pypi.org/project/ilmn-pelops/0.8.0/). The source code can also be found at
hrps://github.com/Illumina/Pelops. The code has extensive unit test coverage (98.1% at
commit 6f29f33e) and tests pass on Python 3.7, 3.9, 3.11. Pelops has minimal dependencies
on third-party packages, with the excepMon of pysam [32]. The IGH::DUX4 caller will also be
implemented in Illumina DRAGEN 4.3.
Compe>ng interests
PG, SB, KJC, CF, IA, HN, DJM, PJC, JB, DRB, MM, and MTR are employees of Illumina, a public
company that develops and markets systems for geneMc analysis.
Funding
This study was supported by Blood Cancer UK (grant 15036). SLR is a Career Development
Fellow supported by Cancer Research UK (CRUK) (grant C60802/A27193).
Authors' contribu>ons
PG and MM provided guidance for socware development, evaluated Pelops on the
paediatric paMent cohort and wrote the manuscript. SB, KJC, PG, and PJC developed the
Pelops socware. JFP conceived the methods for calling DUX4-rearrangements. DJM designed
PCR primers for amplicon sequencing of the rearrangement juncMons, while CF, AI and HN
performed and opMmised PCR and sequencing. SLR, CJH, and AVM provided paMent samples
and orthogonal data. JB, MTR, SLR, CJH, and AVM revised the manuscript. SLR, DRB, CJH,
AVM and MTR conceived the study. All authors read and approved the final manuscript.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 22
Acknowledgements
Primary childhood leukaemia samples used in this study were provided by the VIVO Biobank
for Children and Young People with Cancer. We also thank all the members of the NCRI
Childhood Cancer and Leukaemia Group (CCLG) Leukaemia Subgroup for access to material
and data on clinical trial paMents.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 23
References
1. Rehn JA, O’Connor MJ, White DL, Yeung DT. DUX HunMng—Clinical Features and
DiagnosMc Challenges Associated with DUX4-Rearranged Leukaemia. Cancers. 2020;12:2815.
2. Paiera E, Roberts KG, Wang V, Gu Z, Buck GAN, Pei D, et al. Molecular classificaMon
improves risk assessment in adult BCR-ABL1-negaMve B-ALL. Blood. 2021;138:948–58.
3. Harvey RC, Mullighan CG, Wang X, Dobbin KK, Davidson GS, Bedrick EJ, et al. IdenMficaMon
of novel cluster groups in pediatric high-risk B-precursor acute lymphoblasMc leukemia with
gene expression profiling: correlaMon with genome-wide DNA copy number alteraMons,
clinical characterisMcs, and outcome. Blood. 2010;116:4874–84.
4. Schwab C, Cranston RE, Ryan SL, Butler E, Winterman E, Hawking Z, et al. IntegraMve
genomic analysis of childhood acute lymphoblasMc leukaemia lacking a geneMc biomarker in
the UKALL2003 clinical trial. Leukemia. 2023;37:529–38.
5. Hendrickson PG, Doráis JA, Grow EJ, Whiddon JL, Lim J-W, Wike CL, et al. Conserved roles
of mouse DUX and human DUX4 in acMvaMng cleavage-stage genes and MERVL/HERVL
retrotransposons. Nat Genet. 2017;49:925–34.
6. van Deutekom JC, Wijmenga C, van Tienhoven EA, Gruter AM, Hewir JE, Padberg GW, et
al. FSHD associated DNA rearrangements are due to deleMons of integral copies of a 3.2 kb
tandemly repeated unit. Hum Mol Genet. 1993;2:2037–42.
7. Choi SH, Gearhart MD, Cui Z, Bosnakovski D, Kim M, Schennum N, et al. DUX4 recruits
p300/CBP through its C-terminus and induces global H3K27 acetylaMon changes. Nucleic
Acids Res. 2016;44:5161–73.
8. Young JM, Whiddon JL, Yao Z, Kasinathan B, Snider L, Geng LN, et al. DUX4 binding to
retroelements creates promoters that are acMve in FSHD muscle and tesMs. PLoS Genet.
2013;9:e1003947.
9. Yeoh E-J, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. ClassificaMon,
subtype discovery, and predicMon of outcome in pediatric acute lymphoblasMc leukemia by
gene expression profiling. Cancer Cell. 2002;1:133–43.
10. Gu Z, Churchman ML, Roberts KG, Moore I, Zhou X, Nakitandwe J, et al. PAX5-driven
subtypes of B-progenitor acute lymphoblasMc leukemia. Nat Genet. 2019;51:296–307.
11. Lilljebjörn H, Henningsson R, Hyrenius-Wirsten A, Olsson L, Orsmark-Pietras C, von Palffy
S, et al. IdenMficaMon of ETV6-RUNX1-like and DUX4-rearranged subtypes in paediatric B-cell
precursor acute lymphoblasMc leukaemia. Nat Commun. 2016;7:11790.
12. Ryan SL, Peden JF, Kingsbury Z, Schwab CJ, James T, Polonen P , et al. Whole genome
sequencing provides comprehensive geneMc tesMng in childhood B-cell acute lymphoblasMc
leukaemia. Leukemia. 2023;1–11.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 24
13. Yasuda T, Tsuzuki S, Kawazu M, Hayakawa F, Kojima S, Ueno T, et al. Recurrent DUX4
fusions in B cell acute lymphoblasMc leukemia of adolescents and young adults. Nat Genet.
2016;48:569–74.
14. Zhang J, McCastlain K, Yoshihara H, Xu B, Chang Y , Churchman ML, et al. DeregulaMon of
DUX4 and ERG in acute lymphoblasMc leukemia. Nat Genet. 2016;48:1481–9.
15. Mancarella C, Carrabora M, Toracchio L, Scotlandi K. CIC-Rearranged Sarcomas: An
Intriguing EnMty That May Lead the Way to the Comprehension of More Common Cancers.
Cancers. 2022;14:5411.
16. Antonescu CR, Owosho AA, Zhang L, Chen S, Deniz K, Huryn JM, et al. Sarcomas with CIC-
rearrangements are a disMnct pathologic enMty with aggressive outcome: A clinicopathologic
and molecular study of 115 cases. Am J Surg Pathol. 2017;41:941–9.
17. Kawamura-Saito M, Yamazaki Y , Kaneko K, Kawaguchi N, Kanda H, Mukai H, et al. Fusion
between CIC and DUX4 up-regulates PEA3 family genes in Ewing-like sarcomas with
t(4;19)(q35;q13) translocaMon. Hum Mol Genet. 2006;15:2125–37.
18. Potuckova E, Zuna J, Hovorkova L, Starkova J, Stary J, Trka J, et al. Intragenic ERG
DeleMons Do Not Explain the Biology of ERG-Related Acute LymphoblasMc Leukemia. PloS
One. 2016;11:e0160385.
19. Zaliova M, Potuckova E, Hovorkova L, Musilova A, Winkowska L, Fiser K, et al. ERG
deleMons in childhood acute lymphoblasMc leukemia with DUX4 rearrangements are mostly
polyclonal, prognosMcally relevant and their detecMon rate strongly depends on screening
Method
sensiMvity. Haematologica. 2019;104:1407–16.
20. Li H. Aligning sequence reads, clone sequences and assembly conMgs with BWA-MEM
[Internet]. arXiv; 2013 [cited 2024 May 10]. Available from: hrp://arxiv.org/abs/1303.3997
21. Raczy C, Petrovski R, Saunders CT, Chorny I, Kruglyak S, Margulies EH, et al. Isaac: ultra-
fast whole-genome secondary analysis on Illumina sequencing plaorms. BioinformaMcs.
2013;29:2041–3.
22. Papaemmanuil E, Rapado I, Li Y , Porer NE, Wedge DC, Tubio J, et al. RAG-mediated
recombinaMon is the predominant driver of oncogenic rearrangement in ETV6-RUNX1 acute
lymphoblasMc leukemia. Nat Genet. 2014;46:116–25.
23. Miju šković M, Chou Y -F, Gigi V, Lindsay CR, Shestova O, Lewis SM, et al. Off-Target V(D)J
RecombinaMon Drives Lymphomagenesis and Is Escalated by Loss of the Rag2 C Terminus.
Cell Rep. 2015;12:1842–52.
24. Manso T, Folch G, Giudicelli V, Jabado-Michaloud J, Kushwaha A, Nguefack Ngoune V, et
al. IMGT ® databases, related tools and web resources through three main axes of research
and development. Nucleic Acids Res. 2022;50:D1262–72.
25. MarMn FJ, Amode MR, Aneja A, AusMne -Orimoloye O, Azov AG, Barnes I, et al. Ensembl
2023. Nucleic Acids Res. 2023;51:D933–41.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 25
26. Prjibelski A, AnMpov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo
Assembler. Curr Protoc Bioinforma. 2020;70:e102.
27. Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn
graphs. Genome Res. 2008;18:821–9.
28. Kent WJ. BLAT—The BLAST-Like Alignment Tool. Genome Res. 2002;12:656–64.
29. Koressaar T, Remm M. Enhancements and modificaMons of primer design program
Primer3. Bioinforma Oxf Engl. 2007;23:1289–91.
30. Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, et al. Primer3--new
capabiliMes and interfaces. Nucleic Acids Res. 2012;40:e115.
31. Kõressaar T, Lepamets M, Kaplinski L, Raime K, Andreson R, Remm M. Primer3_masker:
integraMng masking of template sequence with primer design socware. Bioinforma Oxf Engl.
2018;34:1937–8.
32. Bonfield JK, Marshall J, Danecek P , Li H, Ohan V, Whitwham A, et al. HTSlib: C library for
reading/wriMng high-throughput sequencing data. GigaScience. 2021;10:giab007.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 26
Tables
See separate Excel sheet for Supplementary Tables.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 27
Figures
Figure 1. Overview of Pelops’ DUX4-r detecMon method. (A) DetecMon of IGH::DUX4 fusions.
Reads spanning the IGH::DUX4 translocaMon breakpoint contain segments that align to IGH
and DUX4. These segments are frequently aligned to several different repeats. Pelops finds
spanning reads across the whole IGH and DUX4 regions and normalises their count to
spanning read pairs per billion (SRPB). (B) DetecMon of other DUX4 rearrangements. First,
Pelops idenMfies improperly paired reads in the core DUX4 region, with mates mapping
anywhere else across the genome. Then, regions where mulMple mates are clustering are
idenMfied. Finally, any regions with the number of mates below a threshold, as well as
recurrent false posiMve regions are removed. SRPB values are calculated for all remaining
regions.
IGH DUX4
SRPB = Spanning read pairs
Total reads ×10!
Find spanning reads
Normalise
Core DUX4Extended DUX4
Multiple equivalent
alignments of reads
Pelops
IGH
14q32 4q35
10q36
A
B Core DUX4
Improperly paired reads
Region X Region Y
Find mates
Region Z
Calculate SRPB Too few reads Recurrent false-positive
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 28
Figure 2. Spanning read pairs per billion (SRPB) for IGH::DUX4 calculated by Pelops for all
210 tumour and 208 matched germline samples of the evaluaMon cohort, using alignments
by DRAGEN. The x-axis shows the SRPB distribuMon calculated based on the core DUX4
region definiMon, while on the y-axis calculaMons are based on the extended DUX4 region
definiMon. Colours indicate DUX4 fusion type of the samples predicted by Pelops. The
dashed horizontal and verMcal lines indicate SRPB thresholds of 15 and 5, for extended and
core DUX4 regions, respecMvely, on which the idenMficaMon of IGH::DUX4 fusions is based.
The marker indicates whether orthogonal evidence for DUX4-r is available, based either on
RNA-sequencing (gene expression profile and/or IGH::DUX4 fusion), the presence of ERG
deleMons, or amplicon sequencing.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 29
Figure 3. Summary of orthogonal evidence for DUX4 rearrangements available for samples
classified as DUX4-r in Ryan et al, 2023 and evidence from Pelops. PaMent IDs are shown at
the borom of the plot.
0
50
100
IGH::DUX4 SRPB
IGH::DUX4 (Pelops)
DUX4::other (Pelops)
Expression profile
RNA fusion
ERG disruptive SV (WGS)
ERG small variant (WGS)
ERG deletion (MLPA)
PCR amplicon seq
12460
23132
23507
21322
22804
23863
22065
23842
12816
23114
22045
20696
12083
11178
22897
23769
11811
21230
20683
24423
9469
11556
23078
20515
22354
24391
10925
23533
11053
11672
12118
10186
21437
22918
22405
12134
20724
22417
22387
20716
19827
12820
22006
20035
21689
22346
10876
12356
23074
21568
12335
22037
12334
11957
22224
10310
11148
20720
23445
22355
Pelops prediction
IGH::DUX4
QSOX1::DUX4
DNTT::DUX4
MYB::DUX4
SDR16C6P::DUX4
ENTREP1::DUX4
chr14_interg::DUX4
None found
WTS
DUX4−r
IGH::DUX4
IGH::QSOX1::DUX4
None found
Not available
WGS
DEL
INV
BND
missense
frameshift
inframe
None found
ERG deletion (MLPA)
DEL
None found
Not available
PCR amplicon seq
IGH::DUX4
None found
Not available
IGH::DUX4 SRPB
Core DUX4
Extended DUX4
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 30
Supplementary Figures
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 31
Targeted IGH::DUX4 caller
Find evidence
for DU X4-
rearrangement
core DU X4
region
IGH
minimum
MAPQ = 0
JSON report
SAM file for each
rearrangement
Untargeted DUX4-rearrangement caller
Find candidate regions
Start Pelops
SRPB ≥ minimum SRPB
Yes
Nominimum
SRPB = 20
Include rearrangement
with candidate region
For each
candidate region
End for-loop
End
Input BAM/CRAM
Blacklist of
common false-
positive regions
Find evidence
for DU X4-
rearrangement
extended
DU X4 region
IGH
minimum
MAPQ = 0
Find evidence
for DU X4-
rearrangement
core DU X4
region
candidate
region
minimum
MAPQ = 10
For each read
segment aligned to
rearrangement region
Rearrange-
ment region
minimum
MAPQ
Read segment is
duplicate
No
Yes
MAPQ ≥ minimum
MAPQ
Yes
No
Keep read segment
End for-loop
Calculate
SRPB
Count split and
paired reads
Total number
of unique and
mapped reads
in BAM/CRAM
Find evidence for DUX4-rearrangement
For each read
segment aligned to DU X4
region
DU X4
region
Read segment is
duplicate
No
Yes
Keep read segment
End for-loop
Count spanning read pairs,
which have segments
aligned to both regions
End for-loop
Merge directly adjacent
candidate regions
Remove candidate region
overlapping with IGH or
extended DU X4
Remove candidate region
overlapping with any
blacklisted region
Blacklist of
common false-
positive regions
Divide GRCh38 into
bins of size 1 kb
For each 1 kb
region bin
Count number of
improperly paired reads
in bin with mate in core
DU X4 region
Count > 2
Yes
No
Keep bin as
candidate region
Find all improperly
paired reads in core
DU X4 region
Find candidate regions
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 32
Supplementary Figure S1. Flowchart overview of Pelops. This overview illustrates the
algorithm used by Pelops for calling DUX4-rearrangements, and the inputs, outputs and
parameters of the pipeline. All parameters in the light orange parallelograms with a black
frame can be changed by the user of Pelops through its command line interface.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 33
Supplementary Figure S2. Comparison of Pelops to previously published implementaMon
(Ryan et al., 2023) using 210 tumour and 208 matched germline samples for the paediatric
ALL cohort on Isaac-aligned BAM files. Colours indicate DUX4-r status of the sample, as
previously published. The line indicates the expected difference based on counMng spanning
read pairs in Pelops versus the previously published method, which counted spanning reads
individually.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 34
Supplementary Figure S3. Comparison of Pelops’ IGH::DUX4 caller results for three different
aligners. Spanning read pairs per billion (SRPB) for IGH::DUX4 were calculated by Pelops for
all 210 tumour and 208 germline samples of the paediatric ALL validaMon cohort, using read
alignments created by DRAGEN, bwa, and Isaac. The x-axis shows the SRPB distribuMon
calculated based on the core DUX4 region definiMon, while on the y-axis calculaMons are
based on the extended DUX4 region definiMon. Colours indicate DUX4 fusion status of the
samples predicted by Pelops. The dashed horizontal and verMcal lines indicate SRPB
thresholds of 15 and 5, for extended and core DUX4 regions, respecMvely. The marker
indicates whether orthogonal evidence for DUX4-r is available, based on RNA sequencing,
the presence of ERG deleMons, or amplicon sequencing.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 35
Supplementary Figure S4. Spanning read pairs per billion (SRPB) are plored against the
following sequencing parameters: read length, median insert size, mean coverage over
genome and tumour content esMmated by DRAGEN.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 36
Supplementary Figure S5. Pair plot showing relaMonships between read length, median
insert size, average alignment coverage over genome, and esMmated tumour content in
cohort samples. Off-diagonal scarerplots show relaMonships between two variables, while
the on-diagonal plots visualise distribuMons of a single variable.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 37
Supplementary Figure S6. Number of IGH::DUX4 breakpoint juncMons per sample in the de
novo assemblies. Only juncMons which have a conMnuously assembled sequence are
counted. The lec plot shows the number of juncMons in the core DUX4 region, the right plot
the numbers in the extended DUX4 region. Samples for which a full inserMon of DUX4 is
observed are highlighted. Note that each inserMon is counted as two separate breakpoint
juncMons.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Pelops: A dedicated caller for DUX4-r from WGS data 38
Supplementary Figure S7. Genomic locaMon of DUX4-r breakpoints in the IGH locus for the
paediatric paMent cohort. Breakpoint orientaMon + means that the DUX4 segment is
downstream of the breakpoint, while - means that the DUX4 segment is upstream of the
breakpoint. Genomic posiMons are based on the GRCh38 reference. Segments of the IGH
locus not shown here do not contain any DUX4-r breakpoint.
.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 24, 2024. ; https://doi.org/10.1101/2024.05.23.595509doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.