Results
Cohort characteristics and pathobiology of the samples
We obtained fresh frozen tissue samples from pathologically normal lung tissues, benign pulmonary lesions ,
carcinoma in situ, locally invasive carcinomas, and metastatic lung carcinomas from a cohort of deidentified donors
(Figure 1B) under an IRB approved protocol. The pathologically normal tissues were obtained from the cancer
patients, and the benign pulmonary lesions were collected from preventive surgeries. The pathologically benign
lesions were typically identified as suspicious nodules of indeterminate potential and carried additional characteristics
that are associated with elevated cancer risk. For instance, S4 was a well-circumscribed 0.5 x 0.5 x 0.4 cm pulmonary
nodule collected via wedge resection, which was considered non-malignant. S5 was a mass measuring 1.5 x 1.3 x 1.2
cm, with ill -defined boundaries, firm and had diffuse interstitial involvement by non -necrotizing granulomatous
inflammation where the stains for infectious agents were negative. S6 was characterized by a calcified hard nodule
of 3.5 x 2 x 1.1 cm with calcification, and focal atypical cells, suggestive of abnormal pulmonary growth. This patient
had a prior history of squamous cell carcinoma and received chemoradiation, the malignant tumor was removed via
wedge resection and did not overlap the nodule . Regional inflammation was noted for several benign lesions, but
stains for infectious agents were negativ e. Among the cancerous lesions about a third were presented with locally
invasive and/or metastatic disease, while others were carcinoma in-situ. Histological characteristics indicated the
presence of papillary, lepidic, and solid subtypes, and some tumors had mixed histology. For instance, S9 had an
infiltrating adenocarcinoma of lung of 2.0 x 1.7 x 1.9 cm with lepidic (50%), papillary (5%), and acinar (45%) pattern
characteristics; lymph node or pleural invasion not detected. The patient had a prior history of prostate cancer . S17
was diagnosed with invasive 1.6 cm mucin-producing adenocarcinoma, with mixed subtypes consisting of lepidic,
acinar, and papillary patterns associated with granulomatous inflammation; invasion of visceral pleura was noted .
S20 was a n invasive, moderately differentiated lung adenocarcinoma, acinar predominant, and had lymph node
metastasis (pT1c, pN1).
OM-based genomic profiling of the cohort
We performed optical genome mapping at 400X effective coverage on the frozen samples using BioNano Saphyr
platform16. The optical mapping (OM) data was processed and analyzed after multiple QC filters ( see Methods for
details). The overall OM data quality was excellent; on average, the length of the input maps was 2375219 ±526469.2
Mb, with an N50 length of 205.35±47.30 kb, and more than 91.55% ± 6.52% of the genome maps could be mapped
to the human reference genome (GRCh38), achieving a genome coverage 351.76±45.90. Some genomic regions had
multiple contigs representing multiple alternate alleles due to the presence of one more genomic alteration events.
We applied the BNG rare variant analysis pipeline to identify different classes of structural variations and copy
number alterations. The landscapes of genomic rearrangements and other alterations in these samples are presented
using circus plots ( Figure 1B). The insertions (median: 5.45kb) and deletions (median: 16.79kb) were the most
prevalent variants reported in the dataset , while intra -chromosomal fusions and inter -chromosomal translocations
were less common in the cohort, and more so in the non-malignant samples.
For comparative assessment and validation, we performed whole genome sequencing (WGS) of 4 samples -2 sets of
advanced stage carcinomas (S8 and S18) and paired normal tissues (S1 and S2) using 150bp short read PE sequencing
on the Illumina platform (see Methods for details). The numbers of detectable somatic point mutations (S8: 33,541,
S18: 18,445) and InDels (S8: 1396, S18: 570) were within the typical range expected for lung carcinomas 17. WGS
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 12, 2024. ; https://doi.org/10.1101/2024.11.11.623122doi: bioRxiv preprint
Short DNA fragments
(102-103bp)
Frozen
biospecimen
Ultra -long DNA
fragments (10 4-105bp)
Cohort
Optical mapping
with BioNano Saphyr
Short read sequencing
with Illumina
Landscape of
genomic
alterations
Genomic locus with complex
genomic rearrangement.
Multiple contigs showing
overlapping genomic
alterations
Inference about the
genesis of complex
genomic
rearrangement
A
B Normal bronchia Benign pulmonary lesion Carcinoma in situ Invasive carcinoma Metastatic carcinoma
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
S17
S18
S19
S20
Figure 1: A) Outline of the study design profiling genomic landscapes of pathologically normal lung tissues, benign
dysplastic and hyperplastic lesions, and malignant lung carcinoma samples. B) Circos plots showing somatic genomic
structural alterations and rearrangements in the samples in the cohort, grouped according to the disease stages. C)
Representative genomic alteration in a sample (S18) is shown at high resolution, and corresponding variation in sequencing
coverage in Illumina short read sequencing of the tumor is highlighted (tumor: orange, paired normal: blue). The genomic
resolution of optical mapping and short read sequencing are different. The affected and adjoining genomic regions and gene
annotations therein are shown below.
S18: chr17:33723835 -35948096C
Short read
sequencing
coverage
Tumor
Normal
Optical mapping
Genomic
rearrangement
Tumor
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 12, 2024. ; https://doi.org/10.1101/2024.11.11.623122doi: bioRxiv preprint
3
using short read technology identified several structural variations in the tumor samples (n=40 for S8, and n=15 for
S18), but the frequency was lower than that identified using the OM technology. Furthermore, a vast majority of the
high confidence somatic structural variations in tumors identified on the short-read platform were below the size
threshold for detection using the OM technology. Among the large structural variations within detection range of
both technologies, those called by WGS were also detected by OM . For instance, S18 has one deletion in chr17:
33.7Mb-35.9Mb that found in both platform ( Figure 1C), we totally identified four SVs (two inversions and two
deletions), which were identified by both platforms . In contrast, most of the high confidence, high allele frequency
(>10%) OM structural variation calls were not identified by WGS. In some cases, genomic evidence on the Illumina
platform was below the confidence level for detection threshold (also see Discussion), while some others had
rearrangement junctions in the repetitive or low complexity sequences that are difficult to resolve using short read
technologies. Nonetheless, our observations about the extent of concordance between the platforms are similar to
other studies18,19.
A comparative analysis of the genomic landscapes of the samples profiled using OM across different disease stages
indicated that there is an increased level of genomic rearrangements with progressively advanced disease (Figure
1B). In particular, the large copy number alterations and inter-chromosomal rearrangements were rare in
pathologically normal tissues and benign samples, but prevalent in the in-situ and advanced carcinomas. This is
qualitatively consistent with the observations on the mutational landscapes of healthy, diseased, and malignant lung
samples profiled using short read technologies17,20. While a significant burden of somatic point mutations and small
InDels in lung cancer genomes arise prior to malignancy during development and aging 5,12,15, these observations
collectively suggest that large genomic rearrangements are typically uncommon in non-malignant tissues.
Contig reconstruction based on the l ong OM traces allowed detection of both junctions of rearrangements events ,
even if those are 104-106bp apart, enabling end-to-end characterization and allelic reconstruction of the genomic
alteration events. Based on the contig -level data, we further identified the instances where multiple rearrangement
and/or copy number alteration events overlap or occur within the same OM contig and tagged those as clustered SVs.
We identified 25-36 clustered SVs in the benign lesions, 24-59 in carcinoma in situ, and 32-59 in the invasive tumors
and those with metastasis , respectively . Most clustered SVs consist ed of relatively small, localized structural
variations, including deletions, insertions, duplications, and inversions, but some of clusters spanned local (~hundreds
of kb) or distal (~tens of Mb) regions of chromosomes, and/or involved inter -chromosomal rearrangement events.
The contig -level OM data revealed that some of those clustered SVs were due to multiple entangled genomic
alteration breakpoints occurring on the same allelic contig at the locus representing complex genomic rearrangements.
Long sequences captured by Bionano OM allowed us to identify and annotate known cancer genes proximal to
genomic rearrangement junctions. Clustered genomic rearrangements proximal to U2AF1, IRF4, ZNF521, PDE4DIP,
HIP1, WWTR1, NOTCH2, NRG1, and MAPK1 were detected in multiple samples across different disease stages.
U2AF1 and HIP1 proximal rearrangement events were common in the distant and invasive carcinomas, although the
preferential enrichment in advanced tumors was not statistically significant. Considering the genomic resolution of
the OM data and absence of point mutation data, this gene-set does not represent a complete set of oncogenic drivers
in these tumors. Receptor tyrosine kinase (RTK)/RAS/RAF pathway is frequently mutated in non -small cell lung
carcinomas, while MET and ERBB2 focal amplifications are also common17. Although we did not detect these focal
amplifications in our cohort using OM platform, reanalyzing existing single cell transcriptomic data28 for the samples
in the cohort , we assessed relevant oncogenic pathway activities. S17 and S20, both of which are advanced
carcinomas, had prominently elevated expression of the EGFR pathway, suggesting activated potential RTK
signaling. Interestingly, high EGFR expression was also observed in S7, which is a benign lesion. S19, S11, and S10
had high KRAS expression, which might be due to RAS-driven tumorigenesis. Notably, S19 is an acinar carcinoma,
while S10 is of papillary subtype - both subtypes are known to have frequent KRAS mutations21.
omcplR for characterization of complex genomic rearrangements
We developed omcplR, a framework to identify complex genomic rearrangements (CGRs) and resolve their genomic
structures (see Methods). As the first step, omcpIR divided the genome into non-overlapping, equal-sized segments,
and scanned them to identify the genomic segments having multiple contigs that harbor different sets of structural
variation (SV) junctions. It considered those genomic segments, as well as their corresponding contigs as the initial
seeds. For each seed, it then iter atively identified other co -occurring contigs, which were subsequently used in the
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 12, 2024. ; https://doi.org/10.1101/2024.11.11.623122doi: bioRxiv preprint
Figure 2: A) A schematic representation of the computational pipeline to identify complex genomic rearrangements and infer underlying order of events. B) Benchmarking using simulated datasets where the order of rearrangement events are known in advance.
abcdef
Clonal size
Time (generation)
ab
cd
ef
Inferred order of events
a bcd
cdab
ab
cdabef
f e
contigs
refContig-level data
Temporal clonal dynamics
Inference on simulated dataset with known order of events
t
contigs
Rearrangement junctionsJP1-JP2JP4-JP5JP5-JP6JP3-JP6JP4-JP6
C1C2C3C4
Contig –rearrangement junction relationship
JP1-JP2
JP4-JP5JP5-JP6JP3-JP6
JP4-JP6
Inferred Bayesian network
JP1JP2JP3JP4JP5JP6DNA BreakpointsRef. GenomeContigs with rearrangement junctionsC1C2C3 C4
Locus of complex rearrangementAcontigsjunctions
rij : OM coverage supporting a junction
r11 r12 0 0 0r21 0 r23 0 0r31 0 0 r34 00 0 r43 0 r45
Contig –rearrangement junction matrix with coverage
Annotated Order-of-Events
JP1-JP2
JP4-JP5JP5-JP6JP3-JP6
JP4-JP6
Genomic, epigenomic, and functional context annotation
C1C2C3
ab cd efC1C2C3
B
Contig –rearrangement junction matrix with coverage
4
next iteration. The recursive search continued until no new contig was identified for a given seed. SV junctions were then pulled from each cluster contigs. Unique contigs were retained as alternative alleles that differed in their content, orientation, or combinations of underlying structural variations. After the processing, each non-redundant set represented a complex genomic rearrangement (CGR) event. Within the CGR events, some of the contigs carried multiple rearrangement junctions, and on the other hand, a subset of the rearrangement junctions was shared across multiple contigs. omcpIR used the joint patterns of contigs and rearrangement junctions to infer the conditional relationships among the junctions, and thereby predict the likely order of precedence of rearrangement events therein (Figure 2A). For each CGR event, omcpIR estimated the frequency of pairwise co-occurrence between the junctions, and then used a Bayesian approach to determine conditional relationships among all possible junction pairs – such that the CGR event was graphically represented as a Bayesian network, where the nodes were SV junctions and the pairs of junctions with significant conditional relationships related to edges. The likely sequence of rearrangement events during tumorigenesis could be inferred from the topology of the network top-down – early rearrangement junction events were at or near the top of the graphs, while downstream events were below, such that one can annotate the rearrangement junctions within a hierarchy as early, intermediate, or late (Figure 2A). Outside the direct decedent relationships, the order of events inference needs cautious interpretation. If a complex event has multiple disjoint graphs, no statistical inference can be made regarding the relative temporal order of occurrence of rearrangement events across the graphs. Likewise, no inference could be made regarding temporal order between different unrelated CGR events in a cancer genome. We first evaluated the performance of omcpIR using computational simulation where the ground truth about the order of events was known. We simulated population dynamics of clonally derived cells with the Galton Watson branching process (see Methods), where the cells had certain probabilities of generating genomic rearrangements at each round of cell division, which are then passed on to its progenies. Thus, a cell lineage may acquire one or more simple rearrangement events during a cell division, and then the genomic region impacted by such events may accumulate further genomic alterations in subsequent cell divisions in the progenies, resulting in a complex event. Alternately, a genomic region may occasionally accumulate multiple entangled rearrangements in a single, rare catastrophic event. While a single catastrophic CGR event would have comparable allele frequencies of the interconnected events, those that arose due to temporally distinct events would have different allele frequencies within the same CGR. Using different model parameters, representing different evolutionary scenarios and noise-level in the OM data, we propagated the clonal lineages in the simulation keeping track of the CGRs, their order of events, and tabulated the final allele frequencies of the genomic alterations therein (Figure 2B). We found that omcpIR could correctly identify the order of events in complex CGRs. The order inference was robust when the timing of emergence of the events along a clonal lineage are sufficiently spaced in time. But the model did not always distinguish the events that arose in a single catastrophic event from those that arose in quick succession in time, since those may have negligible variations in allele frequencies in a realistic dataset. Next, we scanned the somatic genomes of the premalignant lesions, as well as localized and advanced carcinomas in our cohort using omcpIR to identify the CGR events and predict their likely order of events. Most events were heterologous i.e. involving multiple classes of genomic alterations and involved multiple overlapping contigs – which reflected diversity and heterogeneity within the CGR events. A majority of the CGR events were intra-chromosomal localized within 1Mb regions, such that the rearrangements were within the same topological domains and might not perturb long-range functional interactions within the rearranged regions. Distal (>1Mb) intra-chromosomal events were more common in malignant tumors. As we discuss later, the burden of inter-chromosomal complex events increased from non-malignant to malignant samples. As a case study, we showcase a complex genomic rearrangement event located in chromosome 15 between 23.13Mb-28.62Mb of distant tumor S20, which is an advanced carcinoma (Figure 3A; see Methods for details). This CGR involved 7 alternative DNA contigs that harbored different combinations of structural variations, often involving common subsets of rearrangement junctions. We show the joint distributions of the rearrangement junctions on the contigs using a circus plot, where the width of the arcs indicates the extent of co-occurrence of rearrangement junction-pairs in that sample, involving the most prevalent joint junctions: Chr15_28.42MB||Chr15_28.42MB - Chr15_28.50MB||Chr15_28.62MB. Gene-level annotation suggested that the CGR impacted the genes GOLGA8DP, GOLGA6L2, GOLGA6L7 and MKRN3. The rearrangement junctions in this CGR were within the same chromatin
5
domain and shared similar (epi)genomic contexts (Figure 3B). All the junctions were in the strong euchromatin context in nuclear core, and nearly all of them overlapped with repetitive elements (2 junctions lacked nuclear context annotations, Chr15:23.13MB||Chr15:23.13MB and Chr15:23.36MB||Chr15:28.34MB). Close sequence-level and topological proximity of all involved rearrangement junctions are consistent with a model that DNA breaks within a replication factory and recursive rearrangements could give rise to such events (Figure 3C). Furthermore, the order of events analysis of this complex region indicated that the segment junctions Chr15:23.13MB||Chr15:23.13MB and Chr15:23.59MB||Chr15:28.51MB might be the early event (Figure 3B). Close sequence-level and topological proximity of all involved rearrangement junctions within the same chromatin domain context suggest that the DNA breaks likely occurred within a common replication factory, and we speculate that recursive DNA breaks perhaps due to persistent replication factory crisis22 might lead to such events (Figure 3C). As another case study, we examine a CGR in the in-situ tumor S8, chromosome 9 between 62.81Mb-68.23Mb that impacted the genes GXYLT1P3, TMEM252, CBWD5, FOXD4L3, FOXD4L4, LERFS, ANKRD20A3, CBWD3, and PGM5(Figure 3D). There was 8 alternative DNA contigs that exhibited different combinations of structural variations, often involving common subsets of rearrangement junctions – suggesting genetic heterogeneity. To visually represent the joint distributions of the segment junctions on the contigs, we created a circos plot, highlighting the most prevalent joint junctions: Chr9:62.81MB||Chr9:65.82MB - Chr9:65.67MB||Chr9:65.67MB, Chr9:62.81MB||Chr9:65.82MB - Chr9:63.45MB||Chr9:65.82MB, and Chr9:62.81MB|| Chr9:65.82MB - Chr9:62.90MB||Chr9:62.94MB (Figure 3E). Order of events analysis of this complex region indicated that the segment junctions Chr9:62.81MB||Chr9:65.82MB, Chr9:62.88MB||Chr9:62.95MB and Chr9:62.90MB||Chr9:62.94MB early event (Figure 3F), and Chr9:62.87MB||Chr9:62.96MB and Chr9_63.45MB||Chr9_65.82MB as the late event. The rearrangement junctions had different genomic and epigenomic contexts. Among the three early events, the majority are predominantly composed of strong chromatin within lamina-associated domains at the nuclear periphery. The immediate and late junctions occurred within different chromatin (strong open chromatin and weak open chromatin) and nuclear localization contexts (lamina-associated domains at the nuclear periphery and nuclear core) (Figure 3F). Differences between the rearrangement junctions in their allele frequencies and genomic contexts suggest that this CGR event likely arose from multiple independent rearrangement events sequentially over the course of tumorigenesis. We note that genomic rearrangements may affect chromatin domain organization, which in turn may affect DNA double strand breaks and repair pathway choices, such that etiology of such events may be complex. We identified 1116 CGRs in an initial genome-wide scan in our cohort; we refined the catalog based on the allele frequency of underlying rearrangements and overlap with high confidence genomic alterations as ‘major’. A total 216 major CGRs were found in the cohort, of which 124 CGRs had all rearrangement junctions mapped. The major CGRs were uncommon in the benign lesions (0-2 per sample), but prevalent in the carcinoma in situ (0-33 per sample), locally invasive tumors (20-24) and those with distant metastasis (6-45). Several known cancer genes such as U2AF1, IRF4, ZNF521, PDE4DIP, HIP1, and WWTR1 were near the CGR junctions in multiple samples, however the rearrangement patterns and junctions were sample specific. These genes also had frequent copy number alterations (2-11% cases) in the TCGA lung adenocarcinoma and squamous cell carcinoma cohorts21. Anyhow, based on the patterns of shared and unique rearrangement junctions, omcpIR established the order-of-events for 13 major CGRs. For the remaining cases, no clear order of precedence could be established reliably. It is possible that some of the latter may have arisen due to a single event or events that were not sufficiently separated temporally such that intermediate allelic states could be obscured by clonal sweep or low OM coverage at the junctions. (Epi)genomic contexts and DSB repair mechanisms Analyses of the topological complexity of the order-of events in the CGRs (Figure 4A) indicated a few common patterns. The numbers of relatively early and late events were comparable, and higher than those classified as intermediate (~22%). This could be partially attributed to the fact that intermediate events are only present in the CGRs with order >2. Annotations of genomic contexts indicated that 73% of all junctions in CGRs had at least one of the two junction end sequences in repeat regions (Figure 4B). Late events were marginally enriched for both junction end-sequences in the repeat regions. While most of the junction end-sequences were either both in repeat regions (62.9%), or both in non-repetitive regions (26.9%), early events were significantly more enriched (64.5%, p-value<0.05, Binomial test; Figure 4B) in heterologous junctions, where one of the two junction end-sequences was in the repeat regions. Chromatin context analysis indicates that a majority of the CGR rearrangement junctions are in
D
Example-2: S8 : chr9: 62.81Mb-68.23Mb
Chr9_62.87MB||Chr9_62.96MBChr9_63.46MB||Chr9:65.82MBChr9:65.67MB||Chr9:65.67MBChr9_62.81MB||Chr9_68.23MBChr9_62.81MB||Chr9_65.67MBChr9_62.90MB||Chr9_62.94MBChr9:63.45MB||Chr9_65.82MB
Chr9_62.81MB||Chr9_65.82MBE
Chr9_62.81MB||Chr9_65.67MB
**Chr9_62.81MBb||Chr9_68.23MB
**
Chr9_62.81MB||Chr9_65.82MB
**
**Chr9_65.67MB||Chr9_65.67MB
**Chr9_62.90MBb||Chr9_62.94MB
**Chr9_63.46MB||Chr9_65.82MB
**Chr9_63.45MB||Chr9_65.82MB
**Chr9_62.87MB||Chr9_62.96MB
**
F
Chr9_62.88MB||Chr9_62.95MB
AExample-1: S20 : chr15: 23.13Mb-28.62Mb23.2Mb24Mb25.6Mb26.4Mb24.8Mb27.2Mb28Mb28.4Mb
C Plausible persistent replication-dependent DNA breaks and chromoanasynthesis
Junctions are within the same nuclear, chromatin and replication contextsGenomic DNA DNA breaks
timeDNA break
Chr15_23.13MB ||Chr15_23.13MBChr15_23.21MB||Chr15_28.56MB
Chr15_28.42MB||Chr15_28.42MB
Chr15_28.50MB||Chr15_28.62MB
Chr15_23.59MB||Chr15_28.51MBChr15_23.36MB||Chr15_28.34MB
** **
********
**Chr15_23.22MB||Chr15_23.22MB
B
**
Chromatin: Strong Euchromatin Weak Euchromatin Inter. Chromatin Weak Het. Chromatin Strong Het.chromatinReplication: Early Late NANuclear: Core Periphery NAContext: Repeat Exon Telomere Other**
Figure 3: A) A representative intra-chromosomal complex rearrangement is shown from the sample S20, affecting genomic region chr15: 23.13Mb-28.62Mb. Genomic rearrangements and associated contigs are shown. B) Inferred order-of-events involved in the complex rearrangement. Associated nuclear, chromatin, replication and genomic contexts of the junctions are highlighted. All junctions have similar genomic and epigenomic contexts in this example. C) Replication factory crisis leading to chromoanasynthesis is being proposed as a likely mechanism to explain the genesis of this complex event. D) A representative intra-chromosomal complex rearrangement is shown from the sample S8, affecting genomic regions chr9: 62.81Mb-68.23Mb. E) Circos plot showing pairwise co-occurrence of rearrangement junctions on the same contigs. Widths of the arcs indicate optical mapping coverage. F) Inferred order-of-events involved in the complex rearrangement. (Epi)genomic contexts of the junctions are as above.
6
open chromatin, and both the junction end-sequences in open chromatin are the most common (open|open, 33.3%). Surprisingly, this was significantly more common than both the junction end-sequences in closed chromatin (closed|closed, 0.7%, p-value <0.05, Binomial test; Figure 4B). Taken together, repeat sequences in open chromatin may contribute towards a significant majority of rearrangement junctions in the CGRs. Subsequently, we investigated whether complex genomic rearrangements (CGRs) stem from heightened genomic fragility and instability within specific genomic loci. Notably, most junctions within each CGR event were unique (Figure 4C). Additionally, upon stratifying by disparate cancer stages, we observed that advanced cancer stage patients exhibited a higher prevalence of unique structural variation junctions, underscoring the complexity of CGRs (Figure 4D, Benign: 0.81±0.10; Malignant: 0.94±0.16, Mann-Whitney test p value=0.5067). Moreover, comparing contigs across different chromosomal levels (intrachromosome and interchromosome), we observed numerous CGR junctions incorporating at least one interchromosomal junction in the pre-filtered set, but this was not evident for the high confident set of major CGR events (intra-chromosomal: 4.80±3.70, interchromosomal: 7.41±5.47, Mann-Whitney test p value=3.86e-05). Further statistical analysis revealed the order of events for these junctions, indicating that parental junctions were notably complex and originated from both the same and different chromatin contexts, with a higher prevalence observed in different chromatin regions compared to identical ones (Figure 4E; Similar: 5.5±2.42; Dissimilar: 14±10, Mann-Whitney test p value= 0.1962). This trend persisted for child nodes (Figure 4F; Similar: 5.67±2.73; Dissimilar: 12.67±8.88, Mann-Whitney test p value= 0.1994) and when considering all contigs collectively (Figure 4G; Similar: 4.83±2.32; Dissimilar: 10.50±8.52, Mann-Whitney test p value= 0.1978), emphasizing the significance of accounting for different chromatin contexts in understanding CGR complexity. We examined whether the clustering patterns of breakpoints within the major CGRs could be an artifact of random overlapping events or arise by non-random, localized DNA double strand breaks due to genomic instability during tumorigenesis. For each sample, we randomly shuffled the rearrangement breakpoints within respective chromosomes preserving their chromatin, length, and genomic contexts to create simulated landscapes of genomic alterations expected by chance, repeated the simulation 103 times using a published approach23, and each time calculated the relative frequencies of occurrence of CGRs. We found that it was significantly unlikely to observe the CGRs (simulation p-value 2) rearrangement breakpoints per sample by chance at the frequency observed in our cohort, and even more unlikely to observe the CGRs with the level of complexity observed in the in situ and advanced tumors. We observed similar results by using different strategies to create genomic rearrangements and using alternative null models in the simulation (see Methods). We further analyzed scRNAseq data for pathway-level signatures, and observed that the advanced carcinomas (e.g. S17, S19, S20 etc) with extensive CGRs also had elevated signatures of DNA repair processes (Figure 4H). Although these pathway signatures are not CGR-specific, we note that active cell cycle and DNA break repair signatures in advanced carcinomas suggest continued DNA repair, which can contribute to generation of replication-dependent or independent simple and complex rearrangements, and ultimately persistent genomic instability. Structural inference of the CGRs allowed us to examine their likely etiologies considering the known mutagenesis and DNA repair mechanisms broadly classified as chromoanagenesis 6-8 . We assessed the possibilities by analyzing four attributes of the CGR events (i) intra and inter-chromosomal CGR events, (ii) order-of-events sequence, (iii) chromosomal location, and (iv) sequence contexts of the junction regions for this purpose (Figure 4I). We did not find any CGR event involving interconnected multiple inter- and intra-translocations and deletions, that are typical of the chromoplexy events. As noted earlier, most of the 216 major CGR events were intra-chromosomal and involved localized rearrangement events. None of those were within 5Mb of the telomere or had characteristic inversion and copy number profiles supporting break-fusion-bridge fusion (BFB). Among the CGRs with established order-of-events (n=13), parent and child junctions had clear allele frequency differences, which would rule out a single catastrophic event such as chromothrypsis7,24. A majority of these CGR events also showed branched architecture, often involving regions far away from the telomere, which makes exclusive BFB events unlikely as well. A vast majority of all rearrangement junctions in these CGRs was in low complexity or repeat sequences within gene-rich, transcriptionally active open chromatin domains, although we did not have base pair resolution information on the junction regions. Chromatin domain organization shapes replication timing25, such that the sequences within a chromatin domain typically replicate concurrently, and repetitive sequences permit replication-associated DNA break and repair processes - that is favorable for sustaining chromoanasynthesis-like mechanism7,26. Analysis of scRNAseq data, as above, indicated active replication associated DNA repair activities. Taken together, the genomic signatures
C
Figure 4. A) A schematic representation of order of event of complex genomic rearrangement and its genesis mechanism. B) Proportions of the rearrangements associated with CGRs that have similar or dissimilar genomic and epigenomic contexts at the junction end regions. C) Regression analysis showcasing the relationship between contigs cataloging complex structural variation junctions and the unique junctions identified. D) Distribution of unique complex structural variation junctions across benign and malignant (i.e. in situ, locally invasive, and distant) stages of carcinogenesis. Sequential portrayal of CGR events at the parent node (E), child node (F), and all nodes (G), encompassing both intra- and inter-chromosomal rearrangements alongside their chromatin annotations in Figure2D Similar: Contigs containing identical chromatin within intra-chromosomal regions, Dissimilar: Contigs containing disparate chromatin within intra-chromosomal regions. H) Differences in DNA repair pathway scores in the epithelial cells from different samples based on single cell RNAseq data. I) Evaluating the CGR events in light of known molecular mechanisms driving complex genomic alterations.
R2=0.08, 95% CI [0.01, 0.18], n=124Unique junction frequency
Dissimilar intra-chromosomal
Similar intra-chromosomal
AJP1JP2JP3JP4JP5JP6
C1C2C3 C4
contigs
junctions
JP1-JP2
JP4-JP5JP5-JP6JP3-JP6
JP4-JP6
parentchild
Unique junction frequency
D
Benign Malignant
Similar DissimilarSimilar DissimilarSimilar Dissimilar
E F G
Both junctions in open chromatinBoth junctions in closed chromatin
B
repeatsnon-repeats
openother
Both junctions in repeatsBoth junctions in non-repeatsother
other
6040200percent
6040200percent
Multiple localized DNA breaks over timeChromo-anasynthesisLikely prevalent
Epithelial cells
LS4LS5LS7LS9LS10LS11
LS13LS14LS15LS17LS19
HDNA repair pathway signature
LS17
LS10
LS14
LS20LS19
LS19LS4
LS7LS11LS5LS20
LS9LS13
cohort
CGR events
ChromoplexyNot detected
Circular inter-chromosomal translocations
Closed circular inter-chromosomal rearrangementsNo
Telomeric clustered Del/Dup/Inv
Chromosome shattering and multiple DNA breaks in one catastrophic event
No
Order-of-events inferenceIndeterminateDetermined
Repeated telomeric bridge and break Break-bridge-fusionNot detectedChromothripsisPossible
Localized within the similar replication timing domainsGenomic and chromatin contexts of junctionsSimilar/dissimilarSimilarGenomic and chromatin contexts of junctionsYes
I
7
of the CGRs, in aggregate, were supportive of chromoanasynthesis-like mechanisms, while did not find evidence for widespread occurrence of other chromoanagenesis mechanisms (Figure 4I). Thus, using the Principles of Ockham’s razor, we argue that chromoanasynthesis appears to be the predominant mechanism for the CGRs, especially those with inferred order of events, in our cohort (also see Discussion). Chromoanasynthesis is not exclusive and could co-operate with other DNA repair processes. Signatures of selection Genomic alterations can gain clonal prominence by selection or drift under neutral evolution (Figure 5A). Allele frequency spectrum analyses indicated that a vast majority of point mutations are under neutral evolution27, but genomic rearrangements typically have larger effects, although a similar analysis for genomic rearrangements based on short read sequencing technology is technically challenging. The direct profiling of ultra-long DNA molecules using OM allows direct inference of allele frequency at the rearrangement breakpoints enabling us to assess the clonal dominance and signatures of selection. Overall, allele frequencies of the major CGR breakpoints were higher in locally advanced and distant carcinomas compared to that in the in-situ tumors, and the differences persisted even after adjusting for tumor fraction. Since a minority of the breakpoints in any CGR are early, we hypothesized that it was likely driven by the allele frequency patterns of the late events, potentially due to selection on the secondary events instead of allelic dominance of the corresponding early events. To test that hypothesis, we compared the allele frequencies of the secondary events after normalizing them by the allele frequency of the primary events within the respective CGR and found that the normalized allele frequencies of the CGR breakpoints of the secondary events were significantly higher in locally advanced and distant carcinomas compared to that in the in-situ tumors (Figure 5B; p-value <0.05, Mann Whitney U test). Furthermore, the shape of the allele frequency distribution in the advanced tumors was skewed compared to that observed for the in-situ tumors, with an excess of proportionally higher allele frequency secondary events. Interpreting these observations in light of the site frequency distribution analysis, originally developed for the inference of selection on somatic point mutations27, we argued that the data suggests a relatively stronger positive selection on the secondary events in the advanced tumors. We integrated scRNAseq and OM data to further evaluate transcriptomic and phenotypic signatures associated with the late CGR events in light of known oncogenic driver genes, but the genomic resolution of the integrated dataset was typically limited to assess that conclusively. Nonetheless, several advanced tumors (e.g. S17) showed presence of transcriptionally distinct subclonal cell populations with higher proliferation scores and differences in hallmark pathway signatures indicating intra-tumor population dynamics, as reported by Venkat et al. 28. Anyhow, taken together, our data suggest that early events in the CGRs typically arise before the last clonal sweep, and positive selection on the secondary events promote their overall clonal prominence during the ongoing tumor evolution. Patterns of structural variations in different disease stages We compared the genome-wide patterns of simple and complex genomic rearrangements in the samples grouped according to their disease stages based on histopathological characteristics (Figure 5C-E). The burden of small structural variation events (typically 102-104bp; insertions, deletions, duplications, and inversions) increased in tumors compared to the non-malignant tissues, but the differences were marginal (Figure 5C). In contrast, copy number gain/loss and large-scale structural variations such as rearrangements within and between chromosomes were rare in precancerous tissues but orders of magnitude higher in tumors (Figure 5D). For instance, the normal or benign samples typically had less than 2 large copy number events, whereas advanced tumors had a notably higher number of events (e.g. 118 CNVs in S8, an in-situ tumor). Advanced tumors also exhibited significantly higher number of intra-/inter- chromosome translocation. Notably, we detected a substantial number of 257 intra-chromosomal fusions, primarily within chromosome 8, in the in-situ sample (S14). Somatic complex rearrangement events are more frequent in samples grouped by carcinogenesis stages, with a notable increase in prevalence in advanced disease stages (Figure 5E). These observations provide evidence for increased genomic disorganization at multiple levels with carcinogenesis. Progressive genomic disorganization with disease stages in pseudotime While some classes of genomic alterations were common in both precancerous and cancerous genomes, others arose predominantly in malignant stages. Lasso regression and Random Forest feature importance scores indicated that intra-/inter-chromosome translocation, copy number variations, local clustered SVs and complex genomic rearrangements were the most significant factors that differentiated non-malignant and malignant samples. Complex
Complex rearrangementsInversionCNV gainCNV_lossDeletionDuplicationInsertionIntra-chromosomal fusionInter-chromosomal translocationlocal cluster SVs
C
D
E
F
G
Figure 5: A) A schematic representation of allele frequency change from two possible selection pressure. (B) Positive selection on cells carrying mutations result in an allele frequency distribution with an excess of high allele frequency variants. Normalized variation allele frequency for contigs observed in complex genomic rearrangements at various stages of carcinogenesis show significant increases in invasive and distant tumors. P value calculated by Kolmogorov-Smirnov test. C) Frequency of somatic genomic deletions, duplications, insertions, and inversions in the samples grouped by the stages of carcinogenesis. The frequency distribution shows modest increase with advanced disease stages. Y axis is in the log scale. D) Frequency of somatic intra-chromosomal fusions, inter-chromosomal translocations, CNV gain and losses in the samples grouped by the stages of carcinogenesis. The frequency distribution shows prevalence in normal tissues or benign lesions, but substantial increase starting with carcinoma in situ. Y axis is in the log scale. E) Frequency of somatic complex rearrangement events in the samples grouped by the stages of carcinogenesis, also shows greater prevalence with advanced disease stages. F) Random forest and Lasso feature importance for the different classes of genomic alterations in distinguishing pre-cancerous and cancerous samples. G) Pseudotime plot showing the trajectory of decline in genome stability and organization along the clinical stages of carcinogenesis, as measured by increased burden of different classes of genomic alterations.
Complexrearrangement
CNV_gainComplex rearrangementsIntra-chromosomal fusionCNV_lossInter-chromosome translocationRearrangement clustersDeletionInversionInsertionDuplication0 0.5 1.0 1.5MeanDecreaseGini
Coefficients- 0.02 -0.01 0 0.01 0.02 0.03
0.14 0.018 0.0025 0.00034
-2 -4 -6 -8
Lambda
Log Lambda
BenignNormal
Normal
Benign
Benign
BenignBenignIn situIn situIn situIn situ
In situIn situ
In situIn situ
Distant
Distant
In situInvasiveInvasiveRearranged genomemalignant/aggressive benign/normal
A p-value = 0.0031p-value = 0.0007
Neutral evolutiont tPositive selection
Shift in allele frequency distribution
Allele freq. range0.01.0
Relative density
B
8
genomic rearrangement had the highest coefficient in the LASSO analysis (Figure 5F), indicating that accumulation of complex genomic rearrangements is an important aspect of malignant disease progression. We projected the samples, which ranged from normal lung tissues to advanced lung tumors, on the UMAP space based on the features with high coefficients in the LASSO analyses to assess the extent of variation among the samples in terms of genomic disorganization and compare that with the clinical stages of disease progression (Figure 5G). There were two predominant groups, one where the normal lung tissues and benign lesions were clustered together, and the other where the locally advanced and metastatic tumors were grouped together; the latter group showed greater sample-to-sample variation indicating high inter-tumor genetic heterogeneity in advanced tumors. In situ carcinomas were the most diverse, while a subset of those were positioned with benign and normal tissues, a majority clustered with advanced malignancies, while others connected the two clusters. Thus, the UMAP projection based on the structural variation features also recapitulated the clinical stages of disease progression. A trace of the non-linear fit through the samples represented the trajectory of progressive genomic disorganization (Figure 5G). Since normal tissues have mostly diploid stable genomes, and tumor genomes are highly rearranged, especially in advanced NSCLC, the path along the trajectory can be interpreted as a measure of progressive decay of genomic stability and organization in pseudotime. This analysis indicated that the velocity of decline of genomic stability varies with disease stages. While somatic genome evolution is a continuous process and many benign lesions progress to malignancies, these results suggest that the decline in genomic organization and stability may increase after malignant transformation. DISCUSSION In summary, using deep-coverage optical mapping technology to profile the genomic landscapes of normal lung tissues, benign hyperplasia, dysplasia, carcinoma in situ, and advanced carcinomas we found that large copy-number alterations, chromosome translocations, local clustered SVs and CGRs, especially inter-chromosomal complex rearrangements were uncommon in premalignant tissues but became prevalent in carcinomas. omcplR, developed here, allowed us to resolve the architecture of complex genomic rearrangements using genome-graph concept with OM data and predict their genesis. The early rearrangement junctions in CGRs often localized within the same chromatin domains, but later events had more diverse contexts. It is known that replication timing closely follows chromatin states25, and topologically associated domains tend to replicate in foci called replication factories. Replication stress in the replication factories are frequently associated with DNA double strand breaks, leading us to speculate that replication factory crises may contribute to the genesis of many CGR events, especially those in the topologically associated domains with similar chromatin contexts. A substantial number of CGRs had multiple early events, which were joined by intermediate or late events. It is possible that genomic and epigenomic instability arising from early events may trigger downstream events. Genomic rearrangements can drive epigenetic and chromatin changes in the proximal regions29, such that one may argue30 that early genomic rearrangements may trigger downstream epigenetic changes and thereby plausibly redistribute the risk31 of subsequent genomic alteration events in the contexts of CGRs. We found that advanced tumors had more complex contigs and more inter-chromosomal events, which is consistent with this hypothesis, but further work needs to be done to establish causal inferences. Comparative assessment of data from the OM platform with short read sequencing and evolutionary simulations provided a balanced perspective on the potentials and limitations of our approach to characterize CGRs. Short read sequence technology is excellent at calling single nucleotide variants, copy number events, and small indels; but it is usually insufficient for calling rearrangement junctions in repetitive regions or detecting connected genomic alteration events occurring on the same allelic sequence, especially if such events are kb-Mb apart. and fail to fully resolve allelic heterogeneity of such events, where OM offers clear advantages. OM was particularly suited for our objective of characterization of complex, intertwined genomic rearrangement events. Reference genome assembly versions can potentially have consequences for the OM rearrangement calls, but the rearrangement junctions in the major CGR events in our cohort were usually unique. Ultimately, a hybrid approach utilizing both technologies may help resolve the complete genome architecture and provide a reasonably complete landscape of genomic alterations at molecular resolution. Nonetheless, inference on the CGR architectural helped establish their etiologies. We found that when the underlying events along a clonal lineage are sufficiently separated in time, the order inference is robust, but it is not trivial to distinguish the events that are temporally clustered from those that arose in a single catastrophic event given negligible variation in allele frequencies, considering noise in a realistic dataset. Anyhow, from the
9
perspective of somatic evolution, those events likely have similar consequences on the population dynamics downstream. Several different mechanisms such as chromoanasynthesis, chromothripsis, chromoplexy, and break-bridge fusion have been discussed in the context of complex genomic alterations - these are collectively termed as chromoanagenesis 6-8. Chromothrypsis24 is a catastrophic event involving shattering of a chromosome followed by a random restitching of chromosomal fragments, resulting in the formation of clusters of complex genomic rearrangements involving intra-chromosomal translocations and deletions; many loci affected by chromothrypsis show genomic stability in subsequent rounds of cell divisions. Therefore, most of the rearrangement junctions should be localized and have similar ploidy-adjusted allele frequencies. Nonhomologous end-joining (NHEJ) is the primary repair mechanism for chromothrypsis7,24. In contrast, chromoanasynthesis7,26, a replication-based complex rearrangement process, is dependent on context-dependent replication stress, which is persistent over time, leading to recurrent DNA breaks and different classes of rearrangements over multiple cell divisions. Therefore, many rearrangement junctions involved in such processes should have parent-child relationships. Replication-transcription collision leading to replication fork stalling, and subsequent error-prone replication rescue mechanisms such as fork-stalling and template switching (FosTes) and Microhomology-mediated break-induced replication (MMBIR) 32, which involves local sequence homologies, are suspected to be key mechanisms 8,26,32. Although a lack of base-pair resolution data for the junction regions did not allow us to assess the involvement of HR, NHEJ, or MMBIR, we note that most of the CGR junctions were in repeat sequences, within gene-rich, transcriptionally active open chromatin regions that are permissive of chromoanasynthesis-type processes. Furthermore, allele frequency differences between parent and child junctions within the same CGRs suggest that those would be unlikely due to a single catastrophic event such as chromothrypsis. Break-fusion-bridge events7 arise during mitotic process characterized by a linear sequence of rearrangement events involving inversions, duplications, and deletions in telomere-proximal regions, but the major CGR events in the filtered catalog in our cohort were not in the telomeric regions and did not have the characteristic inversion and copy number profiles. Likewise, some CGRs with indeterminate order-of-events could potentially be due to a single mutagenesis event or multiple distinct events. We did not find any CGR event in our cohort that fit the classic definition of chromoplexy7,8,33, a “closed-chain” process characterized by interconnected multiple inter- and intra-translocations and deletions arising from DSBs with precise junctions. Therefore, we suspect that chromoanasynthesis is probably the predominant mechanism driving the CGRs, especially those with clear order-of-events, in our cohort, notwithstanding the fact that in some cases it could co-occur with other DNA repair mechanisms 7. Ordering the genomic landscapes of normal, precancerous lesions, and carcinomas in pseudotime suggested a transition in genomic disorganization and instability with the onset of malignancy. While some classes of genomic alterations were common in both precancerous and cancerous genomes, others arose predominantly in malignant stages. Lasso regression and Random Forest feature importance scores indicated that intra-/inter-chromosome translocation, copy number variations, local clustered SVs and complex genomic rearrangements were the most significant factors that differentiated non-malignant and malignant samples. This analysis indicated that the velocity of decline of genomic stability varies with disease stages. The rate of genomic aging, as manifested by the extent of genomic disorganization in pseudotime, of different tumors may be different, and may change further with the accumulation of additional driver events during tumorigenesis. While somatic genome evolution is a continuous process and many benign lesions progress to malignancies, these results suggest that the decline in genomic organization and instability may increase after malignant transformation. This might be related to the overcoming of replicative senescence with the onset of malignancy. In situ tumors show substantial variation in their extent of genomic disorganization and instability. Given the inherent stochasticity in accumulation of driver genetic alterations and dynamic complexity of tumor-microenvironment interactions, the evolutionary trajectories of such tumors may not follow a deterministic path, underscoring the clinical challenges in stratifying the patients. Progression from premalignant benign lesions to malignant carcinoma involves genomic changes in tumor cells, and concurrent disruption of the coordination of interactions among different cell types typical in the normal tissues 13,14,34. In a synergistic study, Venkat et al. 28 examined the changes in the transcriptome of immune, stromal, and other cells in the tumor microenvironment in these tumors. They found changes in the transcriptome of tumor and non-tumor cells that not only promote oncogenic pathways but likely also perturb the balance of inter-cellular interactions and ultimately complex tissue-level processes. Genomic rearrangements may change copy number and regulation of
10
many genes, and also higher order, chromatin domain-level organization beyond their specific genomic contexts 30, providing substrates for such changes directly or indirectly. On the other hand, tissue-level changes such as inflammation may influence mutagenesis and DNA repair processes 34-36. Hence, there might be a complex crosstalk between the progressive deterioration of genome organization and disruption of normal patterns of tissue-level cell-cell interactions promoting cancer hallmarks during tumorigenesis in lung, such that considering both aspects may provide complementary and synergistic overview of tumor progression. METHODS Cohort selection and biospecimen preparation The cohort included pathologically normal lung tissues (n=2), benign pulmonary lesions (n=7), carcinoma in situ (n=4), locally invasive carcinoma (n=2) and metastatic lung carcinoma (n=3) samples (Fig. 1A, Table 1), obtained from deidentified donors under an IRB approved protocol. The tumors were typically resected during lobotomy or wedge resection. Tumor adjacent pathologically normal tissues were procured from the cancer patients, while the benign pulmonary lesion samples were collected from preventive surgeries. Stains for infectious agents were negative. Optical genome mapping using BioNano platform For each sample, a minimum of 0.12g frozen tissue sample was used for extraction and purification of ultra-high molecular weight genomic DNA (UHMW gDNA) following manufacturer’s protocols (Bionano, San Diego, USA). Briefly, frozen tissue was homogenized in buffer, and coarse homogenate was filtered, with resulting suspension washed and pelleted (2200 ×g for 5 minutes), and then enzymatically treated to release gDNA. 750 ng of purified high molecular weight DNA were labeled using Direct Label Enzyme 1 (DLE-1) to conjugate DL-green fluorophores at the occurrence of a 6bp motif. After clean-up of the excess of DL-green fluorophores and rapid digestion of the remaining DLE-1 enzyme by proteinase K, DNA was counterstained overnight. Labeled UHMW gDNA solution (4-16ng/uL) was loaded on Saphyr Chips (Bionano, San Diego, USA) for linearization into nanochannel arrays and imaging in the Saphyr System. Images were digitized into whole-genome optical mapping (OM) molecule and label information in real time, collecting to a minimum throughput target of 1500 Gbp. All samples were evaluated based on molecule size, labeling efficiency, and map rate and coverage of the GRCh38 reference. Rare Variant analysis and Structural Variant calling Initial QC, genome alignment, and structural variation detection and copy number calling were performed using the BioNano pipeline (BioNano Solve v3.7.1)16. Briefly, the OM molecules of a given dataset were first aligned against GRCh38. OM molecules mapping to the same genomic regions were assembled into consensus maps. OM molecule clusters not fully aligned with the reference diploid genome were identified as the candidates for capturing putative structural variants (SVs); those were assembled locally into consensus maps and realigned to the reference to generate refined and annotated SV calls, which are queried against an OGM-specific database of 179 phenotypically healthy control individuals. In parallel, a genome-wide copy number profile was also calculated based on depth of coverage, followed by normalization against control datasets and segmentation to identify genomic regions of copy number gain or loss, and aneusomies. For each SV and copy number variant (CNV) call, confidence scores were calculated. Variants occurring in masked regions of GRCh38 were not considered for interpretation and reporting. Circos plots representing different classes of genomic alterations were prepared using Bionano Access v1.7.116. software. Only the SVs with high allele frequency were shown in Fig. 1 for simplicity of visualization, although the downstream analyses considered all high confidence events including those present at low allele frequency. Short read sequencing and cross platform comparison Genomic DNA was extracted from 4 samples – 2 advanced lung adenocarcinoma and 2 pathologically normal tissues, and whole genome sequencing was performed using 150bp PE sequencing on an Illumina HiSeq X platform (~30X for tumor samples, and ~12X for the non-malignant samples, respectively). PE reads were aligned to GRCh38Decoy build of human with BWA-MEM (v0.7.13)37, after removing sequencing adaptors and duplicates using fastp (v.0.21.0)38. Single base substitutions (SBS) and small insertions and deletions (InDels) were identified using Strelka (v2.9.2)39. Variants were filtered for QSS < 40(SNVs) or QSI <40 (Indels) and having at least 20× coverage in tumor sample and 10x coverage for matched normal samples using bcftools (v1.940), and at least four reads supporting the variant. The resulting variants were annotated with ANNOVAR41, after excluding those overlapping with centromeres and segmental duplications (genomicSuperDups) and other variants in the dbSNPv150 and 1000 Genomes Project42. Alongside, Manta (v1.6.0)43 was employed to detect structural variations (SVs) using the default
11
parameters, followed by the extraction of the final SVs marked as 'PASS'. SV calls were compared between the BioNano OM and Illumina short read technology platforms for the 2 samples. We conducted an analysis to determine the degree of overlap between both junction ends within a 100kb range for high-confidence SVs of the same class. Detection of genomic regions with clusters of SVs We filtered out reported SVs (primarily insertions and deletions) that were shorter than 1kb. In addition, in some cases, the same SV may be represented by multiple OM molecule clusters with minor differences at the termini. To reduce such redundancy, we identified all analogous SVs that satisfy two criteria: (i) if both junction ends of the SVs fall within 5kb of one another, and (ii) the SVs are of the same type; in such cases we merged those into a single consensus SV to prepare the non-redundant list of structural variations in the somatic genomes. SV clusters are ³ 4 SVs that overlap with one another or have rearrangement junctions within 10kb distance. We analyzed the filtered set of SVs from the OM data, to identify such SV clusters, which were further examined to identify complex events. OM recorded structural variations using contigs, ensuring that each contig contains at least one structural variation. Additionally, the orientation and reference position of these structural variations are recorded. This enables us to infer the group of complex structural variations (complex rearrangement events) along with their orientation information. Prior to conducting any analysis, we performed a filtering step to exclude contigs with low confidence or potential false positives in recording structural variations. Contigs labeled as "_common" or "_segdupe", as well as those with a variant allele frequency (VAF) below 0.2, were removed. Subsequently, we retained the high quality contigs. OmcplR pipeline We developed omcplR package to identify the complex rearrangement events in the somatic genomes. omcplr first segment the genome into equal sized non-overlapping bins (G, default: 10kb), and track the OM contigs that capturing SVs and SV junctions that map to those bins. The genomic bins harboring multiple contigs (C, default: ³4) are selected as initial seeds for complex rearrangement detection; typically one of those contigs represent the reference allele, while other contigs may harbor one or more genomic rearrangements. The non-reference contigs in the seed region and SV rearrangement junctions therein are retained as the initial set. Next, each contig in that set is iteratively scanned to identify additional rearrangement junctions therein and their corresponding genomic bins, which are included in the set. The process is repeated until no new non-reference contig or genomic bin is identified for a given seed, and the resulting set of contigs that capturing SVs therein denote a complex rearrangement event. A complex event may be intra- or inter-chromosomal. In order to detect the complex rearrangement events within large structural variation events (such as copy number gain/loss and intra/inter-chromosomal structural variations) across various cancer stages, we employed the Grange function in R to overlap the initially identified complex rearrangement events with these regions. Subsequently, we tallied the retained complex rearrangement events. A complex event with ! contigs and " genomic rearrangement junctions therein can be represented as a "×! matrix to show the contig-junction relationship. Each element of the matrix is populated by the effective coverage of the OM traces of rearrangement junctions in the respective contigs. omcplR uses a Bayesian approach to determine the likely order-of -events within a complex rearrangement. It first identifies pairwise co-occurrence between the rearrangement junctions, before determining conditional relationships between the co-occurring pairs to predict potential order-of-event, and finally performing global optimization to finalize the relationship among pairs involved in a complex event using constraint-based algorithm (PC-stable). Each complex event is graphically represented as a Bayesian network, where the direct causal relationship junctions are connected using edges; among them, those with direct causal relationships are connected using arrows instead. Pearson's Correlation statistic was used to calculate the significance of the connection, and p-value < 0.05 was considered significant difference. Functional annotation We annotated the genomic alterations and affected regions based on (i) genomic features––whole genes, exons, repeats, and telomeres, (ii) chromatin features – strong euchromatin, weak euchromatin, intermediate chromatin, weak heterochromatin, strong heterochromatin, (iii) nuclear localization––inter- lamina regions at the nuclear interior, and lamina- associated regions at the nuclear periphery, and (iv) replication timing based on the human reference genome hg38- using a published approach44. We employed the TxDb.Hsapiens.UCSC.hg38.knownGene45 R function to retrieve geneIDs, and used the org.Hs.eg.db package to convert the geneIDs into gene symbols. Analysis of transcriptomic data
12
We obtained processed scRNAseq data from Venkat et al.28 who used 10X Chromium platform to perform single nuclei sequencing to characterize cellular transcriptome, and used that to infer cell type composition, cellular processes, and intercellular interactions of lung tissue samples across different stages of disease progression. This dataset was available for S3-S5, S7, S9-S11, S13-S17, and S19-S20; OM and scRNAseq were performed on different tissue core specimens from the same tumors. Pathway activities were estimated using approaches as described in Venkat et al.28. Data on scRNAseq-guided copy number estimates and clonal inferences for advanced carcinomas were obtained from this study as well. Evolutionary analysis and signatures of selection To generate the expected genomic landscape of rearrangements under the null model, we shuffled the genomic rearrangements 103 times preserving their chromosomes, lengths, genomic and chromatin contexts using an approach proposed previously23. For inter-chromosomal translocations, we shuffled the pairs of endpoints after preserving their chromosomal assignments and genomic and chromatin contexts. We identified the CGRs using the criteria described above during simulation and analyzed their attributes to compared against the observed CGRs in the in-situ, invasive, and distant tumors. For each CGR, we designated those with the highest allele frequency (f) as the primary events (P), and others as the secondary events (S), and normalized the allele frequencies of the latter as: f’(S)=f(S)/f(P). We used the site frequency spectrum approach proposed by Williams et al.27 to test for neutrality for the secondary events by comparing the observed allele frequency distributions for respective CGRs against that expected under a neutral evolution model. Feature importance We used LASSO (least absolute shrinkage and selection operator) a regression-based method to identify the structural variations and genomic alterations whose prevalence distinguish between malignant and non-malignant samples in the cohort. We complemented it by using Random Forest using randomforest R package and calculating feature importance scores. Evolutionary trajectory analysis We projected the samples on the UMAP or PCA space based on selected structural variation features, which had high coefficients in the LASSO analyses. SCORPIUS46 was used to draw the non-linear fit on the UMAP projection of the samples. The inflection point in the plot is indicated with an arrow. FIGURES Figure 1: A) Outline of the study design profiling genomic landscapes of pathologically normal lung tissues, benign dysplastic and hyperplastic lesions, and malignant lung carcinoma samples. B) Circos plots showing somatic genomic structural alterations and rearrangements in the samples in the cohort, grouped according to the disease stages. C) Representative genomic alteration in a sample (S18) is shown at high resolution, and corresponding variation in sequencing coverage in Illumina short read sequencing of the tumor is highlighted (tumor: orange, paired normal: blue). The genomic resolution of optical mapping and short read sequencing are different. The affected and adjoining genomic regions and gene annotations therein are shown below. Figure 2: A) A schematic representation of the computational pipeline to identify complex genomic rearrangements and infer underlying order of events. B) Benchmarking using simulated datasets where the order of rearrangement events are known in advance. Figure 3: A) A representative intra-chromosomal complex rearrangement is shown from the sample S20, affecting genomic region chr15: 23.13Mb-28.62Mb. Genomic rearrangements and associated contigs are shown. B) Inferred order-of-events involved in the complex rearrangement. Associated nuclear, chromatin, replication and genomic contexts of the junctions are highlighted. All junctions have similar genomic and epigenomic contexts in this example. C) Replication factory crisis leading to chromoanasynthesis is being proposed as a likely mechanism to explain the genesis of this complex event. D) A representative intra-chromosomal complex rearrangement is shown from the sample S8, affecting genomic regions chr9: 62.81Mb-68.23Mb. E) Circos plot showing pairwise co-occurrence of rearrangement junctions on the same contigs. Widths of the arcs indicate optical mapping coverage.
13
F) Inferred order-of-events involved in the complex rearrangement. (Epi)genomic contexts of the junctions are as above. Figure 4. A) A schematic representation of order of event of complex genomic rearrangement and its genesis mechanism. B) Proportions of the rearrangements associated with CGRs that have similar or dissimilar genomic and epigenomic contexts at the junction end regions. C) Regression analysis showcasing the relationship between contigs cataloging complex structural variation junctions and the unique junctions identified. D) Distribution of unique complex structural variation junctions across benign and malignant (i.e. in situ, locally invasive, and distant) stages of carcinogenesis. Sequential portrayal of CGR events at the parent node (E), child node (F), and all nodes (G), encompassing both intra- and inter-chromosomal rearrangements alongside their chromatin annotations in Figure2D Similar: Contigs containing identical chromatin within intra-chromosomal regions, Dissimilar: Contigs containing disparate chromatin within intra-chromosomal regions. H) Differences in DNA repair pathway scores in the epithelial cells from different samples based on single cell RNAseq data. I) Evaluating the CGR events in light of known molecular mechanisms driving complex genomic alterations. Figure 5: A) A schematic representation of allele frequency change from two possible selection pressure. (B) Positive selection on cells carrying mutations result in an allele frequency distribution with an excess of high allele frequency variants. Normalized variation allele frequency for contigs observed in complex genomic rearrangements at various stages of carcinogenesis show significant increases in invasive and distant tumors. P value calculated by Kolmogorov-Smirnov test. C) Frequency of somatic genomic deletions, duplications, insertions, and inversions in the samples grouped by the stages of carcinogenesis. The frequency distribution shows modest increase with advanced disease stages. Y axis is in the log scale. D) Frequency of somatic intra-chromosomal fusions, inter-chromosomal translocations, CNV gain and losses in the samples grouped by the stages of carcinogenesis. The frequency distribution shows prevalence in normal tissues or benign lesions, but substantial increase starting with carcinoma in situ. Y axis is in the log scale. E) Frequency of somatic complex rearrangement events in the samples grouped by the stages of carcinogenesis, also shows greater prevalence with advanced disease stages. F) Random forest and Lasso feature importance for the different classes of genomic alterations in distinguishing pre-cancerous and cancerous samples. G) Pseudotime plot showing the trajectory of decline in genome stability and organization along the clinical stages of carcinogenesis, as measured by increased burden of different classes of genomic alterations. REFERENCES 1 Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719-724 (2009). https://doi.org:10.1038/nature07943 2 Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94-101 (2020). https://doi.org:10.1038/s41586-020-1943-3 3 Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112-121 (2020). https://doi.org:10.1038/s41586-019-1913-9 4 Baker, T. M., Waise, S., Tarabichi, M. & Van Loo, P. Aneuploidy and complex genomic rearrangements in cancer evolution. Nat Cancer 5, 228-239 (2024). https://doi.org:10.1038/s43018-023-00711-y 5 Tomasetti, C., Vogelstein, B. & Parmigiani, G. Half or more of the somatic mutations in cancers of self-renewing tissues originate prior to tumor initiation. Proc Natl Acad Sci U S A 110, 1999-2004 (2013). https://doi.org:10.1073/pnas.1221068110 6 Holland, A. J. & Cleveland, D. W. Chromoanagenesis and cancer: mechanisms and consequences of localized, complex chromosomal rearrangements. Nat Med 18, 1630-1638 (2012). https://doi.org:10.1038/nm.2988 7 Krupina, K., Goginashvili, A. & Cleveland, D. W. Scrambling the genome in cancer: causes and consequences of complex chromosome rearrangements. Nat Rev Genet 25, 196-210 (2024). https://doi.org:10.1038/s41576-023-00663-0 8 Pellestor, F., Gaillard, J. B., Schneider, A., Puechberty, J. & Gatinois, V. Chromoanagenesis, the mechanisms of a genomic chaos. Semin Cell Dev Biol 123, 90-99 (2022). https://doi.org:10.1016/j.semcdb.2021.01.004 9 Tanaka, H. & Watanabe, T. Mechanisms Underlying Recurrent Genomic Amplification in Human Cancers. Trends Cancer 6, 462-477 (2020). https://doi.org:10.1016/j.trecan.2020.02.019 10 Turner, K. M. et al. Extrachromosomal oncogene amplification drives tumour evolution and genetic heterogeneity. Nature 543, 122-125 (2017). https://doi.org:10.1038/nature21356
14
11 Ahsan, M. U., Liu, Q., Perdomo, J. E., Fang, L. & Wang, K. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data. Nat Methods 20, 1143-1158 (2023). https://doi.org:10.1038/s41592-023-01932-w 12 Srivastava, S., Wagner, P. D., Hughes, S. K. & Ghosh, S. PreCancer Atlas: Present and Future. Cancer Prev Res (Phila) 16, 379-384 (2023). https://doi.org:10.1158/1940-6207.CAPR-22-0435 13 Curtius, K., Wright, N. A. & Graham, T. A. An evolutionary perspective on field cancerization. Nat Rev Cancer 18, 19-32 (2018). https://doi.org:10.1038/nrc.2017.102 14 Arrigoni, M. G., Woolner, L. B., Bernatz, P. E., Miller, W. E. & Fontana, R. S. Benign tumors of the lung. A ten-year surgical experience. J Thorac Cardiovasc Surg 60, 589-599 (1970). 15 Han, G. et al. An atlas of epithelial cell states and plasticity in lung adenocarcinoma. Nature 627, 656-663 (2024). https://doi.org:10.1038/s41586-024-07113-9 16 Chan, S. et al. Structural Variation Detection and Analysis Using Bionano Optical Mapping. Methods Mol Biol 1833, 193-203 (2018). https://doi.org:10.1007/978-1-4939-8666-8_16 17 Consortium, I. T. P.-C. A. o. W. G. Pan-cancer analysis of whole genomes. Nature 578, 82-93 (2020). https://doi.org:10.1038/s41586-020-1969-6 18 Xia, L. et al. Multiplatform discovery and regulatory function analysis of structural variations in non-small cell lung carcinoma. Cell Rep 36, 109660 (2021). https://doi.org:10.1016/j.celrep.2021.109660 19 Eisfeldt, J. et al. Comprehensive structural variation genome map of individuals carrying complex chromosomal rearrangements. PLoS Genet 15, e1007858 (2019). https://doi.org:10.1371/journal.pgen.1007858 20 Frankell, A. M. et al. The evolution of lung cancer and impact of subclonal selection in TRACERx. Nature 616, 525-533 (2023). https://doi.org:10.1038/s41586-023-05783-5 21 Kadota, K. et al. KRAS Mutation Is a Significant Prognostic Factor in Early-stage Lung Adenocarcinoma. Am J Surg Pathol 40, 1579-1590 (2016). https://doi.org:10.1097/PAS.0000000000000744 22 Gaillard, H., Garcia-Muse, T. & Aguilera, A. Replication stress and cancer. Nat Rev Cancer 15, 276-289 (2015). https://doi.org:10.1038/nrc3916 23 Pedersen, B. S. & De, S. Loss of heterozygosity preferentially occurs in early replicating regions in cancer genomes. Nucleic Acids Res 41, 7615-7624 (2013). https://doi.org:10.1093/nar/gkt552 24 Stephens, P. J. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell 144, 27-40 (2011). https://doi.org:10.1016/j.cell.2010.11.055 25 Gindin, Y., Valenzuela, M. S., Aladjem, M. I., Meltzer, P. S. & Bilke, S. A chromatin structure-based model accurately predicts DNA replication timing in human cells. Mol Syst Biol 10, 722 (2014). https://doi.org:10.1002/msb.134859 26 Liu, P. et al. Chromosome catastrophes involve replication mechanisms generating complex genomic rearrangements. Cell 146, 889-903 (2011). https://doi.org:10.1016/j.cell.2011.07.042 27 Williams, M. J. et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nat Genet 50, 895-903 (2018). https://doi.org:10.1038/s41588-018-0128-6 28 Venkat, V. et al. Disruptive changes in tissue microenvironment prime oncogenic processes at different stages of carcinogenesis in lung. bioRxiv, 1-20 (2024). 29 Akdemir, K. C. et al. Disruption of chromatin folding domains by somatic genomic rearrangements in human cancer. Nat Genet 52, 294-305 (2020). https://doi.org:10.1038/s41588-019-0564-y 30 Hu, X. & De, S. Redistribution of mutation risk in cancer. Nat Cancer 5, 216-217 (2024). https://doi.org:10.1038/s43018-024-00728-x 31 Salvadores, M. & Supek, F. Cell cycle gene alterations associate with a redistribution of mutation risk across chromosomal domains in human cancers. Nat Cancer 5, 330-346 (2024). https://doi.org:10.1038/s43018-023-00707-8 32 Zhang, F. et al. The DNA replication FoSTeS/MMBIR mechanism can generate genomic, genic and exonic complex rearrangements in humans. Nat Genet 41, 849-853 (2009). https://doi.org:10.1038/ng.399 33 Baca, S. C. et al. Punctuated evolution of prostate cancer genomes. Cell 153, 666-677 (2013). https://doi.org:10.1016/j.cell.2013.03.021 34 Zhang, S. et al. Tumor initiation and early tumorigenesis: molecular mechanisms and interventional targets. Signal Transduct Target Ther 9, 149 (2024). https://doi.org:10.1038/s41392-024-01848-7 35 de Visser, K. E. & Joyce, J. A. The evolving tumor microenvironment: From cancer initiation to metastatic outgrowth. Cancer Cell 41, 374-403 (2023). https://doi.org:10.1016/j.ccell.2023.02.016 36 Hanahan, D. Hallmarks of Cancer: New Dimensions. Cancer Discov 12, 31-46 (2022). https://doi.org:10.1158/2159-8290.CD-21-1059 37 Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103-2110 (2016). https://doi.org:10.1093/bioinformatics/btw152 38 Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884-i890 (2018). https://doi.org:10.1093/bioinformatics/bty560 39 Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811-1817 (2012). https://doi.org:10.1093/bioinformatics/bts271
15
40 Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009). https://doi.org:10.1093/bioinformatics/btp352 41 Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010). https://doi.org:10.1093/nar/gkq603 42 Genomes Project, C. et al. A global reference for human genetic variation. Nature 526, 68-74 (2015). https://doi.org:10.1038/nature15393 43 Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220-1222 (2016). https://doi.org:10.1093/bioinformatics/btv710 44 Smith, K. S., Liu, L. L., Ganesan, S., Michor, F. & De, S. Nuclear topology modulates the mutational landscapes of cancer genomes. Nat Struct Mol Biol 24, 1000-1006 (2017). https://doi.org:10.1038/nsmb.3474 45 Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res 47, D853-D858 (2019). https://doi.org:10.1093/nar/gky1095 46 Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat Biotechnol 37, 547-554 (2019). https://doi.org:10.1038/s41587-019-0071-9