Complete de novo assembly of yeast genomes using enzyme-free, dense optical genome mapping | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Complete de novo assembly of yeast genomes using enzyme-free, dense optical genome mapping Fredrik Westerlund, Luis Leal-Garza, Hanna Zachrisson, Albertas Dvirnas, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9608945/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract Large repetitive elements and complex chromosomal organization continue to challenge accurate assembly of eukaryotic genomes. Optical genome mapping (OGM) provides long-range genomic information by imaging individual DNA molecules. While most OGM protocols rely on sparse enzymatic labeling that restricts resolution in poorly labeled or structurally complex regions, dense labeling strategies generate continuous fluorescence intensity profiles that offer a complementary representation of genome structure. Here we extend our Dense Optical Genome Mapping Assembly (DOGMA) pipeline, previously used for bacterial genomes, to eukaryotic genomes using a competitive binding (CB)-based dense OGM protocol that produces continuous intensity profiles, reflecting local AT/GC-content, along individual DNA molecules. We analyzed long DNA molecules from two model eukaryotes, Saccharomyces cerevisiae and Schizosaccharomyces pombe , with compact, yet structurally rich genomes. CB-based OGM captures chromosome-spanning molecules and large repetitive arrays (> 500 kbp), while also revealing genome-scale compositional features that influence barcode similarity and assembly behavior. By explicitly accounting for these features inherent to eukaryotic genomes, DOGMA reconstructs genome-wide optical maps while avoiding collapse of non-adjacent repetitive loci. The resulting assemblies identify structural variations and repetitive arrays that are incompletely represented in reference genomes. These results establish a framework for scalable dense OGM to increasingly complex genomes. Biological sciences/Genetics/Genomics Physical sciences/Nanoscience and technology/Nanobiotechnology Biological sciences/Biological techniques/Genetic techniques Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Accurate and complete assembly of eukaryotic genomes remains a fundamental challenge in modern genomics. Despite substantial advances in sequencing technologies, many assemblies continue to suffer from unresolved gaps, collapsed repeats, and structural ambiguities. These limitations are particularly pronounced in regions characterized by long repetitive elements, segmental duplications, and complex chromosomal architectures 1 . While short-read sequencing platforms provide base-level accuracy, they inherently lack long-range information. Third generation long-read technologies extend read lengths into tens of kilobases, but remain susceptible to coverage biases, elevated error rates, and difficulties in resolving highly repetitive or low-complexity regions 2 . As a result, even well-studied eukaryotic genomes often contain poorly resolved regions such as centromeres, telomeres, ribosomal DNA (rDNA) arrays, and subtelomeric regions. Optical Genome Mapping (OGM) has emerged as a powerful complementary approach for capturing long-range genomic structure. By imaging individual DNA molecules, hundreds of kilobases to megabases in length, OGM provides direct access to genome-wide structural information that is difficult to obtain through sequencing alone. OGM has proven particularly valuable for characterizing large structural variants, copy number changes, and repeat expansions, and has been successfully applied in both research and clinical settings 3 – 5 . Commercial implementations, such as those by Bionano Genomics, rely on enzymatic labeling of specific sequence motifs, producing sparse fluorescent tag patterns (on average 9–15 per 100 kbp) along DNA molecules 6 . While this approach has demonstrated clinical utility, the relatively low label density limits its ability to resolve certain genomic regions, especially those with low motif frequency or structural complexity, including centromeres and large repeats. As an alternative to sparse labeling, dense labeling strategies that create continuous intensity profiles have emerged. Some approaches employ DNA methyltransferases with short (4-nt) recognition motifs, resulting in closely spaced labels whose fluorescence signals overlap due to the diffraction limit 7 . Other dense OGM strategies avoid enzymatic labeling altogether, instead using AT/GC-specific non-covalent binding proteins or synthetic dyes, such as the AT-specific TAMRA polypyrrole 8 , 9 . A distinct enzyme-free dense labeling strategy developed in our laboratory is based on competitive binding (CB) 10 . CB-based OGM exploits the competitive interaction between YOYO-1, a non-specific fluorescent DNA bis-intercalator, and netropsin, a non-fluorescent molecule with binding specificity for AT-rich regions 10 . This interaction produces a continuous fluorescence intensity profile along each DNA molecule, where the signal strength reflects the underlying GC content. While this method cannot resolve individual fluorophores, and hence sequence sites, it generates information-dense intensity profiles (barcodes) that offer a fundamentally different representation of genomic structure compared to motif-based OGM. These barcodes can then be aligned to reference genomes or used for de novo genome assembly 11 – 13 . Since the fluorescence signal in CB-based OGM arises from YOYO-1 labeling of the DNA backbone, a dye already widely used in OGM and single-molecule assays 14 , CB can be implemented without major modifications to existing instrumentation. This compatibility creates opportunities to integrate CB-based OGM with complementary labeling strategies, enabling multimodal analysis of genomic features along individual DNA molecules. To exploit the full potential of CB-based OGM, we have recently developed DOGMA (Dense Optical Genome Mapping Assembly), a computational pipeline designed for de novo assembly of genomes using CB-derived barcodes 13 . DOGMA was initially validated on Escherichia coli , a bacterial genome characterized by limited redundancy, low large-scale compositional variation, and relatively uniform sequence organization, providing an ideal test for overlap accuracy and assembly contiguity. Extending DOGMA to eukaryotic genomes introduces a complex set of challenges. In contrast to E. coli , even relatively small eukaryotic genomes contain complex features, including extensive repetitive regions and large-scale compositional domains that severely impede both sequence- and OGM-based assembly methods 1 , 15 . Repetitive elements often exceed sequencing read length and occur at multiple loci, making it difficult to assign reads or barcodes to unique genomic positions. Large-scale compositional domains, such as extended AT-/GC-rich regions or gradual changes in AT/GC content, further reduce barcode uniqueness by producing similar intensity profiles across long genomic regions. In OGM, both types of features can generate similar barcode patterns at distant loci, leading to ambiguous placement during alignment and assembly. Examples of these features include rDNA arrays, centromeres, telomeres, and subtelomeric regions, which are structurally complex, highly repetitive, and often appear at multiple loci in a genome 16 – 19 . As a result, these regions are systematically underrepresented or misassembled in reference genomes. In this study, we extend DOGMA to assembly of eukaryotic genomes using Saccharomyces cerevisiae and Schizosaccharomyces pombe as model systems. These two yeasts provide complementary test cases for CB-based OGM due to their compact, yet structurally rich, genomes and distinct chromosomal architectures. In S. cerevisiae , regions such as the rDNA array on chromosome XII are associated with genome stability, cellular aging, and nucleolar organization and are notoriously difficult to resolve 20 – 22 . Likewise, in S. pombe , similar arrays located near both ends of chromosome III pose challenges during DNA replication and remain incompletely resolved by sequencing-based approaches, consistent with their repetitive structure 23 , 24 . Similar repetitive structures are widespread in larger eukaryotic genomes. DOGMA is here used to generate de novo OGM assemblies of eukaryotic genomes that capture key structural features, including telomeric regions and large repetitive arrays, while explicitly examining regions where assembly contiguity breaks down. Reference-based analyses are used for validation and biological interpretation, without guiding the assembly process itself. These results demonstrate the feasibility of dense OGM assembly in eukaryotes and the biologically driven challenges that must be addressed when extending OGM-based assembly methods beyond bacterial genomes. Results To perform OGM, we stained genomic DNA samples and stretched them by confinement in nanofluidic devices. Details of the experimental and computational workflows are described in Materials and Methods. In brief, DNA molecules were extracted from yeast cells enclosed in agarose plugs, labeled by competitive binding of YOYO-1 and netropsin, extended in nanochannels, and imaged by fluorescence microscopy, producing continuous intensity profiles along individual DNA molecules. Images were then processed to extract one-dimensional fluorescence intensity profiles, referred to as barcodes ( Fig. 1 a ) . The extracted barcodes underwent quality control and repetitive region screening prior to assembly ( Fig. 1 b ) . Assembly was carried out using a multi-step implementation of the DOGMA pipeline 13 , grouping barcodes into bargroups (contiguous optical maps created from barcodes) based on similarity and contextual consistency ( Fig. 1 c ) . Assembled bargroups were then aligned to reference genomes for validation and structural comparison ( Fig. 1 d ) . Finally, consensus bargroups were generated and compared to each reference chromosome ( Fig. 1 e ) . Reference-based alignment A total of 1,102 S. pombe and 551 S. cerevisiae CB-labeled DNA molecules were imaged and converted into barcodes. To evaluate genome-wide coverage prior to the de novo assembly, the barcodes were aligned to theoretical reference barcodes derived from the corresponding reference genomes (see Methods) 25,26 . The alignment was performed using a full-length alignment strategy, in which each query barcode is aligned in its entirety to the reference, with the Pearson cross-correlation (CC-value) scoring method ( Fig. 2 , see Methods for details). Genome-wide alignments were obtained for all three S. pombe chromosomes and all sixteen S. cerevisiae chromosomes (Fig. 2) . Reference-based coverage exceeded 98% for both genomes, with mean coverage depths of 29.5x for S. pombe and 16.4x for S. cerevisiae , indicating that the dataset provides near-complete genome representation for downstream assembly. Despite the general overall uniformity in coverage, several genomic regions, particularly in S. pombe , displayed elevated coverage. Given the random sampling of DNA inherent to OGM, pronounced local enrichment is not expected. Elevated coverage is more likely to reflect alignment ambiguity, where barcodes from distinct loci exhibit high full-length similarity and are therefore assigned to alternative genomic positions. Such ambiguity can arise when regions at different loci share similar intensity profiles, or when sequences are incompletely represented in the reference assembly, leading to misplacement. A complementary pattern was observed in specific regions of the S. cerevisiae genome, where reduced coverage was observed. This pattern is consistent with a limitation of the full-length alignment strategy, which requires query barcodes to be shorter than the reference. Molecules approaching or exceeding chromosome length are therefore unable to align optimally, and are instead placed at alternative loci, contributing both to apparent depletion at their true origin and enrichment elsewhere. As an additional specificity control, barcodes from each species were aligned against the reference genome of the other species (Fig. 2) . While a majority of the barcodes (76.4%) aligned preferentially to their correct species, a subset exhibited comparable or higher global similarity to regions in the reference of the other species. After inspecting cross-species matches ( Fig. 2 , insets), we noticed that they occurred mainly around specific loci for each genome, creating alignment ambiguity hotspots. We also observed that cross-species matches were enriched for barcodes characterized by either long-range (200-300 kbp) low-frequency intensity trends or extended repetitive regions, both of which reduce barcode uniqueness and increase alignment ambiguity. Overall, the reference-based analysis indicated that global alignment provides reliable genome-wide coverage estimates, but that it is susceptible to alignment ambiguity under specific structural conditions. To clarify the origin of these effects and assess their impact on assembly, we next examined experimental and theoretical barcode properties influencing misalignment prior to initializing the de novo assembly. Structural features driving alignment ambiguity in eukaryotic OGM Analysis of the barcode dataset revealed three features that strongly influence similarity scoring and assembly behavior: chromosome-spanning molecules, repetitive barcode components, and long-range intensity trends. Each of these properties can affect similarity scores and influence the grouping during assembly. Chromosome-spanning molecules are captured by OGM Using the full-length alignment framework, molecules exceeding the theoretical chromosome length cannot be properly aligned, leading to misplacement or exclusion. This effect was particularly evident for the smaller chromosomes in S. cerevisiae , where full-length molecules exceed the corresponding theoretical barcode length. Individual barcodes, up to 800 kbp in size, fully spanned ten of the sixteen S. cerevisiae chromosomes (examples in Fig. 3a ), demonstrating the presence of chromosome-scale molecules in the dataset. This observation highlights the capacity of CB-based OGM to capture entire eukaryotic chromosomes within single intact DNA molecules. For sufficiently small chromosomes, and given limited DNA fragmentation, assembly may therefore be simplified through local alignment (see Methods), which identifies the highest-scoring subregions between barcode and reference without requiring full-length matching. Repetitive barcode components cause alignment ambiguity A subset of barcodes exhibited broad alignment across multiple genomic loci and were identified as being enriched in repetitive features. The identified repetitive units ranged between 9 and 10.5 kb in size, consistent with known repetitive regions in S. cerevisiae and S. pombe , respectively. This repetitive DNA is known to be underrepresented or collapsed to just two copies per locus in current reference assemblies 26,27 . When repetitive regions are not masked, their repetitive components dominate the alignment, producing visible misalignments to incorrect loci (Fig. 3b.i) . To address this issue, and to incorporate repetitive DNA into the alignment and assembly more effectively, we developed a repetitive-element detection algorithm (see Supplementary Methods 2) that identifies and masks repetitive elements. The algorithm also classifies barcodes as fully repetitive, partially repetitive (when a subregion of the barcode has at least five contiguous repetitive units), and non-repetitive. After masking the repetitive segments, the remaining non-repetitive portion of the barcode can be used to determine the correct genomic locus. As illustrated in Fig. 3b.ii , masking resolves the ambiguity and enables consistent alignment of the same molecules to a single genomic locus. Since the pair-wise similarity scoring of the assembly process is similar to the scoring system from the alignment process, it is expected that the resolution of alignment ambiguity via masking of repetitive regions can improve the bargroup formation during DOGMA assembly as well. Long-range GC-dependent intensity gradients influence barcode similarity The reference-based analysis revealed recurrent cases in which barcodes from distinct genomic loci exhibited high global similarity. Inspection of these barcodes suggested the presence of gradual, long-range intensity variations in addition to the typical CB-based OGM barcode features. To better understand the origin of this behavior, we searched for long-range intensity trends in the theoretical reference barcodes. Smoothing of theoretical reference barcodes using a moving-average filter revealed pronounced long-range intensity trends spanning hundreds of kilobases in several locations (Fig. 3c) , reflecting genome-scale compositional domains that can be detected by dense CB-based labeling. In S. pombe , these features were particularly evident, with gradually decreasing GC-content extending ~300 kbp toward chromosome ends and characteristic low-GC motifs spanning ~200 kbp at centromeric regions. While local barcode structure is defined by short-scale intensity variations (5–20 kbp), these long-range trends introduce an underlying “low-frequency” barcode component across much larger genomic distances. Consistent with this organization, experimental barcodes sharing similar long-range intensity trends frequently exhibited ambiguous alignments, including preferential placement at incorrect loci or, in some cases, within the reference genome of the other species (Fig. 2) . These results demonstrate that CB-based OGM captures biologically meaningful genome-scale compositional information that is largely inaccessible to sparse labeling approaches that just label specific motifs. While these long-range intensity trends pose challenges for similarity-based alignment and assembly, they also provide direct insight into large-scale chromosomal organization, highlighting both a limitation and a unique strength of dense OGM. To validate that the elevated coverage observed in Fig. 2 arises from intrinsic genome-scale signal organization, rather than coverage limitations, intragenomic self-similarity analysis was performed across the S. pombe chromosomes (Supplementary Fig. 4). Chromosomal regions ≥150 kbp exhibiting CC-value ≥0.75 were considered highly similar. This analysis revealed discrete similarity “hotspots” shared between chromosomes, consistent with shared long-range signal features captured in CB-based OGM data (Fig. 3c) . These long-range features can dominate global similarity measures despite divergent local structure, providing a mechanistic explanation for the coverage inflation and reduced alignment confidence observed in the reference-based analysis. Together, these findings indicate that structural properties intrinsic to eukaryotic genomes, rather than insufficient coverage, represent the primary constraints on assembly using dense OGM. Molecule characteristics before and after quality control processing Mean molecule lengths prior to quality control (QC) were 337.8 kbp for S. pombe and 359.7 kbp for S. cerevisiae . The QC procedures included stitching molecules spanning multiple imaging fields and masking barcode regions dominated by repetitive features (see Methods). Repetitive segments were masked rather than removed, preserving the non-repetitive parts for downstream analysis. Following QC, the mean molecule lengths remained largely unchanged (337.6 kbp for S. pombe and 366.0 kbp for S. cerevisiae) , indicating that filtering primarily addressed problematic features without introducing size-related bias. A second reference-based alignment was performed to evaluate the QC (Supplementary Fig. 3). This alignment implemented local barcode-to-reference matching to accommodate chromosome spanning molecules, repetitive masking to account for repetitive regions, and a ΔCC threshold to filter barcodes with alignment ambiguity. Increasing the ΔCC threshold improved organism-specific alignment accuracy while reducing the fraction of retained barcodes, illustrating the trade-off between specificity and molecule retention (Fig. 3d) . An optimal threshold of ΔCC = 0.03 maximized correct alignments while maintaining a near complete estimated coverage of both genomes. Bargroup formation and consensus barcode construction Having characterized the structural features influencing similarity metrics, we next assembled QC-filtered and repeat-masked barcodes into bargroups using the DOGMA pipeline. Repetitive elements were identified and masked from barcodes prior to assembly. Assembly without masking produced bargroups containing misaligned repetitive and non-repetitive segments (Fig. 4a.i) . Masking allowed the non-repetitive parts of barcodes to drive the assembly, while still maintaining the repetitive content for post-assembly analysis. This strategy enabled correct assembly of molecules that would otherwise be mis-grouped or left ungrouped, while also resolving incorrect groupings driven by repetitive segments (Fig. 4a.ii) . To mitigate grouping driven primarily by shared long-range intensity trends rather than local structural similarity, we introduced a bargroup scoring metric based on signal-to-noise characteristics (see Methods). This metric prioritizes bargroups with well-defined local features consistent among their constituting barcodes (Fig. 4b.i) and penalizes bargroups dominated by smooth, low-frequency signal components with constituting barcodes that differ in their short-frequency features (Fig. 4b.ii) . Bargroup assembly applied pairwise similarity constraints incorporating limited stretching variation, positional tolerance in pixel coordinates, and statistical thresholds on correlation significance (see Methods). The assembled bargroups were subsequently aligned to the corresponding reference genomes using a local bargroup-to-reference alignment strategy to assess genome-wide coverage and contiguity. Each bargroup consensus was represented as an averaged barcode intensity profile derived from its constituent barcode members (Fig. 4c) . Overall, approximately 20% and 30% of the barcodes used as input for the bargrouping process contributed to bargroup formation from S. pombe and S. cerevisiae , respectively, while the rest remained ungrouped despite passing QC. Ungrouped molecules were retained for downstream coverage analysis and gap resolution. Genome-wide assemblies reveal chromosomal extensions Genome-wide coverage by assembled bargroups closely mirrored the high reference-alignment coverage observed prior to assembly, indicating that most chromosomal regions were represented within contiguous optical maps. Across multiple chromosomes, bargroups spanned megabase-scale intervals, demonstrating robust large-scale assembly performance. Following initial assembly, bargroups were aligned to the reference genomes using a local bargroup-to-reference alignment strategy. Adjacent bargroups with consistent placements were subsequently merged based on local bargroup-to-bargroup similarity to generate chromosome-scale assemblies. The resulting assemblies retained only bargroups satisfying the meanCC, stdCC, and SNR quality criteria defined during bargroup formation (Supplementary Table 3). The final assemblies generated for both S. pombe and S. cerevisiae included highly repetitive regions typically underrepresented in sequence-based assemblies (Fig. 5) . Some of the assembled chromosomes show minor extensions (>6 kbp) relative to the reference chromosomes in the end(s) of each chromosome. These can be explained as a product of our theoretical reference model, which omits the intensity of the first and last 6 pixels, assuming them to be uncertain. Local breaks in contiguity were observed at S. pombe chromosome I from position 4466 kbp to 4787 kbp and chromosome II from position 3208 kbp to 3211kbp (Supplementary Fig. 5). To distinguish assembly limitations from absence of coverage, we examined high-confidence barcodes that aligned strongly to the reference, but did not group into bargroups. Inclusion of these molecules revealed that regions lacking assembled contigs were nevertheless supported by individual barcodes, resulting in near-complete genome-wide molecular representation. Large repetitive structures were resolved in both organisms. In S. cerevisiae , a large repetitive block interrupted the reference assembly of chromosome XII near 479 kbp. This region extended ≥550 kbp and consisted of ~9 kbp repeat units, corresponding to ≥60 tandem copies. In S. pombe , extended repeat arrays were observed at both ends of chromosome III beyond the length represented in the current reference assembly, spanning at least 1 Mb in total (≥500 kbp from each telomeric end). Individual repeat units measured ~10.5 kbp, consistent with rDNA repeats 19,27 and indicating the presence of ≥50 tandem copies at each end. Structural variations (SV) in the form of smaller insertions were also detected in chromosomes such as S. cerevisiae chromosome VIII and III. We were able to manually curate a ~30 kbp insertion in chromosome VIII (position 206 kbp). A similar-sized insertion on chromosome III between positions 100 kbp and 200 kbp was also identified, but its insertion point was not precisely determined. SV detection is not streamlined into our pipeline, making it non-trivial to pinpoint exact insertion points. Discussion This work demonstrates that dense optical genome mapping (OGM) assembly using the DOGMA pipeline can be extended from bacterial to eukaryotic genomes. In doing so, it reveals constraints and opportunities that are largely invisible in simpler systems, such as bacterial genomes, and require targeted methodological adaptations. Applying DOGMA to S. cerevisiae and S. pombe demonstrates that OGM-based assembly of eukaryotic genomes is constrained less by genome size and more by repeat composition and long-range genome organization. For E. coli , DOGMA achieved nearly complete genome reconstruction with minimal preprocessing, reflecting the relative simplicity and low redundancy in bacterial genomes 28 . The transition to S. cerevisiae and S. pombe revealed a fundamentally different assembly landscape. Yeast genomes introduced an increase in barcode ambiguity due to repetitive genomic features such as rDNA arrays and subtelomeric repeats 21,22,25,29 . By explicitly identifying and isolating repetitive signal components, DOGMA can assemble repeat-rich regions without collapsing non-adjacent loci and subsequently reintegrate them using unique flanking context. Assembly of these repetitive regions with OGM is solely limited to the size of the isolated DNA to ensure the collection of data including repetitive regions and their adjacent non-repetitive loci. This demonstrates that repeat content is not a fundamental limitation of dense optical genome assembly; rather, the critical factor is the ability to detect, classify, and appropriately weight repetitive and long-range signal features during assembly. Another key insight of this study is that long-range intensity trends, spanning hundreds of kilobases, contribute substantially to barcode similarity across distinct loci. In S. pombe , these trends manifest as chromosome GC gradients and centromeric low-GC domains, which are consistently detectable in both theoretical and experimental barcodes. While such trends can bias global similarity metrics and promote spurious alignments, they also reflect large-scale chromosomal organization that is inaccessible to sparse-label OGM approaches. Accounting for this signal component will therefore be essential for future OGM assembly frameworks. Together, these findings position dense OGM as both an assembly tool and a probe of large-scale genome organization. While further methodological refinement will be required to fully exploit long-range signal information in even more complex genomes, the complete yeast assemblies presented here establish an important foundation for extending dense OGM to larger eukaryotic systems and to biologically and clinically relevant repetitive loci. From a methodological perspective, the combination of repeat-aware masking and characterization of long-range intensity profiles provides a generalizable strategy for extending dense, CB-based, OGM beyond yeast. Many of the challenges encountered here, such as repetitive arrays and long-range compositional structure, are even more common in mammalian genomes, where repetitive regions constitute over half of the human genome and remain systematically underrepresented or collapsed in many reference assemblies 30,31 . Despite recent advances, such as telomere-to-telomere assemblies, large repetitive loci remain difficult to characterize robustly with sequencing alone, particularly when copy number, structural heterogeneity, or long-range organization are biologically relevant 30 . In this context, dense OGM offers a complementary view, capturing continuous intensity profiles, that are orthogonal to sequence-based information, over hundreds of kilobases. While evaluation in human genomes lies beyond the scope of this study, the framework presented here suggests a path toward resolving medically relevant repeat structures, including large satellite arrays, rDNA clusters, and disease-associated macrosatellites 3,17 . An additional advantage of CB-based, dense OGM is that the intensity variation signal arises from YOYO-1 labeling of the DNA backbone, a fluorophore already used in many OGM and single-molecule assays to simply detect the DNA backbone. Therefore, the incorporation of competitive binding does not require additional excitation sources, emission filters, or major instrumentation changes, making this signal layer effectively available in already existing platforms. This creates opportunities to combine CB-based OGM with other labeling strategies, such as enzymatic motif labeling, damage mapping 32 , or epigenetic assays 14 , while retaining long-range genomic context. In this way, CB-based OGM is not limited to structural reconstruction, but can serve as a scaffold for integrating multiple molecular readouts along individual DNA molecules. In conclusion, this study demonstrates that dense, continuous optical genome mapping can be extended from bacterial to eukaryotic genomes when eukaryotic structural constraints are appropriately considered. By combining repeat-aware masking with local alignment assembly, DOGMA reconstructs genome-wide optical maps in yeast without collapsing non-adjacent loci. Importantly, CB-based OGM preserves large-scale compositional features of chromosomes, enabling analysis of genomic structure beyond strictly local sequence similarity. Together, these results establish DOGMA as a robust framework for eukaryotic optical genome assembly and lay the groundwork for applying dense OGM to increasingly complex genomes. Methods Yeast cultivation and DNA extraction Experiments were performed using Saccharomyces cerevisiae strain BY4742 and Schizosaccharomyces pombe strain NS112 ( leu1-32 his3-D1 ura4-D18 ade6-M210 ). The protocol for DNA extraction was adapted from previous work 33 . In short, S. cerevisiae was grown overnight and used to inoculate a culture that was grown at 30°C for 6 to 8 hours to reach an OD600 of 0.1. Afterwards, 6 x10 8 cells were harvested and frozen at -20°C until the DNA extraction was performed. For S. pombe 34 , 50 mL exponentially growing cells at initial concentration of 0.5 million cells/mL were grown in Edinburgh minimal media for 16 h at 30°C until they reached 14 million cells/mL. Cells were harvested by centrifugation at 3000 rpm for 5 mins and washed twice with 50mM cold EDTA. Cells were frozen at -20°C prior to DNA extraction. The extraction of DNA from both types of yeast cells was performed using the CHEF Yeast Genomic DNA Plug Kit from BIO-RAD. In short, 1 mL of agarose plugs were created for each strain by combining 6 x10 8 cells with 625 µL CSB (Cell Suspension Buffer) and 375 µL 2% CCA (CleanCut agarose). Agarose plugs were then treated with Lyticase, RNase, and Proteinase K, and washed several times according to the CHEF Yeast Genomic Plug Kit protocol. DNA was then extracted from the agarose plugs. This is done by first incubating in 1x CutSmart (New England Biolabs) solution at 70°C for 10 minutes, followed by incubation at 42°C for 10 minutes. This was followed by a 42°C incubation with 2 U agarase for 2 hours (New England Biolabs). DNA was then quantified by using Qubit dsDNA Broad Range Assay kit (Invitrogen). CB-based OGM lab procedures Densely labelled DNA molecules were obtained by a single-step competitive binding (CB) based staining reaction, as described previously 33 .The CB staining reactions were performed by adding 4-6 µM DNA (in bases), 0.4-0.6 µM YOYO-1, and 120-180 µM netropsin in a 10:1:300 ratio to a 0.5X TBE (Tris-Borate EDTA) 10 µL solution. DNA from bacteriophage λ (48,502 bp bought from Roche) was added to make up ~30% of the DNA mass in the staining reactions, to serve as an internal size reference. Finally, an incubation at 50°C for 30 minutes was followed by a 15-fold dilution with MQ-water and 3 µL of β-mercaptoethanol (BME) to reduce photodamage. Nanofluidic experiments Stained DNA molecules were stretched for imaging using a nanofluidic device whose fabrication is described elsewhere 34–36 . These devices are designed to have two sets of two loading wells, where each set is connected via a separate microchannel. Both microchannels connecting the loading wells are connected to each other by 120 parallel nanochannels, each having a cross section of 100 x 150 nm 2 and extending 500 µm in length. To ensure uniform conditions, the chips underwent pre-conditioning with 0.033X TBE buffer supplemented with 2% BME v/v prior to sample loading. DNA loading was done by adding 10 µL of stained sample into one of the loading wells, while the others contained the same buffer used for pre-conditioning, but no DNA. DNA molecules were then driven first into the microchannels and then the nanochannels by pressure-driven N 2 flow. Stretched and stained DNA molecules in the nanochannels were then imaged by using an inverted fluorescence microscope (Zeiss Axio Observer D.1) equipped with a 63x oil immersion objective (NA = 1.46, Zeiss), a Colibri 7 light source (Zeiss), and a Photometrics Evolve EMCCD camera. Systematic acquisition of videos of up to 100 frames was carried out for each molecule, utilizing an exposure time of 100 ms and a FITC filter set (Zeiss). Barcodes processing Barcodes were generated from videos using a previously developed custom MATLAB software called ’lldev’ 37,38 . In short, DNA molecules are detected in each time frame within a movie. The detected DNA molecules are then turned into kymographs, where each row of pixels represents the intensity along the DNA molecule in a single frame. The kymographs are aligned to compensate for small thermal fluctuations of the DNA inside the nanochannels. Aligned kymographs are then processed by a custom quality control (QC) algorithm to merge molecules spanning multiple fields of view (multi-FoV) and to remove molecules that show a significant change in size during imaging (see Supplementary Methods 1). Finally, the filtered and aligned kymographs are time-averaged, rendering a time series-like 1D intensity trace along the extension of the DNA molecule, referred to as a barcode. Reference alignment Reference genome sequences were used as an input to previously established methods 11 to in silico generate theoretical reference barcodes that can be used for alignment and coverage analysis. For S. cerevisiae , the S288C reference genome (R64 assembly), obtained from Saccharomyces Genome Database 26 , was used as reference since the BY4742 strain was derived from S288C. For S. pombe , the reference genome of strain 972h−, obtained from PombeBase 27 , was used as a reference since strain NS112 ( leu1-32 his3-D1 ura4-D18 ade6-M210 ) was derived from the 972 strain background. Genome-wide barcode alignment was performed with a full-length similarity search alignment strategy to estimate coverage across the reference genomes. This approach, implemented in previous work 33 , is based on a Pearson correlation coefficient (CC) scoring system to identify the best placement for a query barcode within a given reference. In this strategy, the full query barcode is aligned to a segment of the reference, and therefore requires the query to be shorter than the reference. To identify individual molecules spanning complete S. cerevisiae chromosomes, a local similarity alignment was used in a barcode-to-reference mode. This approach uses a sliding window to identify the highest-scoring subregions between query and reference, allowing partial matches and overhang beyond reference boundaries. Both query and reference must exceed the selected overlap window length. The local alignment is based on a matrix profile framework and is also used in a barcode-to-barcode mode as the core function for pairwise similarity calculations in DOGMA 13,39 . For each chromosome, an overlap window corresponding to 70% of the chromosome length was used. Candidate barcodes were restricted to lengths between 70% and 150% of the corresponding chromosome size to accommodate molecule length variability while ensuring chromosome-scale coverage. Following assembly, bargroup alignment to the reference genome was performed using local similarity alignment rather than full-length alignment. This choice accounts for assembled contigs that may extend over the edges or exceed the full length of the corresponding reference chromosomes due to the presence of insertions, repetitive or unresolved genomic segments. Local alignments were calculated using an overlap window corresponding to 70% of the non-repetitive portion of each contig. For bargroup alignment to the reference, chromosome XII of the S. cerevisiae genome was split into left and right side after noticing partial alignment due to repetitive regions not included in the reference. The breakpoint for the insertion of repetitive region into chromosome XII was manually determined to be around position 472 kbp. Repetitive Finder Since DOGMA uses local barcode-to-barcode alignment for its bargrouping procedure, repetitive regions shared between molecules can dominate the matching process, potentially leading to non-canonical groupings independent of the non-repetitive content elsewhere in the molecules. For this reason, it is important to identify molecules with repetitive regions and mask the repetitive part of them so that the rest of the molecule can be used as a contextual “anchor” in the bargrouping process. This way, bargroups are created based on the similarity of the non-repetitive parts of each barcode, while the repetitive parts are still preserved and can be “unmasked” for downstream analysis. To identify and mask highly repetitive regions prior to assembly, we developed a custom Repetitive Finder algorithm implemented in MATLAB (Supplementary Fig. 2). The method detects periodic patterns in barcode intensity profiles using the autocorrelation function to estimate repeat periodicity, followed by Fourier transform validation. A rolling-window correlation then identifies continuous repetitive regions, with segmentation achieved using the findchangepts function (Supplementary Methods 2). Regions with an average correlation >0.4 across at least five repeat units are marked as repetitive and masked in subsequent assembly steps. These empirically determined thresholds effectively separate highly repetitive barcodes from non-repetitive ones (Supplementary Fig. 2). DOGMA pipeline implementation and bargroup quality metrics Assembly was performed using the DOGMA algorithm as previously described 13 . Pairwise barcode comparisons were evaluated using normalized Pearson cross-correlation under constrained alignment parameters, including stretch tolerance, positional tolerance, and statistical significance thresholds defined by the DOGMA null model framework (Supplementary Table 1). Because repetitive, partially repetitive, and non-repetitive barcode classes exhibit distinct similarity distributions, parameter values within the null-model-based p-value framework were optimized separately for each class using the parametrization.m script within the DOGMA package (Supplementary Table 2). This parametrization is recommended to be run once per experimental setup, since parameters are greatly affected by pixel size and effective PSF of the set up. To enable incorporation of repetitive regions, assembly was performed in multiple stages reflecting barcode class. Independent DOGMA assemblies were first carried out on non-repetitive and partially repetitive masked barcodes. Repetitive barcodes were then assembled in a subsequent stage using the partially repetitive assemblies as anchoring context. This staged strategy prevents repetitive segments from driving initial grouping while allowing their incorporation once non-repetitive context has been established. Finally, all resulting bargroup contigs were merged through an additional DOGMA-based local assembly step to generate unified bargroups spanning all barcode classes. Consensus bargroups were constructed by averaging aligned barcode intensity profiles following stretch normalization. To mitigate grouping driven by long-range low-frequency trends and to ensure that only structurally consistent and information-rich groups were retained, three complementary quality metrics were evaluated for each bargroup. Internal similarity was first assessed using pairwise Pearson cross-correlation coefficients (CC) computed between all barcodes within each bargroup and their consensus. The mean of these values (meanCC) was required to exceed 0.7, ensuring overall coherence among members. The standard deviation of the CC values (stdCC) was also evaluated to measure internal consistency; bargroups with stdCC > 0.05 were discarded, as high dispersion indicates heterogeneous membership despite acceptable mean similarity. Finally, a signal-to-noise ratio (SNR) was computed to quantify the strength of the consensus structure relative to variability among its members. For a bargroup with members, let denote the intensity at pixel of barcode , and the consensus intensity. The SNR was defined as in equation (1) where the numerator reflects variation across the consensus profile, and the denominator reflects the average per-pixel variability among individual barcodes. Bargroups with SNR < 1.0 were excluded. (1) Only bargroups satisfying all three criteria were retained for downstream assembly. Barcodes that remained unassembled but exhibited high-confidence reference alignments were subsequently anchored to the theoretical genome map to maximize genome-wide representation. Declarations Acknowledgements We thank Dr. Mikael Molin for providing the Saccharomyces cerevisiae strain BY4742 and for valuable advice on yeast cultivation. The nanofluidic devices used in this work were fabricated using Chalmers MyFab cleanroom facility. Funding sources F.W. discloses support for the research of this work from the Swedish Childhood Cancer fund (Barncancerfonden, MT2022-0003) and the Swedish Cancer Foundation (Cancerfonden, 24 3885 Pj). Work in N.S. lab was supported by Knut and Alice Wallenberg foundations (KAW.2021.0173). T.A. and the rest of the authors declare no relevant funding. Author Contributions Conceptualization: L.M.L.G., G.G., F.W. Methodology: L.M.L.G., A.D. Software: L.M.L.G., A.D. Validation: L.M.L.G. Formal analysis: L.M.L.G. Investigation: L.M.L.G., H.Z. Resources: S.K.K., I.O., N.S. Data curation: L.M.L.G., H.Z. Writing – original draft: L.M.L.G., F.W. Writing – review & editing: All authors Visualization: L.M.L.G. Supervision: F.W., T.A. Project administration: L.M.L.G., F.W. Funding acquisition: F.W. Competing interests The authors declare no competing interests. Data and Code availability The datasets generated and analyzed during this study, together with the custom MATLAB code used for barcode processing and assembly, are available to reviewers via a private Google Drive link provided at submission (https://drive.google.com/drive/folders/1aEM9YhjKS1UgG0qovO57dkADrqzxVKeG?usp=sharing). Upon acceptance, these materials will be made publicly available through a permanent repository (e.g. Zenodo) with assigned DOIs. Source data including raw microscopy images, processed barcode data and assembly outputs, are included in these deposits. References Liao X et al (2019) Current challenges and solutions of de novo assembly. Quantitative Biology vol. 7 90–109 Preprint at https://doi.org/10.1007/s40484-019-0166-9 Yuan Y, Chung CYL, Chan TF (2051) Advances in optical mapping for genomic research. Computational and Structural Biotechnology Journal vol. 18 –2062 Preprint at https://doi.org/10.1016/j.csbj.2020.07.018 (2020) Dai Y et al (2020) Single-molecule optical mapping enables quantitative measurement of D4Z4 repeats in facioscapulohumeral muscular dystrophy (FSHD). J Med Genet 57:109–120 van der Sanden B et al (2025) Optical genome mapping enables accurate testing of large repeat expansions. Genome Res 35:810–823 Weissensteiner MH et al (2017) Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res 27:697–708 Dremsek P et al (2021) Optical genome mapping in routine human genetic diagnostics—its advantages and limitations. Genes (Basel). 12 Grunwald A et al (2015) Bacteriophage strain typing by rapid single molecule analysis. Nucleic Acids Res. 43 Park J et al (2019) Single-molecule DNA visualization using AT-specific red and non-specific green DNA-binding fluorescent proteins. Analyst 144:921–927 Lee S et al (2018) TAMRA-polypyrrole for A/T sequence visualization on DNA molecules. Nucleic Acids Res 46 Nyberg LK et al (2012) A single-step competitive binding assay for mapping of single DNA molecules. Biochem Biophys Res Commun 417:404–408 Nilsson AN et al (2014) Competitive binding-based optical DNA mapping for fast identification of Bacteria - Multi-ligand transfer matrix theory and experimental applications on Escherichia coli. Nucleic Acids Res 42 Müller V et al (2020) Cultivation-Free Typing of Bacteria Using Optical DNA Mapping. ACS Infect Dis 6:1076–1084 Dvirnas A et al (2025) DOGMA: de novo assembly of densely labelled optical DNA maps using a matrix profile approach. PLoS ONE 20:e0335633 Jeffet J, Margalit S, Michaeli Y, Ebenstein Y (2021) Single-molecule optical genome mapping in nanochannels: Multidisciplinarity at the nanoscale. Essays in Biochemistry vol. 65 51–66 Preprint at https://doi.org/10.1042/EBC20200021 Hartley G, O’neill RJ (2019) Centromere repeats: Hidden gems of the genome. Genes vol. 10 Preprint at https://doi.org/10.3390/genes10030223 Klein SJ, O’Neill RJ (2018) Transposable elements: genome innovation, chromosome diversity, and centromere conflict. Chromosome Research vol. 26 5–23 Preprint at https://doi.org/10.1007/s10577-017-9569-5 Yang J, Li F (2017) Are all repeats created equal? Understanding DNA repeats at an individual level. Current Genetics vol. 63 57–63 Preprint at https://doi.org/10.1007/s00294-016-0619-x Smirnov E, Chmúrčiaková N, Liška F, Bažantová P, Cmarko D (2021) Variability of human rDNA. Cells vol. 10 1–14 Preprint at https://doi.org/10.3390/cells10020196 Oizumi Y et al (2021) Complete sequences of Schizosaccharomyces pombe subtelomeres reveal multiple patterns of genome variation. Nat Commun 12 Egidi A, Di Felice F, Camilloni G (2020) Saccharomyces cerevisiae rDNA as super-hub: the region where replication, transcription and recombination meet. Cellular and Molecular Life Sciences vol. 77 4787–4798 Preprint at https://doi.org/10.1007/s00018-020-03562-3 D’Alfonso A, Micheli G, Camilloni G (2024) rDNA transcription, replication and stability in Saccharomyces cerevisiae. Seminars in Cell and Developmental Biology vols 159–160 1–9 Preprint at https://doi.org/10.1016/j.semcdb.2024.01.004 James SA et al (2009) Repetitive sequence variation and dynamics in the ribosomal DNA array of Saccharomyces cerevisiae as revealed by whole-genome resequencing. Genome Res 19:626–635 Aquiles Sanchez J, Kim S-M, Huberman JA (1998) Ribosomal DNA Replication in the Fission Yeast, Schizosaccharomyces Pombe. Exp Cell Res 238 Sabouri N, McDonald KR, Webb CJ, Cristea IM, Zakian VA (2012) DNA replication through hard-to-replicate sites, including both highly transcribed RNA Pol II and Pol III genes, requires the S. pombe Pfh1 helicase. Genes Dev 26:581–593 Wood V et al (2002) The genome sequence of Schizosaccharomyces pombe. Nature 415:871–880 Cherry JM et al (2012) Saccharomyces Genome Database: The genomics resource of budding yeast. Nucleic Acids Res 40 Harris MA et al (2022) Fission stories: using PomBase to understand Schizosaccharomyces pombe biology. Genetics 220 Sela I, Wolf YI, Koonin EV (2016) Theory of prokaryotic genome evolution. Proc. Natl. Acad. Sci. U. S. A. 113, 11399–11407 Pasero P, Marilley M (1993) Size variation of rDNA clusters in the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe. Mol Gen Genet 236:448–452 Nurk S et al The complete sequence of a human genome. Science ( (1979)). 4453 (2022)). 4453 (2022) Lander S et al (2001) Initial Sequencing and Analysis of the Human Genome International Human Genome Sequencing Consortium* The Sanger Centre: Beijing Genomics Institute/Human Genome Center. NATURE vol. 409 www.nature.com Detinis Zur T et al (2025) Single-molecule toxicogenomics: Optical genome mapping of DNA-damage in nanochannel arrays. DNA Repair (Amst). 146 Müller V et al (2019) Enzyme-free optical DNA mapping of the human genome using competitive binding. Nucleic Acids Res 47 Persson F, Tegenfeldt JO (2010) DNA in nanochannels-directly visualizing genomic information. Chem Soc Rev 39:985–999 Frykholm K, Müller V, Kk S, Dorfman KD, Westerlund F (2022) DNA in nanochannels: theory and applications. Quarterly Reviews of Biophysics vol. 55 Preprint at https://doi.org/10.1017/S0033583522000117 Sriram KK, Persson F (2024) and F. J. and B. J. P. and T. J. O. and W. F. Fluorescence Microscopy of Nanochannel-Confined DNA. in Single Molecule Analysis: Methods and Protocols (ed. Heller Iddo and Dulin, D. and P. E. J. G.) 175–202Springer US, New York, NY. 10.1007/978-1-0716-3377-9_9 Nyblom M et al (2023) Strain-level bacterial typing directly from patient samples using optical DNA mapping. Commun Med 3 Dvirnas A, Lin Y-L (2021) Ildev.v.0.5.3. Preprint at https://doi.org/10.5281/zenodo.5718208 Yeh M et al (2016) C.-C. Barcelona,. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets. in IEEE 16th International Conference on Data Mining (ICDM) 1317–1322 10.1109/ICDM.2016.0179 Additional Declarations There is NO Competing Interest. Supplementary Files YeastassemblySIsubmission.docx Supplementary Information SupplementaryDatacode.zip Analysis Code Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9608945","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":636359234,"identity":"13328cbb-cf27-44da-ba2f-dd66770aac96","order_by":0,"name":"Fredrik Westerlund","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA3klEQVRIie3OsQrCMBDG8QOhLoddKw59hYNAsSD6KoZCp1Yd3czWV+hjCELmSAeXuisudnKti3QQNIibEOvmkP98P74DsNn+sK4A8ICm4AICnNsQVG/SF5pM2xN9S6o1GRTb03IR881uv615Mpr7olsZH8ReHIUlJVyWs8jjMg7XChmZyAQx6AtapvKApElBBOh4xhV0by+yyZE1XD5IP3ZpzAQdTZJ07WGgVxSBgsAkNHFYKChe5WUSDLmMiApkXx7rVEdxj5iblexwlWPys6yqjTOfdX68t9lsNttnT7M6RXQhzC35AAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0002-4767-4868","institution":"Chalmers University of Technology","correspondingAuthor":true,"prefix":"","firstName":"Fredrik","middleName":"","lastName":"Westerlund","suffix":""},{"id":636359235,"identity":"55fd27ca-bdbd-43aa-8879-4f2286263213","order_by":1,"name":"Luis Leal-Garza","email":"","orcid":"","institution":"Chalmers University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Luis","middleName":"","lastName":"Leal-Garza","suffix":""},{"id":636359236,"identity":"10363246-9821-4c1a-9d33-f3cfe38fa688","order_by":2,"name":"Hanna Zachrisson","email":"","orcid":"","institution":"Chalmers University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Hanna","middleName":"","lastName":"Zachrisson","suffix":""},{"id":636359241,"identity":"d122c5e8-6bfa-47ee-998f-b77c19cf91ec","order_by":3,"name":"Albertas Dvirnas","email":"","orcid":"","institution":"Chalmers University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Albertas","middleName":"","lastName":"Dvirnas","suffix":""},{"id":636359243,"identity":"6ef1169c-42cb-4d84-be01-4e2724931f19","order_by":4,"name":"Gaurav Goyal","email":"","orcid":"","institution":"Quantum Biosystems Inc.","correspondingAuthor":false,"prefix":"","firstName":"Gaurav","middleName":"","lastName":"Goyal","suffix":""},{"id":636359245,"identity":"603d9dc9-244c-41c7-8323-03d65cc9fa42","order_by":5,"name":"Ikenna Obi","email":"","orcid":"https://orcid.org/0000-0003-0364-8964","institution":"Umeå University","correspondingAuthor":false,"prefix":"","firstName":"Ikenna","middleName":"","lastName":"Obi","suffix":""},{"id":636359247,"identity":"2a951fdd-8aba-4cab-bdac-a047b454fde8","order_by":6,"name":"Nasim Sabouri","email":"","orcid":"https://orcid.org/0000-0002-4541-7702","institution":"Umeå University","correspondingAuthor":false,"prefix":"","firstName":"Nasim","middleName":"","lastName":"Sabouri","suffix":""},{"id":636359255,"identity":"b160aa73-f8a7-4ed1-8702-b46e8db6d567","order_by":7,"name":"Tobias Ambjörnsson","email":"","orcid":"","institution":"Lund University","correspondingAuthor":false,"prefix":"","firstName":"Tobias","middleName":"","lastName":"Ambjörnsson","suffix":""}],"badges":[],"createdAt":"2026-05-04 13:30:27","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9608945/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9608945/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108841843,"identity":"c3f92da7-50e3-47ef-8a02-11e40a835d0b","added_by":"auto","created_at":"2026-05-09 01:12:32","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":380558,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eWorkflow overview. a\u003c/strong\u003e CB-labeled DNA molecules are stretched in nanochannels, imaged and detected (red box) to generate continuous fluorescence intensity profiles, referred to as barcodes. \u003cstrong\u003eb\u003c/strong\u003e Extracted barcodes undergo quality control and repetitive region (red) detection. \u003cstrong\u003ec\u003c/strong\u003eMulti-step assembly with the DOGMA pipeline generates barcode contigs (bargroups, gray) from individual barcodes (blue). \u003cstrong\u003ed\u003c/strong\u003e Bargroups (colored) are aligned to reference genomes (gray) for validation and structural comparison. Roman numerals indicate chromosome numbers. \u003cstrong\u003ee\u003c/strong\u003e Consensus (black) is generated from bargroups and aligned to each reference chromosome (gray). Segments present in the reference but not in the consensus are highlighted in yellow and segments present in the consensus but not in the reference in blue.\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/5c5e04026ad9eeb90e58624a.png"},{"id":108841844,"identity":"3189f4f9-0790-46b4-9752-1574aa6174b3","added_by":"auto","created_at":"2026-05-09 01:12:32","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":827308,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAlignment to reference genomes. \u003c/strong\u003eAlignment of all\u003cem\u003e \u003c/em\u003ebarcodes to the reference\u003cem\u003e \u003c/em\u003egenomes of \u003cstrong\u003ea\u003c/strong\u003e \u003cem\u003eS. cerevisiae BY4742 \u003c/em\u003eand \u003cstrong\u003eb\u003c/strong\u003e\u003cem\u003eS. pombe 972h-\u003c/em\u003e, with chromosomes indicated as Roman numerals. Matches to the correct species are represented in blue and to the incorrect species in yellow. The theoretical optical maps are shown in gray. Black lines represent coverage depth. Insets highlight the most enriched genomic loci for each organism.\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/6d2459322c5298eeed7c20fc.png"},{"id":108841848,"identity":"b7687e42-5edb-4e2d-940b-162e00d8914f","added_by":"auto","created_at":"2026-05-09 01:12:32","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1027612,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eChallenges in eukaryotic dense OGM assembly. a\u003c/strong\u003e Representative barcodes (colored) spanning entire reference chromosomes (gray) in \u003cem\u003eS. cerevisiae\u003c/em\u003e. \u003cstrong\u003eb.i, b.ii\u003c/strong\u003e Examples of partially repetitive barcodes (colored) aligned to the reference genome (gray) before \u003cstrong\u003eb.i\u003c/strong\u003e and after \u003cstrong\u003eb.ii\u003c/strong\u003e masking of repetitive regions. The consensus barcode of each alignment is shown in black for comparison. \u003cstrong\u003ec\u003c/strong\u003e Theoretical barcodes for the three \u003cem\u003eS. pombe \u003c/em\u003echromosomes (in gray) and their smoothened trends (black). Orange boxes mark centromeres, showing conserved long-range low-intensity motifs, and blue boxes indicate regions with intensity profile gradients, due to extended GC-content gradients toward chromosome ends. \u003cstrong\u003ed\u003c/strong\u003e Percentage of barcodes aligning to the correct organism (solid lines), percentage of retained barcodes (dashed lines) and percentage of estimated coverage (dotted lines) after using a local barcode-to-reference alignment of repeat-masked barcodes under varying thresholds of ∆CC (difference in CC-value between the first- and second-best match for each molecule) for \u003cem\u003eS. cerevisiae (orange)\u003c/em\u003e and \u003cem\u003eS. pombe (purple)\u003c/em\u003e.\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/8539a1160afe28291ca244a6.png"},{"id":108841849,"identity":"4360a425-1879-429e-b6bf-727efc2ee7f6","added_by":"auto","created_at":"2026-05-09 01:12:32","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":778032,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAssembly of bargroups.\u003c/strong\u003e Example bargroup assembled\u003cstrong\u003e a.i\u003c/strong\u003e without masking and \u003cstrong\u003ea.ii\u003c/strong\u003e after masking of repetitive regions. Constituent barcodes are shown in color and the resulting bargroup consensus in black. \u003cstrong\u003eb\u003c/strong\u003eExamples of bargroups illustrating different signal-to-noise characteristics. \u003cstrong\u003eb.i\u003c/strong\u003eA bargroup that fails (SNR=0.9) the SNR criterion and \u003cstrong\u003eb.ii\u003c/strong\u003e a bargroup that passes (SNR=1.4) the SNR criterion. \u003cstrong\u003ec\u003c/strong\u003e Example of assembled bargroups aligned to the reference genome. Consensus bargroups (black) constructed from multiple barcode members (blue) are aligned post-assembly to their best matching regions along the theoretical reference barcodes (gray) in chromosomes I, I and II (from top to bottom respectively of \u003cem\u003eS. pombe\u003c/em\u003e).\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/b06616fb9cca62a61840c9d5.png"},{"id":108976968,"identity":"9c1f2739-2d7b-4c63-a103-959fe076cb69","added_by":"auto","created_at":"2026-05-11 11:29:42","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":655448,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFull genome assemblies.\u003c/strong\u003e \u003cstrong\u003ea\u003c/strong\u003e All assembled \u003cem\u003eS. cerevisiae\u003c/em\u003echromosomes (black) aligned to the reference (gray). \u003cstrong\u003eb\u003c/strong\u003e All assembled \u003cem\u003eS. pombe\u003c/em\u003e chromosomes (black) aligned to the reference (gray). Chromosomal extensions relative to the reference are highlighted in blue and regions not covered by the assembly are highlighted in yellow. \u003cstrong\u003ea\u003c/strong\u003e and \u003cstrong\u003eb\u003c/strong\u003e have different size scales for their x-axes.\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/3d920c5fdeaa37a12e384068.png"},{"id":108979473,"identity":"a4b7e1e3-79c8-4d07-aa1e-bfba8041dac5","added_by":"auto","created_at":"2026-05-11 11:59:06","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3793009,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/8bea17b3-5369-4bbc-97a4-c5e99fa2491c.pdf"},{"id":108841846,"identity":"9662f86e-02fa-40dd-a4c9-8740196f8cc8","added_by":"auto","created_at":"2026-05-09 01:12:32","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":1687165,"visible":true,"origin":"","legend":"Supplementary Information","description":"","filename":"YeastassemblySIsubmission.docx","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/05e45647b1b5ceb0d45fc402.docx"},{"id":108841845,"identity":"c6f63f64-7451-4028-b614-89c7587b80ea","added_by":"auto","created_at":"2026-05-09 01:12:32","extension":"zip","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":47257462,"visible":true,"origin":"","legend":"Analysis Code","description":"","filename":"SupplementaryDatacode.zip","url":"https://assets-eu.researchsquare.com/files/rs-9608945/v1/1afcc1025755a50952f8651b.zip"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Complete de novo assembly of yeast genomes using enzyme-free, dense optical genome mapping","fulltext":[{"header":"Introduction","content":"\u003cp\u003eAccurate and complete assembly of eukaryotic genomes remains a fundamental challenge in modern genomics. Despite substantial advances in sequencing technologies, many assemblies continue to suffer from unresolved gaps, collapsed repeats, and structural ambiguities. These limitations are particularly pronounced in regions characterized by long repetitive elements, segmental duplications, and complex chromosomal architectures\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. While short-read sequencing platforms provide base-level accuracy, they inherently lack long-range information. Third generation long-read technologies extend read lengths into tens of kilobases, but remain susceptible to coverage biases, elevated error rates, and difficulties in resolving highly repetitive or low-complexity regions\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. As a result, even well-studied eukaryotic genomes often contain poorly resolved regions such as centromeres, telomeres, ribosomal DNA (rDNA) arrays, and subtelomeric regions.\u003c/p\u003e \u003cp\u003eOptical Genome Mapping (OGM) has emerged as a powerful complementary approach for capturing long-range genomic structure. By imaging individual DNA molecules, hundreds of kilobases to megabases in length, OGM provides direct access to genome-wide structural information that is difficult to obtain through sequencing alone. OGM has proven particularly valuable for characterizing large structural variants, copy number changes, and repeat expansions, and has been successfully applied in both research and clinical settings\u003csup\u003e\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Commercial implementations, such as those by Bionano Genomics, rely on enzymatic labeling of specific sequence motifs, producing sparse fluorescent tag patterns (on average 9\u0026ndash;15 per 100 kbp) along DNA molecules\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. While this approach has demonstrated clinical utility, the relatively low label density limits its ability to resolve certain genomic regions, especially those with low motif frequency or structural complexity, including centromeres and large repeats.\u003c/p\u003e \u003cp\u003eAs an alternative to sparse labeling, dense labeling strategies that create continuous intensity profiles have emerged. Some approaches employ DNA methyltransferases with short (4-nt) recognition motifs, resulting in closely spaced labels whose fluorescence signals overlap due to the diffraction limit\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Other dense OGM strategies avoid enzymatic labeling altogether, instead using AT/GC-specific non-covalent binding proteins or synthetic dyes, such as the AT-specific TAMRA polypyrrole\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eA distinct enzyme-free dense labeling strategy developed in our laboratory is based on competitive binding (CB)\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. CB-based OGM exploits the competitive interaction between YOYO-1, a non-specific fluorescent DNA bis-intercalator, and netropsin, a non-fluorescent molecule with binding specificity for AT-rich regions\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. This interaction produces a continuous fluorescence intensity profile along each DNA molecule, where the signal strength reflects the underlying GC content. While this method cannot resolve individual fluorophores, and hence sequence sites, it generates information-dense intensity profiles (barcodes) that offer a fundamentally different representation of genomic structure compared to motif-based OGM. These barcodes can then be aligned to reference genomes or used for \u003cem\u003ede novo\u003c/em\u003e genome assembly\u003csup\u003e\u003cspan additionalcitationids=\"CR12\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. Since the fluorescence signal in CB-based OGM arises from YOYO-1 labeling of the DNA backbone, a dye already widely used in OGM and single-molecule assays\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e, CB can be implemented without major modifications to existing instrumentation. This compatibility creates opportunities to integrate CB-based OGM with complementary labeling strategies, enabling multimodal analysis of genomic features along individual DNA molecules.\u003c/p\u003e \u003cp\u003eTo exploit the full potential of CB-based OGM, we have recently developed DOGMA (Dense Optical Genome Mapping Assembly), a computational pipeline designed for \u003cem\u003ede novo\u003c/em\u003e assembly of genomes using CB-derived barcodes\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. DOGMA was initially validated on \u003cem\u003eEscherichia coli\u003c/em\u003e, a bacterial genome characterized by limited redundancy, low large-scale compositional variation, and relatively uniform sequence organization, providing an ideal test for overlap accuracy and assembly contiguity.\u003c/p\u003e \u003cp\u003eExtending DOGMA to eukaryotic genomes introduces a complex set of challenges. In contrast to \u003cem\u003eE. coli\u003c/em\u003e, even relatively small eukaryotic genomes contain complex features, including extensive repetitive regions and large-scale compositional domains that severely impede both sequence- and OGM-based assembly methods\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Repetitive elements often exceed sequencing read length and occur at multiple loci, making it difficult to assign reads or barcodes to unique genomic positions. Large-scale compositional domains, such as extended AT-/GC-rich regions or gradual changes in AT/GC content, further reduce barcode uniqueness by producing similar intensity profiles across long genomic regions. In OGM, both types of features can generate similar barcode patterns at distant loci, leading to ambiguous placement during alignment and assembly. Examples of these features include rDNA arrays, centromeres, telomeres, and subtelomeric regions, which are structurally complex, highly repetitive, and often appear at multiple loci in a genome\u003csup\u003e\u003cspan additionalcitationids=\"CR17 CR18\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. As a result, these regions are systematically underrepresented or misassembled in reference genomes.\u003c/p\u003e \u003cp\u003eIn this study, we extend DOGMA to assembly of eukaryotic genomes using \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e and \u003cem\u003eSchizosaccharomyces pombe\u003c/em\u003e as model systems. These two yeasts provide complementary test cases for CB-based OGM due to their compact, yet structurally rich, genomes and distinct chromosomal architectures. In \u003cem\u003eS. cerevisiae\u003c/em\u003e, regions such as the rDNA array on chromosome XII are associated with genome stability, cellular aging, and nucleolar organization and are notoriously difficult to resolve\u003csup\u003e\u003cspan additionalcitationids=\"CR21\" citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Likewise, in \u003cem\u003eS. pombe\u003c/em\u003e, similar arrays located near both ends of chromosome III pose challenges during DNA replication and remain incompletely resolved by sequencing-based approaches, consistent with their repetitive structure\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e,\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. Similar repetitive structures are widespread in larger eukaryotic genomes.\u003c/p\u003e \u003cp\u003eDOGMA is here used to generate \u003cem\u003ede novo\u003c/em\u003e OGM assemblies of eukaryotic genomes that capture key structural features, including telomeric regions and large repetitive arrays, while explicitly examining regions where assembly contiguity breaks down. Reference-based analyses are used for validation and biological interpretation, without guiding the assembly process itself. These results demonstrate the feasibility of dense OGM assembly in eukaryotes and the biologically driven challenges that must be addressed when extending OGM-based assembly methods beyond bacterial genomes.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eTo perform OGM, we stained genomic DNA samples and stretched them by confinement in nanofluidic devices. Details of the experimental and computational workflows are described in Materials and Methods. In brief, DNA molecules were extracted from yeast cells enclosed in agarose plugs, labeled by competitive binding of YOYO-1 and netropsin, extended in nanochannels, and imaged by fluorescence microscopy, producing continuous intensity profiles along individual DNA molecules. Images were then processed to extract one-dimensional fluorescence intensity profiles, referred to as barcodes \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea\u003cb\u003e)\u003c/b\u003e. The extracted barcodes underwent quality control and repetitive region screening prior to assembly \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb\u003cb\u003e)\u003c/b\u003e. Assembly was carried out using a multi-step implementation of the DOGMA pipeline\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e, grouping barcodes into bargroups (contiguous optical maps created from barcodes) based on similarity and contextual consistency \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec\u003cb\u003e)\u003c/b\u003e. Assembled bargroups were then aligned to reference genomes for validation and structural comparison \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed\u003cb\u003e)\u003c/b\u003e. Finally, consensus bargroups were generated and compared to each reference chromosome \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ee\u003cb\u003e)\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eReference-based alignment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA total of 1,102 \u003cem\u003eS. pombe\u003c/em\u003e and 551 \u003cem\u003eS. cerevisiae\u003c/em\u003e CB-labeled DNA molecules were imaged and converted into barcodes. To evaluate genome-wide coverage prior to the \u003cem\u003ede novo\u003c/em\u003e assembly, the barcodes were aligned to theoretical reference barcodes derived from the corresponding reference genomes (see Methods)\u003csup\u003e25,26\u003c/sup\u003e. The alignment was performed\u0026nbsp;using a full-length alignment strategy, in which each query barcode is aligned in its entirety to the reference, with the Pearson cross-correlation (CC-value) scoring method (\u003cstrong\u003eFig. 2\u003c/strong\u003e, see Methods for details).\u003c/p\u003e\n\u003cp\u003eGenome-wide alignments were obtained for all three \u003cem\u003eS. pombe\u003c/em\u003e chromosomes and all sixteen \u003cem\u003eS. cerevisiae\u003c/em\u003e chromosomes \u003cstrong\u003e(Fig. 2)\u003c/strong\u003e. Reference-based coverage exceeded 98% for both genomes, with mean coverage depths of 29.5x for \u003cem\u003eS. pombe\u003c/em\u003e and 16.4x for \u003cem\u003eS. cerevisiae\u003c/em\u003e, indicating that the dataset provides near-complete genome representation for downstream assembly.\u003c/p\u003e\n\u003cp\u003eDespite the general overall uniformity in coverage, several genomic regions, particularly in \u003cem\u003eS. pombe\u003c/em\u003e, displayed elevated coverage. Given the random sampling of DNA inherent to OGM, pronounced local enrichment is not expected. Elevated coverage is more likely to reflect alignment ambiguity, where barcodes from distinct loci exhibit high full-length similarity and are therefore assigned to alternative genomic positions. Such ambiguity can arise when regions at different loci share similar intensity profiles, or when sequences are incompletely represented in the reference assembly, leading to misplacement.\u003c/p\u003e\n\u003cp\u003eA complementary pattern was observed in specific regions of the \u003cem\u003eS. cerevisiae\u003c/em\u003e genome, where reduced coverage was observed. This pattern is consistent with a limitation of the full-length alignment strategy, which requires query barcodes to be shorter than the reference. Molecules approaching or exceeding chromosome length are therefore unable to align optimally, and are instead placed at alternative loci, contributing both to apparent depletion at their true origin and enrichment elsewhere.\u003c/p\u003e\n\u003cp\u003eAs an additional specificity control, barcodes from each species were aligned against the reference genome of the other species \u003cstrong\u003e(Fig. 2)\u003c/strong\u003e. While a majority of the barcodes (76.4%) aligned preferentially to their correct species, a subset exhibited comparable or higher global similarity to regions in the reference of the other species. After inspecting cross-species matches (\u003cstrong\u003eFig. 2\u003c/strong\u003e, insets), we noticed that they occurred mainly around specific loci for each genome, creating alignment ambiguity hotspots. We also observed that cross-species matches were enriched for barcodes characterized by either long-range (200-300 kbp) low-frequency intensity trends or extended repetitive regions, both of which reduce barcode uniqueness and increase alignment ambiguity.\u003c/p\u003e\n\u003cp\u003eOverall, the reference-based analysis indicated that global alignment provides reliable genome-wide coverage estimates, but that it is susceptible to alignment ambiguity under specific structural conditions. To clarify the origin of these effects and assess their impact on assembly, we next examined experimental and theoretical barcode properties influencing misalignment prior to initializing the\u0026nbsp;\u003cem\u003ede novo\u003c/em\u003e assembly.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStructural features driving alignment ambiguity in eukaryotic OGM\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAnalysis of the barcode dataset revealed three features that strongly influence similarity scoring and assembly behavior: chromosome-spanning molecules, repetitive barcode components, and long-range intensity trends. Each of these properties can affect similarity scores and influence the grouping during assembly.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eChromosome-spanning molecules are captured by OGM\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eUsing the full-length alignment framework, molecules exceeding the theoretical chromosome length cannot be properly aligned, leading to misplacement or exclusion. This effect was particularly evident for the smaller chromosomes in \u003cem\u003eS. cerevisiae\u003c/em\u003e, where full-length molecules exceed the corresponding theoretical barcode length.\u003c/p\u003e\n\u003cp\u003eIndividual barcodes, up to 800 kbp in size, fully spanned ten of the sixteen \u003cem\u003eS. cerevisiae\u003c/em\u003e chromosomes (examples in \u003cstrong\u003eFig. 3a\u003c/strong\u003e), demonstrating the presence of chromosome-scale molecules in the dataset. This observation highlights the capacity of CB-based OGM to capture entire eukaryotic chromosomes within single intact DNA molecules. For sufficiently small chromosomes, and given limited DNA fragmentation, assembly may therefore be simplified through local alignment (see Methods), which identifies the highest-scoring subregions between barcode and reference without requiring full-length matching.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eRepetitive barcode components cause alignment ambiguity\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA subset of barcodes exhibited broad alignment across multiple genomic loci and were identified as being enriched in repetitive features. The identified repetitive units ranged between 9 and 10.5 kb in size, consistent with known repetitive regions in \u003cem\u003eS. cerevisiae\u003c/em\u003e and \u003cem\u003eS. pombe\u003c/em\u003e, respectively. This repetitive DNA is known to be underrepresented or collapsed to just two copies per locus in current reference assemblies\u003csup\u003e26,27\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eWhen repetitive regions are not masked, their repetitive components dominate the alignment, producing visible misalignments to incorrect loci \u003cstrong\u003e(Fig. 3b.i)\u003c/strong\u003e. To address this issue, and to incorporate repetitive DNA into the alignment and assembly more effectively, we developed a repetitive-element detection algorithm (see Supplementary Methods 2) that identifies and masks repetitive elements. The algorithm also classifies barcodes as fully repetitive, partially repetitive (when a subregion of the barcode has at least five contiguous repetitive units), and non-repetitive.\u003c/p\u003e\n\u003cp\u003eAfter masking the repetitive segments, the remaining non-repetitive portion of the barcode can be used to determine the correct genomic locus. As illustrated in \u003cstrong\u003eFig. 3b.ii\u003c/strong\u003e, masking resolves the ambiguity and enables consistent alignment of the same molecules to a single genomic locus. Since the pair-wise similarity scoring of the assembly process is similar to the scoring system from the alignment process, it is expected that the resolution of alignment ambiguity via masking of repetitive regions can improve the bargroup formation during DOGMA assembly as well.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eLong-range GC-dependent intensity gradients influence barcode similarity\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe reference-based analysis revealed recurrent cases in which barcodes from distinct genomic loci exhibited high global similarity. Inspection of these barcodes suggested the presence of gradual, long-range intensity variations in addition to the typical CB-based OGM barcode features. To better understand the origin of this behavior, we searched for long-range intensity trends in the theoretical reference barcodes.\u003c/p\u003e\n\u003cp\u003eSmoothing of theoretical reference barcodes using a moving-average filter revealed pronounced long-range intensity trends spanning hundreds of kilobases in several locations \u003cstrong\u003e(Fig. 3c)\u003c/strong\u003e, reflecting genome-scale compositional domains that can be detected by dense CB-based labeling. In \u003cem\u003eS. pombe\u003c/em\u003e, these features were particularly evident, with gradually decreasing GC-content extending ~300 kbp toward chromosome ends and characteristic low-GC motifs spanning ~200 kbp at centromeric regions. While local barcode structure is defined by short-scale intensity variations (5\u0026ndash;20 kbp), these long-range trends introduce an underlying \u0026ldquo;low-frequency\u0026rdquo; barcode component across much larger genomic distances.\u003c/p\u003e\n\u003cp\u003eConsistent with this organization, experimental barcodes sharing similar long-range intensity trends frequently exhibited ambiguous alignments, including preferential placement at incorrect loci or, in some cases, within the reference genome of the other species \u003cstrong\u003e(Fig. 2)\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eThese results demonstrate that CB-based OGM captures biologically meaningful genome-scale compositional information that is largely inaccessible to sparse labeling approaches that just label specific motifs. While these long-range intensity trends pose challenges for similarity-based alignment and assembly, they also provide direct insight into large-scale chromosomal organization, highlighting both a limitation and a unique strength of dense OGM.\u003c/p\u003e\n\u003cp\u003eTo validate that the elevated coverage observed in \u003cstrong\u003eFig. 2\u003c/strong\u003e arises from intrinsic genome-scale signal organization, rather than coverage limitations, intragenomic self-similarity analysis was performed across the \u003cem\u003eS. pombe\u003c/em\u003e chromosomes (Supplementary Fig. 4). Chromosomal regions \u0026ge;150 kbp exhibiting CC-value \u0026ge;0.75 were considered highly similar. This analysis revealed discrete similarity \u0026ldquo;hotspots\u0026rdquo; shared between chromosomes, consistent with shared long-range signal features captured in CB-based OGM data \u003cstrong\u003e(Fig. 3c)\u003c/strong\u003e. These long-range features can dominate global similarity measures despite divergent local structure, providing a mechanistic explanation for the coverage inflation and reduced alignment confidence observed in the reference-based analysis.\u003c/p\u003e\n\u003cp\u003eTogether, these findings indicate that structural properties intrinsic to eukaryotic genomes, rather than insufficient coverage, represent the primary constraints on assembly using dense OGM.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eMolecule characteristics before and after quality control processing\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMean molecule lengths prior to quality control (QC) were 337.8 kbp for \u003cem\u003eS. pombe\u0026nbsp;\u003c/em\u003eand 359.7 kbp for \u003cem\u003eS. cerevisiae\u003c/em\u003e. The QC procedures included stitching molecules spanning multiple imaging fields and masking barcode regions dominated by repetitive features (see Methods). Repetitive segments were masked rather than removed, preserving the non-repetitive parts for downstream analysis. Following QC, the mean molecule lengths remained largely unchanged (337.6 kbp for \u003cem\u003eS. pombe\u003c/em\u003e and 366.0 kbp for \u003cem\u003eS. cerevisiae)\u003c/em\u003e, indicating that filtering primarily addressed problematic features without introducing size-related bias.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eA second reference-based alignment was performed to evaluate the QC (Supplementary Fig. 3). This alignment implemented local barcode-to-reference matching to accommodate chromosome spanning molecules, repetitive masking to account for repetitive regions, and a \u0026Delta;CC threshold to filter barcodes with alignment ambiguity. Increasing the \u0026Delta;CC threshold improved organism-specific alignment accuracy while reducing the fraction of retained barcodes, illustrating the trade-off between specificity and molecule retention \u003cstrong\u003e(Fig. 3d)\u003c/strong\u003e. An optimal threshold of \u0026Delta;CC = 0.03 maximized correct alignments while maintaining a near complete estimated coverage of both genomes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBargroup formation and consensus barcode construction\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHaving characterized the structural features influencing similarity metrics, we next assembled QC-filtered and repeat-masked barcodes into bargroups using the DOGMA pipeline. Repetitive elements were identified and masked from barcodes prior to assembly. Assembly without masking produced bargroups containing misaligned repetitive and non-repetitive segments \u003cstrong\u003e(Fig. 4a.i)\u003c/strong\u003e. Masking allowed the non-repetitive parts of barcodes to drive the assembly, while still maintaining the repetitive content for post-assembly analysis. This strategy enabled correct assembly of molecules that would otherwise be mis-grouped or left ungrouped, while also resolving incorrect groupings driven by repetitive segments \u003cstrong\u003e(Fig. 4a.ii)\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eTo mitigate grouping driven primarily by shared long-range intensity trends rather than local structural similarity, we introduced a bargroup scoring metric based on signal-to-noise characteristics (see Methods). This metric prioritizes bargroups with well-defined local features consistent among their constituting barcodes \u003cstrong\u003e(Fig. 4b.i)\u003c/strong\u003e and penalizes bargroups dominated by smooth, low-frequency signal components with constituting barcodes that differ in their short-frequency features \u003cstrong\u003e(Fig. 4b.ii)\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eBargroup assembly applied pairwise similarity constraints incorporating limited stretching variation, positional tolerance in pixel coordinates, and statistical thresholds on correlation significance (see Methods). The assembled bargroups were subsequently aligned to the corresponding reference genomes using a local bargroup-to-reference alignment strategy to assess genome-wide coverage and contiguity. Each bargroup consensus was represented as an averaged barcode intensity profile derived from its constituent barcode members \u003cstrong\u003e(Fig. 4c)\u003c/strong\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eOverall, approximately 20% and 30% of the barcodes used as input for the bargrouping process contributed to bargroup formation from \u003cem\u003eS. pombe\u003c/em\u003e and \u003cem\u003eS. cerevisiae\u003c/em\u003e, respectively, while the rest remained ungrouped despite passing QC. Ungrouped molecules were retained for downstream coverage analysis and gap resolution.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eGenome-wide assemblies reveal chromosomal extensions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGenome-wide coverage by assembled bargroups closely mirrored the high reference-alignment coverage observed prior to assembly, indicating that most chromosomal regions were represented within contiguous optical maps. Across multiple chromosomes, bargroups spanned megabase-scale intervals, demonstrating robust large-scale assembly performance.\u003c/p\u003e\n\u003cp\u003eFollowing initial assembly, bargroups were aligned to the reference genomes using a local bargroup-to-reference alignment strategy. Adjacent bargroups with consistent placements were subsequently merged based on local bargroup-to-bargroup similarity to generate chromosome-scale assemblies. The resulting assemblies retained only bargroups satisfying the meanCC, stdCC, and SNR quality criteria defined during bargroup formation (Supplementary Table 3).\u003c/p\u003e\n\u003cp\u003eThe final assemblies generated for both \u003cem\u003eS. pombe\u003c/em\u003e and \u003cem\u003eS. cerevisiae\u003c/em\u003e included highly repetitive regions typically underrepresented in sequence-based assemblies \u003cstrong\u003e(Fig. 5)\u003c/strong\u003e. Some of the assembled chromosomes show minor extensions (\u0026gt;6 kbp) relative to the reference chromosomes in the end(s) of each chromosome. These can be explained as a product of our theoretical reference model, which omits the intensity of the first and last 6 pixels, assuming them to be uncertain.\u003c/p\u003e\n\u003cp\u003eLocal breaks in contiguity were observed at \u003cem\u003eS. pombe\u0026nbsp;\u003c/em\u003echromosome I from position 4466 kbp to 4787 kbp and chromosome II from position 3208 kbp to 3211kbp (Supplementary Fig. 5). To distinguish assembly limitations from absence of coverage, we examined high-confidence barcodes that aligned strongly to the reference, but did not group into bargroups. Inclusion of these molecules revealed that regions lacking assembled contigs were nevertheless supported by individual barcodes, resulting in near-complete genome-wide molecular representation.\u003c/p\u003e\n\u003cp\u003eLarge repetitive structures were resolved in both organisms. In \u003cem\u003eS. cerevisiae\u003c/em\u003e, a large repetitive block interrupted the reference assembly of chromosome XII near 479 kbp. This region extended \u0026ge;550 kbp and consisted of ~9 kbp repeat units, corresponding to \u0026ge;60 tandem copies. In \u003cem\u003eS. pombe\u003c/em\u003e, extended repeat arrays were observed at both ends of chromosome III beyond the length represented in the current reference assembly, spanning at least 1 Mb in total (\u0026ge;500 kbp from each telomeric end). Individual repeat units measured ~10.5 kbp, consistent with rDNA repeats\u003csup\u003e19,27\u003c/sup\u003e and indicating the presence of \u0026ge;50 tandem copies at each end.\u003c/p\u003e\n\u003cp\u003eStructural variations (SV) in the form of smaller insertions were also detected in chromosomes such as \u003cem\u003eS. cerevisiae\u003c/em\u003e chromosome VIII and III. We were able to manually curate a ~30 kbp insertion in chromosome VIII (position 206 kbp). A similar-sized insertion on chromosome III between positions 100 kbp and 200 kbp was also identified, but its insertion point was not precisely determined. SV detection is not streamlined into our pipeline, making it non-trivial to pinpoint exact insertion points.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis work demonstrates that dense optical genome mapping (OGM) assembly using the DOGMA pipeline can be extended from bacterial to eukaryotic genomes. In doing so, it reveals constraints and opportunities that are largely invisible in simpler systems, such as bacterial genomes, and require targeted methodological adaptations. Applying DOGMA to \u003cem\u003eS. cerevisiae\u003c/em\u003e and \u003cem\u003eS. pombe\u003c/em\u003e demonstrates that OGM-based assembly of eukaryotic genomes is constrained less by genome size and more by repeat composition and long-range genome organization.\u003c/p\u003e\n\u003cp\u003eFor \u003cem\u003eE. coli\u003c/em\u003e, DOGMA achieved nearly complete genome reconstruction with minimal preprocessing, reflecting the relative simplicity and low redundancy in bacterial genomes\u003csup\u003e28\u003c/sup\u003e. The transition to \u003cem\u003eS. cerevisiae\u003c/em\u003e and \u003cem\u003eS. pombe\u003c/em\u003e revealed a fundamentally different assembly landscape. Yeast genomes introduced an increase in barcode ambiguity due to repetitive genomic features such as rDNA arrays and subtelomeric repeats\u003csup\u003e21,22,25,29\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eBy explicitly identifying and isolating repetitive signal components, DOGMA can assemble repeat-rich regions without collapsing non-adjacent loci and subsequently reintegrate them using unique flanking context. Assembly of these repetitive regions with OGM is solely limited to the size of the isolated DNA to ensure the collection of data including repetitive regions and their adjacent non-repetitive loci. This demonstrates that repeat content is not a fundamental limitation of dense optical genome assembly; rather, the critical factor is the ability to detect, classify, and appropriately weight repetitive and long-range signal features during assembly.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAnother key insight of this study is that long-range intensity trends, spanning hundreds of kilobases, contribute substantially to barcode similarity across distinct loci. In \u003cem\u003eS. pombe\u003c/em\u003e, these trends manifest as chromosome GC gradients and centromeric low-GC domains, which are consistently detectable in both theoretical and experimental barcodes. While such trends can bias global similarity metrics and promote spurious alignments, they also reflect large-scale chromosomal organization that is inaccessible to sparse-label OGM approaches. Accounting for this signal component will therefore be essential for future OGM assembly frameworks.\u003c/p\u003e\n\u003cp\u003eTogether, these findings position dense OGM as both an assembly tool and a probe of large-scale genome organization. While further methodological refinement will be required to fully exploit long-range signal information in even more complex genomes, the complete yeast assemblies presented here establish an important foundation for extending dense OGM to larger eukaryotic systems and to biologically and clinically relevant repetitive loci.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFrom a methodological perspective, the combination of repeat-aware masking and characterization of long-range intensity profiles provides a generalizable strategy for extending dense, CB-based, OGM beyond yeast. Many of the challenges encountered here, such as repetitive arrays and long-range compositional structure, are even more common in mammalian genomes, where repetitive regions constitute over half of the human genome and remain systematically underrepresented or collapsed in many reference assemblies\u003csup\u003e30,31\u003c/sup\u003e. Despite recent advances, such as telomere-to-telomere assemblies, large repetitive loci remain difficult to characterize robustly with sequencing alone, particularly when copy number, structural heterogeneity, or long-range organization are biologically relevant\u003csup\u003e30\u003c/sup\u003e. In this context, dense OGM offers a complementary view, capturing continuous intensity profiles, that are orthogonal to sequence-based information, over hundreds of kilobases. While evaluation in human genomes lies beyond the scope of this study, the framework presented here suggests a path toward resolving medically relevant repeat structures, including large satellite arrays, rDNA clusters, and disease-associated macrosatellites\u003csup\u003e3,17\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eAn additional advantage of CB-based, dense OGM is that the intensity variation signal arises from YOYO-1 labeling of the DNA backbone, a fluorophore already used in many OGM and single-molecule assays to simply detect the DNA backbone. Therefore, the incorporation of competitive binding does not require additional excitation sources, emission filters, or major instrumentation changes, making this signal layer effectively available in already existing platforms. This creates opportunities to combine CB-based OGM with other labeling strategies, such as enzymatic motif labeling, damage mapping\u003csup\u003e32\u003c/sup\u003e, or epigenetic assays\u003csup\u003e14\u003c/sup\u003e, while retaining long-range genomic context. In this way, CB-based OGM is not limited to structural reconstruction, but can serve as a scaffold for integrating multiple molecular readouts along individual DNA molecules.\u003c/p\u003e\n\u003cp\u003eIn conclusion, this study demonstrates that dense, continuous optical genome mapping can be extended from bacterial to eukaryotic genomes when eukaryotic structural constraints are appropriately considered. By combining repeat-aware masking with local alignment assembly, DOGMA reconstructs genome-wide optical maps in yeast without collapsing non-adjacent loci. Importantly, CB-based OGM preserves large-scale compositional features of chromosomes, enabling analysis of genomic structure beyond strictly local sequence similarity. Together, these results establish DOGMA as a robust framework for eukaryotic optical genome assembly and lay the groundwork for applying dense OGM to increasingly complex genomes.\u0026nbsp;\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e\u003cstrong\u003eYeast cultivation and DNA extraction\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eExperiments were performed using \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e strain BY4742 and \u003cem\u003eSchizosaccharomyces pombe\u003c/em\u003e strain NS112 (\u003cem\u003eleu1-32 his3-D1 ura4-D18 ade6-M210\u003c/em\u003e). The protocol for DNA extraction was adapted from previous work\u003csup\u003e33\u003c/sup\u003e. In short, \u003cem\u003eS. cerevisiae\u003c/em\u003e was grown overnight and used to inoculate a culture that was grown at 30\u0026deg;C for 6 to 8 hours to reach an OD600 of 0.1. Afterwards, 6 x10\u003csup\u003e8\u003c/sup\u003e cells were harvested and frozen at -20\u0026deg;C until the DNA extraction was performed. For \u003cem\u003eS. pombe\u003csup\u003e34\u003c/sup\u003e\u003c/em\u003e, 50 mL exponentially growing cells at initial concentration of 0.5 million cells/mL were grown in Edinburgh minimal media for 16 h at 30\u0026deg;C until they reached 14 million cells/mL. Cells were harvested by centrifugation at 3000 rpm for 5 mins and washed twice with 50mM cold EDTA. Cells were frozen at -20\u0026deg;C prior to DNA extraction. The extraction of DNA from both types of yeast cells was performed using the CHEF Yeast Genomic DNA Plug Kit from BIO-RAD. In short, 1 mL of agarose plugs were created for each strain by combining 6 x10\u003csup\u003e8\u003c/sup\u003e cells with 625 \u0026micro;L CSB (Cell Suspension Buffer) and 375 \u0026micro;L 2% CCA (CleanCut agarose). Agarose plugs were then treated with Lyticase, RNase, and Proteinase K, and washed several times according to the CHEF Yeast Genomic Plug Kit protocol.\u003c/p\u003e\n\u003cp\u003eDNA was then extracted from the agarose plugs. This is done by first incubating in 1x CutSmart (New England Biolabs) solution at 70\u0026deg;C for 10 minutes, followed by incubation at 42\u0026deg;C for 10 minutes. This was followed by a 42\u0026deg;C incubation with 2 U agarase for 2 hours (New England Biolabs). DNA was then quantified by using Qubit dsDNA Broad Range Assay kit (Invitrogen).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCB-based OGM lab procedures\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDensely labelled DNA molecules were obtained by a single-step competitive binding (CB) based staining reaction, as described previously\u003csup\u003e33\u003c/sup\u003e.The CB staining reactions were performed by adding 4-6 \u0026micro;M DNA (in bases), 0.4-0.6 \u0026micro;M YOYO-1, and 120-180 \u0026micro;M netropsin in a 10:1:300 ratio to a 0.5X TBE (Tris-Borate EDTA) 10 \u0026micro;L solution. DNA from bacteriophage \u0026lambda; (48,502 bp bought from Roche) was added to make up ~30% of the DNA mass in the staining reactions, to serve as an internal size reference. Finally, an incubation at 50\u0026deg;C for 30 minutes was followed by a 15-fold dilution with MQ-water and 3 \u0026micro;L of \u0026beta;-mercaptoethanol (BME) to reduce photodamage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNanofluidic experiments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eStained DNA molecules were stretched for imaging using a nanofluidic device whose fabrication is described elsewhere\u003csup\u003e34\u0026ndash;36\u003c/sup\u003e. These devices are designed to have two sets of two loading wells, where each set is connected via a separate microchannel. Both microchannels connecting the loading wells are connected to each other by 120 parallel nanochannels, each having a cross section of 100 x 150 nm\u003csup\u003e2\u003c/sup\u003e and extending 500 \u0026micro;m in length. To ensure uniform conditions, the chips underwent pre-conditioning with 0.033X TBE buffer supplemented with 2% BME v/v prior to sample loading. DNA loading was done by adding 10 \u0026micro;L of stained sample into one of the loading wells, while the others contained the same buffer used for pre-conditioning, but no DNA. DNA molecules were then driven first into the microchannels and then the nanochannels by pressure-driven N\u003csub\u003e2\u003c/sub\u003e flow. Stretched and stained DNA molecules in the nanochannels were then imaged by using an inverted fluorescence microscope (Zeiss Axio Observer D.1) equipped with a 63x oil immersion objective (NA = 1.46, Zeiss), a Colibri 7 light source (Zeiss), and a Photometrics Evolve EMCCD camera. Systematic acquisition of videos of up to 100 frames was carried out for each molecule, utilizing an exposure time of 100 ms and a FITC filter set (Zeiss).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBarcodes processing\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBarcodes were generated from videos using a previously developed custom MATLAB software called \u0026rsquo;lldev\u0026rsquo;\u003csup\u003e37,38\u003c/sup\u003e. In short, DNA molecules are detected in each time frame within a movie. The detected DNA molecules are then turned into kymographs, where each row of pixels represents the intensity along the DNA molecule in a single frame. The kymographs are aligned to compensate for small thermal fluctuations of the DNA inside the nanochannels. Aligned kymographs are then processed by a custom quality control (QC) algorithm to merge molecules spanning multiple fields of view (multi-FoV) and to remove molecules that show a significant change in size during imaging (see Supplementary Methods 1). Finally, the filtered and aligned kymographs are time-averaged, rendering a time series-like 1D intensity trace along the extension of the DNA molecule, referred to as a barcode.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eReference alignment \u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eReference genome sequences were used as an input to previously established methods\u003csup\u003e11\u003c/sup\u003e to \u003cem\u003ein silico\u003c/em\u003e generate theoretical reference barcodes that can be used for alignment and coverage analysis. For \u003cem\u003eS. cerevisiae\u003c/em\u003e, the S288C reference genome (R64 assembly), obtained from Saccharomyces Genome Database\u003csup\u003e26\u003c/sup\u003e, was used as reference since the BY4742 strain was derived from S288C. For \u003cem\u003eS. pombe\u003c/em\u003e, the reference genome of strain 972h\u0026minus;, obtained from PombeBase\u003csup\u003e27\u003c/sup\u003e, was used as a reference since strain NS112 (\u003cem\u003eleu1-32 his3-D1 ura4-D18 ade6-M210\u003c/em\u003e) was derived from the 972 strain background.\u003c/p\u003e\n\u003cp\u003eGenome-wide barcode alignment was performed with a full-length similarity search alignment strategy to estimate coverage across the reference genomes. This approach, implemented in previous work\u003csup\u003e33\u003c/sup\u003e, is based on a Pearson correlation coefficient (CC) scoring system to identify the best placement for a query barcode within a given reference. In this strategy, the full query barcode is aligned to a segment of the reference, and therefore requires the query to be shorter than the reference.\u003c/p\u003e\n\u003cp\u003eTo identify individual molecules spanning complete \u003cem\u003eS. cerevisiae\u003c/em\u003e chromosomes, a local similarity alignment was used in a barcode-to-reference mode. This approach uses a sliding window to identify the highest-scoring subregions between query and reference, allowing partial matches and overhang beyond reference boundaries. Both query and reference must exceed the selected overlap window length. The local alignment is based on a matrix profile framework and is also used in a barcode-to-barcode mode as the core function for pairwise similarity calculations in DOGMA\u003csup\u003e13,39\u003c/sup\u003e. For each chromosome, an overlap window corresponding to 70% of the chromosome length was used. Candidate barcodes were restricted to lengths between 70% and 150% of the corresponding chromosome size to accommodate molecule length variability while ensuring chromosome-scale coverage.\u003c/p\u003e\n\u003cp\u003eFollowing assembly, bargroup alignment to the reference genome was performed using local similarity alignment rather than full-length alignment. This choice accounts for assembled contigs that may extend over the edges or exceed the full length of the corresponding reference chromosomes due to the presence of insertions, repetitive or unresolved genomic segments. Local alignments were calculated using an overlap window corresponding to 70% of the non-repetitive portion of each contig.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFor bargroup alignment to the reference, chromosome XII of the \u003cem\u003eS. cerevisiae\u003c/em\u003e genome was split into left and right side after noticing partial alignment due to repetitive regions not included in the reference. The breakpoint for the insertion of repetitive region into chromosome XII was manually determined to be around position 472 kbp.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRepetitive Finder\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSince DOGMA uses local barcode-to-barcode alignment for its bargrouping procedure, repetitive regions shared between molecules can dominate the matching process, potentially leading to non-canonical groupings independent of the non-repetitive content elsewhere in the molecules. For this reason, it is important to identify molecules with repetitive regions and mask the repetitive part of them so that the rest of the molecule can be used as a contextual \u0026ldquo;anchor\u0026rdquo; in the bargrouping process. This way, bargroups are created based on the similarity of the non-repetitive parts of each barcode, while the repetitive parts are still preserved and can be \u0026ldquo;unmasked\u0026rdquo; for downstream analysis.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo identify and mask highly repetitive regions prior to assembly, we developed a custom \u003cem\u003eRepetitive Finder\u003c/em\u003e algorithm implemented in MATLAB (Supplementary Fig. 2). The method detects periodic patterns in barcode intensity profiles using the autocorrelation function to estimate repeat periodicity, followed by Fourier transform validation. A rolling-window correlation then identifies continuous repetitive regions, with segmentation achieved using the findchangepts function (Supplementary Methods 2). Regions with an average correlation \u0026gt;0.4 across at least five repeat units are marked as repetitive and masked in subsequent assembly steps. These empirically determined thresholds effectively separate highly repetitive barcodes from non-repetitive ones (Supplementary Fig. 2).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDOGMA pipeline implementation and bargroup quality metrics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAssembly was performed using the DOGMA algorithm as previously described\u003csup\u003e13\u003c/sup\u003e. Pairwise barcode comparisons were evaluated using normalized Pearson cross-correlation under constrained alignment parameters, including stretch tolerance, positional tolerance, and statistical significance thresholds defined by the DOGMA null model framework (Supplementary Table 1).\u003c/p\u003e\n\u003cp\u003eBecause repetitive, partially repetitive, and non-repetitive barcode classes exhibit distinct similarity distributions, parameter values within the null-model-based p-value framework were optimized separately for each class using the\u0026nbsp;parametrization.m\u0026nbsp;script within the DOGMA package (Supplementary Table 2). This parametrization is recommended to be run once per experimental setup, since parameters are greatly affected by pixel size and effective PSF of the set up.\u003c/p\u003e\n\u003cp\u003eTo enable incorporation of repetitive regions, assembly was performed in multiple stages reflecting barcode class. Independent DOGMA assemblies were first carried out on non-repetitive and partially repetitive masked barcodes. Repetitive barcodes were then assembled in a subsequent stage using the partially repetitive assemblies as anchoring context. This staged strategy prevents repetitive segments from driving initial grouping while allowing their incorporation once non-repetitive context has been established. Finally, all resulting bargroup contigs were merged through an additional DOGMA-based local assembly step to generate unified bargroups spanning all barcode classes. Consensus bargroups were constructed by averaging aligned barcode intensity profiles following stretch normalization. To mitigate grouping driven by long-range low-frequency trends and to ensure that only structurally consistent and information-rich groups were retained, three complementary quality metrics were evaluated for each bargroup.\u003c/p\u003e\n\u003cp\u003eInternal similarity was first assessed using pairwise Pearson cross-correlation coefficients (CC) computed between all barcodes within each bargroup and their consensus. The mean of these values (meanCC) was required to exceed 0.7, ensuring overall coherence among members. The standard deviation of the CC values (stdCC) was also evaluated to measure internal consistency; bargroups with stdCC \u0026gt; 0.05 were discarded, as high dispersion indicates heterogeneous membership despite acceptable mean similarity.\u003c/p\u003e\n\u003cp\u003eFinally, a signal-to-noise ratio (SNR) was computed to quantify the strength of the consensus structure relative to variability among its members. For a bargroup with\u0026nbsp;\u003cimg width=\"12\" height=\"22\" src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABIAAAAhCAMAAADj/gtmAAAAAXNSR0IArs4c6QAAAFFQTFRFAAAAAAAAAAA6AABmADqQAGa2OgAAOgA6Ojo6OpDbZgAAZmYAZraQZrb/kDoAkNv/tmYAtv+2tv//25A625Bm27Zm2////7Zm/9uQ//+2///bQQE1RQAAAAF0Uk5TAEDm2GYAAAAJcEhZcwAAFiUAABYlAUlSJPAAAAAZdEVYdFNvZnR3YXJlAE1pY3Jvc29mdCBPZmZpY2V/7TVxAAAAk0lEQVQoU82QWxaDIAxEEx9obcVHtCD7X2iTEKRLMB+QM3AnAwDPrYCIvcSTpvtq0AOx3bNmCoT2jbNI0enGRd1pZLOYtPbXpCRlHuCaXskLmbxOEYdhARKSz0wKfD06XkK14slK3lbJy3Uh15Iqh2Fyq1bqwOTnL6g+i3As7slnh+juB57WMWlB64cQlqCW9/nbD44SBscFrerRAAAAAElFTkSuQmCC\" v:shapes=\"_x0000_i1025\" alt=\"image\"\u003e\u0026nbsp;members, let\u0026nbsp;\u003cimg width=\"37\" height=\"22\" src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADgAAAAhCAMAAAC7gZh/AAAAAXNSR0IArs4c6QAAAHtQTFRFAAAAAAAAAAA6AABmADo6ADpmADqQAGa2OgAAOjo6OjqQOmaQOma2OpDbZgAAZjoAZpC2Zra2ZrbbZrb/kDoAkDo6kJA6kLbbkNv/tmYAtmY6ttv/tv/btv//25A627Zm27aQ29v/2////7Zm/9uQ/9u2/9vb//+2///bWL1TDQAAAAF0Uk5TAEDm2GYAAAAJcEhZcwAAFiUAABYlAUlSJPAAAAAZdEVYdFNvZnR3YXJlAE1pY3Jvc29mdCBPZmZpY2V/7TVxAAABb0lEQVRIS+1U2VbDIBCF1DSolbjEhagkLcTw/18oywwMth77qsd5SYC5s90LjP3b75nA3N9Uxer2mawNT3ax29ctuaGhfuFw7reH4jTxzTtjboyfYqukTrDvBuJkePRwA38guMrldDjNYydfgCbtHpku4VX6XWXzUty+SRjcsAM/gwjQvCPhrcCV7fnWN7/0kEphfCtCiI+R78jAfBio1F4dwv8qcXb5BOhoX6t2oP6457O7IY/YYGUpwiRaSgbWH4Ee9VSoSRWG3bpFFYmqgExxEhWBq0yUZjaBVDpiQkEoPGXELwCzQgjQCjpwBCALqmLDV5dltMg7qj4cDnhYEVNoDgAcuukW2bHMXXSJRCW9uEk0KYMGEaMAfCTfknd2j6CrFMTrIdrlLbCvYO4oOcWv/ck68BZwJy9NKAClfYbIqWKswJmcca0oUBcWfr7IFKg2b2PJefx01MIkSCOa+7Kce/omeHVVj1V1Kf7g4hOVKxyxh9o7mgAAAABJRU5ErkJggg==\" v:shapes=\"_x0000_i1025\" alt=\"image\"\u003e\u0026nbsp;denote the intensity at pixel\u0026nbsp;\u003cimg width=\"9\" height=\"22\" src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA4AAAAhCAMAAADebGoAAAAAAXNSR0IArs4c6QAAAFRQTFRFAAAAAAAAAAA6AABmADpmADqQAGa2OgAAOjqQOmaQOpDbZgAAZjoAZra2Zrb/kDoAkDo6kJA6kNv/tmYAtv//25A62////7Zm/9uQ/9u2//+2///bxNCSSgAAAAF0Uk5TAEDm2GYAAAAJcEhZcwAAFiUAABYlAUlSJPAAAAAZdEVYdFNvZnR3YXJlAE1pY3Jvc29mdCBPZmZpY2V/7TVxAAAAaklEQVQoU91QWRaAIAhkymzRVtM073/PsPJ5hV7zxSzAA6I/wGs0O1HQmPgc3zmLgU6FmsUE38o4Ny6fyvVaGJGBeIN3wt5TMnwLWVhQo8qthwxKkqm2x7YAO7w5Lkkx6HnhOUO8/hcefQFx2QQIL9rLtAAAAABJRU5ErkJggg==\" v:shapes=\"_x0000_i1025\" alt=\"image\"\u003e\u0026nbsp;of barcode\u0026nbsp;\u003cimg width=\"5\" height=\"22\" src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAhCAMAAADTchpHAAAAAXNSR0IArs4c6QAAAD9QTFRFAAAAAAAAAAA6AABmADqQAGa2OgAAOpDbZgAAZrbbZrb/kDoAkNv/tmYAtpA6tv//2////7Zm/9uQ//+2///bidrZ8AAAAAF0Uk5TAEDm2GYAAAAJcEhZcwAAFiUAABYlAUlSJPAAAAAZdEVYdFNvZnR3YXJlAE1pY3Jvc29mdCBPZmZpY2V/7TVxAAAASklEQVQoU2NgoAXgZWQHGyvCzcSD1XxhDkaoEmEOqBJhDmZ+sFohNhYBMEMQqoSBl5ELahymElaI+YKMrCJ8IEtFuBmZOCH6qQoAGKwCCWCz0YcAAAAASUVORK5CYII=\" v:shapes=\"_x0000_i1025\" alt=\"image\"\u003e, and\u0026nbsp;\u003cimg width=\"139\" height=\"30\" src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAANAAAAAtCAMAAADodhWBAAAAAXNSR0IArs4c6QAAAJlQTFRFAAAAAAAAAAA6AABmADo6ADpmADqQAGa2OgAAOgA6OgBmOjo6OjqQOmaQOma2OpDbZgAAZgA6ZgBmZjoAZmZmZpBmZpC2Zra2ZrbbZrb/kDoAkDo6kJA6kJCQkLbbkNv/tmYAtmY6ttv/tv+2tv/btv//25A625Bm27Zm27aQ29v/2/+22////7Zm/9uQ/9u2/9vb//+2///bKPBhmQAAAAF0Uk5TAEDm2GYAAAAJcEhZcwAAFiUAABYlAUlSJPAAAAAZdEVYdFNvZnR3YXJlAE1pY3Jvc29mdCBPZmZpY2V/7TVxAAAD4klEQVRoQ+1YiZLTMAyNu7QNsHQL5dpwbMoRWmhSkv//OORbcuzECSFToJ5h6LaW9XT4SVaSXNfVA1cPXKAHDo8fLhDVaEjHHbv5Olr68gTLzan4pwwCF/9vBhWMrZOkyRi7v7wE8yLqi1AuLlm9vftT9nx7ydfHyY7vMajJXvHY1M//Gi7EBlUpw4snWf3i83Z1Skr4N2odd05oi2VvMFyZCBGEjUQoZ0zpb+AjN6i6PeWLh6SAizRiNdmiBf+4Q84plQMfbb7r4z0yRKQHBoi/tVvqrSlL8JEbBJZU6V2ThSnBg0kfWPPgtlaToUpxEAqbPdLrkSEiXRYBar5sVgA6HYtcGJTfA8WtfnRdoRYm62t/jcN2lkzgBxqVLgtA97smImcQP5eGDIrFhycd7YSLyagpkaeI7sLWgELuMQaFZJBIhBloC/ACDnnFLYE4dl0hF1NfgPiBRofMA/gGLmo4QERkmEEJVFLES4XQnIdcLc52MBl9VRp0A+cZlWHyU6FcZmWqHVuBM887FU0jMtAeHg6tLUnUlRLZF1pAKwST2aciB3+74AC/clqVco/93LONZA/7w+2Jf7YsZU8TsNCycP0QYfegouNiMqeqyAnqp+AAkQqegrb8pKSMDHdDugY+0jRpRIZGiKfQoL7NxWSvEHIdBQdYFVDp90O6lKSjo63+WL23rjUiww2CMA8JkYPJ6OsAZwxy05XIgGeVnfzMGINIJiLDuznA8ZCLKde1pwOcQVdv5W7F/DRCcKHQ1Y0xKBA8KmoABnY7mGzWU4NoHdEq9P9eg6CCIJrEqIaRAq3V6Fpai/AcwsVkd9ELTkqZvuGarXP9K5I5b9/gzmk8KWAc3qjQOUQLk4e2EwecYWelq0o1f2hyLtfn7TrBtQfT9iBe6PWEM4egmKAsm7zXRbINTiOV/UFzSBdaSMvAOUBMYETzTpXgsYVVXwn1SMUArV/Qk8PFhH7SudsGp1of6ALEevra9ORaJmfP4Ls6Y0tlz9juFN1keDXw0u1rmO2XLUw5eeyIfqAFjtIXzZ+pm1OUqZKXJECotXJJ88LPdjJ7CL1iurz9m88Hd6aDuLESDzz/cCRsUJXiri/igde6316Z6AceuB7PdPhrCC3ARgFq7WGDCtopep/gus8JcJXvCd4jYk9yZjoms4RVHJsCGJ1y+c2XPenMjzunUS9jhiRUJkJEm9Q702kBVC0kmkMQR5cpHlEMqheTbO6d6XgAtuYQkyCZ6JD+mc5EiuY6JmKmMxeUSfSIAW/PTGcSRXMdEjPTmQvLJHpiZjqTKJrpkJiZzkxQrmquHrh6YLgHfgHtsGvhFEteegAAAABJRU5ErkJggg==\" v:shapes=\"_x0000_i1025\" alt=\"image\"\u003e\u0026nbsp;the consensus intensity. The SNR was defined as in equation (1) where the numerator reflects variation across the consensus profile, and the denominator reflects the average per-pixel variability among individual barcodes. Bargroups with SNR \u0026lt; 1.0 were excluded.\u003c/p\u003e\n\u003cp\u003e\u003cimg width=\"151\" height=\"34\" src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAOMAAAAzCAMAAABSbk9RAAAAAXNSR0IArs4c6QAAALRQTFRFAAAAAAAAAAA6AABmADo6ADpmADqQAGaQAGa2OgAAOgA6OgBmOjoAOjo6OjpmOmZmOmaQOma2OpC2OpDbZgAAZgA6ZjoAZjo6ZmYAZmY6Zma2ZpC2ZpDbZrbbZrb/kDoAkDo6kGY6kGZmkJBmkLbbkNv/tmYAtmY6tmZmtpA6tpBmtrb/ttvbttv/tv//25A625Bm27Zm27aQ29u22//b2////7Zm/7aQ/9uQ/9u2//+2///bur114AAAAAF0Uk5TAEDm2GYAAAAJcEhZcwAAFiUAABYlAUlSJPAAAAAZdEVYdFNvZnR3YXJlAE1pY3Jvc29mdCBPZmZpY2V/7TVxAAAFFElEQVRoQ+2aa3/SMBTGGwbaqnObIl42hlMpUzcrKFja7/+9fM45KU2alssotfNHXxSahab/nEtOns7zjsdxBo4zsOcMpCOllL/nTY4/b2oGkr56MjcHiwOlXhdGX94YLdH5Pf05VOrkR1NPuec4y36R8ZruGMFTlepc0ff42diahVO5XLSaMenndklHpYxeCIL0qwJw0mfq/IifswHbzRgHGxkFnTtGTuYJuaWdjOkN/K83niLkFFlo+RbuGJTaMekTxwK9tBnTifLn3jTgZvbSdjKGJ/ccXNqOcUBP7eQc9sw4wMcsuABKh8MvusaXxcWUPECwW8mYjsAoAOyrFHJeRTxyzukMTZSkf/ZBglLiuZWM3iJQL7+sGIWuglFSDpkwytInpkgvFumovYxe+i0g1xQ7JrxqlDNKOErKeVSMyZuxl05gG2EUuwhpMuhcw8xo5kDUPRZUDaxc8vazROYDfHWG5KY6Z4gUlA6U7WgEtIz1OqzU6aVVicg4DzhgnLlH/kdflt9x/1fzdBL0MPTtj9BHQsHowhgRziygSZCcEw0n1wjD+IIeRfpsH48ThcBe9hkOgS6OIZOL1E2Xy0Eh9T0Aj3+SfsTcdSkgfwaKKjJ89O5G3XMZCzk0e/4FVzldnltJoiGVPJHqGav/1ozI33RnWojorGvFdMStMU8kRnEqyIdiVv8uK9jERuYhK77bsjVjJI+fjpix85l8lC5NRjjxwTc56c0LYXMZ967lIvPx4foh+2XTjNHw0zj0E1r+XMa9a/KFQkLLDjDCL3HtMBZ3OzX7agjvWQQcbIfYWyGZcq7KojKiPGMz6rBcdeKEoI8sndcMXe/tUOuuKCnzQGPwbcbZoGft3+odv5m7LbEbEHNwdoX3jld2ZGu9qmd5bIamahSsgLJb4RUEaee3kVexdm4TjaYDt+n7ihkOylUuM1LakTVJApEM+29tUM/osoQII8qJk3cGI0dofkgN8phyji4iuPrNGEnaNGuAx25I3qeS6aSWy2oNixHMth5Rj+80dhckGBTHU2ESa3La0dcSiViXt0k7jT30jgPN3r9AaJ2SqMB7K4k82aKy3sCWfUyC7Y4T0ET3stpNz7MRIZW9PEtD9ywRvTWWKanB9dSaCrruZarn0qugoXuxJaJvvf06rDVtxioFPeuVq+fyVM6+y7NE9FYyVinomtFQz4XR1dA9U0SvmRER488GECNinDgbo0JUpNV4pBhh35T1sIsmefpNCrpmzNVzy4ymhm6J6DUzAqE7pNPlXAIpDqBIkfRGm8M/Irl1r3CyNQV5+k0KumbM1XNhLNHQLRG9bkaeYxa7xKNCOlFV8etOL0bco1RPXaegc4iaOYfVc9HKVwy5hm6J6AdhZAJ9kpIWVvp1gzWYJ6CKcZ2CzniaMVfPpaVEX8bouYh+YEYtX5IfdoZzbeRKxk0KujAa6rlOOa6G3iTj6u0Ep8oiY0EXX6Ogi4jmqOfSXKKhH9RX4/7TOcRoDJ0MZNsJ/WD28j5BO9Rwes1BPeSPBV18jYIurueo59ojXQ3dEtFr9lXsrJVPuzKICDhhkqeIQlLD8bYKarjyuQeVvlqDN3TxNQq6rPLM6KrnJRr6QdeOHcuhal3cvpEsD07Fp1cNV0M/ZA2wI+IaXdy+U8RWdxiluc213Hpd3IIM/eUHemNZeAsizW2uyTfo4iYk3lUhpp29lTRTxWj+f1JL91Y7+vix+3EG/t8Z+AuPUNDM+xL8iQAAAABJRU5ErkJggg==\" v:shapes=\"_x0000_i1025\" alt=\"image\"\u003e\u0026nbsp;(1)\u003c/p\u003e\n\u003cp\u003eOnly bargroups satisfying all three criteria were retained for downstream assembly.\u003c/p\u003e\n\u003cp\u003eBarcodes that remained unassembled but exhibited high-confidence reference alignments were subsequently anchored to the theoretical genome map to maximize genome-wide representation.\u003cstrong\u003e\u003cbr clear=\"all\"\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe thank Dr. Mikael Molin for providing the Saccharomyces cerevisiae strain BY4742 and for valuable advice on yeast cultivation. The nanofluidic devices used in this work were fabricated using Chalmers MyFab cleanroom facility.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding sources\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eF.W. discloses support for the research of this work from the Swedish Childhood Cancer fund (Barncancerfonden, MT2022-0003) and the Swedish Cancer Foundation (Cancerfonden, 24 3885 Pj). Work in N.S. lab was supported by Knut and Alice Wallenberg foundations (KAW.2021.0173). T.A. and the rest of the authors declare no relevant funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConceptualization: L.M.L.G., G.G., F.W.\u003c/p\u003e\n\u003cp\u003eMethodology: L.M.L.G., A.D.\u003c/p\u003e\n\u003cp\u003eSoftware: L.M.L.G., A.D.\u003c/p\u003e\n\u003cp\u003eValidation: L.M.L.G.\u003c/p\u003e\n\u003cp\u003eFormal analysis: L.M.L.G.\u003c/p\u003e\n\u003cp\u003eInvestigation: L.M.L.G., H.Z.\u003c/p\u003e\n\u003cp\u003eResources: S.K.K., I.O., N.S.\u003c/p\u003e\n\u003cp\u003eData curation: L.M.L.G., H.Z.\u003c/p\u003e\n\u003cp\u003eWriting \u0026ndash; original draft: L.M.L.G., F.W.\u003c/p\u003e\n\u003cp\u003eWriting \u0026ndash; review \u0026amp; editing: All authors\u003c/p\u003e\n\u003cp\u003eVisualization: L.M.L.G.\u003c/p\u003e\n\u003cp\u003eSupervision: F.W., T.A.\u003c/p\u003e\n\u003cp\u003eProject administration: L.M.L.G., F.W.\u003c/p\u003e\n\u003cp\u003eFunding acquisition: F.W.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData and Code availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated and analyzed during this study, together with the custom MATLAB code used for barcode processing and assembly, are available to reviewers via a private Google Drive link provided at submission (https://drive.google.com/drive/folders/1aEM9YhjKS1UgG0qovO57dkADrqzxVKeG?usp=sharing). Upon acceptance, these materials will be made publicly available through a permanent repository (e.g. Zenodo) with assigned DOIs. Source data including raw microscopy images, processed barcode data and assembly outputs, are included in these deposits.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eLiao X et al (2019) Current challenges and solutions of de novo assembly. \u003cem\u003eQuantitative Biology\u003c/em\u003e vol. 7 90\u0026ndash;109 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s40484-019-0166-9\u003c/span\u003e\u003cspan address=\"10.1007/s40484-019-0166-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYuan Y, Chung CYL, Chan TF (2051) Advances in optical mapping for genomic research. \u003cem\u003eComputational and Structural Biotechnology Journal\u003c/em\u003e vol. 18 \u0026ndash;2062 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.csbj.2020.07.018\u003c/span\u003e\u003cspan address=\"10.1016/j.csbj.2020.07.018\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDai Y et al (2020) Single-molecule optical mapping enables quantitative measurement of D4Z4 repeats in facioscapulohumeral muscular dystrophy (FSHD). J Med Genet 57:109\u0026ndash;120\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003evan der Sanden B et al (2025) Optical genome mapping enables accurate testing of large repeat expansions. Genome Res 35:810\u0026ndash;823\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeissensteiner MH et al (2017) Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res 27:697\u0026ndash;708\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDremsek P et al (2021) Optical genome mapping in routine human genetic diagnostics\u0026mdash;its advantages and limitations. Genes (Basel). 12\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrunwald A et al (2015) Bacteriophage strain typing by rapid single molecule analysis. Nucleic Acids Res. 43\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePark J et al (2019) Single-molecule DNA visualization using AT-specific red and non-specific green DNA-binding fluorescent proteins. Analyst 144:921\u0026ndash;927\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee S et al (2018) TAMRA-polypyrrole for A/T sequence visualization on DNA molecules. Nucleic Acids Res 46\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNyberg LK et al (2012) A single-step competitive binding assay for mapping of single DNA molecules. Biochem Biophys Res Commun 417:404\u0026ndash;408\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNilsson AN et al (2014) Competitive binding-based optical DNA mapping for fast identification of Bacteria - Multi-ligand transfer matrix theory and experimental applications on Escherichia coli. Nucleic Acids Res 42\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eM\u0026uuml;ller V et al (2020) Cultivation-Free Typing of Bacteria Using Optical DNA Mapping. ACS Infect Dis 6:1076\u0026ndash;1084\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDvirnas A et al (2025) DOGMA: de novo assembly of densely labelled optical DNA maps using a matrix profile approach. PLoS ONE 20:e0335633\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJeffet J, Margalit S, Michaeli Y, Ebenstein Y (2021) Single-molecule optical genome mapping in nanochannels: Multidisciplinarity at the nanoscale. \u003cem\u003eEssays in Biochemistry\u003c/em\u003e vol. 65 51\u0026ndash;66 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1042/EBC20200021\u003c/span\u003e\u003cspan address=\"10.1042/EBC20200021\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHartley G, O\u0026rsquo;neill RJ (2019) Centromere repeats: Hidden gems of the genome. \u003cem\u003eGenes\u003c/em\u003e vol. 10 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/genes10030223\u003c/span\u003e\u003cspan address=\"10.3390/genes10030223\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKlein SJ, O\u0026rsquo;Neill RJ (2018) Transposable elements: genome innovation, chromosome diversity, and centromere conflict. \u003cem\u003eChromosome Research\u003c/em\u003e vol. 26 5\u0026ndash;23 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s10577-017-9569-5\u003c/span\u003e\u003cspan address=\"10.1007/s10577-017-9569-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang J, Li F (2017) Are all repeats created equal? Understanding DNA repeats at an individual level. \u003cem\u003eCurrent Genetics\u003c/em\u003e vol. 63 57\u0026ndash;63 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00294-016-0619-x\u003c/span\u003e\u003cspan address=\"10.1007/s00294-016-0619-x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmirnov E, Chm\u0026uacute;rčiakov\u0026aacute; N, Liška F, Bažantov\u0026aacute; P, Cmarko D (2021) Variability of human rDNA. \u003cem\u003eCells\u003c/em\u003e vol. 10 1\u0026ndash;14 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/cells10020196\u003c/span\u003e\u003cspan address=\"10.3390/cells10020196\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOizumi Y et al (2021) Complete sequences of Schizosaccharomyces pombe subtelomeres reveal multiple patterns of genome variation. Nat Commun 12\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEgidi A, Di Felice F, Camilloni G (2020) Saccharomyces cerevisiae rDNA as super-hub: the region where replication, transcription and recombination meet. \u003cem\u003eCellular and Molecular Life Sciences\u003c/em\u003e vol. 77 4787\u0026ndash;4798 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00018-020-03562-3\u003c/span\u003e\u003cspan address=\"10.1007/s00018-020-03562-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eD\u0026rsquo;Alfonso A, Micheli G, Camilloni G (2024) rDNA transcription, replication and stability in Saccharomyces cerevisiae. \u003cem\u003eSeminars in Cell and Developmental Biology\u003c/em\u003e vols 159\u0026ndash;160 1\u0026ndash;9 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.semcdb.2024.01.004\u003c/span\u003e\u003cspan address=\"10.1016/j.semcdb.2024.01.004\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJames SA et al (2009) Repetitive sequence variation and dynamics in the ribosomal DNA array of Saccharomyces cerevisiae as revealed by whole-genome resequencing. Genome Res 19:626\u0026ndash;635\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAquiles Sanchez J, Kim S-M, Huberman JA (1998) Ribosomal DNA Replication in the Fission Yeast, Schizosaccharomyces Pombe. Exp Cell Res 238\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSabouri N, McDonald KR, Webb CJ, Cristea IM, Zakian VA (2012) DNA replication through hard-to-replicate sites, including both highly transcribed RNA Pol II and Pol III genes, requires the S. pombe Pfh1 helicase. Genes Dev 26:581\u0026ndash;593\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWood V et al (2002) The genome sequence of Schizosaccharomyces pombe. Nature 415:871\u0026ndash;880\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCherry JM et al (2012) Saccharomyces Genome Database: The genomics resource of budding yeast. Nucleic Acids Res 40\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarris MA et al (2022) Fission stories: using PomBase to understand Schizosaccharomyces pombe biology. Genetics 220\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSela I, Wolf YI, Koonin EV (2016) Theory of prokaryotic genome evolution. \u003cem\u003eProc. Natl. Acad. Sci. U. S. A.\u003c/em\u003e 113, 11399\u0026ndash;11407\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePasero P, Marilley M (1993) Size variation of rDNA clusters in the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe. Mol Gen Genet 236:448\u0026ndash;452\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNurk S et al The complete sequence of a human genome. \u003cem\u003eScience (\u003c/em\u003e(1979)). 4453 (2022)). 4453 (2022)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLander S et al (2001) Initial Sequencing and Analysis of the Human Genome International Human Genome Sequencing Consortium* The Sanger Centre: Beijing Genomics Institute/Human Genome Center. \u003cem\u003eNATURE\u003c/em\u003e vol. 409 www.nature.com\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDetinis Zur T et al (2025) Single-molecule toxicogenomics: Optical genome mapping of DNA-damage in nanochannel arrays. DNA Repair (Amst). 146\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eM\u0026uuml;ller V et al (2019) Enzyme-free optical DNA mapping of the human genome using competitive binding. Nucleic Acids Res 47\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePersson F, Tegenfeldt JO (2010) DNA in nanochannels-directly visualizing genomic information. Chem Soc Rev 39:985\u0026ndash;999\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFrykholm K, M\u0026uuml;ller V, Kk S, Dorfman KD, Westerlund F (2022) DNA in nanochannels: theory and applications. \u003cem\u003eQuarterly Reviews of Biophysics\u003c/em\u003e vol. 55 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1017/S0033583522000117\u003c/span\u003e\u003cspan address=\"10.1017/S0033583522000117\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSriram KK, Persson F (2024) and F. J. and B. J. P. and T. J. O. and W. F. Fluorescence Microscopy of Nanochannel-Confined DNA. in \u003cem\u003eSingle Molecule Analysis: Methods and Protocols\u003c/em\u003e (ed. Heller Iddo and Dulin, D. and P. E. J. G.) 175\u0026ndash;202Springer US, New York, NY. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/978-1-0716-3377-9_9\u003c/span\u003e\u003cspan address=\"10.1007/978-1-0716-3377-9_9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNyblom M et al (2023) Strain-level bacterial typing directly from patient samples using optical DNA mapping. Commun Med 3\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDvirnas A, Lin Y-L (2021) Ildev.v.0.5.3. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/zenodo.5718208\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.5718208\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYeh M et al (2016) C.-C. Barcelona,. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets. in \u003cem\u003eIEEE 16th International Conference on Data Mining (ICDM)\u003c/em\u003e 1317\u0026ndash;1322 \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICDM.2016.0179\u003c/span\u003e\u003cspan address=\"10.1109/ICDM.2016.0179\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-9608945/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9608945/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLarge repetitive elements and complex chromosomal organization continue to challenge accurate assembly of eukaryotic genomes. Optical genome mapping (OGM) provides long-range genomic information by imaging individual DNA molecules. While most OGM protocols rely on sparse enzymatic labeling that restricts resolution in poorly labeled or structurally complex regions, dense labeling strategies generate continuous fluorescence intensity profiles that offer a complementary representation of genome structure.\u003c/p\u003e \u003cp\u003eHere we extend our Dense Optical Genome Mapping Assembly (DOGMA) pipeline, previously used for bacterial genomes, to eukaryotic genomes using a competitive binding (CB)-based dense OGM protocol that produces continuous intensity profiles, reflecting local AT/GC-content, along individual DNA molecules. We analyzed long DNA molecules from two model eukaryotes, \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e and \u003cem\u003eSchizosaccharomyces pombe\u003c/em\u003e, with compact, yet structurally rich genomes. CB-based OGM captures chromosome-spanning molecules and large repetitive arrays (\u0026gt;\u0026thinsp;500 kbp), while also revealing genome-scale compositional features that influence barcode similarity and assembly behavior. By explicitly accounting for these features inherent to eukaryotic genomes, DOGMA reconstructs genome-wide optical maps while avoiding collapse of non-adjacent repetitive loci. The resulting assemblies identify structural variations and repetitive arrays that are incompletely represented in reference genomes. These results establish a framework for scalable dense OGM to increasingly complex genomes.\u003c/p\u003e","manuscriptTitle":"Complete de novo assembly of yeast genomes using enzyme-free, dense optical genome mapping","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-09 01:12:27","doi":"10.21203/rs.3.rs-9608945/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"205529b2-3f6e-4764-a610-37131187dbd2","owner":[],"postedDate":"May 9th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"This content is not available.","date":"2026-05-07T21:16:53+00:00","index":1,"fulltext":"This content is not available."},{"type":"reviewersInvited","content":"5","date":"2026-05-07T16:25:40+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-05-05T08:33:19+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-05-04T16:48:07+00:00","index":"","fulltext":""},{"type":"submitted","content":"Nature Communications","date":"2026-05-04T13:26:27+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":67733535,"name":"Biological sciences/Genetics/Genomics"},{"id":67733536,"name":"Physical sciences/Nanoscience and technology/Nanobiotechnology"},{"id":67733537,"name":"Biological sciences/Biological techniques/Genetic techniques"}],"tags":[],"updatedAt":"2026-05-09T01:12:27+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-09 01:12:27","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9608945","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9608945","identity":"rs-9608945","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.