Diploid telomere-to-telomere assemblies reveal hierarchical satellite architectures and heritable MHC structural diversity in macaques

doi:10.21203/rs.3.rs-9105354/v1

Diploid telomere-to-telomere assemblies reveal hierarchical satellite architectures and heritable MHC structural diversity in macaques

2026 · doi:10.21203/rs.3.rs-9105354/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 126,782 characters · extracted from preprint-html · click to expand

Diploid telomere-to-telomere assemblies reveal hierarchical satellite architectures and heritable MHC structural diversity in macaques | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Diploid telomere-to-telomere assemblies reveal hierarchical satellite architectures and heritable MHC structural diversity in macaques Seiya Imoto, Takuya Yamamoto, Yusuke Nakamura, Shokichi Takahama, and 20 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9105354/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract The crab-eating macaque (Macaca fascicularis), a key biomedical model, exhibits substantial inter-individual variability. However, its genomic architecture remains incompletely resolved due to extensive structural complexity and reliance on haploid reference assemblies. Here, we generated 24 diploid telomere-to-telomere (T2T) haplotypes from 12 male and female individuals representing three geographic populations including a pedigree, enabling systematic interrogation of genome architecture at diploid resolution with geographic diversity and familial inheritance. We identified previously uncharacterized large inter-chromosomal repeat clusters with haplotype-specific organization that are conserved among old world monkeys. In addition, twenty-four completely contiguous major histocompatibility complex (MHC) haplotypes reveal extensive inter or intra individual variations in gene copy number and structural organization of the class-IA, IB, and II regions. Together, these findings demonstrate that diploid T2T assemblies are essential for accurately capturing structural and immunogenetic diversity in non-human primates and provide a genomic framework for precision immunogenomics in this widely used biomedical model. Biological sciences/Genetics/Genomics Biological sciences/Genetics/Immunogenetics Biological sciences/Immunology/Immunogenetics Figures Figure 1 Figure 2 Figure 3 Figure 4 1. Introduction The crab-eating macaque ( Macaca fascicularis ) is an indispensable non-human primate model for a broad range of biomedical research, including vaccine development, immunology, and drug safety evaluation. Owing to its close immunological similarity to humans, it plays a central role in infectious disease research and in investigations of inter-individual immunological variability [1,2]. Despite its widespread use, substantial differences in immune responses and disease susceptibility among individuals exposed to identical stimuli have long been recognized, yet the genomic basis underlying this variability remains poorly understood [3]. A major contributor to this scientific gap lies in the structural limitations of previously available reference genomes. Most conventional reference assemblies have been predominantly haploid-based, and therefore unable to fully resolve highly polymorphic and tandemly repetitive regions, particularly immune-related loci such as the major histocompatibility complex (MHC). In these regions, distinct haplotypes were often represented as mosaicked or collapsed structures, with structural diversity frequently dismissed as stochastic noise [4]. As a result, biologically meaningful structural variations are averaged, simplified or obscured, fundamentally constraining our understanding of individual-level genomic diversity. Recent advances in genome assembly technologies have begun to transform primate genomics. The completion of the human telomere-to-telomere (T2T) genome [5], which for the first time resolved complete sequences including centromeres and telomeres, marked a pivotal milestone. Parallel efforts generating complete macaque genome sequences [6] and T2T analyses across ape lineages [7] has further indicated the importance of complex structural variations in shaping phenotypic diversity. However, the systematic characterization of diploid-level structural diversity across multiple individuals as well as how such diversity is transmitted across generations, remains largely unexplored [8,9]. The Tsukuba Primate Research Center (TPRC) at National Institutes of Biomedical Innovation, Health and Nutrition (NIBN), maintains a large colony of specific-pathogen-free (SPF) crab-eating macaques (approximately 2,000 monkeys) with seven-generation pedigree information. This resource represents an internationally important primate platform for research on infectious diseases, aging, immunity, and some inherited disorders and their models [10,11,12,13,14,15], highlighting the necessity of precise and complete genomic characterization. Leveraging this unique resource, we generated 24 diploid telomere-to-telomere haplotypes from 12 individuals representing three geographic origins. By integrating ultra-long, high-fidelity, and chromatin-conformation sequencing data, we resolved both homologous chromosomes at the T2T scale. This diploid framework reveals previously unrecognized hierarchical repeat architectures, defines their population-level diversification, and establishes a comprehensive atlas of MHC structural diversity with direct pedigree-based validation of inheritance. Together, our findings demonstrate that diploid T2T resolution is essential for accurately capturing immunogenetic diversity in this key biomedical primate model. 2. Results 2-1. Results: Diploid T2T Assembly of Diverse Macaque Cohorts To comprehensively capture the genome-wide structural diversity at diploid resolution in the crab-eating macaques, we performed de novo contig assemblies achieving telomere-to-telomere (T2T) contiguity for 12 individuals originating from three geographic regions: Indonesia (n = 3), Malaysia (n = 3), and the Philippines (n = 6) (Fig. 1a, Supplementary Table1). The Philippine cohort included a father–offspring pair, enabling the incorporation of both geographic diversity and pedigree structure into the study design. For each individual, PacBio HiFi reads, and Oxford Nanopore Technologies (ONT) ultra-long reads and Pore-C data were integrated (Supplementary Table2). Assembly and haplotype phasing were conducted using Hifiasm and Verkko [16,17]. The resulting haplotype-resolved contig assemblies exhibited high intrinsic continuity prior to reference-guided scaffolding. Contig N50 values ranged from 142–179 Mb, with total assembly sizes of 2.9–3.1 Gb (Fig. 1b), indicating near chromosome-scale contiguity achieved through de novo assembly alone. Importantly, these metrics reflect contig-level continuity prior to reference-guided scaffolding. Chromosome-scale scaffolding was subsequently performed using RagTag with the existing T2T-MFA8v1.1 assembly as a structural guide, yielding 24 haplotype-resolved genome assemblies. Reference-guided scaffolding introduced only limited scaffold-level padding and did not substantially alter underlying contig continuity. Balanced haplotype reconstruction and structural fidelity To assess the haplotype balance, we compared the scaffold auN metrics between paternal and maternal haplotypes (hap1 and hap2) for each individual. We observed strong concordance (R = 0.86), with no evidence of systematic degradation in either haplotype (Fig. 1e), indicating balanced reconstruction. Genome-wide structural comparisons using SyRI further demonstrated high overall synteny with MFA8v1.1 while successfully preserving haplotype-specific structural configurations (Fig. 1c, d, f, g Supplemental Fig 1a-e) [18,19, 20, 21]. Together, these results indicate that diploid reconstruction achieved T2T-level continuity comparable to existing references while maintaining structural integrity suitable for downstream interrogation of genomic architectures. Collectively, the 24 diploid T2T haplotypes generated in this study constituted a highly contiguous and reproducible genomic framework, providing a robust foundation for subsequent analyses of structural diversity and detailed characterization of complex loci, including immune-related regions. 2-2. Assembly graph entanglement reflects shared high-identity satellite arrays Although our diploid T2T assemblies preserved high overall synteny with the existing references, specific regions consistently exhibited complex connectivity patterns within the assembly graph. These structures were characterized by the convergence of sequences originating from multiple chromosomes and could not be adequately represented within the conventional linear reference framework (Fig. 2a). Such entanglements arise when highly similar repeat arrays located on different chromosomes cannot be completely disentangled during graph construction, resulting in multi-edge convergence onto shared nodes. Comparable graph complexities have been reported in satellite-rich regions during telomere-to-telomere assembly of the human genome, underscoring the intrinsic difficulty of resolving high-copy satellite DNA [5,16,17]. To determine the sequence basis of these entanglements, we systematically extracted and characterized the repeat units underlying the convergent graph nodes. Tangled-balls define a hierarchical satellite architecture built from three principal repeat units Detailed sequence decomposition revealed that these entangled graph regions did not represent the amorphous accumulations of repeat DNAs. Instead, in the representative diploid assembly examined here, they formed a highly organized hierarchical satellite architecture, which we collectively term “tangled-balls”. Dot-plot analyses and repeat decomposition identified three principal repeat units that constitute the core elements of this architecture: an 81-bp L-core, an 80-bp S-core, and a larger ~1.4-kb repeat block (L-extension) (Fig. 2b,c). The 80-bp and 81-bp units share a conserved GC-rich internal motif with >90% sequence identity; however, full-length alignments revealed only moderate overall identity (62.1%), supporting their classification as distinct but closely related satellite subtypes. RepeatMasker annotation classified these sequences as SATR1v-related satellites, whereas dot-plot analyses and periodicity assessments demonstrated repeat structures that differ from canonical SATR1/SATR2 arrays (Fig. 2b,c) [22]. Mapping of repeat units within tangled-ball loci revealed that L-core arrays frequently co-occur with the ~1.4-kb L-extension repeat blocks, forming composite repeat clusters (Fig. 2d). This organization gives rise to two higher-order configurations: a large tangled-ball unit, consisting of alternating L-core arrays and L-extension blocks, and a small tangled-ball unit, composed primarily of S-core repeats (Fig. 2e). These observations indicate that tangled-balls represent a hierarchical satellite architecture in which short repeat units (80–81 bp) and longer repeat blocks (~1.4 kb) combine to form higher-order composite repeat arrays. Haplotype-specific orientation polymorphism revealed by diploid resolution In addition to defining the repeat composition, diploid resolution enabled the assessment of repeat orientation relative to the telomeric direction. Analysis of repeat organization uncovered a striking structural dimorphism between homologous chromosomes. Genome-wide mapping of repeat density further revealed that tangled-ball units are not randomly distributed but instead exhibit chromosome-specific enrichment patterns. Notably, L- and S-ball structures were markedly enriched in subtelomeric regions and display a distribution pattern distinct from canonical SATR1/SATR1v/SATR2 repeats (Fig. 2f, and Supplementary Fig 2). Within this representative diploid genome, the repeat density was not uniform, forming discrete localized peaks along specific chromosomal intervals. These focal enrichments suggest that tangled-ball arrays are organized into discrete chromosomal configurations rather than representing diffuse repeat accumulations. Chromosome-11–specific subtype and orthogonal cytogenetic validation In addition to these multi-chromosomal tangled-balls, we identified a chromosome-11–specific repeat cluster composed predominantly of an 80-bp SATR1v-related subtype (chr11-core). Global alignments indicated that the chr11-core shared higher sequence identity with the 81-bp L-core (74.7%) than with the 80-bp S-core (56.7%), supporting the existence of structurally related but distinct satellite subclasses. Unlike the multi-chromosomal tangled-balls, the chr11-specific cluster was confined exclusively to chromosome 11 and did not generate interchromosomal connections within the assembly graph. Comparative genomic alignment with the human T2T reference (CHM13) revealed that the chromosome-11–specific chr11-core cluster was localized within a previously described large-scale human–macaque inversion spanning the region between DPPA3 , ETV6 , and DDX11 (Fig. 2g). The whole-region alignment clearly delineated an orientation switch corresponding to this inversion, with the chr11-core repeat cluster positioned in close proximity to the inferred inversion boundary. Examination of local sequence features across this region revealed higher densities of single-nucleotide variants and small insertions/deletions within the inverted segment relative to the flanking intervals. In parallel, segmental duplication (SD) coverage was increased and local sequence identity reduced in the vicinity of the chr11-core array. Together, these observations indicate that the hierarchical satellite cluster resides within a structurally dynamic genomic environment. The spatial coincidence of the chr11-core satellite cluster with the inferred inversion boundary interval highlights a positional association between the hierarchical repeat architecture and regions exhibiting structural rearrangements between the human and macaque genomes To validate the physical existence and chromosomal distribution of these repeat units, we performed fluorescence in situ hybridization (FISH) using probes specific to the ~1.4-kb, 81-bp, and 80-bp repeats (Fig. 2h). The signal distributions across chromosomes were highly concordant with the copy number estimates derived from our T2T assemblies. A strong positive correlation was observed between the assembly-based copy number estimates and the corresponding FISH signal intensities (R = 0.78, Fig. 2i). This consistent relationship across diverse repeat classes provides orthogonal empirical support for the accuracy of our T2T-based copy number inferences, even within these structurally complex regions. Dual-color FISH further demonstrated that L-core and S-core units co-localized but exhibited variable spatial arrangements between homologous chromosomes, consistent with the orientation polymorphism identified in diploid assemblies. Collectively, these results demonstrate that tangled-ball structures are composed of a small number of clearly definable repeat units, which form diverse combinations and arrangements across chromosomes. Although largely invisible in conventional linear reference genomes, the integration of the diploid T2T assembly with cytogenetic validation enables their systematic and accurate characterization. 2-3. Population-level conservation and geographical divergence of tangled-ball architectures Evolutionary distribution of tangled-ball architectures across primates To assess the evolutionary distribution of tangled-ball architectures, we examined their presence in existing primate reference genomes. Structured satellite clusters corresponding to tangled-balls were identifiable in the M. fascicularis (MFA8v1.1) and M. mulatta (MMU8v2.0) assemblies (Fig. 3a). In contrast, comparable hierarchical arrangements were not detected in the human T2T-CHM13 assembly, indicating a lineage-specific structural differentiation between macaques and humans. Systematic prevalence across 24 haplotypes Analysis of 12 individuals (24 haplotypes) confirmed the widespread presence of tangled-ball structures across all three geographic origins (Indonesia, Malaysia, and the Philippines) (Fig. 3b, Supplementary Fig 3a). The chromosome-11–associated inversion at the ETV6 locus, linked to the chr11-core cluster, was consistently observed in all haplotypes examined (Fig. 3c, Supplementary Fig 3b), supporting its stability as a species-wide structural feature. Haplotype-resolved RepeatMasker profiling revealed that repeat composition is largely conserved at the chromosomal scale but varies in its degree of inter-individual divergence. Certain interspersed elements (e.g., LINEs and SINEs) differed across chromosomes yet exhibited relatively limited variability between individuals. In contrast, satellite repeats and retroposon families showed marked inter-individual heterogeneity, as evidenced by heatmap and coefficient-of-variation analyses. (Supplementary Fig. 3c, d) Quantification of haplotype-specific orientation polymorphism To quantify repeat orientation, we defined a directionality index based on the relative genomic start positions of the L-core and S-core arrays along each chromosome arm, measured from the telomere-proximal end. Specifically, the index was calculated as (starting position of L-core – starting position of S-core). Negative values therefore indicate the canonical telomere-proximal configuration in which the L-core precedes the S-core (L-core → S-core), whereas positive values indicate the reversed configuration (S-core → L-core) (Fig. 3d, e Supplementary Fig 3e). Haplotypes exhibiting start-position offsets exceeding ±2 standard deviations from the chromosome-specific median were detected across multiple chromosomes (Fig. 3e). This observation indicates that the orientation polymorphism is accompanied by substantial shifts in the relative L-core and S-core positioning, rather than representing minor coordinate variations. Consistent with the assembly-based inference, dual-color FISH analyses confirmed that both L-core → S-core and S-core → L-core configurations can be observed cytogenetically, with reversed arrangements detected on homologous chromosomes in selected individuals (Supplementary Fig. 3f). This independent validation reinforced the robustness of diploid T2T reconstruction in resolving fine-scale orientation polymorphisms within repetitive genomic regions. Fine-scale sequence divergence and geographical stratification Although the higher-order architecture of the tangled-balls was conserved, the sequence-level variation within the L-core unit revealed measurable geographic stratification. 3f, Supplementary Fig 3g, h). Haplotypes derived from the Philippines formed a cohesive cluster distinct from the Indonesian and Malaysian cohorts, indicating regional differentiation in satellite sequence composition despite the conservation of the overall architectural organization. Together, these findings demonstrate that tangled-ball architectures are conserved genomic features within macaques that nevertheless exhibit population-level sequence diversification and haplotype-specific structural variability. 2-4. Results: The Comprehensive Landscape of MHC Diversity and its Heritability * High-resolution architecture of 24 Mafa-MHC haplotypes Genome-wide comparison of T2T haplotypes (Fig. 3) revealed that structural diversity was not uniformly distributed across the genome, but instead accumulates within specific genomic regions. Among these, the major histocompatibility complex (MHC), which plays a central role in immune responses, has emerged as the locus exhibiting the most pronounced haplotype-level structural diversity. Comprehensive structural comparisons at the diploid- and haplotype-resolved resolutions have remained limited. Using 24 haplotypes derived from 12 individuals assembled by T2T sequencing, we systematically analyzed the MHC region, defined as the interval spanning GABBR1 to KIFC1 . IPD-registered Mafa-MHC genes [23, 24, 25] were mapped onto each haplotype, enabling hierarchical dissection of the MHC architecture at the levels of genomic compartments, gene blocks, and individual genes (Fig. 4). At the compartment level, the overall Class IA–IB–II organization was conserved across all haplotypes; however, the physical lengths of individual compartments differed by approximately 0.5–1 Mb (Fig. 4a, Supplementary Fig. 4a). No significant correlation was observed between compartment sizes (Supplementary Fig.4b), indicating that major MHC components varied independently. Thus, the MHC is composed of multiple independent architectural modules, each undergoing distinct structural variations. At the gene-block level, the order and relative arrangement of the gene blocks differed substantially among haplotypes (Fig. 4a, Supplementary Fig. 4c). These differences reflect not only copy-number changes but also block-scale rearrangements contributing to structural diversification. No clear population-wide clustering of global structural patterns was detected although locus-specific variation trend was observed at certain genes (e.g., Mafa-A block; Kruskal–Wallis test, Supplementary Fig. 4c, d). At the individual gene level, marked copy-number polymorphisms and presence–absence variations were observed in Class I genes such as Mafa-A and Mafa-B as well as in non-classical genes including Mafa-E . In contrast, Mafa-DRA was conserved across all haplotypes, whereas Mafa-DRB and DP/DQ loci displayed substantial copy-number variation (Fig. 4b). Even genes shared across all haplotypes showed variations in copy number, genomic position, and gene length (Fig. 4b, c). Together, these findings demonstrate that MHC architecture exhibits multi-layered structural heterogeneity organized across three hierarchical levels. Structural differences previously collapsed or obscured in haploid reference assemblies were resolved here as haplotype-specific architectural frameworks. * Pedigree validation: Stable inheritance of complex architectures To determine whether this structural diversity represents stable biological inheritance, we analyzed father–offspring transmission in a pedigree comprising one father and three offsprings from different mothers. Landscape visualization (Fig. 4a, 4d) showed that one paternal haplotype was structurally preserved in each offspring with a high fidelity. Consistently, SyRI analysis identified strictly syntenic regions between the corresponding paternal and inherited haplotypes, with no evidence of misassembly or artifactural rearrangement (Supplementary Fig 4). This high-resolution alignment, enabled by the gapless T2T-assembly, confirmed the absence of even minor structural discrepancies. In contrast, comparisons involving non-inherited haplotypes exhibited local inversions and duplications (Supplementary Fig 4e). At the allelic level, gene concordance was 100% match up to the eight-digit allelic resolution (Fig.4d, Supplementary Fig 4f), demonstrating high-fidelity intergenerational transmission of MHC haplotypes despite their structural complexity. These results establish that the observed structural heterogeneity is not an assembly artifact but represents stable, inheritable haplotype-level diversity. This work provides a foundation for dissecting how complex immunogenetic landscapes shape individual immune responses. 3. Discussion The completion of 24 diploid telomere-to-telomere (T2T) reference genomes established a foundational resource for understanding the structural diversity in the crab-eating macaque, a key non-human primate model in biomedical research. Although recent efforts have generated complete macaque genomes and extended T2T sequencing across primate lineages [6, 7], our results demonstrate that diploid-level resolution is essential for accurately capturing the genomic complexity of Macaca fascicularis . By resolving both haplotypes in 12 individuals, we showed that genomic features previously collapsed or dismissed as assembly “noise” instead represent substantial structural non-equivalence between homologous chromosomes. This observation parallels the advances in human diploid genomics, where haplotype-resolved assemblies have become indispensable for the comprehensive characterization of complex structural variations [19, 20]. A clear example of such previously hidden complexity is the “tangled-balls” architecture identified in this study. Manual deconvolution identified a hierarchical repeat organization composed of 80-bp, 81-bp, and ~1.4-kb repeat units, representing a previously unrecognized layer of genome architecture. The haplotype-specific differences in repeat composition, copy number, and spatial arrangement, independently validated by FISH, indicated that these regions possess the structural flexibility that was invisible in haploid or collapsed references. The differential positioning of repeat clusters between homologous chromosomes further underscores the extent of genomic diversity that becomes apparent only through complete diploid resolution. Genome-wide comparisons at diploid T2T resolution further demonstrate that structural diversity is not uniformly distributed but concentrates within discrete genomic regions. Among them, the major histocompatibility complex (MHC) stands out as the locus exhibiting the most pronounced haplotype-level divergence. Our observations here reinforces the long-standing view that immune-related loci are major repositories of sequence and structural variations, while also highlighting the limitations of previous reference genomes in resolving such complexity. Resolving the Mafa -MHC at diploid T2T resolution establishes an essential reference for precision immunogenomics. While prior high-quality macaque assemblies improved structural completeness [6], continuous reconstruction of fully resolved diploid MHC haplotypes at long-read T2T resolution across multiple geographically diverse individuals had not previously been achieved at this scale. Consequently, systematic evaluation of structural non-equivalence between homologous chromosomes remains limited. Although sequencing analysis of MHC-homozygous Mauritian macaques yielded the first complete Mafa -MHC sequence [26], the strong founder effect in this population constrains its representation of natural structural diversity and inheritance patterns [27, 28, 29]. In contrast, crab-eating macaques display high genetic diversity, defined population structure, and variable degrees of introgression from rhesus macaques, complicating interpretation of MHC variation in existing genomic resources [29, 30, 31, 32]. By analyzing individuals from multiple geographic origins and resolving MHC haplotypes, our study enabled a direct comparison of MHC structural diversity without excessive simplification. This framework preserves biologically meaningful diversity as well as maintains experimental reproducibility, therefore enhancing the translational relevance of this model. Notably, structural diversification across MHC regions is asymmetric. Certain loci, such as Mafa-DRA , remained structurally conserved across haplotypes, consistent with their essential roles in antigen presentation. In contrast, Mafa-DRBs and several Class I loci exhibited extensive copy-number and positional variability. This diversification supports a modular model of MHC organization, in which independent architectural units evolve under differential selective constraints, balancing structural conservation with adaptive flexibility. Notably, the non-classical class I molecule MHC-E ( Mafa-E ) exhibits both copy-number variation and population specific differences, consistent with geographically variable selective pressures. The extensive variability observed at the Mafa-DRB locus closely parallels with recent human HLA pangenome studies, in which the DRB region emerges as a hotspot of copy-number diversification and haplotype-specific architectural remodeling [20]. This convergence across species suggests that DRB structural plasticity may represent an evolutionarily conserved feature of primate MHC organization. These findings align with emerging human HLA pangenome analyses emphasizing the importance of haplotype-specific architectures in shaping immune responses and disease susceptibility [20]. Importantly, non-human primate models uniquely allow the experimental validation of immunogenetic hypotheses that are difficult to test directly in humans. MHC-E ( Mafa-E in M. fascicularis ), a non-classical class I molecule with relatively low allelic polymorphism [33], mediates protection against SIV infection in CMV-vectored vaccine models [34,35] and has been implicated in diverse infectious and immunological contexts [36, 37, 38,39]. Yet in our dataset, Mafa-E displayed copy-number and population-level variations, underscoring that structural diversification can occur even at loci with limited sequence diversity. Our diploid T2T assemblies provide a framework to interrogate the immunological impact of such variation. Crucially, pedigree-based validation has demonstrated that these complex MHC architectures are inherited as stable, discrete units rather than representing assembly artifacts. This confirmed both the technical accuracy of the diploid T2T assemblies and the biological reality of MHC structural heterogeneity. Previous pedigree studies have focused primarily on allelic segregation and have lacked the resolution to evaluate inheritance of continuous MHC architectures. Our findings therefore establish a direct link between the complete structural haplotypes and heritable immunogenetic variation. Collectively, this study demonstrates that a transition from haploid to diploid T2T reference genomes is essential for accurately capturing immunogenetic diversity in the crab-eating macaques. By transforming previously-collapsed representations into fully resolved haplotype frameworks, this resource provides a foundation for linking genomic structure to immune function, disease susceptibility, and therapeutic responses. As primate genomics advances toward a pangenomic paradigm, diploid T2T references will be indispensable for both fundamental immunology and translational biomedical research. Materials and Methods Animal experiments, sample collections. Twelve cynomolgus macaques (Macaca fascicularis; 4–20 years old) were included. Cohort 1 comprised of nine animals from three geographic origins (Indonesia, Malaysia and the Philippines; n = 3 each). Cohort 2 comprised of four Philippine-origin animals (one father and three offspring from different mothers), with one female included in both cohorts. Peripheral blood mononuclear cells (PBMCs) were isolated from EDTA-anticoagulated blood by Ficoll density-gradient centrifugation and cryopreserved in FBS containing 10% DMSO until use. Animals were housed at the Tsukuba Primate Research Center (NIBN, NIBN) and confirmed to be seronegative for SIV, simian type D retrovirus, simian T cell lymphotropic virus, simian foamy virus, Epstein–Barr virus, cytomegalovirus and B virus. All the procedures were approved by the institutional animal ethics committee (approval no. DS21-2R14). DNA extraction and sequencing: High- and ultra-high-molecular-weight (HMW and UHMW, respectively) genomic DNA was extracted from PBMCs using a Monarch HMW DNA Extraction Kit (New England Biolabs, T3050L), following the Oxford Nanopore–recommended protocols to preserve DNA length. DNA integrity and fragment size distributions were assessed prior to library preparation. Oxford Nanopore libraries were prepared using ligation-based protocols with modified handling to minimize DNA shearing for ultra-long-read sequencing (Ultra-Long DNA Sequencing Kit V14, SQK-ULK114, for Ultra-long-read; Ligation Sequencing Kit V14, SQK-LSK114, for Pore-C). Ultra-long-read and Pore-C libraries were sequenced on the PromethION 48 platform (MinKNOW version 24.02) using R10.4.1 flow cells (FLO-PRO114). Raw signal data (POD5 format) were retained for downstream basecalling. For Pore-C, in situ chromatin crosslinking, restriction digestion (NlaIII), proximity ligation and adapter ligation were performed as previously described. In addition, high-fidelity (HiFi) long-read data were generated on the PacBio Revio system by a commercial provider (Takara Bio Inc., Japan). Further experimental details and sequencing metrics are provided in the Supplementary Information. Diploid haplotype-resolved genome assembly Oxford Nanopore basecalling Ultra-long Oxford Nanopore (ONT) sequencing data were basecalled from raw POD5 signal files using Dorado (v0.9.1) with the super accuracy (sup) model. Basecalled reads were generated in BAM format and used as ultra-long ONT reads for downstream analyses . ONT read error correction Ultra-long ONT reads were error-corrected using hifiasm (v0.25.0-r726) in ONT correction mode (-e --ont --write-ec). Corrected reads were exported for downstream assembly. Graph-based diploid assembly using Verkko Diploid genome assembly was performed using Verkko (v2.2). Four data types were provided: raw Oxford Nanopore ultra-long reads, raw PacBio HiFi reads, error-corrected ONT ultra-long reads, and Pore-C proximity ligation reads. To maximize graph connectivity, the error-corrected ONT ultra-long reads were combined with the PacBio HiFi reads and used during the initial assembly graph construction. This combined read set, together with raw ONT ultra-long reads for repeat resolution and Pore-C proximity ligation reads for haplotype phasing, was integrated into the Verkko pipeline. No external reference genome was used during contig construction. No reference-guided correction, sequence replacement, or short-read polishing was performed prior to scaffolding. Long-range chromatins contact information from Pore-C reads was used during graph resolution to assist haplotype phasing. Assembly graphs were generated by Verkko and resolved into phased haplotypes using the default pipeline. Reference-guided chromosome scaffolding Chromosome-scale scaffolding was performed using RagTag (v2.1.0) with the T2T-MFA8v1.1 assembly as a structural guide. Reference-guided scaffolding was used exclusively for contig ordering and orientation to enable consistent chromosome naming and coordinate alignment across individuals. No sequence replacement, structural correction, or gap filling from the reference genome was performed. Gaps introduced during scaffolding were represented by stretches of “N” characters. The contig-level sequence content remained unchanged from the de novo assembly stage. This workflow ensures that contig-level sequences are fully reference-independent, and that reference information is introduced only at the scaffolding stage for coordinate standardization. Assembly quality assessment Assembly continuity was evaluated at the contig level using N50, NG50 and auN metrics. Gene completeness was assessed using compleasm (v0.2.7) with the primate_odb12 lineage dataset. Genome-wide structural concordance with T2T-MFA8v1.1 was assessed using SyRI (v1.7.0) to confirm large-scale synteny and preservation of haplotype-specific structural features. Across all haplotypes, the continuity and completeness metrics were comparable to those of T2T-MFA8v1.1, supporting their suitability for diploid structural analyses rather than serving as replacement reference genomes. NG statistics were calculated relative to the haploid genome size of T2T-MFA8v1.1 (3,060,038,958 bp). Identification and characterization of hierarchical satellite (“tangled-ball”) architectures Assembly graph inspection and extraction of entangled nodes Assembly graphs generated by Verkko were exported in the GFA format. Graph visualization was performed using Bandage (v0.9.0). To systematically identify graph the regions exhibiting multi-chromosomal convergence, we quantified the node degree within the assembly graph. Nodes connected to contigs originating from two or more distinct chromosomes were defined as high-degree nodes. The corresponding genomic sequences were extracted from the phased contigs for downstream analysis. Repeat unit delineation The extracted sequences were analyzed for internal periodicity using tandem repeat detection and self-alignment analyses. Putative repeat units were defined based on: Consistent internal periodicity, Local sequence homogeneity, and recurrent occurrence across multiple high-degree graph regions. This procedure identified three principal repeat cores: an 81-bp unit (L-core), an 80-bp unit (S-core), a 1.4-kb unit (L-extension) and a chromosome-11–specific 80-bp subtype (chr11-core). Repeat boundaries were determined based on conserved motif transitions observed at multiple independent loci. Genome-wide detection of repeat units The consensus sequences of the L-core, S-core, and chr11-core were used as queries for genome-wide searches using the Biostrings package in R (v4.3.2). Pattern matching allowed up to five mismatches, including insertions and deletions, using the matchPattern and vmatchPattern functions with maximum mismatch parameters set accordingly. For a short core units (80–81 bp), pattern matching allowed for up to five mismatches, including insertions and deletions, corresponding to approximately 6% sequence divergence. For a longer L-extension (~1.4 kb), a proportional mismatch threshold was applied, allowing up to 88 mismatches (~6% divergence relative to the unit length), including insertions and deletions, to maintain consistency in the divergence tolerance across repeat classes. Copy numbers were calculated per chromosome and per haplotype. Repeat annotation and comparison with canonical satellite families Repeat annotation was performed using RepeatMasker (v4.2.3) with the Dfam 3.9 database. Annotated SATR1-, SATR1v-, and SATR2-related sequences were extracted for comparative analysis. MHC structural annotation For each haplotype, GABBR1 and KIFC1 on chromosome 4 were aligned using Minimap2 (v2.30-r1287), and the flanking genomic interval was extracted and defined as the MHC region. Reference alleles (n = 3299) were obtained from the Immuno Polymorphism Non-Human Primates MHC Database (IPD-MHC NHP, release 3.16.0.0 (2026-01), build 231). Sequences were restricted to Mafa alleles and aligned to the extracted region using Minimap2. Manual curation was performed to define gene boundaries and structural blocks and to assess allele length variation and copy number. Structural rearrangements within the MHC region were further evaluated using SyRI based on whole-region alignments. Detailed annotation criteria, allele filtering parameters, copy-number determination methods and structural comparison parameters are provided in the Supplementary Information. Fluorescence in situ hybridization (FISH) Fluorescence in situ hybridization (FISH) was performed by a commercial service provider (Chromosome Science Labo Inc., Japan) to validate chromosomal rearrangements and structural variations. The cryopreserved cells were cultured and synchronized prior to metaphase arrest, followed by hypotonic treatment and methanol–acetic acid fixation to prepare chromosome spreads. Replication of R-banding was generated using Hoechst 33258 staining and ultraviolet exposure. Probes were designed based on the specified target sequences. Oligonucleotide probes (<100 bp) were directly labeled, whereas longer probes were synthesized and labeled by nick translation. Probe mixtures were denatured on chromosome spreads and hybridized overnight. After stringency washes, the slides were counterstained with DAPI and imaged using a cytogenetics workstation with a 100× objective. Detailed probe sequences, hybridization conditions and signal quantification criteria are provided in the Supplementary Information. Declarations Data availability The data generated and/or analyzed in this study are available from the corresponding author upon reasonable request. The values for all data the points in the graphs are reported in the Supporting Data Values file. Acknowledgement We thank our laboratory colleagues for their excellent technical support. We also thank HAMRI Co., Ltd. and the Corporation for Production and Research of Laboratory Primates for their support with animal experiments. The super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. This work was also supported by the supplementary budget of the National Institutes of Biomedical Innovation, Health and Nutrition (NIBN), Japan Society for the Promotion of Science Grant-in-Aid for Scientific Research (B) (JP24K02498), and the Japan Agency for Medical Research and Development (26fk0410071, JP223fa627007, JP223fa627005). References Fennessey, C.M. & Keele, B.F. Using nonhuman primates to model HIV transmission. Curr Opin HIV AIDS . 8 , 280-287 (2013). doi:10.1097/COH.0b013e328361cfff Zhang, G.Q. et al. Characterization of the major histocompatibility complex class I A alleles in cynomolgus macaques of Vietnamese origin. Tissue Antigens . 80 , 494-501 (2012). doi:10.1111/tan.12024 Yan, G. et al. Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques. Nat Biotechnol. 29 , 1019–1023 (2011). doi:10.1038/nbt.1992 Shiina, T. et al. Rapid evolution of major histocompatibility complex class I genes in primates generates new disease alleles in humans via hitchhiking diversity. Genetics 173 , 1555-1570 (2006). doi:10.1534/genetics.106.057034 Nurk, S. et al. The complete sequence of a human genome. Science 376 , 44-53 (2022). doi:10.1126/science.abj6987 Zhang, S. et al. Integrated analysis of the complete sequence of a macaque genome. Nature 640 , 714–721 (2025). doi:10.1038/s41586-025-08596-w Yoo, D. et al. Complete sequencing of ape genomes. Nature 641 , 401–418 (2025). doi:10.1038/s41586-025-08816-3 Hoyt, S.J. et al. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science 376 , eabk3112 (2022). doi:10.1126/science.abk3112 Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376 , eabl4178 (2022). doi:1126/science.abl4178 Yasutomi, Y. Establishment of specific pathogen-free macaque colonies in Tsukuba Primate Research Center of Japan for AIDS research. Vaccine . 28 , B75–B77 (2010). Munesue, Y. et al. Cynomolgus macaque model of neuronal ceroid lipofuscinosis type 2 disease. Exp Neurol . 363 , 114381 (2023). Ikeda, Y. et al. Discovery of a cynomolgus monkey family with retinitis pigmentosa. Invest Ophthalmol Vis Sci . 59 , 826–830 (2018). Uchida, A. et al. Non-human primate model of amyotrophic lateral sclerosis with cytoplasmic mislocalization of TDP-43. Brain 135 , 833–846 (2012). Okabayashi, S. et al. Diabetes mellitus accelerates Aβ pathology in brain accompanied by enhanced GAβ generation in nonhuman primates. PLoS One 10 , e0117362 (2015). Koinuma, S. et al. Aging induces abnormal accumulation of Aβ in extracellular vesicle-rich fractions in nonhuman primate brain. Neurobiol Aging 106 , 268–281 (2021). Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods . 18 , 170-175 (2021). doi:10.1038/s41592-020-01056- Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol . 41 , 1474-1482 (2023). doi:10.1038/s41587-023-01662-6 Goel, M. et al. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol . 20 , 277 (2019). doi:10.1186/s13059-019-1911-0 Liao, W.W. et al . A draft human pangenome reference. Nature 617 , 312–324 (2023). doi:1038/s41586-023-05896-x Logsdon, G.A. et al . Complex genetic variation in nearly complete human genomes. Nature 644 , 430–441 (2025). doi:10.1038/s41586-025-09140-6 Manni, M. et al. BUSCO: Assessing genomic data quality and beyond. Curr Protoc . 1 , e323 (2021). doi:10.1002/cpz1.323 Storer, J. et al. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA . 12 , 2 (2021). doi:10.1186/s13100-020-00230-y Robinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res . 43 , D423-D431 (2015). doi:10.1093/nar/gku1161 Maccari, G. et al. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex. Nucleic Acids Res . 45 , D860-D864 (2017). doi:10.1093/nar/gkw1050 Maccari, G. et al. The 2024 IPD-MHC database update: a comprehensive resource for major histocompatibility complex studies. Nucleic Acids Res . 53 , D457-D461 (2025). doi:10.1093/nar/gkae932 Karl, J.A. et al. Complete sequencing of a cynomolgus macaque major histocompatibility complex haplotype. Genome Res . 33 , 448-462 (2023). doi:10.1101/gr.277429.122 Osada, N. et al. Whole-genome sequencing of six Mauritian Cynomolgus macaques (Macaca fascicularis) reveals a genome-wide pattern of polymorphisms under extreme population bottleneck. Genome Biol Evol . 7 , 821-830 (2015). doi:10.1093/gbe/evv033 Ogawa, L.M. & Vallender, E.J. Genetic substructure in cynomolgus macaques (Macaca fascicularis) on the island of Mauritius. BMC Genomics 15 , 748 (2014). doi:10.1186/1471-2164-15-748 Shiina, T. & Blancher, A. The Cynomolgus Macaque MHC polymorphism in experimental medicine. Cells 8 , 978 (2019). doi:10.3390/cells8090978 De Groot, N.G. et al. Dynamic evolution of MHC haplotypes in cynomolgus macaques of different geographic origins. Immunogenetics 74 , 409-429 (2022). doi:10.1007/s00251-021-01249-y Shortreed, C.G. et al. Characterization of 100 extended major histocompatibility complex haplotypes in Indonesian cynomolgus macaques. Immunogenetics 72 , 225-239 (2020). doi:10.1007/s00251-020-01159-5 Doxiadis, G.G. et al. Extensive sharing of MHC class II alleles between rhesus and cynomolgus macaques. Immunogenetics 58 , 259-268 (2006). doi:10.1007/s00251-006-0083-8 Knapp, L.A. Cadavid, L.F. & Watkins, D.I. The MHC-E locus is the most well conserved of all known primate class I histocompatibility genes. J Immunol . 160 , 189-196 (1998). Hansen, S.G. et al. Broadly targeted CD8⁺ T cell responses restricted by major histocompatibility complex E. Science 351 , 714-720 (2016). doi:10.1126/science.aac9475 Malouli, D. et al. Cytomegaloviral determinants of CD8 + T cell programming and RhCMV/SIV vaccine efficacy. Sci Immunol . 6 , eabg5413 (2021). doi:10.1126/sciimmunol.abg5413 Hansen, S.G. et al. Prevention of tuberculosis in rhesus macaques by a cytomegalovirus-based vaccine. Nat Med . 24, 130-143 (2018). doi:10.1038/nm.4473 Murugesan, G. et al. Viral sequence determines HLA-E-restricted T cell recognition of hepatitis B surface antigen. Nat Commun . 15 , 10126 (2024). doi:10.1038/s41467-024-54378-9 Vietzen, H. et al. Ineffective control of Epstein-Barr-virus-induced autoimmunity increases the risk for multiple sclerosis. Cell . 186 , 5705-5718.e13 (2023). doi:10.1016/j.cell.2023.11.015 Iyer, R.F. et al. CD8 + T cell targeting of tumor antigens presented by HLA-E. Sci Adv . 10, eadm7515 (2024). doi:10.1126/sciadv.adm7515 Additional Declarations There is NO Competing Interest. Supplementary Files FigureSuppl202600312.pdf Supplemental Figure1-4,table1,table2 Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9105354","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":623800002,"identity":"c0bf3a34-2576-48bd-bd7d-9dc1798d416f","order_by":0,"name":"Seiya Imoto","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8ElEQVRIiWNgGAWjYBACNjBZIQHjJ0BpHrxaGBsYzkC1HCBGCwNIC2MbA7oWPIBP7PDxBx/nWUTLN/AYMH+oSEvsZ2B++IFB5g5uh0mnJTbO3CaRu+EAjwHDgTM5iTMb2IwlGHie4dGSY9jMC9LCwGP+42BbReKGAwxmQL8cJqBljkTufKDDGCBa2L8RoaVBIrfhAFhLDlALDyFb0hJnzjgGdNhhtgKGM2fSjGc28xRLJODxi/zs5AMfPtTU5c5vb97AUFGRLNvP3r7xw8ce3CGGAMwQyrEBxEjsOUCEFiiwh1A/SNAyCkbBKBgFwx0AADz2Uf2QK5YeAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0002-2989-308X","institution":"The University of Tokyo","correspondingAuthor":true,"prefix":"","firstName":"Seiya","middleName":"","lastName":"Imoto","suffix":""},{"id":623800003,"identity":"8d720bd7-cc91-449b-a279-bd68e518a386","order_by":1,"name":"Takuya Yamamoto","email":"","orcid":"https://orcid.org/0000-0003-3753-1211","institution":"National Institutes of Biomedical Innovation, Health and Nutrition","correspondingAuthor":false,"prefix":"","firstName":"Takuya","middleName":"","lastName":"Yamamoto","suffix":""},{"id":623800004,"identity":"d66db59a-4e8c-490c-ad4f-d33282a620a8","order_by":2,"name":"Yusuke Nakamura","email":"","orcid":"","institution":"National Institute of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Yusuke","middleName":"","lastName":"Nakamura","suffix":""},{"id":623800005,"identity":"30c4ae63-52b5-4bd9-bd7f-f777bacaccf5","order_by":3,"name":"Shokichi Takahama","email":"","orcid":"","institution":"National Institute of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Shokichi","middleName":"","lastName":"Takahama","suffix":""},{"id":623800006,"identity":"aa22678c-2e69-49f5-8081-f830132b52ea","order_by":4,"name":"Kazuma Kiyotani","email":"","orcid":"https://orcid.org/0000-0002-9236-9061","institution":"National Institutes of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Kazuma","middleName":"","lastName":"Kiyotani","suffix":""},{"id":623800007,"identity":"b6dcc372-5a25-4fdb-8e69-1bdd3513e424","order_by":5,"name":"Toyomasa Katagiri","email":"","orcid":"","institution":"National Institutes of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Toyomasa","middleName":"","lastName":"Katagiri","suffix":""},{"id":623800008,"identity":"79a0a6c9-1eba-4ff2-b041-6182a7d66588","order_by":6,"name":"Emiko Urano","email":"","orcid":"","institution":"Tsukuba Primate Research Center, National Institutes of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Emiko","middleName":"","lastName":"Urano","suffix":""},{"id":623800009,"identity":"3d5fac09-10fa-4220-b396-334699ef0772","order_by":7,"name":"Nobuhiro Shimozawa","email":"","orcid":"","institution":"Tsukuba Primate Research Center, National Institutes of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Nobuhiro","middleName":"","lastName":"Shimozawa","suffix":""},{"id":623800010,"identity":"e17e2582-1097-4587-9ff5-6a56ea6ab176","order_by":8,"name":"Naohide Ageyama","email":"","orcid":"","institution":"Tsukuba Primate Research Center, National Institutes of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Naohide","middleName":"","lastName":"Ageyama","suffix":""},{"id":623800011,"identity":"09a1b04b-9444-4b6a-aab7-6e13f2c0e983","order_by":9,"name":"Yasuhiro Yasutomi","email":"","orcid":"","institution":"Tsukuba Primate Research Center, National Institutes of Biomedical Innovation, Health and Nutrition (NIBN)","correspondingAuthor":false,"prefix":"","firstName":"Yasuhiro","middleName":"","lastName":"Yasutomi","suffix":""},{"id":623800012,"identity":"89d456f3-bdcf-4dca-812a-8e2eaf446125","order_by":10,"name":"Kotoe Katayama","email":"","orcid":"https://orcid.org/0000-0002-3966-1231","institution":"The University of Tokyo","correspondingAuthor":false,"prefix":"","firstName":"Kotoe","middleName":"","lastName":"Katayama","suffix":""},{"id":623800013,"identity":"f5c3d4ad-1219-478d-b4cd-bce1fcd3bba7","order_by":11,"name":"Yoshimasa Ono","email":"","orcid":"","institution":"Institute of Medical Science, University of Tokyo","correspondingAuthor":false,"prefix":"","firstName":"Yoshimasa","middleName":"","lastName":"Ono","suffix":""},{"id":623800014,"identity":"81d427bb-5434-4cc6-8c2f-0fec92c19788","order_by":12,"name":"Noriaki Sato","email":"","orcid":"","institution":"Institute of Medical Science, University of Tokyo","correspondingAuthor":false,"prefix":"","firstName":"Noriaki","middleName":"","lastName":"Sato","suffix":""},{"id":623800015,"identity":"e08f9bdd-aa14-4a9d-a29c-77102605ce30","order_by":13,"name":"Takayoshi Hyugaji","email":"","orcid":"","institution":"M\u0026D Data Science Center, Institute of Integrated Research, Institute of Science Tokyo","correspondingAuthor":false,"prefix":"","firstName":"Takayoshi","middleName":"","lastName":"Hyugaji","suffix":""},{"id":623800016,"identity":"c5359a50-f03f-4e36-a3aa-debdb898cb07","order_by":14,"name":"Hiroko Tanaka","email":"","orcid":"https://orcid.org/0000-0001-9634-8922","institution":"Tokyo Medical and Dental University","correspondingAuthor":false,"prefix":"","firstName":"Hiroko","middleName":"","lastName":"Tanaka","suffix":""},{"id":623800017,"identity":"028c569b-85f6-465e-bf38-cfa4d0ba715b","order_by":15,"name":"Takanori Hasegawa","email":"","orcid":"https://orcid.org/0000-0001-7251-9950","institution":"Tokyo Medical and Dental University","correspondingAuthor":false,"prefix":"","firstName":"Takanori","middleName":"","lastName":"Hasegawa","suffix":""},{"id":623800018,"identity":"e0ebf764-d2cf-4d05-b33c-1279ae843328","order_by":16,"name":"Satoru Miyano","email":"","orcid":"https://orcid.org/0000-0002-1753-6616","institution":"M\u0026D Data Science Center, Institute of Integrated Research, Institute of Science Tokyo","correspondingAuthor":false,"prefix":"","firstName":"Satoru","middleName":"","lastName":"Miyano","suffix":""},{"id":623800019,"identity":"2d9bd13d-36c7-44cc-87c5-90363b857575","order_by":17,"name":"Nicholas Ong","email":"","orcid":"","institution":"Oxford Nanopore Technologies plc ","correspondingAuthor":false,"prefix":"","firstName":"Nicholas","middleName":"","lastName":"Ong","suffix":""},{"id":623800020,"identity":"d95610d6-c75e-42d5-a063-c98470f8d14a","order_by":18,"name":"Lin Yang","email":"","orcid":"","institution":"Oxford Nanopore Technologies plc ","correspondingAuthor":false,"prefix":"","firstName":"Lin","middleName":"","lastName":"Yang","suffix":""},{"id":623800021,"identity":"89b8f941-f773-498f-9848-56a7fde23a45","order_by":19,"name":"Miles Benton","email":"","orcid":"https://orcid.org/0000-0003-3442-965X","institution":"Oxford Nanopore Technologies plc ","correspondingAuthor":false,"prefix":"","firstName":"Miles","middleName":"","lastName":"Benton","suffix":""},{"id":623800022,"identity":"c6db699c-3324-438a-8bf9-44bf1f989fe1","order_by":20,"name":"Lihye Kim","email":"","orcid":"","institution":"Oxford Nanopore Technologies plc ","correspondingAuthor":false,"prefix":"","firstName":"Lihye","middleName":"","lastName":"Kim","suffix":""},{"id":623800023,"identity":"cc92ba42-9974-4a4f-9b30-5159cf0a4d3d","order_by":21,"name":"Ratna Sariyatun","email":"","orcid":"","institution":"National Institutes of Biomedical Innovation, Health and Nutrition","correspondingAuthor":false,"prefix":"","firstName":"Ratna","middleName":"","lastName":"Sariyatun","suffix":""},{"id":623800024,"identity":"7ac013a6-1ecc-4479-88b8-3dbb8ace7adf","order_by":22,"name":"Yuta Nagatsuka","email":"","orcid":"","institution":"National Institutes of Biomedical Innovation, Health and Nutrition","correspondingAuthor":false,"prefix":"","firstName":"Yuta","middleName":"","lastName":"Nagatsuka","suffix":""},{"id":623800025,"identity":"e36fe614-a58a-453a-a0e3-a4878b61d3fa","order_by":23,"name":"Takuto Nogimori","email":"","orcid":"https://orcid.org/0000-0002-6011-9631","institution":"National Institutes of Biomedical Innovation, Health and Nutrition","correspondingAuthor":false,"prefix":"","firstName":"Takuto","middleName":"","lastName":"Nogimori","suffix":""}],"badges":[],"createdAt":"2026-03-12 13:25:26","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9105354/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9105354/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107403652,"identity":"9f4f4db4-2ee2-46dc-804a-2e14e8eb65ec","added_by":"auto","created_at":"2026-04-21 08:00:02","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":379131,"visible":true,"origin":"","legend":"\u003cp\u003eDiploid T2T assembly of 24 crab-eating macaque haplotypes. a, Study design and assembly workflow. Twelve individuals from three geographic origins (Indonesia, \u0026nbsp;Malaysia, and the Philippines), including one father–offspring pair, were sequenced using PacBio HiFi, and \u0026nbsp;ONT ultra-long and Pore-C reads. Contig assembly and haplotype phasing were performed with Hifiasm and \u0026nbsp;Verkko, followed by chromosome-scale scaffolding with RagTag using T2T-MFA8v1.1 as a structural guide. b, Contig-level assembly continuity metrics for 24 haplotypes prior to reference-guided scaffolding. Contig \u0026nbsp;N50 values range from 142 to 179 Mb, with total assembly sizes of 2.9–3.1 Gb. c, Representative Genome-wide structural comparison between representative haplotypes and T2T MFA8v1.1 using SyRI, illustrating overall synteny and preservation of haplotype-specific structural \u0026nbsp;configurations. d, Cumulative coverage curves (NG50 and NG90) for haplotype-resolved de novo contig assemblies prior to \u0026nbsp;reference-guided scaffolding. Contigs were ordered by decreasing length, and cumulative assembly span \u0026nbsp;(Gbp) was plotted against contig length (Mbp). NG50 and NG90 were calculated relative to the haploid \u0026nbsp;genome size of T2T-MFA8v1.1 (3,060,038,958 bp). The distributions demonstrate chromosome-scale \u0026nbsp;contiguity achieved through reference-independent assembly. e, Concordance of scaffold auN metrics between haplotype1 (hap1) and haplotype2 (hap2) haplotypes \u0026nbsp;(Pearson’s R = 0.86), indicating balanced haplotype reconstruction. f, Genome-wide structural concordance between haplotype assemblies and T2T-MFA8v1.1. Whole-genome \u0026nbsp;alignments were performed using SyRI, and the proportion of genomic sequence classified as syntenic \u0026nbsp;versus non-syntenic is shown as a percentage of total aligned length for each haplotype. Across individuals \u0026nbsp;and geographic origins, the vast majority of genomic sequence is syntenic relative to T2T-MFA8v1.1, \u0026nbsp;indicating high large-scale structural concordance while retaining haplotype-specific structural variation. g, Average BUSCO completeness (Complete = single-copy + duplicated genes) for each haplotype assembly, \u0026nbsp;grouped by geographic origin and sex. The completeness exceeds 99% across all assemblies\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-9105354/v1/8d3a3e443b19896316b2d61d.png"},{"id":107488043,"identity":"867ceda5-6412-4ed9-9230-f01a9cee3753","added_by":"auto","created_at":"2026-04-22 02:43:22","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1589638,"visible":true,"origin":"","legend":"\u003cp\u003eDeep characterization of repeat unit of “tangled-balls”. a, Assembly graph visualization highlighting high-degree nodes generated by convergence of contigs \u0026nbsp;originating from multiple chromosomes. These graph entanglements reflect unresolved connections among \u0026nbsp;highly similar repeat arrays. b, Dot-plot comparisons of repeat units composing tangled-balls, including the L-core (81 bp), S-core (80 \u0026nbsp;bp), chr11-core (80 bp), and representative SATR1/SATR2-related sequences. Distinct periodicity patterns \u0026nbsp;differentiate tangled-ball units from canonical satellite arrays. c, Self-alignment of the ~1.4-kb L-extension repeat, illustrating its mosaic organization and internal repeat \u0026nbsp;periodicity. d, Linear organization of repeat units within representative tangled-ball loci, showing the arrangement of \u0026nbsp;the L-core, S-core, and L-extension components. e, Schematic models of large and small tangled-ball configurations derived from the repeat unit \u0026nbsp;composition and orientation. f, Genome-wide density distribution of tangled-ball repeat units across chromosomes for representative \u0026nbsp;haplotypes, , demonstrating non-random chromosomal enrichment. g, Localization of the chromosome-11–specific chr11-core cluster within a human–macaque inversion \u0026nbsp;boundary near the ETV6 locus. h, Fluorescence in situ hybridization (FISH) using probes targeting the L-extension, L-core, S-core, and \u0026nbsp;chr11-core repeat units, validating the chromosomal localization inferred from the assembly. i, Correlation between the assembly-based repeat copy number and FISH signal intensity (R = 0.78), \u0026nbsp;supporting the quantitative accuracy of T2T-based copy-number inference. j, Dual-color FISH analysis demonstrating co-localization of L-core and S-core units and variable spatial \u0026nbsp;organization between homologous chromosomes, consistent with haplotype-specific orientation \u0026nbsp;polymorphism.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-9105354/v1/4ccac6ba2a99fbbbac1d3ad8.png"},{"id":107489325,"identity":"1d8c4bf1-caec-4cd9-9cec-2e281accfba7","added_by":"auto","created_at":"2026-04-22 02:47:23","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":610891,"visible":true,"origin":"","legend":"\u003cp\u003eGenomic scale similarity and diversity of T2T-haplotypes of three different origins of \u0026nbsp;macaques a, Total copy number of L-extension (~1.4 kb), L-core (81 bp), S-core (80 bp), and chr11-core (80 bp) \u0026nbsp;repeat units across the reference assemblies. Hierarchical satellite clusters corresponding to tangled balls were detected in M. fascicularis (MFA8v1.1) and M. mulatta (MMU8v2.0) but were absent in the \u0026nbsp;human T2T-CHM13 assembly. b, Chromosome-level distribution of repeat unit copy numbers across 24 haplotypes from Indonesia, \u0026nbsp;Malaysia, and the Philippines, demonstrating the widespread presence of tangled-ball structures. c, Comparative genomic alignment of the chromosome-11 region showing conservation of a large-scale \u0026nbsp;inversion associated with the chr11-core cluster across all haplotypes. d, Directionality index quantifying the orientation of L-core and S-core arrays across chromosomes. \u0026nbsp;Negative values indicate canonical L-core → S-core orientation; positive values indicate reversed S core → L-core configuration. e, Clustering of haplotypes based on L-core sequence variants allowing up to five mismatches. \u0026nbsp;Heatmap representation of variant presence across individuals grouped by geographic origins. f, Unsupervised hierarchical clustering of L-core variant profiles, revealing geographic stratification, \u0026nbsp;including cohesive clustering of the Philippine individuals.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-9105354/v1/549311ec3395fb0c1e901db8.png"},{"id":107403656,"identity":"61193cb3-07c9-4045-bdeb-7866b1b485eb","added_by":"auto","created_at":"2026-04-21 08:00:02","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":673945,"visible":true,"origin":"","legend":"\u003cp\u003eComparative landscape of 24 Mafa-MHC haplotypes. a, Structural alignment of Mafa-MHC regions across 24 haplotypes. Alignment of MHC regions centered \u0026nbsp;on Class I (Mafa-A and \u0026nbsp;Mafa-B) and Class II loci. Class IA, Class IB and Class II compartments are \u0026nbsp;indicated by shared vertical guide lines on the right, common to all rows. Gene blocks defined at the \u0026nbsp;locus level (for example, Mafa-A, Mafa-B, and DRA) are shown as background-colored rectangles, and \u0026nbsp;individual genes are represented as dark horizontal bars. Allelic digits are not displayed in this panel. \u0026nbsp;Haplotypes were arranged by individually using facet panels; facet strip colors denote a geographic \u0026nbsp;origins. b, Diploid-resolved variation in Mafa-MHC gene copy number. Heatmap of haplotype-specific gene copy \u0026nbsp;numbers at the locus-unit level (for example, Mafa-A1, Mafa-B, DRA1) identified by diploid T2T \u0026nbsp;assembly. Numerical labels indicate copy number, and color intensity reflects the same value, and grey \u0026nbsp;denotes absence. c, Haplotype-specific variation in Mafa-MHC gene length. Heatmap of the genomic span for each locus \u0026nbsp;unit (for example, Mafa-A1, Mafa-B, DRA1) across haplotypes. Values represent gene length in base pairs \u0026nbsp;(bp), highlighting structural polymorphisms that affecting locus size and organization. d, Conservation of the paternal haplotype structure among the three paternal half-siblings. Chromosome scale alignment of one father and three offspring (from different mothers). Only the haplotype shared \u0026nbsp;between the father and each offspring is color-highlighted, whereas another haplotype is shown in grey. \u0026nbsp;Four-digit allele resolution is displayed. The pedigree is shown on the left. Individuals are separated by \u0026nbsp;facet panels, with facet strip colors indicating their geographic origins.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-9105354/v1/41c2ddbb3a05a5807f1092f1.png"},{"id":107490112,"identity":"c9461c7e-6692-4cc0-9b7c-6b202fe1f00c","added_by":"auto","created_at":"2026-04-22 02:50:14","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3745681,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9105354/v1/0f990854-0abf-4423-a930-af0665a7bc34.pdf"},{"id":107403653,"identity":"46fc13d8-c170-4d25-bd2e-0beb7d8cd2b7","added_by":"auto","created_at":"2026-04-21 08:00:02","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":2513850,"visible":true,"origin":"","legend":"\u003cp\u003eSupplemental Figure1-4,table1,table2\u003c/p\u003e","description":"","filename":"FigureSuppl202600312.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9105354/v1/82420b3cf0d0fdaf937f52df.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Diploid telomere-to-telomere assemblies reveal hierarchical satellite architectures and heritable MHC structural diversity in macaques","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe crab-eating macaque (\u003cem\u003eMacaca fascicularis\u003c/em\u003e) is an indispensable non-human primate model for a broad range of biomedical research, including vaccine development, immunology, and drug safety evaluation. Owing to its close immunological similarity to humans, it plays a central role in infectious disease research and in investigations of inter-individual immunological variability [1,2]. Despite its widespread use, substantial differences in immune responses and disease susceptibility among individuals exposed to identical stimuli have long been recognized, yet the genomic basis underlying this variability remains poorly understood [3].\u003c/p\u003e\n\u003cp\u003eA major contributor to this scientific gap lies in the structural limitations of previously available reference genomes. Most conventional reference assemblies have been predominantly haploid-based, and therefore unable to fully resolve highly polymorphic and tandemly repetitive regions, particularly immune-related loci such as the major histocompatibility complex (MHC). In these regions, distinct haplotypes were often represented as mosaicked or collapsed structures, with structural diversity frequently dismissed as stochastic noise [4]. As a result, biologically meaningful structural variations are averaged, simplified or obscured, fundamentally constraining our understanding of individual-level genomic diversity.\u003c/p\u003e\n\u003cp\u003eRecent advances in genome assembly technologies have begun to transform primate genomics. The completion of the human telomere-to-telomere (T2T) genome [5], which for the first time resolved complete sequences including centromeres and telomeres, marked a pivotal milestone. Parallel efforts generating complete macaque genome sequences [6] and T2T analyses across ape lineages [7] has further indicated the importance of complex structural variations in shaping phenotypic diversity. However, the systematic characterization of diploid-level structural diversity across multiple individuals as well as how such diversity is transmitted across generations, remains largely unexplored [8,9].\u003c/p\u003e\n\u003cp\u003eThe Tsukuba Primate Research Center (TPRC) at National Institutes of Biomedical Innovation, Health and Nutrition (NIBN), maintains a large colony of specific-pathogen-free (SPF) crab-eating macaques (approximately 2,000 monkeys) with seven-generation pedigree information. This resource represents an internationally important primate platform for research on infectious diseases, aging, immunity, and some inherited disorders and their models [10,11,12,13,14,15], highlighting the necessity of precise and complete genomic characterization.\u003c/p\u003e\n\u003cp\u003eLeveraging this unique resource, we generated 24 diploid telomere-to-telomere haplotypes from 12 individuals representing three geographic origins. By integrating ultra-long, high-fidelity, and chromatin-conformation sequencing data, we resolved both homologous chromosomes at the T2T scale. This diploid framework reveals previously unrecognized hierarchical repeat architectures, defines their population-level diversification, and establishes a comprehensive atlas of MHC structural diversity with direct pedigree-based validation of inheritance. Together, our findings demonstrate that diploid T2T resolution is essential for accurately capturing immunogenetic diversity in this key biomedical primate model.\u003c/p\u003e"},{"header":"2. Results","content":"\u003cp\u003e\u003cstrong\u003e2-1. Results: Diploid T2T Assembly of Diverse Macaque Cohorts\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo comprehensively capture the genome-wide structural diversity at diploid resolution in the crab-eating macaques, we performed de novo contig assemblies achieving telomere-to-telomere (T2T) contiguity for 12 individuals originating from three geographic regions: Indonesia (n = 3), Malaysia (n = 3), and the Philippines (n = 6) (Fig. 1a, Supplementary Table1). The Philippine cohort included a father–offspring pair, enabling the incorporation of both geographic diversity and pedigree structure into the study design. For each individual, PacBio HiFi reads, and Oxford Nanopore Technologies (ONT) ultra-long reads and Pore-C data were integrated (Supplementary Table2). Assembly and haplotype phasing were conducted using Hifiasm and Verkko [16,17].\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe resulting haplotype-resolved contig assemblies exhibited high intrinsic continuity prior to reference-guided scaffolding. Contig N50 values ranged from 142–179 Mb, with total assembly sizes of 2.9–3.1 Gb (Fig. 1b), indicating near chromosome-scale contiguity achieved through de novo assembly alone. Importantly, these metrics reflect contig-level continuity prior to reference-guided scaffolding.\u003c/p\u003e\n\u003cp\u003eChromosome-scale scaffolding was subsequently performed using RagTag with the existing T2T-MFA8v1.1 assembly as a structural guide, yielding 24 haplotype-resolved genome assemblies. Reference-guided scaffolding introduced only limited scaffold-level padding and did not substantially alter underlying contig continuity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBalanced haplotype reconstruction and structural fidelity\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo assess the haplotype balance, we compared the scaffold auN metrics between paternal and maternal haplotypes (hap1 and hap2) for each individual. We observed strong concordance (R = 0.86), with no evidence of systematic degradation in either haplotype (Fig. 1e), indicating balanced reconstruction. Genome-wide structural comparisons using SyRI further demonstrated high overall synteny with MFA8v1.1 while successfully preserving haplotype-specific structural configurations (Fig. 1c, d, f, g Supplemental Fig 1a-e) [18,19, 20, 21]. Together, these results indicate that diploid reconstruction achieved T2T-level continuity comparable to existing references while maintaining structural integrity suitable for downstream interrogation of genomic architectures. Collectively, the 24 diploid T2T haplotypes generated in this study constituted a highly contiguous and reproducible genomic framework, providing a robust foundation for subsequent analyses of structural diversity and detailed characterization of complex loci, including immune-related regions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e2-2. Assembly graph entanglement reflects shared high-identity satellite arrays\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAlthough our diploid T2T assemblies preserved high overall synteny with the existing references, specific regions consistently exhibited complex connectivity patterns within the assembly graph. These structures were characterized by the convergence of sequences originating from multiple chromosomes and could not be adequately represented within the conventional linear reference framework (Fig. 2a). Such entanglements arise when highly similar repeat arrays located on different chromosomes cannot be completely disentangled during graph construction, resulting in multi-edge convergence onto shared nodes. Comparable graph complexities have been reported in satellite-rich regions during telomere-to-telomere assembly of the human genome, underscoring the intrinsic difficulty of resolving high-copy satellite DNA [5,16,17]. To determine the sequence basis of these entanglements, we systematically extracted and characterized the repeat units underlying the convergent graph nodes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTangled-balls define a hierarchical satellite architecture built from three principal repeat units\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u0026nbsp;Detailed sequence decomposition revealed that these entangled graph regions did not represent the amorphous accumulations of repeat DNAs. Instead, in the representative diploid assembly examined here, they formed a highly organized hierarchical satellite architecture, which we collectively term “tangled-balls”.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eDot-plot analyses and repeat decomposition identified three principal repeat units that constitute the core elements of this architecture: an 81-bp L-core, an 80-bp S-core, and a larger ~1.4-kb repeat block (L-extension) (Fig. 2b,c). The 80-bp and 81-bp units share a conserved GC-rich internal motif with \u0026gt;90% sequence identity; however, full-length alignments revealed only moderate overall identity (62.1%), supporting their classification as distinct but closely related satellite subtypes. RepeatMasker annotation classified these sequences as SATR1v-related satellites, whereas dot-plot analyses and periodicity assessments demonstrated repeat structures that differ from canonical SATR1/SATR2 arrays (Fig. 2b,c) [22]. Mapping of repeat units within tangled-ball loci revealed that L-core arrays frequently co-occur with the ~1.4-kb L-extension repeat blocks, forming composite repeat clusters (Fig. 2d). This organization gives rise to two higher-order configurations: a large tangled-ball unit, consisting of alternating L-core arrays and L-extension blocks, and a small tangled-ball unit, composed primarily of S-core repeats (Fig. 2e). These observations indicate that tangled-balls represent a hierarchical satellite architecture in which short repeat units (80–81 bp) and longer repeat blocks (~1.4 kb) combine to form higher-order composite repeat arrays.\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u003cstrong\u003eHaplotype-specific orientation polymorphism revealed by diploid resolution\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u0026nbsp;In addition to defining the repeat composition, diploid resolution enabled the assessment of repeat orientation relative to the telomeric direction. Analysis of repeat organization uncovered a striking structural dimorphism between homologous chromosomes. Genome-wide mapping of repeat density further revealed that tangled-ball units are not randomly distributed but instead exhibit chromosome-specific enrichment patterns. Notably, L- and S-ball structures were markedly enriched in subtelomeric regions and display a distribution pattern distinct from canonical SATR1/SATR1v/SATR2 repeats (Fig. 2f, and Supplementary Fig 2). Within this representative diploid genome, the repeat density was not uniform, forming discrete localized peaks along specific chromosomal intervals. These focal enrichments suggest that tangled-ball arrays are organized into discrete chromosomal configurations rather than representing diffuse repeat accumulations.\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u003cstrong\u003eChromosome-11–specific subtype and orthogonal cytogenetic validation\u003cbr\u003e\u0026nbsp;\u003c/strong\u003e\u003cbr\u003eIn addition to these multi-chromosomal tangled-balls, we identified a chromosome-11–specific repeat cluster composed predominantly of an 80-bp SATR1v-related subtype (chr11-core). Global alignments indicated that the chr11-core shared higher sequence identity with the 81-bp L-core (74.7%) than with the 80-bp S-core (56.7%), supporting the existence of structurally related but distinct satellite subclasses. Unlike the multi-chromosomal tangled-balls, the chr11-specific cluster was confined exclusively to chromosome 11 and did not generate interchromosomal connections within the assembly graph. Comparative genomic alignment with the human T2T reference (CHM13) revealed that the chromosome-11–specific chr11-core cluster was localized within a previously described large-scale human–macaque inversion spanning the region between \u003cem\u003eDPPA3\u003c/em\u003e, \u003cem\u003eETV6\u003c/em\u003e, and\u0026nbsp;\u003cem\u003eDDX11\u003c/em\u003e (Fig. 2g). The whole-region alignment clearly delineated an orientation switch corresponding to this inversion, with the chr11-core repeat cluster positioned in close proximity to the inferred inversion boundary. Examination of local sequence features across this region revealed higher densities of single-nucleotide variants and small insertions/deletions within the inverted segment relative to the flanking intervals. In parallel, segmental duplication (SD) coverage was increased and local sequence identity reduced in the vicinity of the chr11-core array. Together, these observations indicate that the hierarchical satellite cluster resides within a structurally dynamic genomic environment. The spatial coincidence of the chr11-core satellite cluster with the inferred inversion boundary interval highlights a positional association between the hierarchical repeat architecture and regions exhibiting structural rearrangements between the human and macaque genomes\u003cbr\u003e\u0026nbsp;To validate the physical existence and chromosomal distribution of these repeat units, we performed fluorescence in situ hybridization (FISH) using probes specific to the ~1.4-kb, 81-bp, and 80-bp repeats (Fig. 2h). The signal distributions across chromosomes were highly concordant with the copy number estimates derived from our T2T assemblies. A strong positive correlation was observed between the assembly-based copy number estimates and the corresponding FISH signal intensities (R = 0.78, Fig. 2i). This consistent relationship across diverse repeat classes provides orthogonal empirical support for the accuracy of our T2T-based copy number inferences, even within these structurally complex regions. Dual-color FISH further demonstrated that L-core and S-core units co-localized but exhibited variable spatial arrangements between homologous chromosomes, consistent with the orientation polymorphism identified in diploid assemblies.\u003cbr\u003e\u0026nbsp;Collectively, these results demonstrate that tangled-ball structures are composed of a small number of clearly definable repeat units, which form diverse combinations and arrangements across chromosomes. Although largely invisible in conventional linear reference genomes, the integration of the diploid T2T assembly with cytogenetic validation enables their systematic and accurate characterization.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e2-3. Population-level conservation and geographical divergence of tangled-ball architectures\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEvolutionary distribution of tangled-ball architectures across primates\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo assess the evolutionary distribution of tangled-ball architectures, we examined their presence in existing primate reference genomes. Structured satellite clusters corresponding to tangled-balls were identifiable in the \u003cem\u003eM. fascicularis\u003c/em\u003e (MFA8v1.1) and \u003cem\u003eM. mulatta\u003c/em\u003e (MMU8v2.0) assemblies (Fig. 3a). In contrast, comparable hierarchical arrangements were not detected in the human T2T-CHM13 assembly, indicating a lineage-specific structural differentiation between macaques and humans.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSystematic prevalence across 24 haplotypes\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAnalysis of 12 individuals (24 haplotypes) confirmed the widespread presence of tangled-ball structures across all three geographic origins (Indonesia, Malaysia, and the Philippines) (Fig. 3b, Supplementary Fig 3a). The chromosome-11–associated inversion at the ETV6 locus, linked to the chr11-core cluster, was consistently observed in all haplotypes examined (Fig. 3c, Supplementary Fig 3b), supporting its stability as a species-wide structural feature.\u003c/p\u003e\n\u003cp\u003eHaplotype-resolved RepeatMasker profiling revealed that repeat composition is largely conserved at the chromosomal scale but varies in its degree of inter-individual divergence. Certain interspersed elements (e.g., LINEs and SINEs) differed across chromosomes yet exhibited relatively limited variability between individuals. In contrast, satellite repeats and retroposon families showed marked inter-individual heterogeneity, as evidenced by heatmap and coefficient-of-variation analyses. (Supplementary Fig. 3c, d)\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eQuantification of haplotype-specific orientation polymorphism\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo quantify repeat orientation, we defined a directionality index based on the relative genomic start positions of the L-core and S-core arrays along each chromosome arm, measured from the telomere-proximal end. Specifically, the index was calculated as (starting position of L-core – starting position of S-core). Negative values therefore indicate the canonical telomere-proximal configuration in which the L-core precedes the S-core (L-core → S-core), whereas positive values indicate the reversed configuration (S-core → L-core) (Fig. 3d, e Supplementary Fig 3e). Haplotypes exhibiting start-position offsets exceeding ±2 standard deviations from the chromosome-specific median were detected across multiple chromosomes (Fig. 3e). This observation indicates that the orientation polymorphism is accompanied by substantial shifts in the relative L-core and S-core positioning, rather than representing minor coordinate variations. Consistent with the assembly-based inference, dual-color FISH analyses confirmed that both L-core → S-core and S-core → L-core configurations can be observed cytogenetically, with reversed arrangements detected on homologous chromosomes in selected individuals (Supplementary Fig. 3f). This independent validation reinforced the robustness of diploid T2T reconstruction in resolving fine-scale orientation polymorphisms within repetitive genomic regions.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFine-scale sequence divergence and geographical stratification\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAlthough the higher-order architecture of the tangled-balls was conserved, the sequence-level variation within the L-core unit revealed measurable geographic stratification.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e3f, Supplementary Fig 3g, h). Haplotypes derived from the Philippines formed a cohesive cluster distinct from the Indonesian and Malaysian cohorts, indicating regional differentiation in satellite sequence composition despite the conservation of the overall architectural organization.\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u0026nbsp;Together, these findings demonstrate that tangled-ball architectures are conserved genomic features within macaques that nevertheless exhibit population-level sequence diversification and haplotype-specific structural variability.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e2-4. Results: The Comprehensive Landscape of MHC Diversity and its Heritability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e* High-resolution architecture of 24 Mafa-MHC haplotypes\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGenome-wide comparison of T2T haplotypes (Fig. 3) revealed that structural diversity was not uniformly distributed across the genome, but instead accumulates within specific genomic regions. Among these, the major histocompatibility complex (MHC), which plays a central role in immune responses, has emerged as the locus exhibiting the most pronounced haplotype-level structural diversity.\u0026nbsp;Comprehensive structural comparisons at the diploid- and haplotype-resolved resolutions have remained limited.\u003c/p\u003e\n\u003cp\u003eUsing 24 haplotypes derived from 12 individuals assembled by T2T sequencing, we systematically analyzed the MHC region, defined as the interval spanning \u003cem\u003eGABBR1\u003c/em\u003e to \u003cem\u003eKIFC1\u003c/em\u003e. IPD-registered Mafa-MHC genes [23, 24, 25] were mapped onto each haplotype, enabling hierarchical dissection of the MHC architecture at the levels of genomic compartments, gene blocks, and individual genes (Fig. 4).\u003c/p\u003e\n\u003cp\u003eAt the compartment level, the overall Class IA–IB–II organization was conserved across all haplotypes; however, the physical lengths of individual compartments differed by approximately 0.5–1 Mb (Fig. 4a, Supplementary Fig. 4a). No significant correlation was observed between compartment sizes (Supplementary Fig.4b), indicating that major MHC components varied independently. Thus, the MHC is composed of multiple independent architectural modules, each undergoing distinct structural variations.\u003c/p\u003e\n\u003cp\u003eAt the gene-block level, the order and relative arrangement of the gene blocks differed substantially among haplotypes (Fig. 4a, Supplementary Fig. 4c). These differences reflect not only copy-number changes but also block-scale rearrangements contributing to structural diversification. No clear population-wide clustering of global structural patterns was detected although locus-specific variation trend was observed at certain genes (e.g., \u003cem\u003eMafa-A\u003c/em\u003e block; Kruskal–Wallis test, Supplementary Fig. 4c, d).\u003c/p\u003e\n\u003cp\u003eAt the individual gene level, marked copy-number polymorphisms and presence–absence variations were observed in Class I genes such as \u003cem\u003eMafa-A\u003c/em\u003e and \u003cem\u003eMafa-B\u003c/em\u003e as well as in non-classical genes including \u003cem\u003eMafa-E\u003c/em\u003e. In contrast,\u003cem\u003e\u0026nbsp;Mafa-DRA\u0026nbsp;\u003c/em\u003ewas conserved across all haplotypes, whereas \u003cem\u003eMafa-DRB\u003c/em\u003e and \u003cem\u003eDP/DQ\u003c/em\u003e loci displayed substantial copy-number variation (Fig. 4b). Even genes shared across all haplotypes showed variations in copy number, genomic position, and gene length (Fig. 4b, c).\u003c/p\u003e\n\u003cp\u003eTogether, these findings demonstrate that MHC architecture exhibits multi-layered structural heterogeneity organized across three hierarchical levels. Structural differences previously collapsed or obscured in haploid reference assemblies were resolved here as haplotype-specific architectural frameworks.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e* Pedigree validation: Stable inheritance of complex architectures\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo determine whether this structural diversity represents stable biological inheritance, we analyzed father–offspring transmission in a pedigree comprising one father and three offsprings from different mothers. Landscape visualization (Fig. 4a, 4d) showed that one paternal haplotype was structurally preserved in each offspring with a high fidelity.\u003c/p\u003e\n\u003cp\u003eConsistently, SyRI analysis identified strictly syntenic regions between the corresponding paternal and inherited haplotypes, with no evidence of misassembly or artifactural rearrangement (Supplementary Fig 4). This high-resolution alignment, enabled by the gapless T2T-assembly, confirmed the absence of even minor structural discrepancies. In contrast, comparisons involving non-inherited haplotypes exhibited local inversions and duplications (Supplementary Fig 4e).\u003c/p\u003e\n\u003cp\u003eAt the allelic level, gene concordance was 100% match up to the eight-digit allelic resolution (Fig.4d, Supplementary Fig 4f), demonstrating high-fidelity intergenerational transmission of MHC haplotypes despite their structural complexity.\u003c/p\u003e\n\u003cp\u003eThese results establish that the observed structural heterogeneity is not an assembly artifact but represents stable, inheritable haplotype-level diversity. This work provides a foundation for dissecting how complex immunogenetic landscapes shape individual immune responses.\u003c/p\u003e"},{"header":"3. Discussion","content":"\u003cp\u003eThe completion of 24 diploid telomere-to-telomere (T2T) reference genomes established a foundational resource for understanding the structural diversity in the crab-eating macaque, a key non-human primate model in biomedical research. Although recent efforts have generated complete macaque genomes and extended T2T sequencing across primate lineages [6, 7], our results demonstrate that diploid-level resolution is essential for accurately capturing the genomic complexity of \u003cem\u003eMacaca fascicularis\u003c/em\u003e. By resolving both haplotypes in 12 individuals, we showed that genomic features previously collapsed or dismissed as assembly “noise” instead represent substantial structural non-equivalence between homologous chromosomes. This observation parallels the advances in human diploid genomics, where haplotype-resolved assemblies have become indispensable for the comprehensive characterization of complex structural variations [19, 20].\u003c/p\u003e\n\u003cp\u003eA clear example of such previously hidden complexity is the “tangled-balls” architecture identified in this study. Manual deconvolution identified a hierarchical repeat organization composed of 80-bp, 81-bp, and ~1.4-kb repeat units, representing a previously unrecognized layer of genome architecture. The haplotype-specific differences in repeat composition, copy number, and spatial arrangement, independently validated by FISH, indicated that these regions possess the structural flexibility that was invisible in haploid or collapsed references. The differential positioning of repeat clusters between homologous chromosomes further underscores the extent of genomic diversity that becomes apparent only through complete diploid resolution.\u003c/p\u003e\n\u003cp\u003eGenome-wide comparisons at diploid T2T resolution further demonstrate that structural diversity is not uniformly distributed but concentrates within discrete genomic regions. Among them, the major histocompatibility complex (MHC) stands out as the locus exhibiting the most pronounced haplotype-level divergence. Our observations here reinforces the long-standing view that immune-related loci are major repositories of sequence and structural variations, while also highlighting the limitations of previous reference genomes in resolving such complexity.\u003c/p\u003e\n\u003cp\u003eResolving the \u003cem\u003eMafa\u003c/em\u003e-MHC at diploid T2T resolution establishes an essential reference for precision immunogenomics. While prior high-quality macaque assemblies improved structural completeness [6], continuous reconstruction of fully resolved diploid MHC haplotypes at long-read T2T resolution across multiple geographically diverse individuals had not previously been achieved at this scale. Consequently, systematic evaluation of structural non-equivalence between homologous chromosomes remains limited. Although sequencing analysis of MHC-homozygous Mauritian macaques yielded the first complete \u003cem\u003eMafa\u003c/em\u003e-MHC sequence [26], the strong founder effect in this population constrains its representation of natural structural diversity and inheritance patterns [27, 28, 29]. In contrast, crab-eating macaques display high genetic diversity, defined population structure, and variable degrees of introgression from rhesus macaques, complicating interpretation of MHC variation in existing genomic resources [29, 30, 31, 32].\u003c/p\u003e\n\u003cp\u003eBy analyzing individuals from multiple geographic origins and resolving MHC haplotypes, our study enabled a direct comparison of MHC structural diversity without excessive simplification. This framework preserves biologically meaningful diversity as well as maintains experimental reproducibility, therefore enhancing the translational relevance of this model.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNotably, structural diversification across MHC regions is asymmetric. Certain loci, such as \u003cem\u003eMafa-DRA\u003c/em\u003e, remained structurally conserved across haplotypes, consistent with their essential roles in antigen presentation. In contrast, \u003cem\u003eMafa-DRBs\u003c/em\u003e and several Class I loci exhibited extensive copy-number and positional variability. This diversification supports a modular model of MHC organization, in which independent architectural units evolve under differential selective constraints, balancing structural conservation with adaptive flexibility. Notably, the non-classical class I molecule \u003cem\u003eMHC-E\u003c/em\u003e (\u003cem\u003eMafa-E\u003c/em\u003e) exhibits both copy-number variation and population specific differences, consistent with geographically variable selective pressures.\u003c/p\u003e\n\u003cp\u003eThe extensive variability observed at the \u003cem\u003eMafa-DRB\u003c/em\u003e locus closely parallels with recent human HLA pangenome studies, in which the DRB region emerges as a hotspot of copy-number diversification and haplotype-specific architectural remodeling [20]. This convergence across species suggests that DRB structural plasticity may represent an evolutionarily conserved feature of primate MHC organization.\u003c/p\u003e\n\u003cp\u003eThese findings align with emerging human HLA pangenome analyses emphasizing the importance of haplotype-specific architectures in shaping immune responses and disease susceptibility [20].\u003c/p\u003e\n\u003cp\u003eImportantly, non-human primate models uniquely allow the experimental validation of immunogenetic hypotheses that are difficult to test directly in humans. MHC-E (\u003cem\u003eMafa-E\u0026nbsp;\u003c/em\u003ein\u003cem\u003e\u0026nbsp;M. fascicularis\u003c/em\u003e), a non-classical class I molecule with relatively low allelic polymorphism [33], mediates protection against SIV infection in CMV-vectored vaccine models [34,35] and has been implicated in diverse infectious and immunological contexts [36, 37, 38,39]. Yet in our dataset, \u003cem\u003eMafa-E\u003c/em\u003e displayed copy-number and population-level variations, underscoring that structural diversification can occur even at loci with limited sequence diversity. Our diploid T2T assemblies provide a framework to interrogate the immunological impact of such variation.\u003c/p\u003e\n\u003cp\u003eCrucially, pedigree-based validation has demonstrated that these complex MHC architectures are inherited as stable, discrete units rather than representing assembly artifacts. This confirmed both the technical accuracy of the diploid T2T assemblies and the biological reality of MHC structural heterogeneity. Previous pedigree studies have focused primarily on allelic segregation and have lacked the resolution to evaluate inheritance of continuous MHC architectures. Our findings therefore establish a direct link between the complete structural haplotypes and heritable immunogenetic variation.\u003c/p\u003e\n\u003cp\u003eCollectively, this study demonstrates that a transition from haploid to diploid T2T reference genomes is essential for accurately capturing immunogenetic diversity in the crab-eating macaques. By transforming previously-collapsed representations into fully resolved haplotype frameworks, this resource provides a foundation for linking genomic structure to immune function, disease susceptibility, and therapeutic responses. As primate genomics advances toward a pangenomic paradigm, diploid T2T references will be indispensable for both fundamental immunology and translational biomedical research.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003e\u003cstrong\u003eAnimal experiments, sample collections.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTwelve cynomolgus macaques (Macaca fascicularis; 4–20 years old) were included. Cohort 1 comprised of nine animals from three geographic origins (Indonesia, Malaysia and the Philippines; n = 3 each). Cohort 2 comprised of four Philippine-origin animals (one father and three offspring from different mothers), with one female included in both cohorts. Peripheral blood mononuclear cells (PBMCs) were isolated from EDTA-anticoagulated blood by Ficoll density-gradient centrifugation and cryopreserved in FBS containing 10% DMSO until use.\u003c/p\u003e\n\u003cp\u003eAnimals were housed at the Tsukuba Primate Research Center (NIBN, NIBN) and confirmed to be seronegative for SIV, simian type D retrovirus, simian T cell lymphotropic virus, simian foamy virus, Epstein–Barr virus, cytomegalovirus and B virus. All the procedures were approved by the institutional animal ethics committee (approval no. DS21-2R14).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDNA extraction and sequencing:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHigh- and ultra-high-molecular-weight (HMW and UHMW, respectively) genomic DNA was extracted from PBMCs using a Monarch HMW DNA Extraction Kit (New England Biolabs, T3050L), following the Oxford Nanopore–recommended protocols to preserve DNA length. DNA integrity and fragment size distributions were assessed prior to library preparation. Oxford Nanopore libraries were prepared using ligation-based protocols with modified handling to minimize DNA shearing for ultra-long-read sequencing (Ultra-Long DNA Sequencing Kit V14, SQK-ULK114, for Ultra-long-read; Ligation Sequencing Kit V14, SQK-LSK114, for Pore-C). Ultra-long-read and Pore-C libraries were sequenced on the PromethION 48 platform (MinKNOW version 24.02) using R10.4.1 flow cells (FLO-PRO114). Raw signal data (POD5 format) were retained for downstream basecalling. For Pore-C, in situ chromatin crosslinking, restriction digestion (NlaIII), proximity ligation and adapter ligation were performed as previously described. In addition, high-fidelity (HiFi) long-read data were generated on the PacBio Revio system by a commercial provider (Takara Bio Inc., Japan). Further experimental details and sequencing metrics are provided in the Supplementary Information.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDiploid haplotype-resolved genome assembly\u003cbr\u003e\u003c/strong\u003e\u003cu\u003eOxford Nanopore basecalling\u003cbr\u003e\u003c/u\u003eUltra-long Oxford Nanopore (ONT) sequencing data were basecalled from raw POD5 signal files using Dorado (v0.9.1) with the super accuracy (sup) model. Basecalled reads were generated in BAM format and used as ultra-long ONT reads for downstream analyses\u003cu\u003e.\u003cbr\u003e\u0026nbsp;\u003cbr\u003e\u0026nbsp;ONT read error correction\u0026nbsp;\u003c/u\u003e\u003c/p\u003e\n\u003cp\u003eUltra-long ONT reads were error-corrected using hifiasm (v0.25.0-r726) in ONT correction mode (-e --ont --write-ec). Corrected reads were exported for downstream assembly.\u003cbr\u003e\u0026nbsp;\u003cbr\u003e\u0026nbsp;\u003cu\u003eGraph-based diploid assembly using Verkko\u003c/u\u003e\u003c/p\u003e\n\u003cp\u003eDiploid genome assembly was performed using Verkko (v2.2). Four data types were provided: raw Oxford Nanopore ultra-long reads, raw PacBio HiFi reads, error-corrected ONT ultra-long reads, and Pore-C proximity ligation reads. To maximize graph connectivity, the error-corrected ONT ultra-long reads were combined with the PacBio HiFi reads and used during the initial assembly graph construction. This combined read set, together with raw ONT ultra-long reads for repeat resolution and Pore-C proximity ligation reads for haplotype phasing, was integrated into the Verkko pipeline. No external reference genome was used during contig construction. No reference-guided correction, sequence replacement, or short-read polishing was performed prior to scaffolding. Long-range chromatins contact information from Pore-C reads was used during graph resolution to assist haplotype phasing. Assembly graphs were generated by Verkko and resolved into phased haplotypes using the default pipeline.\u003c/p\u003e\n\u003cp\u003e\u003cu\u003eReference-guided chromosome scaffolding\u003c/u\u003e\u003c/p\u003e\n\u003cp\u003eChromosome-scale scaffolding was performed using RagTag (v2.1.0) with the T2T-MFA8v1.1 assembly as a structural guide. Reference-guided scaffolding was used exclusively for contig ordering and orientation to enable consistent chromosome naming and coordinate alignment across individuals. No sequence replacement, structural correction, or gap filling from the reference genome was performed. Gaps introduced during scaffolding were represented by stretches of “N” characters. The contig-level sequence content remained unchanged from the de novo assembly stage. This workflow ensures that contig-level sequences are fully reference-independent, and that reference information is introduced only at the scaffolding stage for coordinate standardization.\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u0026nbsp;\u003cstrong\u003eAssembly quality assessment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAssembly continuity was evaluated at the contig level using N50, NG50 and auN metrics.\u003c/p\u003e\n\u003cp\u003eGene completeness was assessed using compleasm (v0.2.7) with the primate_odb12 lineage dataset. Genome-wide structural concordance with T2T-MFA8v1.1 was assessed using SyRI (v1.7.0) to confirm large-scale synteny and preservation of haplotype-specific structural features. Across all haplotypes, the continuity and completeness metrics were comparable to those of T2T-MFA8v1.1, supporting their suitability for diploid structural analyses rather than serving as replacement reference genomes. NG statistics were calculated relative to the haploid genome size of T2T-MFA8v1.1 (3,060,038,958 bp).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eIdentification and characterization of hierarchical satellite (“tangled-ball”) architectures\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cu\u003eAssembly graph inspection and extraction of entangled nodes\u003c/u\u003e\u003c/p\u003e\n\u003cp\u003eAssembly graphs generated by Verkko were exported in the GFA format. Graph visualization was performed using Bandage (v0.9.0).\u003c/p\u003e\n\u003cp\u003eTo systematically identify graph the regions exhibiting multi-chromosomal convergence, we quantified the node degree within the assembly graph. Nodes connected to contigs originating from two or more distinct chromosomes were defined as high-degree nodes. The corresponding genomic sequences were extracted from the phased contigs for downstream analysis.\u003cbr\u003e\u003cstrong\u003eRepeat unit delineation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe extracted sequences were analyzed for internal periodicity using tandem repeat detection and self-alignment analyses. Putative repeat units were defined based on: Consistent internal periodicity, Local sequence homogeneity, and recurrent occurrence across multiple high-degree graph regions. This procedure identified three principal repeat cores: an 81-bp unit (L-core), an 80-bp unit (S-core), a 1.4-kb unit (L-extension) and a chromosome-11–specific 80-bp subtype (chr11-core). Repeat boundaries were determined based on conserved motif transitions observed at multiple independent loci.\u003cbr\u003e\u0026nbsp;\u003cbr\u003e\u0026nbsp;\u003cstrong\u003eGenome-wide detection of repeat units\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe consensus sequences of the L-core, S-core, and chr11-core were used as queries for genome-wide searches using the Biostrings package in R (v4.3.2). Pattern matching allowed up to five mismatches, including insertions and deletions, using the matchPattern and vmatchPattern functions with maximum mismatch parameters set accordingly. For a short core units (80–81 bp), pattern matching allowed for up to five mismatches, including insertions and deletions, corresponding to approximately 6% sequence divergence. For a longer L-extension (~1.4 kb), a proportional mismatch threshold was applied, allowing up to 88 mismatches (~6% divergence relative to the unit length), including insertions and deletions, to maintain consistency in the divergence tolerance across repeat classes. Copy numbers were calculated per chromosome and per haplotype.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRepeat annotation and comparison with canonical satellite families\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eRepeat annotation was performed using RepeatMasker (v4.2.3) with the Dfam 3.9 database. Annotated SATR1-, SATR1v-, and SATR2-related sequences were extracted for comparative analysis.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMHC structural annotation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFor each haplotype, \u003cem\u003eGABBR1\u003c/em\u003e and \u003cem\u003eKIFC1\u003c/em\u003e on chromosome 4 were aligned using Minimap2 (v2.30-r1287), and the flanking genomic interval was extracted and defined as the MHC region. Reference alleles (n = 3299) were obtained from the Immuno Polymorphism Non-Human Primates MHC Database (IPD-MHC NHP, release 3.16.0.0 (2026-01), build 231). Sequences were restricted to Mafa alleles and aligned to the extracted region using Minimap2. Manual curation was performed to define gene boundaries and structural blocks and to assess allele length variation and copy number. Structural rearrangements within the MHC region were further evaluated using SyRI based on whole-region alignments. Detailed annotation criteria, allele filtering parameters, copy-number determination methods and structural comparison parameters are provided in the Supplementary Information.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFluorescence in situ hybridization (FISH)\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFluorescence in situ hybridization (FISH) was performed by a commercial service provider (Chromosome Science Labo Inc., Japan) to validate chromosomal rearrangements and structural variations. The cryopreserved cells were cultured and synchronized prior to metaphase arrest, followed by hypotonic treatment and methanol–acetic acid fixation to prepare chromosome spreads. Replication of R-banding was generated using Hoechst 33258 staining and ultraviolet exposure. Probes were designed based on the specified target sequences. Oligonucleotide probes (\u0026lt;100 bp) were directly labeled, whereas longer probes were synthesized and labeled by nick translation. Probe mixtures were denatured on chromosome spreads and hybridized overnight. After stringency washes, the slides were counterstained with DAPI and imaged using a cytogenetics workstation with a 100× objective. Detailed probe sequences, hybridization conditions and signal quantification criteria are provided in the Supplementary Information.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data generated and/or analyzed in this study are available from the corresponding author upon reasonable request. The values for all data the points in the graphs are reported in the Supporting Data Values file.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe thank our laboratory colleagues for their excellent technical support. We also thank HAMRI Co., Ltd. and the Corporation for Production and Research of Laboratory Primates for their support with animal experiments. The super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. This work was also supported by the supplementary budget of the National Institutes of Biomedical Innovation, Health and Nutrition (NIBN), Japan Society for the Promotion of Science Grant-in-Aid for Scientific Research (B) (JP24K02498), and the Japan Agency for Medical Research and Development (26fk0410071, JP223fa627007, JP223fa627005).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eFennessey, C.M. \u0026amp; Keele, B.F. Using nonhuman primates to model HIV transmission.\u0026nbsp;\u003cem\u003eCurr Opin HIV AIDS\u003c/em\u003e. \u003cstrong\u003e8\u003c/strong\u003e, 280-287 (2013). doi:10.1097/COH.0b013e328361cfff\u003c/li\u003e\n\u003cli\u003eZhang, G.Q. et al. Characterization of the major histocompatibility complex class I A alleles in cynomolgus macaques of Vietnamese origin.\u0026nbsp;\u003cem\u003eTissue Antigens\u003c/em\u003e. \u003cstrong\u003e80\u003c/strong\u003e, 494-501 (2012). doi:10.1111/tan.12024\u003c/li\u003e\n\u003cli\u003eYan, G. et al.\u0026nbsp;Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques.\u0026nbsp;\u003cem\u003eNat Biotechnol.\u003c/em\u003e\u003cstrong\u003e29\u003c/strong\u003e, 1019\u0026ndash;1023 (2011). doi:10.1038/nbt.1992\u003c/li\u003e\n\u003cli\u003eShiina, T. et al. Rapid evolution of major histocompatibility complex class I genes in primates generates new disease alleles in humans via hitchhiking diversity.\u0026nbsp;\u003cem\u003eGenetics\u003c/em\u003e\u003cstrong\u003e173\u003c/strong\u003e, 1555-1570 (2006). doi:10.1534/genetics.106.057034\u003c/li\u003e\n\u003cli\u003eNurk, S. et al. The complete sequence of a human genome.\u0026nbsp;\u003cem\u003eScience\u003c/em\u003e\u003cstrong\u003e376\u003c/strong\u003e, 44-53 (2022). doi:10.1126/science.abj6987\u003c/li\u003e\n\u003cli\u003eZhang, S. et al.\u0026nbsp;Integrated analysis of the complete sequence of a macaque genome.\u0026nbsp;\u003cem\u003eNature\u003c/em\u003e\u0026nbsp;\u003cstrong\u003e640\u003c/strong\u003e, 714\u0026ndash;721 (2025). doi:10.1038/s41586-025-08596-w\u003c/li\u003e\n\u003cli\u003eYoo, D. et al.\u0026nbsp;Complete sequencing of ape genomes.\u0026nbsp;\u003cem\u003eNature\u003c/em\u003e\u0026nbsp;\u003cstrong\u003e641\u003c/strong\u003e, 401\u0026ndash;418 (2025). doi:10.1038/s41586-025-08816-3\u003c/li\u003e\n\u003cli\u003eHoyt, S.J.\u0026nbsp;et al. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. \u003cem\u003eScience \u003c/em\u003e\u003cstrong\u003e376\u003c/strong\u003e, eabk3112 (2022). doi:10.1126/science.abk3112\u003c/li\u003e\n\u003cli\u003eAltemose, N.\u0026nbsp;\u003cem\u003eet al.\u003c/em\u003e Complete genomic and epigenetic maps of human centromeres. \u003cem\u003eScience \u003c/em\u003e\u003cstrong\u003e376\u003c/strong\u003e, eabl4178 (2022). doi:1126/science.abl4178\u003c/li\u003e\n\u003cli\u003eYasutomi, Y. Establishment of specific pathogen-free macaque colonies in Tsukuba Primate Research Center of Japan for AIDS research. \u003cem\u003eVaccine\u003c/em\u003e. \u003cstrong\u003e28\u003c/strong\u003e, B75\u0026ndash;B77 (2010).\u003c/li\u003e\n\u003cli\u003eMunesue, Y. et al. Cynomolgus macaque model of neuronal ceroid lipofuscinosis type 2 disease. \u003cem\u003eExp Neurol\u003c/em\u003e. \u003cstrong\u003e363\u003c/strong\u003e, 114381 (2023).\u003c/li\u003e\n\u003cli\u003eIkeda, Y. et al. Discovery of a cynomolgus monkey family with retinitis pigmentosa.\u003cem\u003e Invest Ophthalmol Vis Sci\u003c/em\u003e. \u003cstrong\u003e59\u003c/strong\u003e, 826\u0026ndash;830 (2018).\u003c/li\u003e\n\u003cli\u003eUchida, A. et al. Non-human primate model of amyotrophic lateral sclerosis with cytoplasmic mislocalization of TDP-43. \u003cem\u003eBrain\u003c/em\u003e\u003cstrong\u003e 135\u003c/strong\u003e, 833\u0026ndash;846 (2012).\u003c/li\u003e\n\u003cli\u003eOkabayashi, S. et al. Diabetes mellitus accelerates A\u0026beta; pathology in brain accompanied by enhanced GA\u0026beta; generation in nonhuman primates. \u003cem\u003ePLoS One\u003c/em\u003e\u003cstrong\u003e10\u003c/strong\u003e, e0117362 (2015).\u003c/li\u003e\n\u003cli\u003eKoinuma, S. et al. Aging induces abnormal accumulation of A\u0026beta; in extracellular vesicle-rich fractions in nonhuman primate brain. \u003cem\u003eNeurobiol Aging\u003c/em\u003e\u003cstrong\u003e106\u003c/strong\u003e, 268\u0026ndash;281 (2021).\u003c/li\u003e\n\u003cli\u003eCheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm.\u0026nbsp;\u003cem\u003eNat Methods\u003c/em\u003e. \u003cstrong\u003e18\u003c/strong\u003e, 170-175 (2021). doi:10.1038/s41592-020-01056-\u003c/li\u003e\n\u003cli\u003eRautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko.\u0026nbsp;\u003cem\u003eNat Biotechnol\u003c/em\u003e. \u003cstrong\u003e41\u003c/strong\u003e, 1474-1482 (2023). doi:10.1038/s41587-023-01662-6\u003c/li\u003e\n\u003cli\u003eGoel, M. et al. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies.\u0026nbsp;\u003cem\u003eGenome Biol\u003c/em\u003e. \u003cstrong\u003e20\u003c/strong\u003e, 277 (2019). doi:10.1186/s13059-019-1911-0\u003c/li\u003e\n\u003cli\u003eLiao, W.W.\u0026nbsp;et al\u003cem\u003e.\u003c/em\u003eA draft human pangenome reference.\u0026nbsp;\u003cem\u003eNature\u003c/em\u003e\u0026nbsp;\u003cstrong\u003e617\u003c/strong\u003e, 312\u0026ndash;324 (2023). doi:1038/s41586-023-05896-x\u003c/li\u003e\n\u003cli\u003eLogsdon, G.A.\u0026nbsp;et al\u003cem\u003e.\u003c/em\u003e\u0026nbsp;Complex genetic variation in nearly complete human genomes.\u0026nbsp;\u003cem\u003eNature\u003c/em\u003e\u0026nbsp;\u003cstrong\u003e644\u003c/strong\u003e, 430\u0026ndash;441 (2025). doi:10.1038/s41586-025-09140-6\u003c/li\u003e\n\u003cli\u003eManni, M. et al. BUSCO: Assessing genomic data quality and beyond.\u0026nbsp;\u003cem\u003eCurr Protoc\u003c/em\u003e. \u003cstrong\u003e1\u003c/strong\u003e, e323 (2021). doi:10.1002/cpz1.323\u003c/li\u003e\n\u003cli\u003eStorer, J. et al. The Dfam community resource of transposable element families, sequence models, and genome annotations. \u003cem\u003eMob DNA\u003c/em\u003e. \u003cstrong\u003e12\u003c/strong\u003e, 2 (2021). doi:10.1186/s13100-020-00230-y\u003c/li\u003e\n\u003cli\u003eRobinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. \u003cem\u003eNucleic Acids Res\u003c/em\u003e. \u003cstrong\u003e43\u003c/strong\u003e, D423-D431 (2015). doi:10.1093/nar/gku1161\u003c/li\u003e\n\u003cli\u003eMaccari, G. et al. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex. \u003cem\u003eNucleic Acids Res\u003c/em\u003e. \u003cstrong\u003e45\u003c/strong\u003e, D860-D864 (2017). doi:10.1093/nar/gkw1050\u003c/li\u003e\n\u003cli\u003eMaccari, G. et al. The 2024 IPD-MHC database update: a comprehensive resource for major histocompatibility complex studies. \u003cem\u003eNucleic Acids Res\u003c/em\u003e. \u003cstrong\u003e53\u003c/strong\u003e, D457-D461 (2025). doi:10.1093/nar/gkae932\u003c/li\u003e\n\u003cli\u003eKarl, J.A. et al. Complete sequencing of a cynomolgus macaque major histocompatibility complex haplotype. \u003cem\u003eGenome Res\u003c/em\u003e. \u003cstrong\u003e33\u003c/strong\u003e, 448-462 (2023). doi:10.1101/gr.277429.122\u003c/li\u003e\n\u003cli\u003eOsada, N. et al. Whole-genome sequencing of six Mauritian Cynomolgus macaques (Macaca fascicularis) reveals a genome-wide pattern of polymorphisms under extreme population bottleneck. \u003cem\u003eGenome Biol Evol\u003c/em\u003e. \u003cstrong\u003e7\u003c/strong\u003e, 821-830 (2015). doi:10.1093/gbe/evv033\u003c/li\u003e\n\u003cli\u003eOgawa, L.M. \u0026amp; Vallender, E.J. Genetic substructure in cynomolgus macaques (Macaca fascicularis) on the island of Mauritius. \u003cem\u003eBMC Genomics\u003c/em\u003e\u003cstrong\u003e15\u003c/strong\u003e, 748 (2014). doi:10.1186/1471-2164-15-748\u003c/li\u003e\n\u003cli\u003eShiina, T. \u0026amp; Blancher, A. The Cynomolgus Macaque MHC polymorphism in experimental medicine. \u003cem\u003eCells\u003c/em\u003e\u003cstrong\u003e8\u003c/strong\u003e, 978 (2019). doi:10.3390/cells8090978\u003c/li\u003e\n\u003cli\u003eDe Groot, N.G. et al. Dynamic evolution of MHC haplotypes in cynomolgus macaques of different geographic origins. \u003cem\u003eImmunogenetics\u003c/em\u003e\u003cstrong\u003e74\u003c/strong\u003e, 409-429 (2022). doi:10.1007/s00251-021-01249-y\u003c/li\u003e\n\u003cli\u003eShortreed, C.G. et al. Characterization of 100 extended major histocompatibility complex haplotypes in Indonesian cynomolgus macaques. \u003cem\u003eImmunogenetics\u003c/em\u003e\u003cstrong\u003e72\u003c/strong\u003e, 225-239 (2020). doi:10.1007/s00251-020-01159-5\u003c/li\u003e\n\u003cli\u003eDoxiadis, G.G. et al. Extensive sharing of MHC class II alleles between rhesus and cynomolgus macaques. \u003cem\u003eImmunogenetics\u003c/em\u003e\u003cstrong\u003e58\u003c/strong\u003e, 259-268 (2006). doi:10.1007/s00251-006-0083-8\u003c/li\u003e\n\u003cli\u003eKnapp, L.A. Cadavid, L.F. \u0026amp; Watkins, D.I. The MHC-E locus is the most well conserved of all known primate class I histocompatibility genes. \u003cem\u003eJ Immunol\u003c/em\u003e. \u003cstrong\u003e160\u003c/strong\u003e, 189-196 (1998).\u003c/li\u003e\n\u003cli\u003eHansen, S.G. et al. Broadly targeted CD8⁺ T cell responses restricted by major histocompatibility complex E. \u003cem\u003eScience\u003c/em\u003e\u003cstrong\u003e351\u003c/strong\u003e, 714-720 (2016). doi:10.1126/science.aac9475\u003c/li\u003e\n\u003cli\u003eMalouli, D. et al. Cytomegaloviral determinants of CD8\u003csup\u003e+\u003c/sup\u003e T cell programming and RhCMV/SIV vaccine efficacy. \u003cem\u003eSci Immunol\u003c/em\u003e. \u003cstrong\u003e6\u003c/strong\u003e, eabg5413 (2021). doi:10.1126/sciimmunol.abg5413\u003c/li\u003e\n\u003cli\u003eHansen, S.G. et al. Prevention of tuberculosis in rhesus macaques by a cytomegalovirus-based vaccine. \u003cem\u003eNat Med\u003c/em\u003e. 24, 130-143 (2018). doi:10.1038/nm.4473\u003c/li\u003e\n\u003cli\u003eMurugesan, G. et al. Viral sequence determines HLA-E-restricted T cell recognition of hepatitis B surface antigen. \u003cem\u003eNat Commun\u003c/em\u003e. \u003cstrong\u003e15\u003c/strong\u003e, 10126 (2024). doi:10.1038/s41467-024-54378-9\u003c/li\u003e\n\u003cli\u003eVietzen, H. et al. Ineffective control of Epstein-Barr-virus-induced autoimmunity increases the risk for multiple sclerosis. \u003cem\u003eCell\u003c/em\u003e. \u003cstrong\u003e186\u003c/strong\u003e, 5705-5718.e13 (2023). doi:10.1016/j.cell.2023.11.015\u003c/li\u003e\n\u003cli\u003eIyer, R.F. et al. CD8\u003csup\u003e+\u003c/sup\u003e T cell targeting of tumor antigens presented by HLA-E. \u003cem\u003eSci Adv\u003c/em\u003e. 10, eadm7515 (2024). doi:10.1126/sciadv.adm7515\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-9105354/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9105354/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"The crab-eating macaque (Macaca fascicularis), a key biomedical model, exhibits substantial inter-individual variability. However, its genomic architecture remains incompletely resolved due to extensive structural complexity and reliance on haploid reference assemblies. Here, we generated 24 diploid telomere-to-telomere (T2T) haplotypes from 12 male and female individuals representing three geographic populations including a pedigree, enabling systematic interrogation of genome architecture at diploid resolution with geographic diversity and familial inheritance. We identified previously uncharacterized large inter-chromosomal repeat clusters with haplotype-specific organization that are conserved among old world monkeys. In addition, twenty-four completely contiguous major histocompatibility complex (MHC) haplotypes reveal extensive inter or intra individual variations in gene copy number and structural organization of the class-IA, IB, and II regions. Together, these findings demonstrate that diploid T2T assemblies are essential for accurately capturing structural and immunogenetic diversity in non-human primates and provide a genomic framework for precision immunogenomics in this widely used biomedical model.","manuscriptTitle":"Diploid telomere-to-telomere assemblies reveal hierarchical satellite architectures and heritable MHC structural diversity in macaques","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-21 07:59:56","doi":"10.21203/rs.3.rs-9105354/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"95f76970-97d5-4ccc-91dd-5c5f84edc814","owner":[],"postedDate":"April 21st, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":66389824,"name":"Biological sciences/Genetics/Genomics"},{"id":66389825,"name":"Biological sciences/Genetics/Immunogenetics"},{"id":66389826,"name":"Biological sciences/Immunology/Immunogenetics"}],"tags":[],"updatedAt":"2026-04-29T09:42:45+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-21 07:59:56","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9105354","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9105354","identity":"rs-9105354","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0