A haplotype-resolved, chromosome-scale genome assembly and annotation for  Carya glabra  (pignut hickory; Juglandaceae)

doi:10.64898/2025.12.19.695579

A haplotype-resolved, chromosome-scale genome assembly and annotation for Carya glabra (pignut hickory; Juglandaceae)

2025 · doi:10.64898/2025.12.19.695579

preprint OA: closed CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 87,375 characters · extracted from oa-pdf · 8 sections · click to expand

Abstract

1 Carya glabra (2n = 4x = 64), also known as pignut hickory, is a widely distributed 2 species in the walnut family (Juglandaceae). Native to the central and eastern United States and 3 southeastern Canada, C. glabra plays an important ecological role as a common upland forest 4 species; it is closely related to several economically valuable nut trees, including C. illinoinensis 5 (pecan). A deeper understanding of the genetics of C. glabra is essential for studying its 6 evolutionary history and biology, with potential implications for agricultural improvement of 7 pecan. Here, we present the first nuclear genome assembly and annotation of C. glabra. The 8 assembly is chromosome-level and phased, representing the first assembled polyploid genome in 9 the genus Carya. A total of 64 pseudochromosomes were assembled and phased into four 10 haplotypes. The haplotype A assembly spans 600.4 Mb, comprises 55.0% repetitive sequences, 11 and contains 30,947 protein-coding genes, with a BUSCO completeness score of 97.7%. 12 Functional annotation assigned 94.3% of haplotype A genes to gene families, and 79.7% and 13 86.3% of genes were annotated with Gene Ontology terms and protein domains, respectively; 14 635 putative plant disease resistance genes were found in haplotype A. The other three 15 haplotypes exhibited similarly high-quality annotation metrics. Our genomic analyses also 16 suggest that C. glabra is an autotetraploid. Comparative genomic analyses revealed high 17 collinearity among the four haplotypes of C. glabra and the published genomes of three other 18 Carya species, although structural variation among the genomes of these species was identified. 19 In addition, we provide an improved chloroplast genome assembly and the first mitochondrial 20 genome for C. glabra. Importantly, most members of the research team are undergraduate 21 students; the sequenced individual is located in McCarty Woods, a Conservation Area on the 22 University of Florida campus. This work highlights the value of genome assembly efforts as 23 powerful tools for teaching genomics and supporting conservation initiatives. This first high-24 quality reference genome for C. glabra provides a valuable resource for studying Carya, a genus 25 of significant ecological and economic importance. 26 27

Keywords

autopolyploid; campus genome initiative; chloroplast genome; chromosome-level 28 genome; comparative genomics; conservation; genome annotation; haplotype-resolved; 29 mitochondrial genome; undergraduate training30 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 3 Article summary: 31 Carya glabra (pignut hickory) is a common upland forest species in North America. This 32 species is a member of the walnut family (Juglandaceae), which includes many economically 33 important nut trees. Here, we present the first nuclear genome assembly and annotation of C. 34 glabra. The assembly is chromosome-level and phased. The haplotype A assembly contains 35 30,947 protein-coding genes, with a BUSCO completeness score of 97.7%. Our genomic 36 analyses suggest that C. glabra is an autopolyploid. We also provide chloroplast and 37 mitochondrial genome assemblies. This nuclear genome provides a valuable resource for 38 studying Carya, a genus of significant ecological and economic importance.39 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 4

Introduction

40 Carya glabra (2n = 4x = 64) (Juglandaceae; walnut family), commonly known as pignut 41 hickory, is a widespread species in the central and eastern United States and southeastern 42 Canada, ranging from Ontario southward to central Florida (Fig. 1a; POWO 2025). Pignut 43 hickory is a slow-growing, deciduous tree that typically reaches 20–30 meters in height and 30–44 100 centimeters in diameter (Tirmenstein 1991). The species is monoecious, bearing staminate 45 catkins and pistillate flowers that appear in spikes (Tirmenstein 1991). Carya possesses an 46 accessory fruit; a pear-shaped nut is enclosed in a four-valved husk (of bracts). The fruit remains 47 green until maturity, turning brown as it ripens (Fig. 1a; Smalley 1990). 48 The species is an ecological dominant in dry upland forests (Smalley 1990). In addition, 49 the nuts are rich in crude fat and are consumed by a variety of wildlife, including squirrels, birds, 50 foxes, rabbits, and raccoons (Smalley 1990). The wood of C. glabra is heavy and strong, making 51 it ideal for tool handles and mallets, and it is also commonly used as fuelwood (Smalley 1990; 52 Tirmenstein 1991). Pignut hickory also shows potential value for restoration of disturbed sites, as 53 it has been reported to recolonize abandoned strip mines (Hardt and Forman 1989). 54 Carya comprises 19 species with an intercontinentally disjunct distribution (POWO 55 2025). In Asia, the genus is native to India, China, and countries in Southeast Asia, while in 56 North America it occurs in eastern Canada, central and eastern United States, and Mexico 57 (POWO 2025). Phylogenetic analyses support two monophyletic groups within the genus, 58 corresponding to the primary geographic distributions (Asia and North America) (Zhang et al. 59 2013; Xi et al. 2022; Zhang et al. 2024b). According to molecular age estimation and 60 biogeographic analyses, Carya in North America dates to the early Paleocene (Zhang et al. 61 2013). Its earliest confirmed occurrence is evidenced by fossil fruits from the late Eocene 62 (Manchester 1999). The highest species diversification rate of the North America clade occurred 63 around 10.1 million years ago (Ma) during the late Miocene, suggesting that C. glabra or its 64 ancestor likely emerged around this time (Zhang et al. 2013). At least six North American Carya 65 species, including C. glabra, are tetraploid (2n = 4x = 64) (Woodworth 1930; Stone 1961; Zhang 66 et al 2013), whereas all Asian species investigated are diploid (Grauke 2016). The North 67 America clade showed a higher diversification rate than the Asia clade, which may be attributed 68 to the polyploid nature of many North American species (Zhang et al. 2013). 69 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 5 Recent phylogenetic studies indicate that the closest relative of C. glabra may be C. 70 texana, which is also a tetraploid (Huang et al. 2019; Xi et al. 2022). Based on plastome data, 71 other close relatives include C. palmeri (2n = 2x = 32) and some but not all populations of C. 72 illinoinensis (2n = 2x = 32) (Xi et al. 2022). In contrast, phylogenetic analyses, using 73 approximately 10× resequencing data relative to the C. cathayensis genome, indicate that the 74 clade containing C. glabra and C. texana is sister to another tetraploid species, C. tomentosa 75 (Huang et al. 2019). Notable reported examples of natural hybridization involving C. glabra 76 include the hybrid Carya × demareei Palmer, which arose from a cross between C. glabra and 77 diploid C. cordiformis (Sutton and Crowley 2020). Furthermore, the overlapping geographical 78 ranges of C. glabra and tetraploid C. ovalis have led to frequent hybridization between those two 79 species (Coder 2023). 80 Carya includes two species that are commercially cultivated nut trees: C. illinoinensis 81 (pecan) and C. cathayensis (Chinese hickory) (Grauke 2016). In the United States, pecan 82 production exceeded 120,000 metric tons in 2024, with a value of $468 million (USDA-NASS 83 2025). To date, genome assemblies have been reported for three Carya species – C. illinoinensis 84 (Huang et al. 2019; Lovell et al. 2021; Xiao et al. 2021), C. cathayensis (Huang et al. 2019; 85 Zhang et al. 2024b), and C. sinensis (Zhang et al. 2024b) – all of which are diploid. 86 In this study, we assembled and annotated the first nuclear genome of tetraploid Carya 87 glabra. This chromosome-level, phased genome represents the first polyploid genome reported 88 within the genus. The reference genome of C. glabra should enable novel research in the 89 economically important genus Carya, with broad applications in both agriculture and 90 evolutionary biology. The sequenced individual is located in McCarty Woods, a designated 91 Conservation Area and quiet oasis at the center of the University of Florida (UF) campus (Fig. 92 1b). Most of the researchers involved in this project are undergraduate students enrolled in a 93 Course-based Undergraduate Research Experience (CURE) class at UF (Fig. 1c). As part of the 94 American Campus Tree Genomes (ACTG) project (https://www.hudsonalpha.org/actg), this 95 work highlights the potential of genome assembly projects to support conservation efforts and 96 enhance hands-on genomics education. 97 98

Materials

& Methods 99 Sample collection 100 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 6 Fresh leaf and axillary bud tissues were collected from a Carya glabra individual in the 101 McCarty Woods Conservation Area, located centrally on the UF campus. An herbarium voucher 102 for this plant was deposited in the Florida Museum of Natural History Herbarium (FLAS). The 103 collected tissues were immediately frozen in liquid nitrogen. 104 105 DNA isolation and sequencing 106 Carya glabra leaf tissue was sent to the HudsonAlpha Institute for Biotechnology 107 (Huntsville, AL, USA) for DNA isolation and subsequent sequencing. High-molecular-weight 108 DNA was extracted using the Nanobind Plant Nuclei Big DNA Kit (Circulomics-PacBio, Menlo 109 Park, CA, USA). Isolated DNA was sheared with Megaruptor (Diagenode, Denville, NJ, USA), 110 and fragments with a size of approximately 25 kb were selected using BluePippin (Sage Science, 111 Beverly, MA, USA). Size-selected DNA was used to construct the PacBio sequencing library 112 using the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA). The 113 library was then sequenced on two SMRT Cells on a PacBio Revio system at HudsonAlpha to 114 generate High-Fidelity (HiFi) reads. 115 In addition, an Omni-C library was constructed using flash-frozen leaf material following 116 the Dovetail Genomics protocol (Dovetail Genomics, Scotts Valley, CA, USA). The library was 117 sequenced on one S4 flow cell of the Illumina NovaSeq 6000 system (Illumina, San Diego, CA, 118 USA) at HudsonAlpha to generate paired-end 150-bp reads. Basic statistics of PacBio HiFi data 119 and Omni-C data were assessed using SeqKit2 (v.2.4.0; Shen et al. 2024). 120 121 RNA isolation and sequencing 122 Leaf and axillary bud tissues from the same C. glabra individual used for DNA isolation 123 were collected and flash-frozen in liquid nitrogen. RNA was extracted from each tissue (leaf and 124 axillary bud) using a modified CTAB method (Jordon-Thaden et al. 2015). RNA quality was 125 assessed using a Bioanalyzer at the Interdisciplinary Center for Biotechnology Research (ICBR), 126 UF (Gainesville, FL, USA). Two strand-specific (i.e., directional) RNA-seq libraries were 127 prepared, and the libraries were sequenced on the Illumina NovaSeq X platform to generate 128 paired-end 151-bp reads at ICBR. The statistics of the RNA-seq data were calculated using 129 SeqKit2, and the raw reads were filtered using fastp (v.0.23.4; Chen et al. 2018) with default 130 parameters. 131 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 7 132 Chloroplast and mitochondrial genome assembly and annotation 133 Both organellar genomes were simultaneously assembled from PacBio HiFi reads using 134 Oatk (v1.0; Zhou et al. 2025). Oatk’s plastome assembly graph was simplified and circularized 135 using Bandage (v.0.8.1; Wick et al. 2015), and the resulting assembly was annotated using the 136 web application GeSeq (https://chlorobox.mpimp-golm.mpg.de/geseq.html; Tillich et al. 2017). 137 The plastome annotation was further curated by comparing GeSeq’s annotation with the 138 well-annotated Nicotiana tabacum chloroplast genome (NCBI accession number: NC_001879), 139 as well as three published Carya glabra chloroplast genomes (BK061156; OR099205; 140 NC_067504) (Luo et al. 2021; Xi et al. 2022; Liu et al. 2025). The chloroplast genomes were 141 first aligned using MAFFT (v.7.490) with default parameters in Geneious Prime (2025.2.2; 142 https://www.geneious.com). The annotation was then manually inspected and curated. 143 Ambiguous transfer RNA (tRNA) annotations were further validated using BLAST searches in 144 the PlantRNA 2.0 database (http://plantrna.ibmp.cnrs.fr/; Cognat et al. 2022). 145 Oatk’s mitochondrial assembly graph could not be resolved into a single circular 146 chromosome without excluding graph segments. Therefore, two circular contigs were inferred 147 from the graph and saved as separate chromosomes using Bandage. These two mitochondrial 148 chromosomes were annotated with the web application PMGA 149 (http://47.96.249.172:16084/annotate.html; Li et al. 2025) using the three databases available in 150 the program. Additionally, we searched plastome and mitochondrial proteins using Captus 151 (v.1.6.1; Ortiz et al. 2023). The four annotation tracks, one from Captus and three from PMGA 152 (each corresponding to one of the three databases from PMGA), were checked against each other 153 for consistency, retaining only the best annotation (i.e., that includes start and stop codons 154 whenever possible, longest and/or most frequently observed) in case of discrepancies. 155 Following manual curation, the edited GenBank files were exported from Geneious 156 Prime and then uploaded to OGDRAW (v.1.3.1; Greiner et al. 2019) to generate the final 157 chloroplast and mitochondrial genome annotation maps using the default parameters (except 158 checking the “tidy up annotation” box). 159 160 Nuclear genome profiling 161 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 8 Jellyfish (v2.3.0; Marçais and Kingsford 2011) was used to count k-mers and generate a 162 k-mer histogram (k-mer size: 21) from the HiFi reads. The k-mer histogram was then imported to 163 GenomeScope 2.0 (http://genomescope.org/genomescope2.0/; Ranallo-Benavidez et al. 2020) to 164 infer nuclear genome characteristics, including monoploid genome size and heterozygosity, with 165 default parameters except setting ploidal level as 4. 166 167 Nuclear genome assembly 168 Hifiasm (v.0.19.9; Cheng et al. 2021) was used to perform de novo assembly with default 169 parameters. Both HiFi reads and Omni-C reads were used as input data. Given the polyploid 170 nature of the Carya glabra genome, the unitig assembly from hifiasm, which contained the 171 genomic information from all four haplotypes, was used for downstream analyses. 172 To scaffold the unitigs, first, bwa-mem2 (v.2.2.1; Vasimuddin et al. 2019) was used to 173 align the Omni-C reads to the unitig assembly. The resulting alignments were then analyzed with 174 the hic_qc pipeline from Phase Genomics (Seattle, WA, USA) to assess the overall quality of the 175 Omni-C library. Then, YaHS (v.1.1; Zhou et al. 2023) was used to perform the scaffolding 176 process with default parameters. 177 Next, using the Hi-C alignment file as input, the ‘juicer pre’ tool from YaHS and Juicer 178 (v.1.22.01; Durand et al. 2016) were used to generate the Hi-C contact map. We then manually 179 curated the assembly by examining the Hi-C contact map using Juicebox Assembly Tools 180 (v.1.11.08; Dudchenko et al. 2018). Misjoin and inversion errors were manually corrected, and 181 the orientation of chromosomes was also curated to match the published Carya illinoinensis 182 genome (Lovell et al. 2021). After all edits, the final genome assembly was generated using the 183 ‘juicer post’ tool from YaHS. 184 A dot plot was generated using the web application D-GENIES 185 (https://dgenies.toulouse.inra.fr/; Cabanettes and Klopp 2018) to compare the Carya illinoinensis 186 genome with the assembled C. glabra genome. To assign scaffolds to chromosomes, the C. 187 glabra scaffolds were renamed according to their alignment with the C. illinoinensis 188 chromosomes. The four copies of each chromosome in C. glabra were labeled A, B, C, and D in 189 descending order of length. Each set of 16 chromosomes with the same label (e.g., Chr01A, 190 Chr02A, …, Chr16A) was grouped and referred to as a haplotype (e.g., haplotype A). The 64 191 chromosomes were therefore assigned to four haplotypes (A, B, C, and D). It is important to note 192 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 9 that this haplotype assignment is artificial and does not necessarily reflect a biological haplotype, 193 since each haplotype set may represent a mixture of chromosomes originating from different 194 gametes. For each haplotype set, genome completeness was estimated using benchmarking 195 universal single-copy orthologs (BUSCO, v.5.3.0) with the eudicots_odb10 database (Manni et 196 al. 2021). 197 198 Nuclear genome annotation 199 To annotate repeat sequences, for each haplotype of the chromosome-level genome 200 assembly, EDTA (v.2.1.0; Ou et al. 2019) was used for de novo transposable element (TE) 201 annotation. Using the TE library generated by EDTA, RepeatMasker (v.4.1.7; Smit et al. 2013-202 2015) was used to identify additional repeat elements and to softmask the genome (with repeat 203 elements written in lowercase). 204 For gene annotation, BRAKER3 (v.3.0.8; Gabriel et al. 2024) was used to predict 205 protein-coding genes using the RNA-seq data from the leaf and axillary bud tissues from C. 206 glabra and protein evidence from model species (Table S1). Various BRAKER3 parameter 207 settings were tested using the haplotype A genome (Table S2). The setting that resulted in the 208 highest BUSCO score (using the eudicots_odb10 database) was applied to annotate all other 209 haplotypes (i.e., B, C, and D). After the initial annotation, gene models meeting any of the 210 following criteria were filtered out using AGAT (v.1.4.2; Dainat 2022): (1) presence of a 211 premature stop codon; (2) absence of a start and/or stop codon; or (3) an open reading frame 212 (ORF) length of ≤100 amino acids or ≤50 amino acids. The genes were named in accordance 213 with the guidelines proposed by Cannon et al. (2025). 214 Functional annotation was performed using the web application TRAPID 2.0 (Bucchini et 215 al. 2021), with the PLAZA 4.5 dicots database (Van Bel et al. 2018) as the reference and the 216 rosids clade selected for the similarity search. All parameters were set to default, except that 217 “input sequences are CDS” was selected. 218 Lastly, Circos (v.0.69-9; Krzywinski et al. 2009) was used to visualize the genome and 219 the associated genetic features, including gene and TE densities along the chromosomes. 220 221 Comparative genomic analyses 222 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 10 Genome-level synteny analysis was performed using GENESPACE (v.1.3.1; Lovell et al. 223 2022) to compare the four Carya glabra haplotypes with chromosome-level genome assemblies 224 from three other Carya species: C. cathayensis (Zhang et al. 2024b), C. illinoinensis (Lovell et 225 al. 2021), and C. sinensis (Zhang et al. 2024b). 226 227 Identification of putative disease resistance genes 228 Because disease resistance is a key trait for pecan improvement, plant disease resistance 229 genes (R genes) in the Carya glabra genome were predicted using the DRAGO 2 pipeline (with 230 default parameters) from the Plant Resistance Genes database (PRGdb 3.0) (Osuna-Cruz et al. 231 2018). Using the same pipeline, R genes were also identified in three other Carya species with 232 assembled genomes: C. illinoinensis, C. cathayensis, and C. sinensis. In addition, we focused 233 particularly on resistance to Phylloxera – aphid-like insects that induce gall formation in pecan. 234 A major quantitative trait locus (QTL) associated with phylloxera resistance was identified by 235 Lovell et al. (2021) in C. illinoinensis. Using the primary assembly of C. illinoinensis cv. 236 ‘Lakota’ as the reference, this QTL is located on chromosome 16 (positions 1521681 to 237 2392040), between genes CiLak.16G012100 and CiLak.16G019000 (Lovell et al. 2021). 238 Syntenic regions in C. glabra corresponding to this QTL were detected and visualized using 239 MCScan from JCVI (v.1.2.10) (Tang et al. 2024). Within these syntenic regions, putative R 240 genes were identified across all four C. glabra haplotypes. 241 242

Results

243 Statistics of sequence data 244 The basic statistics of the raw sequence data are summarized in Table 1. PacBio HiFi 245 reads were generated on two SMRT cells, yielding a total of 79.1 gigabases (Gb) of data (44.1 246 Gb from one cell and 35.0 Gb from the other cell) (Table 1). In total, 5.3 million HiFi reads were 247 obtained, with an average read length of 15.0 kilobases (kb). The proportions of bases with 248 quality scores greater than 20 (Q20) and 30 (Q30) were 97.7% and 94.5%, respectively. The 249 sequencing coverage, calculated by dividing the total number of bases by the monoploid genome 250 size (1x), was 131.7× (Table 1). Given that the Carya glabra is a tetraploid and comprises four 251 haplotypes, the coverage per haplotype was therefore 32.9×. 252 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 11 For Omni-C data, a total of 264.5 million reads (derived from paired-end sequencing of 253 132.3 million DNA fragments) were generated, and the total number of bases was 39.7 Gb 254 (Table 1). The Q20 and Q30 quality scores were 98.7% and 96.4%, respectively. Sequencing 255 coverage was 66.1×, corresponding to 16.5× per haplotype in the tetraploid genome. 256 RNAs extracted from leaf and axillary bud tissues were of high quality, with RNA 257 Integrity (RIN) scores of 7.1 and 7.2, respectively. For RNA-seq data, 161.6 million reads (from 258 paired-end sequencing of 80.3 million fragments) were generated from the leaf tissue, and the 259 Q20 and Q30 quality scores were 99.0% and 96.1%, respectively (Table 1). We also generated 260 148.8 million reads from the axillary bud tissue, and the Q20 and Q30 scores were 99.0% and 261 96.0%, respectively. 262 263 Chloroplast and mitochondrial genome assembly and annotation 264 The chloroplast genome of Carya glabra is 160,839 bp in length and has the typical 265 quadripartite structure (Fig. 2). The genome is composed of a pair of inverted repeat (IR) regions 266 (i.e., IRA and IRB; 26,006 bp in length for each region), a large single-copy (LSC) region 267 (90,041 bp), and a small single-copy (SSC) region (18,786 bp) (Fig. 2). A total of 113 unique 268 genes, including 79 protein-coding genes, 30 tRNA genes, and 4 rRNA genes, were annotated 269 (Fig. 2). A detailed list of these genes, along with their functional categories and genomic 270 locations, is provided in Table S3. The GC contents of LSC, SSC, and IR regions were 33.7%, 271 29.9%, and 42.6%, respectively. 272 The two mitochondrial chromosomes are 493,063 bp and 147,309 bp in length (Fig. 3). 273 The larger chromosome (mtChr1) also presents a quadripartite structure where two inverted 274 repeats (mtIR) of 2,760 bp intercalate a small single-copy (mtSSC) region (135,915 bp) and a 275 large single-copy (mtLSC) region (351,628 bp). The smaller chromosome (mtChr2) is mostly 276 redundant with mtChr1, consisting of one of the mtIRs, the entire mtSSC, 1,795 bp of the 277 mtLSC, and a unique segment of 6,839 bp. A total of 42 protein-coding genes, 23 tRNA genes, 278 and 3 rRNA genes were annotated in the mitochondrial genome (Table S4). From these, 15 were 279 annotated as functional plastome-derived genes (5 protein-coding genes and 10 tRNA genes) 280 (Table S4). We additionally identified 15 nonfunctional plastome genes: six were complete but 281 contained premature stop codons, and nine were only fragmentary. All plastome-derived genes 282 were located inside several sequence segments with varying lengths and degrees of conservation, 283 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 12 as measured by their sequence identity to the chloroplast assembly (Table S5). Most notably, two 284 large segments contained multiple functional plastome genes, the first segment (15,031 bp, 285 99.2% identity) contained trnA-UGC, trnI-CAU, trnL-CAA, and trnV-GAC genes; and the second 286 segment (2,137 bp, 84.3% identity) contained psaJ, rpl20, and rpl33 genes (Table S5). 287 288 Nuclear genome profiling 289 Based on k-mer frequency analysis of the unassembled HiFi reads, GenomeScope 2.0 290 estimated the monoploid genome size as 515.4 Mb, with a heterozygosity value of 4.9% and 291 repetitive sequences accounting for 38.5% of the genome. The frequencies of the heterozygous 292 forms aaab and aabb were 3.2% and 1.4%, respectively. The resulting k-mer spectrum is shown 293 in Fig. 4. The four major peaks, corresponding to k-mers present in one to four copies, are 294 characteristic of an autotetraploid genome. 295 296 Nuclear genome assembly and annotation 297 The initial unitig assembly generated by hifiasm comprised 2,856 unitigs with an N50 of 298 7.5 Mb. A dot plot comparing this unitig assembly with one set of chromosomes from the Carya 299 illinoinensis genome revealed that each C. illinoinensis region corresponded to four unitigs, 300 confirming the tetraploid nature of the C. glabra genome and indicating that the unitig assembly 301 incorporated genomic sequences from all four haplotypes (Fig. S1). The complete BUSCO score 302 for the unitig assembly was 98.9%, consisting of 1.0% single-copy and 97.9% duplicated 303 BUSCOs; the high proportion of complete and duplicated BUSCOs reflects that sequences from 304 all haplotypes were represented in the assembly. 305 Next, the unitigs were scaffolded by YaHS using the Omni-C data. Based on the hic_qc 306 analysis, the Omni-C library was considered “sufficient”, showing high proportions of long-307 distance and inter-unitig contacts (Table S6). The initial YaHS scaffolding resulted in 2,584 308 scaffolds with an N50 of 36.9 Mb, including 62 scaffolds longer than 10 Mb. Examination of the 309 Hi-C contact map, along with the dot plot comparing the Carya illinoinensis genome with the 310 initial YaHS scaffolds, revealed several scaffolding errors, including two misjoins and an 311 inversion error, which were corrected manually using Juicebox (Fig. S2). In addition, Juicebox 312 was used to reorient several scaffolds to match the chromosome orientations of C. illinoinensis. 313 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 13 After manual curation, the final assembly contained 64 scaffolds longer than 10 Mb, 314 accounting for 94.8% of the total assembled sequences (2,319.4 Mb out of 2,445.8 Mb) and 315 corresponding to the expected chromosome number of the Carya glabra genome (Fig. 5). 316 Hereafter, we refer to these 64 scaffolds as pseudo-chromosomes (or simply chromosomes for 317 brevity). Each pseudo-chromosome was named according to its syntenic similarity with the C. 318 illinoinensis genome based on the dot plot (Fig. 5c) and was assigned to haplotypes (A through 319 D) based on descending length. It is important to note that this haplotype assignment is artificial 320 and does not necessarily reflect true biological haplotypes (see Materials and Methods). The 321 monoploid genome (1x) sizes for haplotypes A, B, C, and D were 600.4 Mb, 585.2 Mb, 574.3 322 Mb, and 559.4 Mb, respectively (Table 2). In addition, the complete BUSCO scores for the 323 assembled genomes were 97.8%, 97.6%, 96.8%, and 95.4% for haplotypes A, B, C, and D, 324 respectively (Table 2). Detailed statistics for each chromosome are provided in Table S7. 325 Repetitive sequences accounted for the majority of the Carya glabra genome (Table 2; 326 Table S8). In haplotypes A, B, C, and D, 55.0%, 54.4%, 54.0%, and 53.8% of the genomic 327 sequences were classified as repetitive regions, respectively (Table 2). Specifically, 328 retrotransposons comprised 24.7-27.2% of the genome across the four haplotypes, and DNA 329 transposons represented 19.4-21.5% of the genome (Table S8). In addition, simple repeats 330 (duplications of short DNA motifs; microsatellites) accounted for 1.2-1.3% of the genome. 331 For protein-coding gene prediction, several BRAKER3 settings were tested using the 332 haplotype A genome as the reference (Table S2). The combination that used RNA-seq data from 333 C. glabra and protein evidence from 14 model species – followed by filtering out gene models 334 ≤50 amino acids – produced the highest BUSCO score (97.7%) (Table S2). Therefore, the same 335 setting was used to annotate the genes from haplotypes B, C, and D. 336 A total of 30,947 genes were predicted for haplotype A, with an average CDS length of 337 1,241 bp (Table 2; Table S9). For haplotypes B, C, and D, the number of predicted protein-338 coding genes ranged from 30,110 to 31,087 (Table 2). The average CDS length ranged from 339 1,239 bp to 1,254 bp (Table S9). All haplotypes had an average of 5.0 exons per gene, and the 340 average gene length varied between 4,364 bp and 4,460 bp (Table S9). 341 TRAPID annotation assigned gene family information to 94.3% of the predicted genes in 342 haplotype A, with 79.7% and 86.3% of genes annotated with Gene Ontology (GO) terms and 343 protein domains, respectively (Table S9). The core gene family completeness score in TRAPID 344 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 14 was 0.982, exceeding the conservation threshold of 0.9, further supporting the high completeness 345 of the predicted gene models. Similarly, haplotypes B, C, and D showed high annotation rates: 346 93.9–94.7% of genes were assigned to gene families, and 85.9–86.5% were annotated with 347 protein domains (Table S9). All haplotypes also exhibited high BUSCO completeness scores 348 based on the annotated genes, ranging from 94.9% to 97.7% (Table 2). 349 350 Comparative genomic analysis 351 Synteny analysis was performed among the four haplotypes of Carya glabra and the 352 haploid genomes of C. cathayensis, C. illinoinensis, and C. sinensis, revealing high overall 353 collinearity among the genomes (Fig. 6). However, several structural variants were also 354 identified. For example, an inversion on chromosome 16 was detected between the C. sinensis 355 and C. illinoinensis genomes (indicated by green circle 1 in Fig. 6). Another inversion on 356 chromosome 11 was observed between C. illinoinensis and all four haplotypes of C. glabra 357 (green circle 2); this inversion was also evident in the corresponding dot plot (Fig. 5c). 358 Furthermore, structural variation was found among the four C. glabra haplotypes. For instance, 359 between haplotypes B and C, the synteny analysis showed an inversion on chromosome 3, which 360 was also detected in the dot plot (Fig. 5c; green circle 3 in Fig. 6). 361 362 Disease resistance genes in C. glabra 363 Plant disease resistance genes, i.e., R genes, across the four haplotypes were predicted. 364 Specifically, we focused on four major classes of R genes: CNL [containing the coiled-coil 365 domain, the nucleotide-binding site (NBS) domain, and the leucine-rich repeat (LRR) domain], 366 TNL (containing the Toll-interleukin receptor-like domain, the NBS domain, and the LRR 367 domain), RLP [receptor-like protein, containing the transmembrane (TM) domain and the LRR 368 domain], and RLK (receptor-like kinase, containing the TM domain, the LRR domain, and the 369 kinase domain). In haplotype A, we identified 625 putative R genes from these four classes, 370 including 56 CNL, 39 TNL, 214 RLP, and 316 RLK class genes (Table S10). For haplotypes B, 371 C, and D, 638, 655, and 608 putative R genes were annotated, respectively (Table S10). In 372 addition, we identified 724, 685, and 800 putative R genes in the primary assemblies of C. 373 illinoinensis, C. sinensis, and C. cathayensis, respectively (Table S10). 374 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 15 The syntenic regions in C. glabra corresponding to the major QTL for phylloxera 375 resistance in C. illinoinensis were identified on chromosome 16 (Fig. 7). Within these syntenic 376 regions, 8, 10, 11, and 8 R genes were detected in haplotypes A, B, C, and D, respectively (Fig. 377 7; Table S11). Syntenic gene pairs between the five R genes annotated in the primary assembly 378 of C. illinoinensis cv. ‘Lakota’ and their counterparts in C. glabra were highlighted in the 379 synteny plot (Fig. 7). Among the 37 C. glabra R genes (30 of 37) located in these syntenic 380 regions, 30 belong to the TNL class, while 3 and 4 belong to the RLP and RLK classes, 381 respectively (Table S11). 382 383

Discussion

384 Carya glabra organellar genomes 385 The chloroplast genome size in Juglandaceae ranges from 158,223 bp to 161,713 bp (Liu 386 et al. 2025). Three Carya glabra chloroplast genomes have been published to date (Luo et al. 387 2021; Xi et al. 2022; Liu et al. 2025), with sizes ranging from 160,645 bp to 160,652 bp. In the 388 present study, the assembled chloroplast genome of C. glabra is 160,839 bp in length (Fig. 2), 389 very similar to the published C. glabra chloroplast genomes and within the size range observed 390 across species from other Juglandaceae. 391 A total of 109, 113, and 114 unique genes were annotated in previously published C. 392 glabra chloroplast genomes with NCBI accession numbers OR099205, NC_067504, and 393 BK061156, respectively. In our study, 113 unique genes were identified, including 79 protein-394 coding genes, 30 tRNA genes, and 4 rRNA genes (Fig. 2; Table S3). The additional gene 395 reported in accession BK061156 is ycf15, a functionally uncharacterized gene that is also absent 396 from the well-annotated Nicotiana tabacum chloroplast genome (NC_001879). Through manual 397 curation, we identified several misannotated and missing genes in previously reported C. glabra 398 chloroplast genomes (summarized in Table S12). For example, additional copies of tRNA genes 399 trnA-UGC and tnrM-CAU were misannotated in BK061156; two protein-coding genes, atpB and 400 rpoB, were missing from OR099205; and the first exons of petB, petD, and rpl16 were absent 401 from NC_067504. All such potential annotation errors were manually corrected in the present 402 study. Together, these results indicate that although several C. glabra chloroplast genomes have 403 been published, our assembly and annotation represent the most complete and accurate version to 404 date. 405 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 16 Compared to chloroplast genomes, the reports of the assembly of plant mitochondrial 406 genomes are few, primarily due to the high structural complexity of the mitogenome in plants 407 (Palmer and Herbon 1988; Møller et al. 2021; Wu et al. 2022; Wang et al. 2024). Only a few 408 mitochondrial genomes have been published for species from Juglandaceae, and those available 409 mitogenomes show substantial variation in structure and gene content. Chen et al. (2024) 410 assembled the first mitochondrial genome of Carya illinoinensis: the single circular genome is 411 495.2 kb in length and contains 37 protein-coding genes, 24 tRNA genes, and 3 rRNA genes. 412 The Juglans regia (Juglandaceae) mitogenome consists of three circular chromosomes and 413 includes 39 protein-coding genes, 47 tRNA genes, and 5 rRNA genes (Ye et al. 2024). The 414 Juglans mandshurica mitochondrial genome includes two chromosomes and has 38 protein-415 coding genes, 20 tRNA genes, and 3 rRNA genes (Su et al. 2023). In Carya glabra, the 416 mitogenome includes two chromosomes (493.1 kb and 147.3 kb in length), and we identified 42 417 protein-coding genes, 23 tRNA genes, and 3 rRNA genes (Fig. 3; Table S4). Although 418 mitogenomes are generally highly variable, the C. glabra mitochondrial genome is broadly 419 comparable with other published Juglandaceae mitogenomes. 420 The varying sizes and identities of the plastome segments detected in the C. glabra 421 mitochondrial genome suggest multiple transfer events occurring at different times (Table S5). In 422 future studies, it would be interesting to compare these transferred segments with other 423 congeneric chloroplast and mitochondrial genomes. 424 425 Nuclear genomes in Carya 426 We assembled and annotated the first nuclear genome of Carya glabra (Fig. 5). The 427 assembly is chromosome-level and haplotype-resolved, representing the first assembled 428 polyploid genome in the genus (Fig. 5). Furthermore, GenomeScope 2.0 predicted that Carya 429 glabra is an autotetraploid based on the pattern of nucleotide heterozygosity levels: the 430 frequency of the heterozygous aaab genotype was higher than that of the aabb genotype (3.2% 431 versus 1.4%), a pattern characteristic of autopolyploids (Ranallo-Benavidez et al. 2020). 432 Additionally, the k-mer spectrum showing four major peaks (Fig. 4), along with the high 433 similarity among the four copies of each chromosome compared to the C. illinoinensis genome 434 based on the dot plot (Fig. 5c), further support that C. glabra is an autotetraploid. 435 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 17 In terms of genomic composition, 53.8-55.0% of the Carya glabra genome consists of 436 repetitive sequences, with slight variation among haplotypes (Table 2). Similar, but lower, 437 proportions of repetitive content have been reported in other Carya species. Lovell et al. (2021) 438 found that 49.7% of the C. illinoinensis genome is repetitive sequences, and Zhang et al. (2024b) 439 reported repeat fractions in the genomes of C. sinensis (43.5%) and C. cathayensis (50.1%) 440 (Table 2). 441 We predicted more than 30,000 protein-coding genes for each Carya glabra haplotype 442 (Table 2). BUSCO completeness scores were high across all haplotypes, with haplotype A 443 having a BUSCO score of 97.7%. The number of genes predicted in Carya glabra is broadly 444 comparable to those reported for other Carya species (Table 2). Lovell et al. (2021) annotated 445 32,267 genes in C. illinoinensis, and Zhang et al. (2024b) identified 35,370 and 36,722 genes in 446 C. sinensis and C. cathayensis, respectively (Zhang et al. 2024b). 447 Several non-mutually exclusive factors may explain the differences in gene count among 448 Carya genomes. First, the annotation pipeline can affect the number of predicted genes. 449 Weisman et al. (2022) found that applying different annotation methods to the same genome can 450 lead to the identification of genes unique to each method. In this study, we used BRAKER3 for 451 gene annotation, whereas PASA (Haas et al. 2003) and FGENESH (Salamov et al. 2020) were 452 used to annotate the C. illinoinensis genome (Lovell et al. 2021). Zhang et al. (2024b) used 453 PASA, AUGUSTUS (Stanke et al. 2006), and GeneWise (Birney et al. 2004) to annotate the C. 454 sinensis and C. cathayensis genomes. Second, the diversity and number of tissues represented in 455 the RNA-seq data can affect annotation completeness, and sampling from multiple tissues is 456 recommended (Salzberg 2019; Kress et al. 2022; Vuruputoor et al. 2023). Our annotations were 457 supported by RNA-seq data from two tissues (leaf and axillary bud), whereas Lovell et al. (2021) 458 used RNA-seq data from a larger number of tissues, including leaf, catkin, and dormant and 459 swelling buds. Lastly, the lower gene count in C. glabra may reflect its polyploid nature. 460 Genome fractionation and gene loss are common following polyploid formation (Langham et al. 461 2004; Leitch and Bennett 2004; Freeling 2009; Soltis et al. 2015; Van de Peer et al. 2017; 462 Wendel et al. 2018), although fractionation as originally defined (Freeling 2009) cannot occur in 463 an autopolyploid that lacks parental subgenomes. Indeed, the relatively smaller monoploid (1x) 464 genome size of C. glabra (e.g., 600.4 Mb for haplotype A and smaller for the other haplotypes) 465 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 18 compared with diploid Carya species (e.g., 674.3 Mb for C. illinoinensis) may result from gene 466 loss following polyploidy in C. glabra. 467 In summary, the C. glabra genome assembly and annotation presented in this study are of 468 high quality, with metrics comparable to, or surpassing (based on the BUSCO completeness 469 score; Table 2), published genomes from other Carya species. 470 471 Potential practical applications of the Carya glabra genome assembly 472 The Carya glabra genome assembly provides a valuable resource for identifying 473 candidate genes that may facilitate breeding programs in pecan (C. illinoinensis) and Chinese 474 hickory (C. cathayensis). Notably, we identified over 600 disease resistance genes (R genes) in 475 each haplotype of C. glabra (Table S10). A similar, but higher, number of R genes has been 476 identified in other Carya species: C. illinoinensis, C. sinensis, and C. cathayensis have 724, 685, 477 and 800 R genes, respectively (Table S10). We focused particularly on a genomic region 478 syntenic to a major QTL associated with phylloxera resistance in C. illinoinensis. Several aphid-479 like insects from the genus Phylloxera infect pecan and induce gall formation, which can cause 480 defoliation and significantly reduce yield (Hedin et al. 1985; Andersen and Mizell III 1987). 481 Lovell et al. (2021) identified a single major QTL underlying this trait, and several candidate R 482 genes containing LRR domains were annotated within this QTL. In the syntenic region in C. 483 glabra, we identified 8, 10, 11, and 8 R genes in haplotypes A, B, C, and D, respectively (Fig. 7; 484 Table S11). These candidate genes provide an additional genetic resource that could facilitate 485 engineering efforts to improve phylloxera resistance in pecan. 486 Polyploidy plays an important role in plant breeding (Udall and Wendel 2006; Sattler et 487 al. 2016), and polyploids often exhibit an advantageous stress response relative to diploids 488 (Bomblies 2020; Fox et al. 2020; Van de Peer et al. 2021; Tossi et al. 2022). Future studies 489 examining stress response in Carya glabra and its closely related diploid species (e.g., C. 490 palmeri and C. illinoinensis) could provide valuable insights into the effect of polyploidy on 491 stress tolerance in Carya – information that may inform future strategies for improving pecan 492 and Chinese hickory. 493 494 Genome assembly and annotation as tools for conservation and teaching genomics 495 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 19 McCarty Woods is a 2.9-acre (11,735.9 m2) designated Conservation Area located at the 496 heart of the UF campus (Fig. 1b). Representing part of the southernmost extent of deciduous 497 forest in eastern North America, McCarty Woods contains more than 100 native plant species, 498 including Carya glabra (Sharman 2024). Although designated as a Conservation Area, McCarty 499 Woods’ central location on the UF campus has made it a recurring target for development. In 500 2021, a campaign led by botanists at the Florida Museum of Natural History as well as students 501 and community members successfully halted proposed development plans, and efforts to 502 advocate for long-term protection and restoration of the Woods are ongoing. 503 In collaboration with the ACTG project, the McCarty Woods Genome Project launched 504 in 2024 (Sharman 2024). By sequencing the first genomes of iconic trees growing in the Woods, 505 the project aims to “immortalize” these individuals and provide reference genomes that will 506 guide future research and applications involving these species. These genomic resources 507 strengthen the case for preserving the Conservation Area status for McCarty Woods and 508 underscore its significant value for research and education. The reference genome of Carya 509 glabra presented in this study represents the first genome produced by the McCarty Woods 510 Genome Project, with others in progress (e.g., Quercus michauxii). 511 A Course-based Undergraduate Research Experience (CURE) class was offered at UF in 512 Spring 2025 as part of the McCarty Woods Genome Project (Fig. 1c). Teaching materials and 513 data analysis pipelines from the ACTG project (Harkess 2022; Yocca et al. 2024; Zhang et al. 514 2024a) were incorporated into the course, providing undergraduate students with hands-on 515 experience in genome assembly and annotation of Carya glabra. By combining real-world data 516 with active learning, the course engaged students from eight departments — Biology, 517 Biomedical Engineering, Chemistry, Computer & Information Science & Engineering, English, 518 Entomology and Nematology, Mechanical and Aerospace Engineering, and Statistics — and 519 emphasized programming, collaboration, critical thinking, and scientific writing. Bioinformatic 520 code generated through the course is publicly available on GitLab 521 (https://gitlab.com/shengchenshan/bot4935-plant-genome-assembly-and-annotation), and lecture 522 slides are available on Zenodo (https://doi.org/10.5281/zenodo.17969442). In summary, the 523 course provided students insight into the process of scientific research and the role of genomics 524 in biological sciences, highlighting the value of genome assembly and annotation in training the 525 next generation of biological scientists and bioinformaticians. 526 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 20 527 Future directions 528 The Carya glabra nuclear genome assembly provides an important tool for investigating 529 the roles of polyploidy and hybridization in genome evolution in Carya. Several intriguing 530 evolutionary questions remain. When did C. glabra undergo the most recent whole-genome 531 duplication? Phylogenetic studies suggest that its closest relative is C. texana, which is also a 532 tetraploid (Huang et al. 2019; Xi et al. 2022). Did these two species share an ancestral 533 polyploidization event prior to divergence, or did they experience independent whole-genome 534 duplication events? If the latter is the case, what is the diploid ancestor of Carya glabra? Are 535 there undetected diploid populations of C. glabra? What environmental factors may have 536 contributed to the success of genome doubling in these lineages? 537 The possibility of gene flow between C. glabra and pecan (C. illinoinensis), which is a 538 diploid, also merits investigation. Plastome-based phylogenetic analyses have shown that C. 539 glabra is closely related to a specific C. illinoinensis cultivar, ‘87MX3-2.11’ (Xi et al. 2022). If 540 introgression involving C. glabra and pecan occurred, it may provide novel opportunities for 541 pecan breeding and the potential transfer of beneficial traits from C. glabra into this 542 economically important crop.543 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 21 Table 1. Basic statistics of the raw sequence data from Carya glabra. PacBio HiFi Omni-C RNA-seq Leaf Axillary bud Total bases (Gb) 79.1 39.7 24.3 22.5 Total read number (million) 5.3 264.5 160.6 148.8 Average read length (bp) 15,035.2 150.0 151.0 151.0 Coverage* 131.7× 66.1× - - Note: *sequencing coverage was calculated by dividing the total number of bases by the assembled monoploid (1x) genome size (600.4 Mb for haplotype A, as described in the nuclear genome assembly section). .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 22 Table 2. Assembly statistics and genomic features of the Carya glabra genome and other published genomes of Carya species. Genome statistics C. glabra (4x) C. illinoinensis (2x)1 C. sinensis (2x) C. cathayensis (2x) Hap. A Hap. B Hap. C Hap. D Monoploid (1x) genome size (Mb) 600.4 585.2 574.3 559.4 674.3 623.2 698.1 N50 (Mb) 39.6 37.7 36.8 36.2 44.7 38.9 43.5 Repeat sequences (%) 55.0 54.4 54.0 53.8 49.7 43.5 50.1 Predicted protein-coding genes 30,947 31,087 30,369 30,110 32,267 35,370 36,722 Complete BUSCO (%) assembly 97.8 97.6 96.8 95.4 98.1 96.9 97.0 Complete BUSCO (%) annotation 97.7 97.1 96.5 94.9 96.3 94.8 95.8

Reference

Current work Lovell et al. 2021 Zhang et al. 2024b Note: 1the statistics are from C. illinoinensis cv. ‘Pawnee’. Hap.: haplotype. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 23 Fig. 1. Carya glabra (pignut hickory) on the campus of the University of Florida. (a) The C. glabra individual sequenced in this study; the inset highlights the fruits and compound leaves. (b) Location of the C. glabra individual (indicated by the red pin) in McCarty Woods on the University of Florida campus. (c) Most members of the research team in front of the C. glabra tree; most are undergraduate researchers. Photo credits: (a) Shengchen Shan; (b) John Rouse; (c) Erin L. Grady. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 24 Fig. 2. Annotated chloroplast genome of Carya glabra. The outermost circle shows the annotated genes, color-coded according to their functional categories (legend displayed in the figure center). Genes on the inside of the circle are transcribed clockwise, whereas those on the outside are transcribed counterclockwise. Intron-containing genes are marked with an asterisk (*). The inner circle indicates the four structural regions of the chloroplast genome: the large single-copy, the small single-copy, and the two inverted repeat regions (A and B). The innermost grey graph represents the GC content, with the grey

Reference

line marking the 50% threshold. The figure is modified from the OGDRAW output. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 25 Fig. 3. Annotated mitochondrial genome of Carya glabra shown as two conformations labeled mtChr1 and mtChr2. The outermost circle shows the annotated genes, color-coded according to their functional categories (legend displayed at bottom center). Genes on the inside of the circle are transcribed clockwise, whereas those on the outside are transcribed counterclockwise. The innermost grey graph represents the GC content, with the grey reference line marking the 50% threshold. Chromosomes are not drawn to scale. The figure is modified from the OGDRAW output. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 26 Fig. 4. K-mer spectrum of Carya glabra. The plot illustrates the distribution of k-mer frequences (i.e., counts of unique k-mers; y-axis) across different coverage depths (x-axis) in the entire HiFi dataset. The leftmost error peak, representing the large number of low-coverage unique k-mers, results from sequencing errors. Peaks 1, 2, 3, and 4 correspond to k-mers present in one, two, three, and four copies, respectively, within the tetraploid genome. The coverages for peaks 1, 2, 3, and 4 are 34.2×, 68.4×, 102.6×, and 136.8×, respectively. The high-coverage “hump”, indicated by the arrow, represents k-mers derived from repetitive regions. K-mer size: 21. The figure is modified from the GenomeScope 2.0 output. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 27 Fig. 5. The chromosome-level assembly of the Carya glabra (4x) nuclear genome. (a) Circos plot of the 16 chromosomes from haplotype A of the Carya glabra genome. The unit of the chromosome length is Mb. The densities of various genomic features in 100-kb sliding windows across the chromosomes are shown on four tracks (A: genes; B: transposons; C: copia; D: gypsy). (b) The Hi-C contact map of the nuclear genome assembly. (c) The dot plot comparing one set of chromosomes from Carya illinoinensis (2x) and the four sets of chromosomes from C. glabra. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 28 Fig. 6. Syntenic map (riparian plot) of homologous regions among the four haplotypes of Carya glabra and the haploid genomes of C. cathayensis, C. sinensis, and C. illinoinensis. The chromosomes are scaled by gene rank order. Among the structural variants identified, three are highlighted: green circle 1 marks an inversion on chromosome 16 between C. sinensis and C. illinoinensis; green circle 2 indicates an inversion between C. illinoinensis and haplotype A of C. glabra on chromosome 11; an inversion between C. glabra haplotypes B and C on chromosome 3 is indicated by green circle 3. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 29 Fig. 7. Synteny between Carya illinoinensis cv. ‘Lakota’ and the four Carya glabra haplotypes at the major quantitative trait locus (QTL) associated with phylloxera resistance. QTL mapping in C. illinoinensis by Lovell et al. (2021) identified a single large QTL peak on chromosome 16. Within this QTL on the primary assembly of C. illinoinensis cv. ‘Lakota’, five putative plant disease resistance genes (R genes) containing the leucine-rich repeat (LRR) domain were annotated (indicated by arrowheads). In the corresponding syntenic region of C. glabra, chromosome 16C contains 11 putative R genes – the highest count among the four haplotypes – with each gene labeled by name. The syntenic regions on chromosomes 16A, 16B, and 16D contain 8, 10, and 8 R genes, respectively. Syntenic gene pairs are connected by the ribbons, with those linking to the 11 R genes on C. glabra chromosome 16C highlighted in red. Note that not all R genes on chromosomes 16A, 16B, and 16D are reciprocal best hits with R genes on chromosome 16C; therefore, these are not connected with red ribbons in the plot. Genes are depicted as boxes, with blue representing genes on the positive strand and green representing genes on the negative strand. Chromosome segments are not drawn to scale. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 30 Data availability Raw data generated in this project, including PacBio HiFi, Omni-C, and RNA-seq, are deposited in NCBI under BioProject PRJNA1373287. The four haplotypes of the nuclear genome assembly are available under BioProject PRJNA1376128–PRJNA1376131. The nuclear genome annotation and organellar genomes are available at Zenodo (https://doi.org/10.5281/zenodo.17969322). All codes and scripts are available at: https://gitlab.com/shengchenshan/bot4935-plant-genome-assembly-and-annotation. Acknowledgments The authors acknowledge Matthew A. Gitzendanner, Andre S. Chanderbali, and Lawrence Oshins from the University of Florida Research Computing team for their technical assistance and support. We also appreciate the helpful discussions with Rhett M. Rautsaw and Shujun Ou on PacBio sequencing and transposable element annotation, respectively. Computational resources were provided by HiPerGator, the University of Florida supercomputer. Funding This work was supported by US National Science Foundation grants IOS-1923234 and DEB- 2043478 to DES and PSS, DBI-2320251 to PSS and DES, IOS-PGRP CAREER-223930 to AH, and the University of Florida. Author contributions DES, PSS, AH, SS, and EMO designed the project. SS, EMO, PSS, DES, AH, BK, AO, GS, BS, RT, AT, EL, BP, TR, LS, GV, LW, and HZ contributed to data analysis and interpretation. SS, EMO, DES, PSS, HZ, AH, BK, AO, GS, BS, RT, AT, BP, TR, MHR, and GV wrote the manuscript. All authors reviewed and approved the manuscript. Conflicts of interest The authors declare no conflict of interest. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 31 Supplementary materials Fig. S1. Dot plot comparing one set of chromosomes from Carya illinoinensis (2x) with the unitig assembly of Carya glabra (4x). Fig. S2. Manual curation of the YaHS scaffolding output using Juicebox. Table S1. Protein evidence used for nuclear genome annotation. Table S2. Statistics of gene models predicted under different BRAKER3 parameter settings for Carya glabra haplotype A. Table S3. Annotated genes in the Carya glabra chloroplast genome. Table S4. Annotated genes in the Carya glabra mitchondrial genome. Table S5. Chloroplast-derived segments in the Carya glabra mitochondrial genome. Table S6. Omni-C library quality control report from Phase Genomics’ hic_qc pipeline. Table S7. Lengths (in Mb) of the 64 assembled pseudo-chromosomes of Carya glabra. Table S8. Summary of repetitive element annotation in Carya glabra. Table S9. Statistics of finalized gene models predicted for four haplotypes from Carya glabra. Table S10. Four major classes of plant disease resistance genes (R genes) identified in Carya glabra and three other Carya species with assembled genomes. Table S11. Putative Carya glabra plant disease resistance genes (R genes) identified in the syntenic regions corresponding to the major quantitative trait locus (QTL) associated with phylloxera resistance in Carya illinoinensis. Table S12. Misannotated and missing genes in previously published Carya glabra chloroplast genomes. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 32 Literature cited Andersen PC, Mizell III RF. 1987. Physiological effects of galls induced by Phylloxera notabilis (Homoptera: Phylloxeridae) on pecan foliage. Environmental Entomology. 16(1):264–268. Birney E, Clamp M, Durbin R. 2004. GeneWise and genomewise. Genome Research. 14:988– 995. Bomblies K. 2020. When everything changes at once: finding a new normal after genome duplication. Proceedings of the Royal Society B. 287(1939):20202154. Bucchini F, Del Cortona A, Kreft Ł, Botzki A, Van Bel M, Vandepoele K. 2021. TRAPID 2.0: a web application for taxonomic and functional analysis of de novo transcriptomes. Nucleic Acids Research. 49(17):e101. Cabanettes F, Klopp C. 2018. D-GENIES: dot plot large genomes in an interactive, efficient and simple way. PeerJ. 6:e4958. Cannon EK, Molik DC, Wright AJ, Zhang H, Honaas L, Chougule K, Dyer S. 2025. Guidelines for gene and genome assembly nomenclature. Genetics. 229(3):iyaf006. Chen S, Zhou Y, Chen Y, Gu J. 2018. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 34(17):i884–i890. Chen Y, Wang W, Zhang S, Zhao Y, Feng L, Zhu C. 2024. Assembly and analysis of the complete mitochondrial genome of Carya illinoinensis to provide insights into the conserved sequences of tRNA genes. Scientific Reports. 14:28571. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods. 18(2):170–175. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 33 Coder KD. 2023. Native hickories of Georgia I: History & genetic relationships. University of Georgia, Warnell School of Forestry & Natural Resources. [accessed 2025 November 20];WSFNR-23-24A. Cognat V, Pawlak G, Pflieger D, Drouard L. 2022. PlantRNA 2.0: an updated database dedicated to tRNAs of photosynthetic eukaryotes. The Plant Journal. 112(4):1112–1119. Dainat J. 2022. Another Gtf/Gff Analysis Toolkit (AGAT): Resolve interoperability issues and accomplish more with your annotations. In Plant and Animal Genome XXIX Conference, San Diego, CA, USA. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, et al. 2021. Twelve years of SAMtools and BCFtools. Gigascience. 10(2):giab008. Dudchenko O, Shamim MS, Batra SS, Durand NC, Musial NT, Mostofa R, Pham M, Glenn St Hilaire B, Yao W, Stamenova E, et al. 2018. The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. BioRxiv. 254797. Duncan WH, Duncan MB. 1988. Trees of the southeastern United States. Athens (GA): The University of Georgia Press. Durand NC, Shamim MS, Machol I, Rao SS, Huntley MH, Lander ES, Aiden EL. 2016. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Systems. 3(1):95–98. Fox DT, Soltis DE, Soltis PS, Ashman TL, Van de Peer Y. 2020. Polyploidy: a biological force from cells to ecosystems. Trends in Cell Biology. 30(9):688–694. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 34 Freeling M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annual Review of Plant Biology. 60:433–453. Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, Stanke M. 2024. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Research. 34:769–777. Grauke LJ, Wood BW, Harris MK. 2016. Crop vulnerability: Carya. HortScience. 51(6):653– 663. Greiner S, Lehwark P, Bock R. 2019. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: Expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids Research. 47(W1):W59–W64. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith Jr RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, et al. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research. 31(19):5654–5666. Hardt RA, Forman RT. 1989. Boundary form effects on woody colonization of reclaimed surface mines. Ecology. 70(5):1252–1260. Harkess A. 2022. The American Campus Tree Genomes documentation; [accessed 2025 November 21]. https://actg-wgaa.readthedocs.io/en/latest. Hedin PA, Neel WW, Burks ML, Grimley E. 1985. Evaluation of plant constituents associated with pecan phylloxera gall formation. Journal of Chemical Ecology. 11(4):473–484. Huang Y, Xiao L, Zhang Z, Zhang R, Wang Z, Huang C, Huang R, Luan Y, Fan T, Wang J, et al. 2019. The genomes of pecan and Chinese hickory provide insights into Carya evolution and nut nutrition. GigaScience. 8(5):giz036. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 35 Kress WJ, Soltis DE, Kersey PJ, Wegrzyn JL, Leebens-Mack JH, Gostel MR, Liu X, Soltis PS. 2022. Green plant genomes: What we know in an era of rapidly expanding opportunities. Proceedings of the National Academy of Sciences, USA. 119(4):e2115640118. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA. 2009. Circos: an information aesthetic for comparative genomics. Genome Research. 19(9): 1639–1645. Langham RJ, Walsh J, Dunn M, Ko C, Goff SA, Freeling M. 2004. Genomic duplication, fractionation and the origin of regulatory novelty. Genetics. 166(2):935–945. Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 25(14):1754–1760. Li J, Ni Y, Lu Q, Chen H, Liu C. 2025. PMGA: A plant mitochondrial genome annotator. Plant Communications. 6(3):101191. Liu Y, Chen K, Wang L, Yu X, Xu C, Suo Z, Zhou S, Shi S, Dong W. 2025. Assembly-free reads accurate identification (AFRAID) approach outperforms other methods of DNA barcoding in the walnut family (Juglandaceae). Plant Diversity. 47(1):115–126. Lovell JT, Bentley NB, Bhattarai G, Jenkins JW, Sreedasyam A, Alarcon Y, Bock C, Boston LB, Carlson J, Cervantes K, et al. 2021. Four chromosome scale genomes and a pan-genome annotation to accelerate pecan tree breeding. Nature Communications. 12(1):4125. Lovell JT, Sreedasyam A, Schranz ME, Wilson M, Carlson JW, Harkess A, Emms D, Goodstein DM, Schmutz J. 2022. GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. elife. 11:e78526. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 36 Luo J, Chen J, Guo W, Yang Z, Lim KJ, Wang Z. 2021. Reassessment of Annamocarya sinesis (Carya sinensis) taxonomy through concatenation and coalescence phylogenetic analysis. Plants. 11(1):52. Manchester SR. 1999. Biogeographical relationships of North American tertiary floras. Annals of the Missouri Botanical Garden. 86(2):472–522. Manni M, Berkeley MR, Seppey M, Zdobnov EM. 2021. BUSCO: assessing genomic data quality and beyond. Current Protocols. 1:e323. Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27(6):764 –770. Møller IM, Rasmusson AG, Van Aken O. 2021. Plant mitochondria – past, present and future. The Plant Journal. 108(4):912-959. Ortiz EM, Höwener A, Shigita G, Raza M, Maurin O, Zuntini A, Forest F, Baker WJ, Schaefer H. 2023. A novel phylogenomics pipeline reveals complex pattern of reticulate evolution in Cucurbitales. BioRxiv. 564367. Osuna-Cruz CM, Paytuvi-Gallart A, Di Donato A, Sundesha V, Andolfo G, Aiese Cigliano R, Sanseverino W, Ercolano MR. 2018. PRGdb 3.0: a comprehensive platform for prediction and analysis of plant disease resistance genes. Nucleic Acids Research. 46(D1):D1197–D1201. Ou S, Su W, Liao Y, Chougule K, Agda JR, Hellinga AJ, Lugo CS, Elliott TA, Ware D, Peterson T, et al. 2019. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology. 20:275. Palmer JD, Herbon LA. 1988. Plant mitochondrial DNA evolved rapidly in structure, but slowly in sequence. Journal of Molecular evolution. 28:87–97. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 37 POWO. 2025. Plants of the World Online; [accessed 2025 November 30]. https://powo.science.kew.org/. Ranallo-Benavidez TR, Jaron KS, Schatz MC. 2020. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications. 11:1432. Robinson JT, Turner D, Durand NC, Thorvaldsdóttir H, Mesirov JP, Aiden EL. 2018. Juicebox.js provides a cloud-based visualization system for Hi-C data. Cell Systems. 6(2):256- 258. Salamov AA, Solovyev VV. 2000. Ab initio gene finding in Drosophila genomic DNA. Genome Research. 10:516–522. Salzberg SL. 2019. Next-generation genome annotation: We still struggle to get it right. Genome Biology. 20:92. Sattler MC, Carvalho CR, Clarindo WR. 2016. The polyploidy and its key role in plant breeding. Planta. 243(2):281–296. Sharman S. 2024. Using genomics to immortalize and protect McCarty Woods on UF Campus; [accessed 2025 November 21]. https://www.hudsonalpha.org/using-genomics-to-immortalize- and-protect-mccarty-woods. Shen W, Sipos B, Zhao L. 2024. SeqKit2: A Swiss army knife for sequence and alignment processing. iMeta. 3(3):e191. Smalley GW. 1990. Carya glabra (Mill.) Sweet pignut hickory. In: Burns RM, Honkala BH, editors. Silvics of North America (Volume 2, Hardwoods). Washington, DC: U.S. Department of Agriculture, Forest Service. p. 198–204. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 38 Smit AFA, Hubley R, Green P. 2013-2015. RepeatMasker Open-4.0. http://www.repeatmasker.org. Soltis PS, Marchant DB, Van de Peer Y, Soltis DE. 2015. Polyploidy and genome evolution in plants. Current Opinion in Genetics & Development. 35:119–125. Stanke M, Schöffmann O, Morgenstern B, Waack S. 2006. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 7:62. Stone DE. 1961. Ploidal level and stomatal size in the American hickories. Brittonia. 13:293– 302. Su X, Liu Q, Guo H, Hu D, Liu D, Wang Z, Zhang P. 2023. Deciphering the mitochondrial genome of Juglans mandshurica (Juglandaceae). Mitochondrial DNA Part B. 8(2):249–254. Sutton J, Crowley D. 2020. Carya hybrids. Trees and Shrubs Online. https://treesandshrubsonline.org/articles/carya/carya-hybrids. Tang H, Krishnakumar V, Zeng X, Xu Z, Taranto A, Lomas JS, Zhang Y, Huang Y, Wang Y, Yim WC, et al. 2024. JCVI: A versatile toolkit for comparative genomics analysis. iMeta. 3(4):e211. Tillich M, Lehwark P, Pellizzer T, Ulbricht-Jones ES, Fischer A, Bock R, Greiner S. 2017. GeSeq – versatile and accurate annotation of organelle genomes. Nucleic Acids Research. 45(W1):W6-W11. Tirmenstein DA. 1991. Carya glabra, pignut hickory; [accessed 2025 November 21]. https://research.fs.usda.gov/feis/species-reviews/cargla. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 39 Tossi VE, Martínez Tosar LJ, Laino LE, Iannicelli J, Regalado JJ, Escandón AS, Baroli I, Causin HF, Pitta-Álvarez SI. 2022. Impact of polyploidy on plant tolerance to abiotic and biotic stresses. Frontiers in Plant Science. 13:869423. Udall JA, Wendel JF. 2006. Polyploidy and crop improvement. Crop Science. 46(S1):S3-S14. USDA-NASS. 2025. Noncitrus Fruits and Nuts: 2024 Summary. United States Department of Agriculture, National Agricultural Statistics Service, Washington, DC, USA. https://esmis.nal.usda.gov/sites/default/release- files/zs25x846c/mc87rn20c/w37656321/ncit0525.pdf. Van Bel M, Diels T, Vancaester E, Kreft L, Botzki A, Van de Peer Y, Coppens F, Vandepoele K. 2018. PLAZA 4.0: An integrative resource for functional, evolutionary and comparative plant genomics. Nucleic Acids Research. 46(D1):D1190–D1196. Van de Peer Y, Ashman TL, Soltis PS, Soltis DE. 2021. Polyploidy: an evolutionary and ecological force in stressful times. The Plant Cell. 33(1):11–26. Van de Peer Y, Mizrachi E, Marchal K. 2017. The evolutionary significance of polyploidy. Nature Reviews Genetics. 18:411–424. Vasimuddin M, Misra S, Li H, Aluru S. 2019. Efficient architecture-aware acceleration of BWA- MEM for multicore systems. Paper presented at: IPDPS 2019. IEEE International Parallel and Distributed Processing Symposium; Rio de Janeiro, Brazil. Vuruputoor VS, Monyak D, Fetter KC, Webster C, Bhattarai A, Shrestha B, Zaman S, Bennett J, McEvoy SL, Caballero M, et al. 2023. Welcome to the big leaves: Best practices for improving genome annotation in non‐model plant genomes. Applications in Plant Sciences. 11(4):e11533. Wang J, Kan S, Liao X, Zhou J, Tembrock LR, Daniell H, Jin S, Wu Z. 2024. Plant organellar genomes: Much done, much more to do. Trends in Plant Science. 29(7):754–769. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 40 Weisman CM, Murray AW, Eddy SR. 2022. Mixing genome annotation methods in a comparative analysis inflates the apparent number of lineage-specific genes. Current Biology. 32(12):2632–2639. Wendel JF, Lisch D, Hu G, Mason AS. 2018. The long and short of doubling down: Polyploidy, epigenetics, and the temporal dynamics of genome fractionation. Current Opinion in Genetics & Development. 49:1–7. Wick RR, Schultz MB, Zobel J, Holt KE. 2015. Bandage: Interactive visualization of de novo genome assemblies. Bioinformatics. 31(20):3350–3352. Woodworth RH. 1930. Meiosis of microsporogenesis in the Juglandaceae. American Journal of Botany. 17(9):863–869. Wu ZQ, Liao XZ, Zhang XN, Tembrock LR, Broz A. 2022. Genomic architectural variation of plant mitochondria – A review of multichromosomal structuring. Journal of Systematics and Evolution. 60(1):160-168. Xi J, Lv S, Zhang W, Zhang J, Wang K, Guo H, Hu J, Yang Y, Wang J, Xia G, et al. 2022. Comparative plastomes of Carya species provide new insights into the plastomes evolution and maternal phylogeny of the genus. Frontiers in Plant Science. 13:990064. Xiao L, Yu M, Zhang Y, Hu J, Zhang R, Wang J, Guo H, Zhang H, Guo X, Deng T, et al. 2021. Chromosome-scale assembly reveals asymmetric paleo-subgenome evolution and targets for the acceleration of fungal resistance breeding in the nut crop, pecan. Plant Communications. 2(6):100247. Ye H, Liu H, Li H, Lei D, Gao Z, Zhou H, Zhao P. 2024. Complete mitochondrial genome assembly of Juglans regia unveiled its molecular characteristics, genome evolution, and phylogenetic implications. BMC Genomics. 25:894. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint 41 Yocca A, Akinyuwa M, Bailey N, Cliver B, Estes H, Guillemette A, Hasannin O, Hutchison J, Jenkins W, Kaur I, et al. 2024. A chromosome-scale assembly for ‘d’Anjou’ pear. G3: Genes, Genomes, Genetics. 14(3):jkae003. Zhang H, Ko I, Eaker A, Haney S, Khuu N, Ryan K, Appleby AB, Hoffmann B, Landis H, Pierro KA, et al. 2024a. A haplotype-resolved, chromosome-scale genome for Malus domestica Borkh. ‘WA 38’. G3: Genes, Genomes, Genetics. 14(12):jkae222. Zhang JB, Li RQ, Xiang XG, Manchester SR, Lin L, Wang W, Wen J, Chen ZD. 2013. Integrated fossil and molecular data reveal the biogeographic diversification of the eastern Asian-eastern North American disjunct hickory genus (Carya Nutt.). PLoS One. 8(7):e70449. Zhang WP, Ding YM, Cao Y, Li P, Yang Y, Pang XX, Bai WN, Zhang DY. 2024b. Uncovering ghost introgression through genomic analysis of a distinct eastern Asian hickory species. The Plant Journal. 119(3):1386–1399. Zhou C, McCarthy SA, Durbin R. 2023. YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 39(1):btac808. Zhou C, Brown M, Blaxter M, Darwin Tree of Life Project Consortium, McCarthy SA, Durbin R. 2025. Oatk: A de novo assembly tool for complex plant organelle genomes. Genome Biology. 26:235. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 23, 2025. ; https://doi.org/10.64898/2025.12.19.695579doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-NC-ND-4.0