Full text
49,103 characters
· extracted from
preprint-html
· click to expand
Genomic Variation Landscape and Population Genetic Analysis of Camellia longistyla Based on Whole-Genome Resequencing | Authorea try { document.documentElement.classList.add('js'); } catch (e) { } var _gaq = _gaq || []; _gaq.push(['_setAccount', 'G-8VDV14Y67G']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); Skip to main content Preprints Collections Wiley Open Research IET Open Research Ecological Society of Japan All Collections About About Authorea FAQs Contact Us Quick Search anywhere Search for preprint articles, keywords, etc. Search Search ADVANCED SEARCH SCROLL This is a preprint and has not been peer reviewed. Data may be preliminary. 18 March 2026 V1 Latest version Share on Genomic Variation Landscape and Population Genetic Analysis of Camellia longistyla Based on Whole-Genome Resequencing Authors : Fengchan Wu 0009-0003-6251-3426 , Binyang Zhao , Peiyu Xi , Na Ran , Yulin Guo , Chunyan Guo , and Anding Li [email protected] Authors Info & Affiliations https://doi.org/10.22541/au.177383017.79260276/v1 127 views 63 downloads Contents Abstract Information & Authors Metrics & Citations View Options References Figures Tables Media Share Abstract Camellia longistyla is an endemic species to Guizhou Province, possessing significant ornamental value and potential economic value for oil production.To elucidate its genetic background and guide scientific conservation and sustainable utilization, this study employed whole-genome resequencing for the first time on 98 individuals from three natural populations in Chishui City and Leishan County. Through high-quality SNP markers, we systematically analyzed its genomic variation characteristics, population genetic structure, and genetic diversity levels.A total of 61,370,270 high-quality SNPs were identified, which were uniformly distributed across chromosomes but with distinct variation hotspot regions. Population structure analysis clearly divided all samples into two main subpopulations (Pop60 and Pop38), perfectly corresponding to their geographical origins, indicating that geographical isolation is the key factor driving population differentiation. Genetic diversity assessment indicated moderately low overall genetic diversity, with significant imbalance between the two subpopulations: the larger Chishui population (Pop60) exhibited lower genetic diversity than the Leishan population (Pop38), suggesting a higher risk of genetic diversity loss in the former. This study provides the first genome-level insights into the genetic structure and diversity status of C.longistyla, offering crucial scientific evidence for formulating differentiated conservation strategies, identifying priority conservation units, and future germplasm innovation and breeding research. Genomic Variation Landscape and Population Genetic Analysis of Camellia longistyla Based on Whole-Genome Resequencing Fengchan Wu 1 , Binyang Zhao 2 , Peiyu Xi 1 , Na Ran 1 , Yulin Guo 1 , Chunyan Guo 3 and Anding Li 1, * 1 Guizhou Institute of Biology, Guizhou Academy of Sciences, Guiyang, Guizhou Province, China 550009 2 Zunyi City Forestry Research Institute,Zunyi City Forestry Bureau, Zunyi, China 563000 3 Guizhou Botanical Garden,Guizhou Academy of Sciences, Guiyang, China 550001 * Correspondence: [email protected] Abstract: Camellia longistyla is an endemic species to Guizhou Province, possessing significant ornamental value and potential economic value for oil production.To elucidate its genetic background and guide scientific conservation and sustainable utilization, this study employed whole-genome resequencing for the first time on 98 individuals from three natural populations in Chishui City and Leishan County. Through high-quality SNP markers, we systematically analyzed its genomic variation characteristics, population genetic structure, and genetic diversity levels.A total of 61,370,270 high-quality SNPs were identified, which were uniformly distributed across chromosomes but with distinct variation hotspot regions. Population structure analysis clearly divided all samples into two main subpopulations (Pop60 and Pop38), perfectly corresponding to their geographical origins, indicating that geographical isolation is the key factor driving population differentiation. Genetic diversity assessment indicated moderately low overall genetic diversity, with significant imbalance between the two subpopulations: the larger Chishui population (Pop60) exhibited lower genetic diversity than the Leishan population (Pop38), suggesting a higher risk of genetic diversity loss in the former. This study provides the first genome-level insights into the genetic structure and diversity status of C.longistyla , offering crucial scientific evidence for formulating differentiated conservation strategies, identifying priority conservation units, and future germplasm innovation and breeding research. Keywords: Camellia longistyla ; Population structure; Genetic diversity; Whole genome resequencing Introduction Camellia longistyla belongs to the genus Camellia (section Camellia) within the Theaceae family. It is an endemic species of Guizhou Province (Zhou & Chen, 1983), currently distributed only in the Chishui Alsophila Natural Reserve and the southeastern part of the Leigongshan National Nature Reserve, within evergreen broad-leaved mixed forests at altitudes of 950–1400 m. This species exhibits weak natural regeneration capacity and limited population size, classifying it as an endangered plant requiring urgent protection (Liu et al., 2016). As an evergreen broad-leaved tree, C. longistyla not only possesses a graceful tree form, showy flowers, and high ornamental value but also displays outstanding economic traits (Zou, 2000): its fruit is relatively large, with a single fruit weight of approximately 49.61 g and a thousand-seed weight of up to 3289.7 g, significantly higher than Camellia oleifera ; its oil content reaches 43.93%, with oil quality approaching national tea oil standards, and it is rich in amino acids, demonstrating excellent potential for oil utilization and comprehensive exploitation, making it an important specialty oil and ornamental tree species (Liu et al., 2018). Genetic diversity represents the biological characteristics formed during long-term species evolution and is a core component of biodiversity, directly affecting a species’ adaptive potential, evolutionary capacity, and value for sustainable utilization (Wang et al., 2014).For forest trees, assessing genetic diversity not only helps reveal their evolutionary history and population structure but also provides a scientific basis for germplasm resource conservation, genetic improvement, and elite variety breeding (Huan et al., 2019). With the development of molecular biology techniques, molecular marker technologies based on high-throughput sequencing have become effective tools for germplasm resource research. As the most common type of genetic variation, SNPs have been widely applied in natural population studies to investigate genetic diversity and population genetic structure in plants such as pigeonpea (Varshney et al., 2017), Albizia odoratissima on Hainan Island (Li, Ji, Yang, Xu, & Guan, 2025), Eucommia ulmoides (Qing et al., 2022), Rhododendron bailiense (Luo et al., 2025), apple (Luo, Evans, Norelli, Zhang, & Peace, 2020), and tea (Patturaj, Manikantan, Veerasamy, Elias, & Ramasamy, 2025). Zhao et al. analyzed a total of 112,072 SNPs and further conducted phylogenetic analysis, principal component analysis, and population structure analysis to investigate the genetic diversity and geographical distribution characteristics of cultivated tea plants on the Guizhou Plateau(Zhao et al., 2022). The results showed significant differences between ancient tea germplasm from the Yangtze River Basin and the Pearl River Basin, further validating this inference by revealing clustering relationships among three populations. Niu et al. used 79,016 SNPs to perform population structure analysis on 415 tea plant materials, identifying four groups: pure wild type, mixed wild type, ancient landraces, and modern landraces (Zhao et al., 2022). Zhou et al. developed 29 SNP markers for wild resources of Camellia chekiangoleosa to explore its genetic relationship with two wild related species.The results showed that the genetic diversity of the three Camellia species was generally low, but two markers exhibited high discriminative power and could serve as important molecular tools for species differentiation. Genetic differentiation analysis indicated that C. chekiangoleosa was more closely related to Camellia crassissima , and the phylogenetic tree also supported C. chekiangoleosa and C. crassissima forming independent branches (Zhou et al., 2025). Although C. longistyla possesses significant ecological, economic, and ornamental value, knowledge of its genetic background and resource distribution remains very limited, with no reports on genetic diversity at the whole-genome level. Therefore, this study utilizes whole-genome resequencing technology for the first time to perform high-throughput sequencing on C. longistyla populations from different natural distributions. Through genome-wide SNP marker analysis, we systematically reveal its population genetic structure, diversity levels, and differentiation characteristics, combined with population genetics methods for conservation unit delineation. This study provides a theoretical foundation for the scientific conservation, germplasm resource preservation, and sustainable breeding utilization of this rare and endemic species, as well as offer methodological references for genetic studies of other Camellia species. Materials and methods Experimental Materials The C. longistyla materials for this study were collected from three natural distribution sites, all located in the subtropical humid climate zone of China. The specific sampling locations were: Wangxiangtai, Jinshagou Alsophila Natural Reserve, Chishui City (county-level city), Guizhou Province (A); Erlangba, Chishui City, Guizhou Province (B); and Datangxiang, Leigongshan, Leishan County, Guizhou Province (C). The sampling sites spanned longitude from 105°58′45″E to 108°02′46″E and latitude from 26°17′58″N to 28°32′37″N, with altitudes ranging from 1039.8 to 1369.7 meters (Table 1). A total of 98 germplasm individuals were collected: 40 samples (Sample IDs: 1-40) from site A, 20 samples (Sample IDs: 41-64, excluding 42, 43, 44, 51) from site B, and 38 samples (Sample IDs: 65-104, excluding 6, 30) from site C. Young leaves from each plant were collected, immediately placed in cryotubes, flash-frozen in liquid nitrogen, and stored at -80°C for subsequent sequencing and variation detection. Fig.1 Geographic locations of C.longistyl populations sampled in this study. Table1 Collection location information for t C. longistyl samples. Population Sample locality Longitude (E) Latitude(N) Altitude/m A Wangxiangtai, Jinshagou Alsophila Natural Reserve, Chishui City (county-level city), Guizhou Province 106°01′19″E 28°27′12″N 1369.65 B Erlangba, Chishui City, Guizhou Province 105°58′45″E 28°32′37″N 1305.29 C Datangxiang, Leigongshan, Leishan County, Guizhou Province 108°02′46″E 26°17′58″N 1039.78 2.1. Data Analysis 2.1.1. DNA Extraction, Library Construction, and Sequencing Genomic DNA from C. longistyla samples was extracted using the modified CTAB method (Doyle, 1987). DNA integrity was assessed using 0.75% agarose gel electrophoresis, purity was checked with a NanoDrop One spectrophotometer (requiring OD260/280 ratio between 1.8-2.2), and precise quantification was performed using a Qubit 3.0 fluorometer (requiring concentration >50 ng/μl, total amount a Covaris ultrasonic disruptor, fragments of 200-400 bp were selected using magnetic beads, followed by end repair, A-tailing, adapter ligation, library purification, and PCR amplification to complete library construction. Preliminary quantification and insert size detection were performed using Qubit 2.0 and Agilent 2100, respectively, and the effective library concentration was accurately quantified by Q-PCR. Qualified libraries were sequenced on the DNBSEQ platform, using rolling circle replication to prepare DNA nanoballs and combinatorial Probe-Anchor Synthesis for paired-end sequencing. Raw data generated from sequencing were converted to FASTQ format after base calling. Quality control and filtering were performed using fastp software (version: 0.20.1; parameters: default), with specific criteria including: removing adapter sequences; trimming polyG/polyX tails with length ≥10 bp; cropping low-quality regions using a sliding window; removing reads with N base count >5, low-quality bases (quality value 40%, or length Reads for subsequent analysis (Chen, Zhou, Chen, & Gu, 2018). 2.2.2. Sequence Alignment, Variant Detection, and Annotation Clean Reads from each sample were aligned to the reference genome using BWA software (parameters: ‘mem -R‘) (Alexander, Novembre, & Lange, 2009). SAM files were converted to sorted BAM files using SAMtools (parameters: ‘sort‘), and PCR duplicates were removed using the ‘markdup -r‘ parameter. Alignment rates and coverage were statistically analyzed using Python scripts (Alexander, Novembre, & Lange, 2009; Li et al., 2009). Variant detection was performed based on BAM files following the GATK Best Practices workflow. Initially, preliminary detection was conducted using the HaplotypeCaller module (GATK v4.2.5.0, parameter: minimum mapping quality set to 20). Subsequently, base quality score recalibration was performed using the BaseRecalibrator and ApplyBQSR modules, with the preliminary results as the known variant set. Based on the recalibrated data, HaplotypeCaller was run again, combined with GenotypeGVCFs for joint genotyping. The resulting variant set underwent hard filtering using the VariantFiltration module. SNPs and InDels were separated using GATK’s SelectVariants module (Depristo, Banks, Poplin, Garimella, & Daly, 2011). Obtained SNPs were stringently filtered: vcftools (parameters: ‘–max-missing 0.9 –maf 0.01 –min-alleles 2 –max-alleles 2‘) was used to filter sites with missing rate non-biallelic sites; further filtering was performed using PLINK software (v1.90b6.21; parameter: ‘–hwe 1e-6‘) to remove sites deviating from Hardy-Weinberg equilibrium, ultimately obtaining 61,370,270 high-quality SNPs (Danecek et al., 2011). SNP functional annotation was performed using ANNOVAR (Kai, Mingyao, & Hakon, 2010), and the genome-wide marker density was visualized using the CMplot package in R (Li Lin Yin, 2017). 2.2.3. Population Genetics Analysis Population genetic analysis was conducted based on the high-quality SNP dataset described above. The genetic relationship matrix among samples and principal component analysis were calculated using VCF2PCACluster software, and results were visualized using R language. A neighbor-joining tree was constructed based on the genetic relationship matrix using MEGA11 software, and beautified using the iTOL online tool (Sudhir, Glen, Michael, Christina, & Koichiro, 2018). Linkage disequilibrium decay was calculated using PopLDdecay software (window size: 500 kb). Population structure analysis was performed using ADMIXTURE software (parameters: preset K values from 1 to 9), with the optimal K value determined by minimizing cross-validation error. Based on geographical distribution, the two adjacent populations from Chishui City were merged into subpopulation Pop60, and the other independent population was designated as subpopulation Pop38 (Alexander, Novembre, & Lange, 2009). Genome-wide nucleotide diversity was calculated using vcftools (parameters: ‘–window-pi 100000 –window-pi-step 10000‘) with 100 kb windows and 10 kb steps. Genetic differentiation coefficients (Fst) between subpopulations were calculated using vcftools (parameters: ‘–fst-window-size 100000 –fst-window-step 10000‘) with the same window settings. Additionally, observed heterozygosity, expected heterozygosity, and minor allele frequency for the two subpopulations and the overall sample were calculated using PLINK software, and statistics were compiled using Python scripts. Results Quality Control and Alignment Results High-throughput sequencing of 98 samples showed that each sample obtained an average of 287,278,263 Raw Reads.After quality control filtering, each sample yielded an average of 287,271,802 Clean Reads. The average raw sequencing data base quality value Q20 was 97.10%, and the average Q30 reached 91.42%.By aligning the filtered Clean Reads to the reference genome (reference genome: 2.73 Gb), the alignment rate for the 98 samples ranged from 92.75% to 99.62%, with a mean mapped rate of 99.22% (proportion of reads aligned to the reference genome). The average sequencing depth ranged from 9.43× to 21.47×, with a mean depth of 14.41×. Sequencing coverage for each sample ranged from 74.29% to 83.75%; the coverage at 1×, 5×, 10×, 15×, 20×, and 30× was 80.41%, 59.71%, 43.41%, 18.93%, 22.10%, and 8.95%, respectively. Genomic Variation Identification and Mapping of Distribution Visualization of the distribution of the raw variant dataset across the genome indicated that variants such as SNPs and InDels were relatively evenly distributed across different chromosomes (Fig. 2A). The number of SNPs detected per sample ranged from 27,156,397 to 43,382,987, with a total of 319,934,173 SNPs detected. Statistics on SNP mutation types showed that in single samples, the number of transition (Ti) mutations ranged from 21,632,619 to 34,695,413, and the number of transversion (Tv) mutations ranged from 5,327,241 to 8,687,574, with a Ti/Tv ratio of 2.73 to 4.20. The number of InDels detected per sample ranged from 1,680,856 to 2,973,933, with a total of 41,184,838 detected. InDel lengths were mainly concentrated between ±1-2 bp (Fig. 2C). Fig.2 A.Genome-wide variation distribution(a: CG content; b: Gene count; c: SNP count; d: Insertion count; e: Deletion count; GWHBGBN00000001 represents chromosome 1, and so on.).B.Pie chart of SNP mutation types in the original mutation dataset. C.Statistics of InDel type variation length in the original mutation dataset. Annotation of SNP and InDel Variations The annotation of single-nucleotide polymorphisms (SNPs) and insertions/deletions (InDels) (Fig. 3A) demonstrated that intergenic regions harbored the largest number of variants, accounting for more than 80% of the entire genome, which was significantly higher than that of other variant types.This indicates that non-coding regions of the genome harbor the vast majority of genomic variations. While these variants may influence gene expression regulation, they are unlikely to directly alter protein structure.The second most abundant were intronic variants, accounting for about 10% of whole-genome variations, approximately 1/8 of the intergenic region count. Introns are non-coding regions within genes; these variations might affect splice site recognition, leading to aberrant splicing events and subsequently influencing gene splicing patterns. This could result in abnormal mRNA production, thereby affecting protein translation and function. Exonic variants numbered about 1/5 of intronic variants. These types of variations can significantly impact gene expression and function, potentially directly affecting the amino acid sequence and function of proteins, leading to changes in protein structure and function, and exerting the greatest influence on organismal genetic diversity. Therefore, variants located in exonic regions are a key focus for subsequent research. For variants located in exons, statistics were compiled based on their functional impact, with results shown in Fig. 3B. Nonsynonymous mutations were the most abundant type, with approximately 5.90×10⁶ across the whole genome, followed by synonymous mutations, with approximately 3.25×10⁶. The numbers of frameshift mutations, nonframeshift mutations, stoploss mutations, and stopgain mutations were all relatively low, collectively accounting for less than 10% of total variants. Fig.3 Annotation of SNP and InDel positions within the genome(A. Statistics of the number of variants in each genomic interval; B. Variants in exons.) Genomic Distribution of High-Quality SNP Variations After further quality control of SNPs from the raw variant dataset, a final set of 61,370,270 high-quality SNP variants was obtained (Table 2). Chromosome 1 contained the most variants (5,595,382), followed by chromosome 3 with 4,985,582 high-quality SNP variants. Chromosome 14 had the fewest variants, containing 2,932,223 high-quality SNPs. Variants were relatively evenly distributed across all chromosomes, with an average distribution density of 42.04 bp/SNP. However, each chromosome exhibited some variant hotspot regions (Fig. 4), potentially indicating areas influenced by factors such as meiotic recombination hotspots, selective pressure, gene families, genetic drift, or gene expression regulation. The density of variants in these regions may have significant implications for plant genetic diversity, adaptive evolution, and gene function. Table 2 Statistics on the number of high-quality SNP variant chromosomes Chromosome SNP number Density (bp/SNP) Chromosome SNP number Density (bp/SNP) GWHBGBN00000001 5,595,382 38.64 GWHBGBN00000009 4,210,930 42.66 GWHBGBN00000002 4,967,315 39.33 GWHBGBN00000010 3,836,577 42.10 GWHBGBN00000003 4,985,582 40.98 GWHBGBN00000011 3,434,797 40.50 GWHBGBN00000004 4,523,957 45.63 GWHBGBN00000012 3,180,670 47.03 GWHBGBN00000005 4,118,848 46.53 GWHBGBN00000013 3,499,156 43.38 GWHBGBN00000006 3,943,707 41.39 GWHBGBN00000014 2,932,223 37.50 GWHBGBN00000007 4,644,934 40.60 GWHBGBN00000015 3,238,419 40.85 GWHBGBN00000008 4,257,773 43.51 Total 61,370,270 42.04 Fig. 4 High quality SNP chromosome distribution density map Population Structure Analysis of Camellia longistyla Cluster analysis showed that the 60 samples collected from sites A and B in Chishui City almost completely clustered into one group, distinctly separated from the 38 samples from Leishan County. Therefore, overall, the 98 samples could be clustered into two subpopulations (Fig. 5A). Principal component analysis based on high-quality SNPs (Fig. 5D) and population structure analysis (Fig. 5B) both indicated that the 98 C. longistyla samples were clearly divided into two subpopulations. In subsequent analyses, Pop60 represents subpopulation 1 (comprising all 60 samples from collection sites A and B), Pop38 represents subpopulation 2 (comprising all 38 samples from provenance C), and Pop98 represents the entire set of 98 C. longistyla samples. Linkage disequilibrium analysis of the three populations showed that LD decay in the C. longistyla genome was very rapid (Fig. 5C), with a very low degree of linkage. This is partly attributable to the heterozygous genome characteristics resulting from outcrossing pollination, and also suggests potentially frequent recombination within the genome, which would generate higher genomic genetic diversity. Fig.5 A. Neighbour-joining tree constructed based on genetic distance. B.Cross-validation (CV) error for K values ranging from 1 to 9. C. ADMIXTURE results for 98 C.longistyla individuals based on the SNP dataset, showing population structure for K = 1, 2, 3, 4, 5, 6, 7,8 and 9. D. Linkage disequilibrium (LD) decay of C.longistyla. E. Principal coordinate analysis among C.longistyla. Genetic Relationships among Camellia longistyla Individuals Based on the high-quality SNP dataset, this study analyzed the genetic relationships among the 98 C. longistyla samples. Calculation of the genetic similarity coefficient (GSC) between samples showed that GSCs ranged from 0.81 to 0.95, with a mean of 0.86. Among these, 99.98% (totaling 4752 pairs) of sample pairs had genetic similarity between 0.8 and 0.9, and 0.02% (1 pair) had genetic similarity between 0.9 and 1.0. The lowest GSC (0.81) was observed between samples numbered 76 and 99, while the highest genetic similarity coefficient (0.95) was observed between samples numbered 46 and 48. Furthermore, using the genetic relationship matrix, a genetic distance matrix among the 98 C. longistyla materials was calculated. The heatmap of the genetic distance matrix (Fig. 9), visualized based on genetic distances and group relationships, showed that materials within the same subpopulation had relatively small genetic distances, while materials from different groups had relatively large genetic distances. This indicates significant genetic differentiation among materials from different subpopulations, potentially resulting from geographical isolation, genetic drift, or differential adaptive evolution. Fig.6 Genetic distance matrix heat map of 98 C.longistyla samples Genetic Diversity Analysis of Camellia longistyla Populations Genetic diversity statistics (Table 3) showed that overall, for the 98 C. longistyla materials, nucleotide diversity (π value) was 4.207×10⁻³, polymorphism information content (PIC) was 0.154, Shannon’s information index, gene diversity index, and effective number of alleles were 0.304, 0.178, and 1.257, respectively. The observed heterozygosity and expected heterozygosity for this population were 0.178 and 0.604, respectively. The above analysis results indicate that the 98 C. longistyla samples in this experimental population possess some level of genomic genetic diversity, but it is relatively low. This may be because the habitat distribution range of C. longistyla is relatively narrow, thereby limiting the development of its genomic genetic diversity, underscoring the importance of this study’s analysis of the species’ genomic genetic diversity. Comparing the two subpopulations revealed that the overall genetic diversity of the Pop60 subpopulation was lower than that of the Pop38 subpopulation. Specifically, Pop60’s nucleotide diversity was 0.298×10⁻³ lower than Pop38’s, and its PIC was 0.011 lower. Similarly, for all calculation parameters involved in this study, Pop60 was consistently lower than Pop38. The Pop60 subpopulation, despite having a larger number of individuals, exhibited lower genomic genetic diversity. This phenomenon may suggest a more severe loss of genetic diversity in the C. longistyla population in Chishui City, warranting attention and prioritized conservation efforts from researchers. Table 3 Genetic diversity of C.longistyla populations Population π(10 -3 ) PIC I Nei Ne Ho He MAF pop98 4.207 0.154 0.304 0.178 1.257 0.178 0.604 0.113 pop60 4.009 0.144 0.288 0.170 1.249 0.169 0.706 0.109 pop38 4.307 0.156 0.309 0.183 1.268 0.182 0.732 0.118 Discussion Variant Identification This study provides the first systematic assessment of genetic variation in C. longistyla based on whole-genome resequencing technology. The obtained data lay a solid foundation for in-depth analysis of its population genetic structure, evolutionary history, and the genetic basis of important traits. Sequencing data quality is the fundamental prerequisite for the accuracy of subsequent analyses.In this study, the average Clean Reads Q30 value for the 98 samples after quality control was as high as 91.42%, the average alignment rate was 99.24%, the average sequencing depth was 14.41×, and the GC content (39.53%) was highly consistent with the reference genome. These metrics indicate that the sequencing data are of high quality, uniformly covered, and reliable, comparable to the standards of high-quality resequencing studies in species such as Sapindus (Liu et al., 2025) and Macadamia integrifolia (Li et al., 2024), fully meeting the requirements for genome-wide variant detection and analysis, and providing a solid foundation for subsequent identification of high-confidence genetic variants. This study identified a vast number of raw SNPs and InDels, which were relatively uniformly distributed across chromosomes. Notably, the maximum ratio of transitions (Ti) to transversions (Tv) among single nucleotide variants was as high as 4.20. This ratio is significantly higher than most previously reported cases, such as in studies on Robinia pseudoacacia . Existing theory suggests that a higher Ts/Tv ratio is often associated with lower levels of genetic differentiation and specific evolutionary pressures (Wang et al., 2025). In cultivated or anthropogenically influenced plants, intense artificial selection might preferentially retain synonymous mutations or transition mutations in conserved regions that have minimal impact on protein structure and function, leading to an increased Ts/Tv ratio (Wang et al., 2025). Given the currently known narrow distribution of C. longistyla , it may have undergone selection in a specific geographical environment during its speciation process; the very high Ti/Tv ratio observed in this study likely reflects this evolutionary history. Simultaneously, the length distribution of InDels, predominantly ±1-2 bp, aligns with the general patterns of InDel variation in plant genomes, providing a basis for their development and utilization as auxiliary molecular markers (Muñoz-Espinoza, Genova, Sánchez, Correa, & Hinrichsen, 2020). Functional annotation of variants revealed that the vast majority (>80%) were located in intergenic regions, followed by intronic regions (~10%), with exonic variants comprising only a small proportion. This distribution pattern is highly consistent with research findings in species such as Robinia pseudoacacia (Wang et al., 2025) and Sapindus (Liu et al., 2025), conforming to the general rule in eukaryotic genomes where coding sequences are relatively conserved and non-coding sequences accumulate more variation. Although few in number, exonic variants are key sites that directly affect protein amino acid sequences and may consequently be associated with phenotypic traits. Among these, nonsynonymous mutations were approximately 1.8 times more abundant than synonymous mutations, suggesting that the C. longistyla population may have experienced relatively complex selective pressures during evolution, with some nonsynonymous mutations potentially being retained and contributing to its phenotypic diversity. The rare occurrences of stopgain/stoploss and frameshift mutations, while potentially having major impacts on gene function, also reflect the role of purifying selection in maintaining the functional integrity of core genes. Through stringent quality control of raw variants, we obtained over 61 million high-quality SNPs, with relatively uniform distribution densities across different chromosomes, yet with distinct ”hotspot regions”. These SNP-enriched areas may correspond to recombination hotspots, key loci under positive or balancing selection, repeat-rich regions, or non-coding regions with important regulatory functions within the genome. In resequencing studies of tree species such as Tectona grandis (Patturaj, Manikantan, Veerasamy, Elias, & Ramasamy, 2025) and Macadamia integrifolia (Li et al., 2024), similar hotspot regions have often been confirmed to be associated with species adaptation and the genetic regulation of important economic traits. Therefore, these potential variant hotspots in the C. longistyla genome are key target regions for subsequent analyses of population genetic differentiation, assessment of linkage disequilibrium decay, and genome-wide association studies to locate candidate genes for important traits. Genetic Diversity Analysis and Population Structure This study reveals, for the first time, the population genetic structure of C. longistyla based on genome-wide SNP markers. Integrating the results of cluster analysis, principal component analysis (PCA), and population structure analysis, the 98 samples were clearly divided into two subpopulations: Pop60, mainly comprising materials from the two collection sites in Chishui City, and Pop38, comprising materials from Leishan County. This division based on genetic background strongly aligns with the geographical origins of the samples, indicating that geographical isolation may be the key factor driving population differentiation in C. longistyla . This finding is consistent with studies on species such as Sapindus (Liu et al., 2025) and Robinia pseudoacacia (Li, Ji, Yang, Xu, & Guan, 2025), where population structures also exhibited patterns closely related to geographical distribution. PCA results showed clear separation of the two subpopulations in principal component space, further confirming substantial genetic differentiation between them. Notably, although the Pop60 subpopulation includes materials from two adjacent collection sites (A and B), they are genetically highly similar and do not form further substructure. This may imply that within a relatively small geographical range (e.g., within Chishui City), gene flow is relatively unimpeded, and no significant genetic barriers have yet formed. In contrast, the geographical distance and potential physical barriers (such as mountains, rivers, etc.) between Chishui City and Leishan County likely effectively restrict gene flow, leading to the formation of two genetically independent subpopulations (Zhao, Fan, Yin, Sun, & Ge, 2019). This finding underscores the core role of geographical isolation in shaping the genetic structure of endangered narrow-range plants (Sobel, Chen, Watt, & Rausher, 2010). This study found that the linkage disequilibrium (LD) decay rate across the C. longistyla genome is very rapid. Rapid LD decay is a typical characteristic of outcrossing, allogamous species, usually associated with larger effective population sizes and higher historical recombination rates. In contrast, a study on Albizia odoratissima on Hainan Island found that island populations exhibited increased LD due to restricted gene flow (Li, Ji, Yang, Xu, & Guan, 2025). This result confirms at the genomic level that C. longistyla possesses an outcrossing breeding system. Genetic diversity is the foundation for species adaptation and evolution. This study assessed the genetic diversity parameters of C. longistyla overall and for its two subpopulations (Table 5). Overall, metrics such as nucleotide diversity (π = 4.207×10⁻³), observed heterozygosity ( Ho = 0.178), and polymorphism information content ( PIC = 0.154) for C. longistyla are at moderate to low levels compared to some endangered woody plants (e.g., Rhododendron bailiense , π = 0.2489) (Luo et al., 2025), but significantly higher than certain species with extremely low genetic diversity (e.g., Rhododendron huadingense ) (Chen, 2016). This level reflects the genetic limitations faced by this species due to its narrow distribution range and endangered status. Widespread species typically possess higher genetic diversity, while narrowly distributed species affected by habitat fragmentation often experience reduced diversity due to genetic drift and inbreeding (Luo et al., 2025). More importantly, a significant imbalance in genetic diversity exists between the two subpopulations. The Pop38 subpopulation exhibited significantly higher values for all genetic diversity parameters compared to the Pop60 subpopulation, despite the latter having a larger sample size. This phenomenon of ”high quantity but low diversity” warrants significant concern. In contrast, the higher genetic diversity of the Pop38 subpopulation, especially its observed heterozygosity approaching or even partially exceeding expected values, suggests that this population may have maintained a relatively good outcrossing state and adaptive potential (Sobel, Chen, Watt, & Rausher, 2010). Genetic distance and kinship analysis based on high-quality SNPs provided microscopic evidence supporting the population structure described above . The mean genetic similarity coefficient (GSC) among materials was relatively high (0.86), and the vast majority of material pairs (99.98%) had GSCs concentrated between 0.8 and 0.9, indicating a relatively conserved intraspecific genetic background and close genetic relationships among individuals in C. longistyla . This bears some resemblance to the low genetic variation observed in species primarily propagated asexually or under high-intensity selection, such as Robinia pseudoacacia (Wang et al., 2025). The genetic distance heatmap clearly displayed a block structure characterized by ”small intra-subpopulation distances, large inter-subpopulation distances”, intuitively confirming the genetic differentiation between Pop60 and Pop38. The largest genetic distance between the two subpopulations occurred between samples 76 and 99, representing the most differentiated individuals between the two geographical populations, providing excellent material for studying adaptive differentiation. Conclusion This study provides the first systematic analysis of the genetic structure, linkage disequilibrium characteristics, and diversity of C. longistyla based on genome-wide SNP markers. The results clearly divided the 98 accessions into two genetic subpopulations: Chishui (Pop60) and Leishan (Pop38), with geographical isolation identified as the primary driving force. Linkage disequilibrium decayed very rapidly, confirming the outcrossing nature of this species at the molecular level. The overall genetic diversity of the population was moderately low and severely unevenly distributed: the larger Chishui population exhibited significantly lower genetic diversity than the Leishan population, indicating higher genetic vulnerability and priority for conservation. In summary, this study establishes a crucial genetic foundation for the conservation of this endangered endemic species. It recommends implementing ex situ conservation for both subpopulations, with focused intervention in the Chishui population to promote gene flow. Furthermore, the data generated provide valuable resources for subsequent gene discovery and molecular breeding efforts. Funding: This research was funded by Guizhou Provincial Forestry Scientific Research Project ’Research on Key Technologies for Conservation of Germplasm Resources of Camellia longistyla, a Unique Oil Tree Species in Guizhou (Qianlinkehe [2022] No.02) ;Scientific and technological innovation talent team building of Guizhou Institute of Biology,grant NO.2024002;Guizhou Provincial Characteristic Forestry Industry Project ’Research on Reproductive Regulation Mechanism and Elite Germplasm Resource Exploration of Camellia oleifera with Short Styles in Weining; 2024 Central Financial Subsidy Project for Cultivation of Improved Forest Tree Varieties from the National Forest Tree Germplasm Repository of Plateau Plants, Longli State-owned Forest Farm, Guizhou Province. Data Availability Statement https://ngdc.cncb.ac.cn/gsa/s/w3A5m1jV References Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. GENOME RESEARCH , 19 (9), 1655-1664 Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. BIOINFORMATICS , 34 (17), i884-i890 Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., Depristo, M. A.,… Sherry, S. T. (2011). The variant call format and VCFtools. BIOINFORMATICS , 27 (15), 2156-2158 Depristo, M. A., Banks, E., Poplin, R., Garimella, K. V., & Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. NATURE GENETICS , 43 (5), 491-498 Doyle, J. (1987). A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem Bull , 19 Huan, L., Jinpu, W., Ting, Y., Weixue, M., Bo, S., Tuo, Y.,… Wangsheng, L. (2019). Molecular digitization of a botanical garden: high-depth whole-genome sequencing of 689 vascular plant species from the Ruili Botanical Garden. GigaScience (4), 4 Kai, W., Mingyao, L., & Hakon, H. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. NUCLEIC ACIDS RESEARCH (16), e164 Li, Z., Ji, Q., Yang, Y., Xu, M., & Guan, Y. (2025). Low genetic diversity and weak population structure of Albizia odoratissima on Hainan Island. BMC PLANT BIOLOGY Li, Z., Wu, C., Ma, J., Geng, J., Tao, L., He, X., & Gong, L. (2024). Genetic diversity analysis of macadamia germplasm in China based on whole-genome resequencing. Tree Genetics & Genomes , 20 (3). https//:10.1007/s11295-024-01648-8 Lilinyin (2017). CMplot: Circle Manhattan Plot. Liu, J. M., Zhao, G. C., Zheng, Y. L., Xu, Y. Y., Wang, M. Z., Li, L.,… Salojaervi, J. (2025). Genetic diversity and adaptive evolutionary history of Sapindus in China: insights from whole-genome resequencing of 100 representative individuals. PLANT BIOTECHNOLOGY JOURNAL , 23 (7), 2485-2500. https//:10.1111/pbi.70058 Luo, F., Evans, K., Norelli, J. L., Zhang, Z., & Peace, C. (2020). Prospects for achieving durable disease resistance with elite fruit quality in apple breeding. Tree Genetics & Genomes Luo, J., Yuan, C. J., Wang, H. D., Zhang, J. H., Chen, J., He, S.,… Luo, D. L. (2025). Study on the Genetic Diversity Characteristics of the Endemic Plant Rhododendron bailiense in Guizhou, China Based on SNP Molecular Markers. Ecology and Evolution , 15 (2). https//:10.1002/ece3.70966 Muoz-Espinoza, C., Genova, A. D., Sánchez, A., Correa, J., & Hinrichsen, P. (2020). Identification of SNPs and InDels associated with berry size in table grapes integrating genetic and transcriptomic approaches. BMC PLANT BIOLOGY , 20 (1), 365 Niu, S., Song, Q., Koiwa, H., Qiao, D., Zhao, D., Chen, Z.,… Wen, X. (2019). Genetic diversity, linkage disequilibrium, and population structure analysis of the tea plant (Camellia sinensis) from an origin center, Guizhou plateau, using genome-wide SNPs developed by genotyping-by-sequencing. BMC PLANT BIOLOGY , 19 Patturaj, M., Manikantan, A., Veerasamy, S., Elias, A. A., & Ramasamy, Y. (2025). Whole genome resequencing unveils population structure and wood trait associations for Indian teak germplasm. Tree Genetics & Genomes , 21 (2). https//:10.1007/s11295-025-01691-z Qing, J., Meng, Y., He, F., Du, Q., Zhong, J., Du, H.,… Wang, L. (2022). Whole genome re-sequencing reveals the genetic diversity and evolutionary patterns of Eucommia ulmoides. MOLECULAR GENETICS AND GENOMICS , 297 (2), 485-494. https//:10.1007/s00438-022-01864-8 Sobel, J. M., Chen, G. F., Watt, L. R., & Rausher, D. W. S. (2010). The biology of speciation. EVOLUTION , 64 (2), 295-315 Sudhir, K., Glen, S., Michael, L., Christina, K., & Koichiro, T. (2018). MEGA X: Molecular Evolutionary Genetics Analysis across computing platforms. Molecular Biology & Evolution (6), 6 Varshney, R. K., Saxena, R. K., Upadhyaya, H. D., Khan, A. W., Yu, Y., Kim, C.,… An, S. (2017). Whole-genome resequencing of 292 pigeonpea accessions identifies genomic regions associated with domestication and agronomic traits. NATURE GENETICS Wang, H., Ma, Y., Wang, R., Zang, D., Yu, X., Li, J.,… Zang, F. (2025). Genetic structure analysis and core germplasm construction of Robinia pseudoacacia and its closely related species based on SNP. BMC PLANT BIOLOGY , 25 (1), Wang, S., Wong, D., Forrest, K., Allen, A., Chao, S., Huang, B. E.,… Int, W. G. S. (2014). Characterization of polyploid wheat genomic diversity using a high-density 90 000 single nucleotide polymorphism array. PLANT BIOTECHNOLOGY JOURNAL , 12 (6), 787-796. https//:10.1111/pbi.12183 Zhao, Y. P., Fan, G., Yin, P. P., Sun, S., & Ge, S. (2019). Resequencing 545 ginkgo genomes across the world reveals the evolutionary history of the living fossil. Nature Communications , 10 (1), Zhao, Z., Song, Q., Bai, D., Niu, S., He, Y., Qiao, D.,… Li, F. (2022). Population structure analysis to explore genetic diversity and geographical distribution characteristics of cultivated-type tea plant in Guizhou Plateau. BMC PLANT BIOLOGY , 22 (1). https//:10.1186/s12870-022-03438-7 Zhou, P., Huang, B., Huang, J., Xu, L. A., Wen, Q., & Pyhjrvi, T. (2025). Genetic differentiation and associated climatic variables between Camellia chekiangoleosa and its wild relatives revealed by SNP markers. Industrial Crops & Products , 234 (000), Chen, Z. H. (2016). Population characteristics and conservation genetics of the rare and endemic plant Rhododendron huadingense [Master’s thesis], Hangzhou Normal University. Liu, H. Y., Wang, J. W., Hong, J., Fan, Z. W., Tang, S. H., & Zou, T. C. (2018). Amino acid and fatty acid composition of seeds of five wild Camellia species in Guizhou. Guihaia, 38(02), 169-179 Zhou, H., & Chen, X. C. (1983). Camellia longistyla—Another new Camellia species from Guizhou. Guizhou Forestry Science and Technology, (03), 35-36 Zou, T. C. (2000). Study on germplasm resource utilization of Camellia flavida and Camellia longistyla. Guizhou Science, (03), 209-215 Information & Authors Information Version history V1 Version 1 18 March 2026 Copyright This work is licensed under a Non Exclusive No Reuse License. Keywords ecosystem genetics plants sequencing Authors Affiliations Fengchan Wu 0009-0003-6251-3426 Guizhou Academy of Sciences View all articles by this author Binyang Zhao Zunyi City Forestry Bureau View all articles by this author Peiyu Xi Guizhou Academy of Sciences View all articles by this author Na Ran Guizhou Academy of Sciences View all articles by this author Yulin Guo Guizhou Academy of Sciences View all articles by this author Chunyan Guo Guizhou Academy of Sciences View all articles by this author Anding Li [email protected] Guizhou Academy of Sciences View all articles by this author Metrics & Citations Metrics Article Usage 127 views 63 downloads .FvxKWukQNSOunydq8rnd { width: 100px; } Citations Download citation Fengchan Wu, Binyang Zhao, Peiyu Xi, et al. Genomic Variation Landscape and Population Genetic Analysis of Camellia longistyla Based on Whole-Genome Resequencing. Authorea . 18 March 2026. DOI: https://doi.org/10.22541/au.177383017.79260276/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu . Format Please select one from the list RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks Direct import Tips for downloading citations document.getElementById('citMgrHelpLink').addEventListener('click', function() { popupHelp(this.href); return false; }); $(".js__slcInclude").on("change", function(e){ if ($(this).val() == 'refworks') $('#direct').prop("checked", false); $('#direct').prop("disabled", ($(this).val() == 'refworks')); }); View Options View options PDF View PDF Figures Tables Media Share Share Share article link Copy Link Copied! Copying failed. Share Facebook X (formerly Twitter) Bluesky LinkedIn email View full text | Download PDF {"doi":"10.22541/au.177383017.79260276/v1","type":"Article"} Now Reading: Share Figures Tables Close figure viewer Back to article Figure title goes here Change zoom level Go to figure location within the article Download figure Toggle share panel Toggle share panel Share Toggle information panel Toggle information panel Go to previous graphic Go to next graphic Go to previous table Go to next table All figures All tables View all material View all material xrefBack.goTo xrefBack.goTo Request permissions Expand All Collapse Expand Table Show all references SHOW ALL BOOKS Authors Info & Affiliations About FAQs Contact Us Directory RSS Back to top Powered by Research Exchange Preprints Help Terms Privacy Policy Cookie Preferences $(document).ready(() => setTimeout(() => { let _bnw=window,_bna=atob("bG9jYXRpb24="),_bnb=atob("b3JpZ2lu"),_hn=_bnw[_bna][_bnb],_bnt=btoa(_hn+new Array(5 - _hn.length % 4).join(" ")); $.get("/resource/lodash?t="+_bnt); },4000)); (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'9fe483f0b8b04807',t:'MTc3OTIwODUzMg=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.