Are genetically defined “metapopulations” self-evident in YHRD?

doi:10.64898/2026.02.07.704579

Are genetically defined “metapopulations” self-evident in YHRD?

2026 · doi:10.64898/2026.02.07.704579

preprint OA: closed CC-BY-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 60,489 characters · extracted from oa-pdf · 6 sections · click to expand

Abstract

In forensic genetics, the evidential value of a match between the Y-chromosomal short tandem repeat (Y-STR) profiles of a trace and a suspect is typically quantified by the frequency of the profile in a population database, particularly the Y-chromosomal Haplotype Reference Database (YHRD). However, for this approach of obtaining a ‘match prob- ability’ to be valid, the database population must be representative of all plausible alternative trace donors in a given case. Since appropriately defining such a ‘suspect population’ can be difficult, YHRD highlights so-called ‘metapop- ulations’ that comprise profiles from different, geographically dispersed populations with presumed shared ancestry. We investigated whether such metapopulations are self-evident in the current version of YHRD. To this end, we per- formed classical cluster analysis using allele dissimilarity as a measure of pairwise distance between Y-STR profiles. Our analyses revealed only a weak genetic structure in YHRD the extent of which was inversely proportional to the respective marker mutation rate. This suggests that YHRD cannot be divided into clearly distinguishable subgroups based solely on the genetic information it contains, at least not into subgroups that would correspond closely to the metapopulations highlighted in the database itself. If profile frequencies in metapopulations are to continue to be equated with match probabilities, then a clearer definition of metapopulations and a better justification of their use in forensics are needed.

Keywords

Y-STR, Population database, Metapopulation, Cluster analysis, Genetic co-ancestry, Match probability 1. Introduction The analysis of Y-chromosomal short tandem repeat (Y-STR) markers plays an important role in forensic genetics, particularly in cases, such as sexual assaults, where the biological trace of interest contains an unbalanced mixture of a major female and a minor male DNA component. Due to their male specificity, single-source Y-STR profiles can ∗Corresponding author Email address: [email protected] (T´ora Oluffa Stenberg Olsen) Preprint submitted to bioRxiv February 7, 2026 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint often still be reliably determined from mixtures even when the ratio of male to female genetic material is minimal [1, 2]. Once a suspect has been identified in cases like the above, his Y-STR profile can be determined as well and compared to the trace profile. While a mismatch between the two profiles normally excludes the suspect from being a donor, matches are more difficult to interpret. This is because, in court, the forensic genetic expert must quantify the strength of evidence of a match, usually in the form of the probability with which such a match would also be observed between the trace profile and the profile of another male. The question of how best to obtain this so-called ‘match probability’ and, in particular, what “another” should mean in this context has occupied forensic geneticists for more than two decades [3–5]. The most commonly used approach to determining a match probability equates the latter with the frequency of the Y-STR profile of interest in a suitable group of plausible alternative donors - also referred to as the ‘suspect population’ [5, 6]. Although various mathematically sound methods have been proposed to estimate a haplotype frequency from a representative sample of a suspect population [7–9], the main concern to date has been how to adequately define this population itself. Notably, the DNA commission of the International Society for Forensic Genetics (ISFG) recommended use of the geographical location of the crime scene as a “leading criterion” for selecting the “right” suspect population [2]. The ISFG proposal is certainly meaningful because one of the key criteria for the plausibility of donorship is spatial proximity to the place where the trace was recovered. It is undoubtedly reasonable to imply that most plausible alternative donors lived near the crime scene at the time. However, with very few exceptions, estimating the population frequency of a Y-STR profile requires a sample or database that is both sufficiently large and sufficiently representative of the population in question. ‘Representative’ here takes a statistical definition and means that the sample or database adequately captures all relevant population characteristics. The largest database established for this purpose is the Y-chromosomal Haplotype Reference Database (YHRD), which emerged in 2004 from a series of previous efforts in Europe [10]. Since then, YHRD has been continuously extended and currently contains approximately 350,000 Y-STR profiles from all over the world, comprising between eight and 27 Y-STR markers. Despite these tremendous efforts, even a database the size of YHRD cannot cover every case-relevant geographical target region at a level (village, city, state, province, etc.) desirable for forensic casework. Moreover, humans have undergone a multitude of demographic processes that led to the formation, and subsequent reorganization, of local populations at varying levels of intra- and inter-group genetic differentiation. To allow forensic experts to take this fact into account when using YHRD, the database curators decided to highlight “metapopulations” in YHRD. These are defined as combinations of originally separate, regional subsets of the data that were supposedly connected by past migration or gene flow [10]. The stated goal of the endeavour has been to group together “geographically dispersed human population samples with shared genetic ancestry” [2], using geographic, linguistic, and demographic bonds as proxies for (male) genetic relatedness. 2 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint In the present study, we aimed to assess the extent to which the above-mentioned objective was achieved. To this end, we investigated (i) whether clusters of Y-STR profiles are evident in YHRD that indicate subsets with markedly more recent co- ancestry, (ii) to what extent the degree of clustering observed for different marker combinations depends upon their mutation rates, and (iii) whether the Y-STR profile clusters thus identified overlap with the metapopulations defined in YHRD. Ideally, our study would have been based upon an ab initio reconstruction of the genealogy of the Y-STR pro- files included in YHRD, using software tools publicly available for this purpose. However, the reconstruction of coalescence trees is not only computationally intensive, but its results also critically depend upon the underlying evo- lutionary model and the choice of model parameters [11]. Therefore, we performed classical cluster analysis using marker-wise allelic dissimilarity as an (agnostic) measure of the pair-wise distance between Y-STR profiles. Although this approach may not perfectly reproduce the results of a coalescence-based analysis, it should still yield trees that adequately reflect the true genealogy of the Y-STR profiles included in YHRD. 2. Methods 2.1. Data and data analyses All data analyses were carried out using R [12] version 4.5.0 and the packages rforensicbatwing [13] version 1.3.1, umap [14] version 0.2.10.0, Rtsne [15–17] version 0.17, cluster [18] version 2.1.8, fields [19] version 16.3.1, LEA [20] version 3.16.0, tidyverse [21] version 2.0.0, pals [22] version 1.10, dendextend [23] version 1.19.0, fpc [24] version 2.2.13, ggpubr [25] version 0.6.0, seriation [26, 27] version 1.5.8, data.tree [28] version 1.1.0, igraph [29, 30] version 2.1.4, ggraph [31] version 2.2.1, and cowplot [32] version 1.1.3. The respective code is publicly available online [33]. We used data from YHRD [34] in accordance with a project proposal previously submitted to the database cu- rators (https://yhrd.org/pages/Projects/P5). Currently, YHRD defines seven major metapopulations, namely African, Afro-Asiatic, Native American, Australian Aboriginal, East Asian, Eskimo-Aleut, and Eurasian, in addition to one ‘Admixed’ metapopulation covering populations with pronounced admixture [2, 34]. Some of these YHRD metapop- ulations are further divided into subgroups (Fig. 1) that YHRD also refers to as ‘metapopulations’, but from which we distinguish the original metapopulations by adding the term ‘major’ to the latter, if required. A more detailed description of the geographical range of the YHRD metapopulations can be found in Table 2 in [2]. We generated four sets of Y-STR loci henceforth referred to as ’kits’. Each kit consisted of eight Y-STR loci that were selected based upon their mutation rates (Table 1). We refer to these kits as ’fast’, ’medium’, ’mixed’, 3 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint All Eurasian Afro−Asiatic Admixed East Asian Native American African Eskimo Aleut Australian Aboriginal Caucasian Altaic European Indo−Iranian Indian Uralic−Yukaghir Semitic Berber Cushitic Japanese Sino−Tibetan Korean Tai−Kadai Austronesian Austro−Asiatic Indo−Pacific Dravidian Sub−Saharan African African American South−Eastern European Western European Eastern European Chinese (Han) Tibeto−Burman Fig. 1: Overview of YHRD metapopulations. Colors correspond to the major YHRD metapopulations defined in [34], including the ’Admixed’ metapopulation. Table 1: Mutation rate-dependent definition of Y-STR marker kits. Kit Fast Medium Mixed Slow Mutation ratea 4.36 ×10−3 −1.40 × 10−2 2.34 × 10−3 − 4.34 × 10−3 3.95 ×10−4 −1.40 × 10−2 3.95 ×10−4 −2.24 × 10−3 Loci DYS627, DYS576, DYS518, DYS449, DYS570, DYS458, DYS439, DYS460 DYS481, DYS549, DYS456, DYS533, DYS635, DYS389I, YGATAH4, DYS391 DYS627, DYS576, DYS518, DYS449, DYS438, DYS392, DYS448, DYS437 DYS438, DYS392, DYS393, DYS437, DYS643, DYS448, DYS390, DYS19 Number of Y-STR profilesb 102,253 102,880 103,129 107,748 Number of YHRD metapopulations 25 27 25 27 aMutation rates were taken from www.yhrd.org. bNumbers include only Y-STR profiles with integer-valued alleles. 4 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint and ’slow’ depending on the mutation rates of the loci included in the respective kit. Only integer-valued alleles were considered here because distances to and between intermediate alleles are difficult to define in accordance with evolutionary proximity. The numbers of Y-STR profiles for each YHRD metapopulation and each kit are provided in Supplementary Table S.1. Except for the analysis explained in Section 2.3, the number of Y-STR profiles was limited to a maximum of 100 per YHRD metapopulation for each marker kit to ensure a balanced ancestry of the resulting sample. The Y-STR profiles were selected randomly from the YHRD metapopulation under consideration. 2.2. Spatial autocorrelation We examined the degree of spatial autocorrelation between Y-STR profiles and their sampling locations, using Moran’s I [35]. In particular, we repeated the analysis by Roewer et al. [36] to assess whether similar levels of autocorrelation as observed in early YHRD also characterise the data analysed in the present study. First, we calculated pair-wise Great Circle Distances (in kilometres) between sampling locations from the respective longitude and latitude data. Y-STR profile pairs were then grouped into distance classes defined as adjacent 250 kilometres-wide intervals, with one additional class comprising all pairs sampled at the same location. We then calculated Moran’s I for each class, setting the required spatial weights for Y-STR profile pairs equal to unity, if the profile pair fell into that class, and equal to zero otherwise. While a value of I close to plus one indicates similarity of Y-STR profiles in a given distance class, and a value of I close to minus one indicates dissimilarity, I = 0 means a lack of correlation in repeat number between Y-STR profiles in the respective class [35]. 2.3. Distance measures The possible structuring of the YHRD data was to be analysed by multi-dimensional scaling (MDS) and hierar- chical clustering, which requires a pair-wise distance measure between Y-STR profiles. For this, we considered the following three quantities: d1 = lX j=1 −m j log(µ j), (1) d2 = lX j=1 −m j 1 log(1 − µ j) , (2) d3 = lX j=1 log m j + 1 − log 2µ j + 1 Ne !! · 1[m j > 0]. (3) Here, l is the number of loci, µ j denotes the mutation rate at the jth locus, m j is the absolute difference in repeat number, at the jth locus, between the two Y-STR profiles of interest, and Ne denotes the effective size of the under- lying population. We considered effective population sizes of 1,000, 5,000, and 10,000, respectively [37]. All three quantities satisfy the properties of distance measures (Appendix A). 5 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint The rationale of all three distance measures was to consider the respective mutation rates when weighting the allelic differences. Hence, we introduced locus-specific weights that depended differently upon each locus-specific mutation rate. More precisely, d1 weights the repeat difference by − log(µ j) so that loci with a low mutation rate contribute disproportionally to the overall profile dissimilarity. The second distance measure, d2, weights the repeat difference with−1/ log(1 − µ j), which has a similar effect tod1 but gives even more weight to loci with lower mutation rates. The third measure, d3, additionally considers the estimated time to the most recent common ancestor of the two profiles of interest. Thus, d3 subtracts log(2µj+1/Ne) from log(mj+1) to simultaneously account for the divergence of profiles expected from mutation and drift. A more detailed justification of d3 is provided in Appendix B. In addition to d1 to d3, we also considered the Euclidean and Manhattan distance. All distance measures used in the present study were computational compromises that likely resulted in some loss of ancestry information. Therefore, we compared each measure to BATWING [38] estimates of the total length of the coalescent branches between two Y-STR profiles, as described in [13]. Unfortunately, due to high computational costs, BATWING estimates were not suitable as distance measures themselves. For each marker kit, we instead selected 30 Y-STR profiles at random, estimated the lengths of the coalescent branches for all possible profile pairs, and repeated this procedure five times. The other distance measures were then related to the BATWING estimates in each repetition by way of Spearman’s rank correlation coefficient. 2.4. Dimensionality reduction To be able to visualise whether and how Y-STR profiles from YHRD group into clusters, we first employed three

Methods

of dimensionality reduction, namely MDS, Uniform Manifold Approximation and Projection (UMAP), and t- Distributed Stochastic Neighbour Embedding (t-SNE) [16, 17, 39–41]. The three methods differ in how they maintain the overall structure of distance in the data during dimensionality reduction. While MDS aims to preserve all pairwise distances between Y-STR profiles as much as possible, t-SNE focuses on the maintenance of local relationships. Therefore, t-SNE often produces well-separated clusters, but only inadequately captures their global relationships. UMAP lies somewhere in between and typically yields similar results to t-SNE regarding local structure, but tends to infer the global structure somewhat better than t-SNE, although still less reliably than MDS. 2.5. Clustering For each marker kit, the corresponding Y-STR profiles were subjected to hierarchical clustering with one of five possible agglomeration methods: ‘average’, ‘single’, ‘complete’, ‘Ward’, and ‘weighted’ [42]. The selection of the agglomeration method was based upon the respective agglomeration coefficient,ac. This figure lies between zero and unity and relates the average dissimilarity of a Y-STR profile to the first cluster it joins to the dissimilarity between the last two merged clusters [18]. An ac value close to unity indicates a data structure where data points join clusters early, relative to the final merger, thereby suggesting the presence of well-defined and tight clusters. It should be noted 6 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint that ac tends to increase with an increasing number of observations and therefore should not be compared between samples of different sizes. For a given marker kit, we performed hierarchical clustering with the agglomeration method that yielded the largest ac and with the distance measure that correlated most strongly with the coalescent tree branch lengths estimated before with BATWING. The number of clusters present in the resulting tree was determined using the elbow method [43]. By plotting a given tree height against the number of sub-clusters present at that height, the elbow method identifies a point (the ‘elbow’) where the decrease in height per added cluster slows down significantly. This property suggests that adding more clusters beyond the elbow only slightly increases the similarity of Y-STR profiles within clusters. The stability of the inferred clusters was assessed by bootstrapping using the Jaccard index [24, 44–46] as a measure of cluster overlap. For this purpose, 100 bootstrap samples were drawn ignoring multiple drawings of one and the same Y-STR profile. Each bootstrap sample was then subjected to the same hierarchical cluster analysis as the original data and decomposed into the same number of clusters as originally inferred for the respective marker kit. For each original cluster, the most similar bootstrap cluster was then identified through maximisation of the pairwise Jaccard index. Finally, these maximum Jaccard indices were averaged over all bootstrap runs to yield a measure of stability of each original cluster. In addition to hierarchical clustering, we also performed STRUCTURE [47] analyses as a means to group Y-STR profiles. STRUCTURE does not aim to maximise Y-STR profile similarity within groups, but rather maximises the overall probability of each profile being assigned to a specific group by varying the profile frequencies within groups. STRUCTURE analyses were performed in R as described in [48, 49]. The marker kit-specific numbers of groups were again determined using the elbow method [43], but this time employing a cross-entropy criterion [49] calculated for between one and 15 groups present. 3. Results 3.1. Spatial autocorrelation Spatial autocorrelation analysis with Moran’sI index revealed that the current relationship between Y-STR profile characteristics and sampling location resembled that reported for an early version of YHRD [36]. Thus, for each marker kit, Y-STR profiles tended to be more similar (i.e. Moran’sI > 0) when sampled at nearby rather than distant locations, and increasing the Great Circle Distance consistently decreased the spatial autocorrelation (Fig. 2). 3.2. Distance measures To select the most appropriate of five possible distance measures for further analysis (see Section 2.3), each distance measure was compared to the coalescent tree branch lengths estimated with BATWING [13, 38], assuming an effective population size of 1,000 (Fig. 3; for results obtained with effective population sizes of 5,000 and 10,000, see Supplementary Figs. S.1 and S.2, respectively). Benchmarking against BATWING was performed in five repetitions, separately for each marker kit. 7 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint 0.0 0.1 0 4000 8000 12000 Great Circle Distance (km) Moran's I index Marker kit Fast Medium Mixed Slow Fig. 2: Spatial autocorrelation (Moran’sI index) analysis of Y-STR profiles. For the definition of marker kits, see main text and Table 1. We selected d1 as the most appropriate distance measure for further analysis for all marker kits because it con- sistently yielded the strongest correlation with the coalescent tree branch lengths (Fig. 3). Interestingly, the results obtained for d1 were very similar to those obtained for the Manhattan distance. This suggests that additional weighting of the absolute allele difference for a given Y-STR by minus the logarithm of the corresponding mutation rate (see Section 2.3) did not significantly affect distance measure performance for the marker kits considered. 3.3. Dimensionality reduction Employing distance measure d1, the Y-STR profiles were next subjected to MDS analysis separately for each marker kit (Fig. 4). If Y-STR profiles were significantly more similar within than between YHRD metapopulations, then Y-STR profiles from the same YHRD metapopulation should ’group’ in an MDS plot and somehow distinguish themselves from other YHRD metapopulations. However, visual inspection of the marker kit-specific MDS plots provided no evidence that this was true for the major YHRD metapopulations (Fig. 4). Only in a few cases, mainly for slowly mutating markers, was a certain clustering noticeable (e.g. for major YHRD metapopulations African or Eskimo Aleut). In all other cases, the Y-STR profiles from a given major YHRD metapopulation were approxi- mately evenly distributed throughout the MDS plot of the entire data set. Similar results were obtained using UMAP and t-SNE analyses for dimensionality reduction (Supplementary Figs. S.3 and S.4), and when considering certain subgroups of the major YHRD metapopulations separately (Supplementary Figs. S.5-S.7). 8 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint Fig. 3: Benchmarking of pairwise Y-STR profile distance measures against BATWING. For each pair of Y-STR profiles, the respective distance is plotted against the coalescent branch length estimated with BATWING, assuming an effective population size of 1,000. ˆρ: average Spearman’s rank correlation coefficient, taken over five repetitions of the analysis. For the definition of marker kits, see main text and Table 1. 3.4. Clustering 3.4.1. Hierarchical clustering A suitable agglomeration method for hierarchical clustering was selected for each marker kit by maximisation of the respective agglomeration coefficientac (see Section 2.5). Of the five agglomeration methods considered (’av- erage’, ’single’, ’complete’, ’Ward’, and ’weighted’), the Ward method consistently yielded both the highest and the least variable ac values for the different marker kits (Table 2). Therefore, hierarchical clustering was performed applying the Ward method to distance measure d1. 9 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint Fig. 4: Multidimensional scaling (MDS) analysis of Y-STR profiles. In each column, the Y-STR profiles from one major YHRD metapopulation are highlighted in black. MDS1 (MDS2): first (second) MDS coordinate; n: number of Y-STR profiles included. For the definition of marker kits, see main text and Table 1. Table 2: Agglomeration coefficients (ac) obtained with different agglomeration methods. Marker kita Agglomeration method Average Single Complete Ward Weighted Fast 0.89 0.80 0.93 0.99 0.90 Medium 0.93 0.92 0.96 0.99 0.95 Mixed 0.91 0.84 0.95 0.99 0.93 Slow 0.97 0.97 0.98 1.00 0.98 aFor the definition of marker kits, see main text and Table 1. For each marker kit, we determined the height of the hierarchical cluster dendrogram at which a given residual number of clusters was reduced by one by the subsequent merger. Applying the elbow method [43] to a graphical representation of these residual cluster numbers and merger dendrogram heights (Fig. 5), we assigned six, four, seven, and five structurally evident clusters to the Y-STR profiles for the fast, medium, mixed, and slow mutating marker kit, respectively. The corresponding cluster dendrograms are provided in Supplementary Fig. S.8. The stability of the clusters identified in this way was assessed by bootstrapping (100 bootstrap samples per marker kit), using the Jaccard index as a measure of cluster overlap. Several clusters were characterised by an average 10 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint Fig. 5: Residual cluster number and merger dendrogram height during hierarchical clustering. k: number of clusters present as identified by the elbow method (also highlighted in orange); grey bars: decrease in dendrogram height when the residual cluster number is increased by one. For the definition of marker kits, see main text and Table 1. Table 3: Average maximum Jaccard indices obtained during bootstrapping-based assessment of cluster stability. Marker kita Structural cluster 1 2 3 4 5 6 7 Fast 0.70 0.61 0.53 0.53 0.47 0.27 Medium 0.66 0.62 0.50 0.44 Mixed 0.65 0.55 0.50 0.45 0.43 041 0.29 Slow 0.76 0.75 0.74 0.64 0.48 aFor the definition of marker kits, see main text and Table 1. Note: Clusters were numbered post hoc according to cluster stability. maximum Jaccard index, taken over the bootstrap samples, around or below 0.5 (Table 3) which means that, at best, half of the Y-STR profiles from the cluster were, on average, also grouped together in bootstrap clusters. Furthermore, the inferred clusters proved to be most stable for the slowly mutating marker kit which seems plausible since the allelic similarity and historical relatedness of Y-STR profiles are more likely to be correlated when mutation rates are small. We next determined how the Y-STR profiles from a given major YHRD metapopulation were distributed over the marker kit-specific clusters identified (Fig. 6). With the exception of slowly mutating markers, the clusters showed only little major YHRD metapopulation specificity or, in cases where a cluster dominated one particular metapopu- lation, it also did so for another, historically distant major YHRD metapopulation. These results suggest that major YHRD metapopulations would be difficult to recover by agnostic clustering of Y-STR profiles alone, at least for 11 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint Fig. 6: Distribution of major YHRD metapopulations over identified Y-STR profile clusters. The height of each bar corresponds to the fraction of Y-STR profiles in a given major YHRD metapopulation that belongs to a certain cluster. For the definition of marker kits, see main text and Table 1. medium and fast mutating markers. For slowly mutating markers, in contrast, some major YHRD metapopulations overlapped significantly with specific clusters and, with the exception of Eurasians, at least 50% of each major YHRD metapopulation belonged to only a single cluster. Moreover, significant overlaps in cluster membership as, for exam- ple, between Eskimo Aleut and Native Americans usually appeared to be historically meaningful. 3.4.2. STRUCTURE analysis Y-STR profiles were also grouped by STRUCTURE [47] analysis as described in [48, 49]. To determine the number of Y-STR profile groups present, we applied the elbow method [43] to a cross-entropy criterion [49] each time calculated for a fixed group number of between one and 15. Seven, three, nine, and ten groups were identified this way 12 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint Fig. 7: Grouping of Y-STR profiles by STRUCTURE analysis. The height of each bar corresponds to the fraction of profiles in a major YHRD metapopulation that belongs to a certain STRUCTURE-derived group. For the definition of marker kits, see main text and Table 1. for the fast, medium, mixed, and slow marker kit, respectively. Each Y-STR profile was then assigned to the group with the highest probability of membership for that profile. Line plots of the STRUCTURE results (Supplementary Fig. S.9) highlighted that many Y-STR profiles were characterised by a large probability of membership in one group. In particular, ca. 79%, 77%, 69%, and 69% of YHRD profiles belonged to one group with probability ≥ 0.9 for the fast, medium, mixed, and slow marker kit, respectively. Similar to hierarchical clustering, STRUCTURE yielded groupings of Y-STR profiles that, in most cases, only poorly matched the definition of major YHRD metapopulations (Fig. 7). Only a few major YHRD metapopulations had more than 50% of their Y-STR profiles assigned to one and the same group. In view of the large number of groups inferred for the mixed and slow marker kits, we extended the analysis for these marker kits so as to also cover the predefined subgroups of major YHRD metapopulations (Supplementary Fig. S.10). Since STRUCTURE assigns 13 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint Y-STR profiles to groups only with a certain probability, we additionally repeated the STRUCTURE analyses only for profiles that had a group membership probability ≥ 0.9 (Supplementary Fig. S.11). Both analyses yielded similar

Results

to the original. 4. Discussion When equating the pairwise distance between Y-STR profiles to the sum of marker-wise allele differences, clas- sical cluster analysis suggests only weak inherent structure in the data currently included in YHRD. Furthermore, some of the clusters identified in our study proved unstable when judged by the Jaccard indices obtained from com- prehensive bootstrapping. Notably, clusters derived for Y-STR profiles comprising slowly mutating markers exhibited greater stability than others, a pattern consistent with previous observations for much smaller numbers of markers [50–52]. Also consistent with other studies [37], the most pronounced differences in YHRD were observed between Y-STR profiles of African and non-African origin, while Eurasian profiles tended to be most similar to one another. Our results suggest that YHRD cannot be divided into clearly distinguishable subsamples that correspond strongly to the metapopulations highlighted by YHRD itself, at least not by drawing solely upon the Y-STR profiles themselves. On the contrary, the relationship between Y-STR haplotype and metapopulation membership proved to be highly variable and dependent upon the markers considered. The metapopulations highlighted in YHRD are therefore not well-defined genetic units. This finding does not call into question the concept of metapopulations per se, but rather suggests that the human genetic clock of the Y-chromosome has ticked too differently from cultural and geographical clocks for the traces of these clocks to run synchronously. What does this mean for forensic practice, support of which is one of the essential - if not the most essential - concerns of YHRD? Some of us have previously pointed out that equating database frequency estimates with match probabilities in forensic casework requires that the database, or at least the part of it that is being used, represents a meaningful case-specific suspect population [5]. At first glance, it seems obvious that the definition of this population should not be based primarily upon genetic considerations to avoid circular reasoning. Instead, it is generally accepted that the group of alternative trace donors is first narrowed down using non-genetic information (location, time, cultural context, witness statements, etc.) and only then characterized in detail genetically [2]. In this second step, it can of course be useful to favor genetic markers that distinguish the suspect population from other groups of males. These considerations inevitably lead to the question of whether and under what circumstances a YHRD metapop- ulation can be considered a plausible suspect population. First of all, “plausible” in this context must mean that the definition of the metapopulation applies to the suspect population in question almost unchanged. Furthermore, this applicability must be fully accepted by all parties involved in the case, including the defense. In our view, the whole approach is therefore only viable if the criteria used to define YHRD metapopulations are made fully transparent in advance to all stakeholders - which, to our knowledge, has not yet been the case. 14 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint In view of the general problems associated with the forensic use of population database information [5], alternative approaches to evaluate trace-donor matches of Y-STR profiles have been proposed, including evolutionary simulation of potential matches [4] and calculation of the exact match probability inside and outside the suspect pedigree [53]. However, it will certainly take some time before such novel approaches can be incorporated into forensic casework, and the current practice of estimating profile frequencies using databases is therefore likely to be maintained for a while. If this is the case, then the genetic information contained in YHRD must at least be used to select metapopula- tions - to serve as suspect populations - in a way that is fair to the suspects while preserving the evidentiary value of the database as much as possible. A two-step approach is conceivable for this. First, the YHRD metapopulation with the (likely) strongest Y-chromosomal similarity to the suspect is identified after restricting the Y-STR profile to the slowly mutating markers. Developing a suitable method for such a selection admittedly requires further research. Be that as it may, if the selection is consistent with the non-genetic evidence for a particular suspect population, it can also be considered fair to the suspect. “Fair” here refers to the principle of “in dubio pro reo” (presumption of innocence) because a suspect profile generated for any other set of Y-STR markers should also be quite frequent, if not most frequent, in the selected metapopulation. In the second step, the frequency of the (much more suspect-specific) Y-STR profile comprising moderately to rapidly mutating markers is estimated in the selected YHRD metapopulation and, if this is to be practiced despite theoretical reservations, equated to the match probability. In the vast majority of cases, the persuasiveness of the result of the second step should be similar to that of using the entire profile. Furthermore, since mutations at different Y-STRs are statistically independent, the use of different markers for the initial identification and subsequent genetic characterization of a suspect population would also help to avoid circular reasoning when evaluating trace-donor profile matches. Should the genetic and non-genetic information point to very different suspect populations in step one, we recommend applying the approach to both selections and presenting both results in court. Although our fundamental reservations about the approach taken so far for obtaining match probabilities remain, and although YHRD offers only limited coverage for many populations, particularly Non-Eurasians [54], we believe that a change to the procedure as proposed here could represent a significant improvement in terms of its scientific and legal irrefutability. Appendix A. Distance properties of distance measures d1, d2, and d3 Any valid distance measure, d, must fulfil the following properties for Y-STR profilesx1, x2, and x3: 1. d(x1, x2) = 0 ⇔ x1 = x2 (Reflexivity). 2. d(x1, x2) ≥ 0 (Non-negativity). 3. d(x1, x2) = d(x2, x1) (Symmetry). 4. d(x1, x2) + d(x1, x3) ≥ d(x2, x3) (Triangle inequality). 15 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint Measure d1 For d1, the first three properties follow from the fact that (i) log( µ j) < 0 since µ j ∈ (0, 1), (ii) m j ≥ 0 and m j = 0 for all j if and only if the two Y-STR profiles in question are identical, and (iii)m j is itself symmetric. Let m{i,k} j denote the absolute allelic difference betweenxi and xk at the jth locus. Then d1(x1, x2) + d1(x1, x3) = lX j=1 −m{1,2} j log(µ j) + lX j=1 −m{1,3} j log(µ j) = lX j=1 − m{1,2} j + m{1,3} j log(µ j) ≥ lX j=1 −m{2,3} j log(µ j) = d1(x2, x3). Thus, the triangle inequality also holds for d1. Measure d2 Measure d2 satisfies the first three properties of a distance measure for the same reasons as d1 (see above). The triangle inequality holds for d2 because d2(x1, x2) + d2(x1, x3) = lX j=1 −m{1,2} j 1 log(1 − µ j) + lX j=1 −m{1,3} j 1 log(1 − µ j) = lX j=1 − m{1,2} j + m{1,3} j 1 log(1 − µ j) ≥ lX j=1 −m{2,3} j 1 log(1 − µ j) = d2(x2, x3). Measure d3 Reflexivity and symmetry hold for d3 by definition. Further, even for rapidly mutating Y-STRs, it appears reason- able to assume that Ne ≥ 1/(1 − 2µ j). Hence, 2µ j + 1/Ne ≤ 1 and m j + 1 ≥ 1, which implies non-negativity. Finally, the triangle inequality also holds for d3 because if m{i,k} j = 0 for any j, i, and k, then m{i,p} j = m{k,p} j for any p, and if m{i,k} j > 0 for all j, i, and k, then 16 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint d3(x1, x2) + d3(x1, x3) = lX j=1 log m{1,2} j + 1 − log 2µ j + 1 Ne !! + lX j=1 log m{1,3} j + 1 − log 2µ j + 1 Ne !! ≥ lX j=1 log m{1,2} j + 1 + log m{1,3} j + 1 − log 2µ j + 1 Ne !! = lX j=1 log m{1,2} j m{1,3} j + m{1,2} j + m{1,3} j + 1 − log 2µ j + 1 Ne !! ≥ lX j=1 log m{1,2} j + m{1,3} j + 1 − log 2µ j + 1 Ne !! ≥ lX j=1 log m{2,3} j + 1 − log 2µ j + 1 Ne !! = d3(x2, x3). Appendix B. Motivation of distance measure d3 The following considerations are based upon [55, 56]. Let x1 and x2 be two single-locus Y-STR profiles for which their coalescence time T should be estimated by some meaningful value bT. Let mi, for i ∈ {1, 2}, denote the number of mutations that xi has undergone since the most recent common ancestor of x1 and x2. As an estimate bT, we want to derive the posterior expectation of T given the allele difference betweenx1 and x2. To this end, we assume the following prior distribution for T: T ∼ Exp 1 Ne ! , where Ne is the effective population size [56]. If µ denotes the mutation rate, then the conditional distribution of mi given T equals mi | T ∼ Poisson(µT) for i ∈ {1, 2}. Hence, the posterior distribution of T is given by P(T | m1 + m2) = P(m1 + m2 | T)P(T) P(m1 + m2) ∝ P(m1 + m2 | T)P(T) = (2µT)m1+m2 e−2µT (m1 + m2)! 1 Ne e− 1 Ne T ∝ T m1+m2 e−(2µ+ 1 Ne )T . This means that T | m1 + m2 ∼ Γ(m1 + m2 + 1, 2µ + 1 Ne ) with (posterior) expectation E[T | m1 + m2] = m1 + m2 + 1 2µ + 1 Ne . 17 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint In case of l loci, the likelihood equals the product of the locus-specific terms so that the posterior expectation now equals ˆT = lY j=1 m1 j + m2 j + 1 2µ j + 1 Ne . To avoid numerical overflow, we will take log ˆT = lX j=1 log m1 j + m2 j + 1 − log 2µ j + 1 Ne ! and hence, we propose distance measure d3(x1, x2) = lX j=1 log m1 j + m2 j + 1 − log 2µ j + 1 Ne !! 1[m 1 j + m2 j > 0]. Pragmatically, we set m1 j + m2 j equal to the absolute allelic difference between x1 and x2 at the jth locus, thereby knowingly disregarding the possibility of (undetectable) backwards mutations.

References

[1] Lutz Roewer. Y-chromosome short tandem repeats in forensics—Sexing, profiling, and matching male DNA. WIREs F orensic Science, 1(4): e1336, 2019. doi:10.1002/wfs2.1336. [2] Lutz Roewer, Mikkel Meyer Andersen, Jack Ballantyne, John M. Butler, Amke Caliebe, Daniel Corach, Maria Eugenia D’Amato, Leonor Gusm˜ao, Yiping Hou, Peter de Knijff, Walther Parson, Mechthild Prinz, Peter M. Schneider, Duncan Taylor, Marielle Vennemann, and Sascha Willuweit. DNA commission of the International Society of Forensic Genetics (ISFG): Recommendations on the interpretation of Y-STR

Results

in forensic analysis. F orensic Sci. Int. Genet., 48:102308, 2020. doi:10.1016/j.fsigen.2020.102308. [3] C. H. Brenner. Fundamental problem of forensic mathematics - The evidential value of a rare haplotype. F orensic Sci. Int. Genet., 4(5): 281–291, 2010. doi:10.1016/j.fsigen.2009.10.013. [4] Mikkel M. Andersen and David J. Balding. How convincing is a matching Y-chromosome profile? PLOS Genetics, 13(11):1–16, 2017. doi:10.1371/journal.pgen.1007028. [5] Amke Caliebe and Michael Krawczak. Match probabilities for Y-chromosomal profiles: A paradigm shift. F orensic Sci. Int. Genet., 37: 200–203, 2018. doi:10.1016/j.fsigen.2018.08.009. [6] David J. Balding and Richard A. Nichols. DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands. F orensic Sci. Int., 64(2):125–140, 1994. doi:10.1016/0379-0738(94)90222-4. [7] Mikkel Meyer Andersen and David J. Balding. Assessing the Forensic Value of DNA Evidence from Y Chromosomes and Mitogenomes. Genes, 12(8), 2021. doi:10.3390/genes12081209. [8] L. Roewer, M. Kayser, P. de Knijff, K. Anslinger, A. Betz, A. Cagli `a, D. Corach, S. F ¨uredi, L. Henke, M. Hidding, H.J. K ¨argel, R. Lessig, M. Nagy, V .L. Pascali, W. Parson, B. Rolf, C. Schmitt, R. Szibor, J. Teifel-Greding, and M. Krawczak. A new method for the evaluation of matches in non-recombining genomes: application to Y-chromosomal short tandem repeat (STR) haplotypes in European males. F orensic Sci. Int., 114(1):31–43, 2000. doi:10.1016/S0379-0738(00)00287-5. [9] Mikkel Meyer Andersen, Amke Caliebe, Arne Jochens, Sascha Willuweit, and Michael Krawczak. Estimating trace-suspect match probabili- ties for singleton Y-STR haplotypes using coalescent theory.F orensic Sci. Int. Genet., 7(2):264–271, 2013. doi:10.1016/j.fsigen.2012.11.004. [10] Sascha Willuweit and Lutz Roewer. Y chromosome haplotype reference database (YHRD): Update. F orensic Sci. Int. Genet., 1(2):83–87, 2007. doi:10.1016/j.fsigen.2007.01.017. 18 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint [11] Sven Gundlach, Olaf Junge, Lars Wienbrandt, Michael Krawczak, and Amke Caliebe. Comparison of Markov Chain Monte Carlo Software for the Evolutionary Analysis of Y-Chromosomal Microsatellite Data. Computational and Structural Biotechnology Journal, 17:1082–1090, 2019. doi:10.1016/j.csbj.2019.07.014. [12] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/. [13] Mikkel Meyer Andersen and Ian J. Wilson. rforensicbatwing: BATWING for Calculating F orensic Trace-Suspect Match Probabilities, 2018. URL https://github.com/mikldk/rforensicbatwing. R package version 1.3.1, commit d1585bfe1211b18693ce8d0e2dc0ae73a6ddeba5. [14] Tomasz Konopka. umap: Uniform Manifold Approximation and Projection, 2023. URL https://CRAN.R-project.org/package=umap. R package version 0.2.10.0. [15] Jesse H. Krijthe. Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation, 2015. URL https://github. com/jkrijthe/Rtsne. R package version 0.17. [16] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, 9(86): 2579–2605, 2008. [17] L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15(93):3221–3245, 2014. [18] Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. cluster: Cluster Analysis Basics and Extensions, 2024. URL https://CRAN.R-project.org/package=cluster. [19] Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain. fields: Tools for spatial data, 2021. URL https://github.com/ dnychka/fieldsRPackage. R package version 16.3.1. [20] Eric Frichot and Olivier Francois. LEA: an R package for Landscape and Ecological Association studies. Methods in Ecology and Evolution, 2015. URL http://membres-timc.imag.fr/Olivier.Francois/lea.html. [21] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain Franc ¸ois, Garrett Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn, Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache, Kirill M¨uller, Jeroen Ooms, David Robinson, Dana Paige Seidel, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Kara Woo, and Hiroaki Yutani. Welcome to the tidyverse. Journal of Open Source Software, 4(43):1686, 2019. doi:10.21105/joss.01686. [22] Kevin Wright. pals: Color Palettes, Colormaps, and Tools to Evaluate Them, 2025. URL https://CRAN.R-project.org/package= pals. R package version 1.10. [23] Tal Galili. dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics, 2015. doi:10.1093/bioinformatics/btv428. [24] Christian Hennig. fpc: Flexible Procedures for Clustering, 2024. URL https://CRAN.R-project.org/package=fpc. R package version 2.2-13. [25] Alboukadel Kassambara. ggpubr: ’ggplot2’ Based Publication Ready Plots, 2023. URL https://CRAN.R-project.org/package= ggpubr. R package version 0.6.0. [26] Michael Hahsler, Kurt Hornik, and Christian Buchta. Getting things in order: An introduction to the R package seriation.Journal of Statistical Software, 25(3):1–34, 2008. doi:10.18637/jss.v025.i03. [27] Michael Hahsler, Christian Buchta, and Kurt Hornik. seriation: Infrastructure for Ordering Objects Using Seriation, 2025. URL https: //CRAN.R-project.org/package=seriation. R package version 1.5.8. [28] Christoph Glur. data.tree: General Purpose Hierarchical Data Structure, 2023. URL https://CRAN.R-project.org/package=data. tree. R package version 1.1.0. [29] G ´abor Cs´ardi and Tam´as Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006. URL https://igraph.org. [30] G ´abor Cs´ardi, Tam´as Nepusz, Vincent Traag, Szabolcs Horv´at, Fabio Zanini, Daniel Noom, and Kirill M¨uller. igraph: Network Analysis and Visualization in R, 2026. URL https://CRAN.R-project.org/package=igraph. R package version 2.1.4. 19 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint [31] Thomas Lin Pedersen. ggraph: An Implementation of Grammar of Graphics for Graphs and Networks , 2024. URL https://CRAN. R-project.org/package=ggraph. R package version 2.2.1. [32] Claus O. Wilke. cowplot: Streamlined Plot Theme and Plot Annotations for ’ggplot2’, 2024. URL https://CRAN.R-project.org/ package=cowplot. R package version 1.1.3. [33] T ´ora Oluffa Stenberg Olsen. Code for reproducing the results. doi:10.6084/m9.figshare.31286275. [34] S.Willuweit and L. Roewer. The new Y Chromosome Haplotype Reference Database. F orensic Sci. Int. Genet., 15:43–48, 2015. doi:10.1016/j.fsigen.2014.11.024. [35] P. A. P. Moran. The Interpretation of Statistical Maps. Journal of the Royal Statistical Society. Series B (Methodological), 10(2):243–251, 1948. [36] Lutz Roewer, Peter J. P. Croucher, Sascha Willuweit, Tim T. Lu, Manfred Kayser, R ¨udiger Lessig, Peter de Knijff, Mark A. Jobling, Chris Tyler-Smith, and Michael Krawczak. Signature of recent historical events in the European Y-chromosomal STR haplotype distribution.Hum. Genet., 116:279–291, 2005. doi:10.1007/s00439-004-1201-z. [37] Hongyang Xu, Chuan-Chao Wang, Rukesh Shrestha, Ling-Xiang Wang, Manfei Zhang, Yungang He, Judith R. Kidd, Kenneth K. Kidd, Li Jin, and Hui Li. Inferring population structure and demographic history using Y-STR data from worldwide populations. Mol. Genet. Genomics, 290:141–150, 2015. doi:10.1007/s00438-014-0903-8. [38] Ian J. Wilson, Michael E. Weale, and David J. Balding. Inferences from DNA Data: Population Histories, Evolutionary Processes and Forensic Match Probabilities. Journal of the Royal Statistical Society Series A: Statistics in Society, 166(2):155–188, 2003. doi:10.1111/1467- 985X.00264. [39] J. C. Gower. Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis. Biometrika, 53(3/4):325–338, 1966. doi:10.2307/2333639. [40] K. V . Mardia. Some properties of clasical multi-dimesional scaling. Communications in Statistics - Theory and Methods, 7(13):1233–1241, 1978. doi:10.1080/03610927808827707. [41] J. Healy and L. McInnes. Uniform manifold approximation and projection. Nat Rev Methods Primers , 4(82), 2024. doi:10.1038/s43586- 024-00363-x. [42] G. N. Lance and W. T. Williams. A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems. The Computer Journal, 9 (4):373–380, 1967. doi:10.1093/comjnl/9.4.373. [43] R.L. Thorndike. Who belongs in the family? Psychometrika, 18(4):267–276, 1953. doi:10.1007/BF02289263. [44] Paul Jaccard. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques r ´egions voisines. Bulletin de la Societe V audoise des Sciences Naturelles, 37:241–272, 1901. doi:10.5169/seals-266440. [45] Christian Hennig. Cluster-wise assessment of cluster stability. Computational Statistics & Data Analysis, 52(1):258–271, 2007. doi:10.1016/j.csda.2006.11.025. [46] Christian Hennig. Dissolution point and isolation robustness: Robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis, 99(6):1154–1176, 2008. doi:10.1016/j.jmva.2007.07.002. [47] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2):945–959, 2000. doi:10.1093/genetics/155.2.945. [48] O. Franc ¸ois. Running Structure-like Population Genetic Analyses with R, 2016. R Tutorials in Population Genetics. U. Grenoble-Alpes. [49] Eric Frichot, Franc ¸ois Mathieu, Th ´eo Trouillon, Guillaume Bouchard, and Olivier Franc ¸ois. Fast and Efficient Estimation of Individual Ancestry Coefficients.Genetics, 196(4):973–983, 2014. doi:10.1534/genetics.113.160572. [50] Wentao Shi, Qasim Ayub, Mark Vermeulen, Rong guang Shao, Sofia Zuniga, Kristiaan van der Gaag, Peter de Knijff, Manfred Kayser, Yali Xue, and Chris Tyler-Smith. A Worldwide Survey of Human Male Demographic History Based on Y-SNP and Y-STR Data from the HGDP–CEPH Populations. Molecular Biology and Evolution, 27(2):385–393, 2010. doi:10.1093/molbev/msp243. [51] Amalia Diaz-Lacava, Maja Walier, Sascha Willuweit, Thomas F. Wienker, Rolf Fimmers, Max P. Baur, and Lutz Roewer. Geostatistical inference of main Y-STR-haplotype groups in Europe.F orensic Sci. Int. Genet., 5(2):91–94, 2011. doi:10.1016/j.fsigen.2010.09.010. Haploid 20 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint DNA markers in Forensic Genetics. [52] Josephine Purps, Sabine Siegert, Sascha Willuweit, Marion Nagy, C´ıntia Alves, Renato Salazar, Sheila M.T. Angustia, Lorna H. Santos, Katja Anslinger, Birgit Bayer, Qasim Ayub, Wei Wei, Yali Xue, Chris Tyler-Smith, Miriam Baeta Bafalluy, Bego˜na Mart´ınez-Jarreta, Balazs Egyed, Beate Balitzki, Sibylle Tschumi, David Ballard, Denise Syndercombe Court, Xinia Barrantes, Gerhard B¨aßler, Tina Wiest, Burkhard Berger, Harald Niederst¨atter, Walther Parson, Carey Davis, Bruce Budowle, Helen Burri, Urs Borer, Christoph Koller, Elizeu F. Carvalho, Patricia M. Domingues, Wafaa Takash Chamoun, Michael D. Coble, Carolyn R. Hill, Daniel Corach, Mariela Caputo, Maria E. D’Amato, Sean Davison, Ronny Decorte, Maarten H.D. Larmuseau, Claudio Ottoni, Olga Rickards, Di Lu, Chengtao Jiang, Tadeusz Dobosz, Anna Jonkisz, William E. Frank, Ivana Furac, Christian Gehrig, Vincent Castella, Branka Grskovic, Cordula Haas, Jana Wobst, Gavrilo Hadzic, Katja Drobnic, Katsuya Honda, Yiping Hou, Di Zhou, Yan Li, Shengping Hu, Shenglan Chen, Uta-Dorothee Immel, R¨udiger Lessig, Zlatko Jakovski, Tanja Ilievska, Anja E. Klann, Cristina Cano Garc´ıa, Peter de Knijff, Thirsa Kraaijenbrink, Aikaterini Kondili, Penelope Miniati, Maria V ouropoulou, Lejla Kovacevic, Damir Marjanovic, Iris Lindner, Issam Mansour, Mouayyad Al-Azem, Ansar El Andari, Miguel Marino, Sandra Furfuro, Laura Locarno, Pablo Mart´ın, Gracia M. Luque, Antonio Alonso, Lu´ıs Souto Miranda, Helena Moreira, Natsuko Mizuno, Yasuki Iwashima, Rodrigo S. Moura Neto, Tatiana L.S. Nogueira, Rosane Silva, Marina Nastainczyk-Wulf, Jeanett Edelmann, Michael Kohl, Shengjie Nie, Xianping Wang, Baowen Cheng, Carolina N´u˜nez, Marian Mart´ınez de Pancorbo, Jill K. Olofsson, Niels Morling, Valerio Onofri, Adriano Tagliabracci, Horolma Pamjav, Antonia V olgyi, Gusztav Barany, Ryszard Pawlowski, Agnieszka Maciejewska, Susi Pelotti, Witold Pepinski, Monica Abreu-Glowacka, Christopher Phillips, Jorge C ´ardenas, Danel Rey-Gonzalez, Antonio Salas, Francesca Brisighelli, Cristian Capelli, Ulises Toscanini, Andrea Piccinini, Marilidia Piglionica, Stefania L. Baldassarra, Rafal Ploski, Magdalena Konarzewska, Emila Jastrzebska, Carlo Robino, Antti Sajantila, Jukka U. Palo, Evelyn Guevara, Jazelyn Salvador, Maria Corazon De Ungria, Jae Joseph Russell Rodriguez, Ulrike Schmidt, Nicola Schlauderer, Pekka Saukko, Peter M. Schneider, Miriam Sirker, Kyoung-Jin Shin, Yu Na Oh, Iulia Skitsa, Alexandra Ampati, Tobi-Gail Smith, Lina Solis de Calvit, Vlastimil Stenzl, Thomas Capal, Andreas Tillmar, Helena Nilsson, Stefania Turrina, Domenico De Leo, Andrea Verzeletti, Venusia Cortellini, Jon H. Wetton, Gareth M. Gwynne, Mark A. Jobling, Martin R. Whittle, Denilce R. Sumita, Paulina Wola´nska-Nowak, Rita Y .Y . Yong, Michael Krawczak, Michael Nothnagel, and Lutz Roewer. A global analysis of Y-chromosomal haplotype diversity for 23 STR loci. F orensic Sci. Int. Genet., 12:12–23, 2014. doi:https://doi.org/10.1016/j.fsigen.2014.04.008. [53] Amke Caliebe, Dion Zandstra, Arwin Ralf, Manfred Kayser, and Michael Krawczak. A novel mathematical framework for pedigree-based calculation of Y-STR match probabilities. Sci Rep, 15:14651, 2025. doi:10.1038/s41598-025-98644-2. [54] Rita Costa, Jennifer Fadoni, Ant ´onio Amorim, and Laura Cain ´e. Y-STR Databases-Application in Sexual Crimes. Genes, 16(5):484, 2025. doi:10.3390/genes16050484. [55] Simon Tavar ´e, David J. Balding, R.C. Griffiths, and Peter Donnelly. Inferring Coalescence Times From DNA Sequence Data.Genetics, 145 (2):505–518, 1997. doi:10.1093/genetics/145.2.505. [56] Bruce Walsh. Estimating the Time to the Most Recent Common Ancestor for the Y chromosome or Mitochondrial DNA for a Pair of Individuals. Genetics, 158(2):897–912, 2001. doi:10.1093/genetics/158.2.897. 21 .CC-BY-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-ND-4.0