Keywords
Y-STR, Population database, Metapopulation, Cluster analysis, Genetic co-ancestry, Match probability
1. Introduction
The analysis of Y-chromosomal short tandem repeat (Y-STR) markers plays an important role in forensic genetics,
particularly in cases, such as sexual assaults, where the biological trace of interest contains an unbalanced mixture of
a major female and a minor male DNA component. Due to their male specificity, single-source Y-STR profiles can
∗Corresponding author
Email address:
[email protected] (T´ora Oluffa Stenberg Olsen)
Preprint submitted to bioRxiv February 7, 2026
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
often still be reliably determined from mixtures even when the ratio of male to female genetic material is minimal
[1, 2].
Once a suspect has been identified in cases like the above, his Y-STR profile can be determined as well and
compared to the trace profile. While a mismatch between the two profiles normally excludes the suspect from being
a donor, matches are more difficult to interpret. This is because, in court, the forensic genetic expert must quantify
the strength of evidence of a match, usually in the form of the probability with which such a match would also be
observed between the trace profile and the profile of another male. The question of how best to obtain this so-called
‘match probability’ and, in particular, what “another” should mean in this context has occupied forensic geneticists
for more than two decades [3–5].
The most commonly used approach to determining a match probability equates the latter with the frequency of
the Y-STR profile of interest in a suitable group of plausible alternative donors - also referred to as the ‘suspect
population’ [5, 6]. Although various mathematically sound methods have been proposed to estimate a haplotype
frequency from a representative sample of a suspect population [7–9], the main concern to date has been how to
adequately define this population itself. Notably, the DNA commission of the International Society for Forensic
Genetics (ISFG) recommended use of the geographical location of the crime scene as a “leading criterion” for selecting
the “right” suspect population [2].
The ISFG proposal is certainly meaningful because one of the key criteria for the plausibility of donorship is
spatial proximity to the place where the trace was recovered. It is undoubtedly reasonable to imply that most plausible
alternative donors lived near the crime scene at the time. However, with very few exceptions, estimating the population
frequency of a Y-STR profile requires a sample or database that is both sufficiently large and sufficiently representative
of the population in question. ‘Representative’ here takes a statistical definition and means that the sample or database
adequately captures all relevant population characteristics. The largest database established for this purpose is the
Y-chromosomal Haplotype Reference Database (YHRD), which emerged in 2004 from a series of previous efforts
in Europe [10]. Since then, YHRD has been continuously extended and currently contains approximately 350,000
Y-STR profiles from all over the world, comprising between eight and 27 Y-STR markers.
Despite these tremendous efforts, even a database the size of YHRD cannot cover every case-relevant geographical
target region at a level (village, city, state, province, etc.) desirable for forensic casework. Moreover, humans have
undergone a multitude of demographic processes that led to the formation, and subsequent reorganization, of local
populations at varying levels of intra- and inter-group genetic differentiation. To allow forensic experts to take this
fact into account when using YHRD, the database curators decided to highlight “metapopulations” in YHRD. These
are defined as combinations of originally separate, regional subsets of the data that were supposedly connected by past
migration or gene flow [10]. The stated goal of the endeavour has been to group together “geographically dispersed
human population samples with shared genetic ancestry” [2], using geographic, linguistic, and demographic bonds as
proxies for (male) genetic relatedness.
2
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
In the present study, we aimed to assess the extent to which the above-mentioned objective was achieved. To this
end, we investigated
(i) whether clusters of Y-STR profiles are evident in YHRD that indicate subsets with markedly more recent co-
ancestry,
(ii) to what extent the degree of clustering observed for different marker combinations depends upon their mutation
rates, and
(iii) whether the Y-STR profile clusters thus identified overlap with the metapopulations defined in YHRD.
Ideally, our study would have been based upon an ab initio reconstruction of the genealogy of the Y-STR pro-
files included in YHRD, using software tools publicly available for this purpose. However, the reconstruction of
coalescence trees is not only computationally intensive, but its results also critically depend upon the underlying evo-
lutionary model and the choice of model parameters [11]. Therefore, we performed classical cluster analysis using
marker-wise allelic dissimilarity as an (agnostic) measure of the pair-wise distance between Y-STR profiles. Although
this approach may not perfectly reproduce the results of a coalescence-based analysis, it should still yield trees that
adequately reflect the true genealogy of the Y-STR profiles included in YHRD.
2. Methods
2.1. Data and data analyses
All data analyses were carried out using R [12] version 4.5.0 and the packages rforensicbatwing [13] version
1.3.1, umap [14] version 0.2.10.0, Rtsne [15–17] version 0.17, cluster [18] version 2.1.8, fields [19] version
16.3.1, LEA [20] version 3.16.0, tidyverse [21] version 2.0.0, pals [22] version 1.10, dendextend [23] version
1.19.0, fpc [24] version 2.2.13, ggpubr [25] version 0.6.0, seriation [26, 27] version 1.5.8, data.tree [28]
version 1.1.0, igraph [29, 30] version 2.1.4, ggraph [31] version 2.2.1, and cowplot [32] version 1.1.3. The
respective code is publicly available online [33].
We used data from YHRD [34] in accordance with a project proposal previously submitted to the database cu-
rators (https://yhrd.org/pages/Projects/P5). Currently, YHRD defines seven major metapopulations, namely African,
Afro-Asiatic, Native American, Australian Aboriginal, East Asian, Eskimo-Aleut, and Eurasian, in addition to one
‘Admixed’ metapopulation covering populations with pronounced admixture [2, 34]. Some of these YHRD metapop-
ulations are further divided into subgroups (Fig. 1) that YHRD also refers to as ‘metapopulations’, but from which
we distinguish the original metapopulations by adding the term ‘major’ to the latter, if required. A more detailed
description of the geographical range of the YHRD metapopulations can be found in Table 2 in [2].
We generated four sets of Y-STR loci henceforth referred to as ’kits’. Each kit consisted of eight Y-STR loci
that were selected based upon their mutation rates (Table 1). We refer to these kits as ’fast’, ’medium’, ’mixed’,
3
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
All
Eurasian
Afro−Asiatic
Admixed
East Asian
Native American African
Eskimo Aleut
Australian Aboriginal Caucasian
Altaic
European
Indo−Iranian
Indian
Uralic−Yukaghir
Semitic
Berber
Cushitic
Japanese
Sino−Tibetan
Korean
Tai−Kadai
Austronesian
Austro−Asiatic
Indo−Pacific
Dravidian
Sub−Saharan African
African American South−Eastern European
Western European
Eastern European
Chinese (Han)
Tibeto−Burman
Fig. 1: Overview of YHRD metapopulations. Colors correspond to the major YHRD metapopulations defined in [34], including the ’Admixed’
metapopulation.
Table 1: Mutation rate-dependent definition of Y-STR marker kits.
Kit Fast Medium Mixed Slow
Mutation ratea 4.36 ×10−3 −1.40 ×
10−2
2.34 × 10−3 − 4.34 ×
10−3
3.95 ×10−4 −1.40 ×
10−2
3.95 ×10−4 −2.24 ×
10−3
Loci DYS627, DYS576,
DYS518, DYS449,
DYS570, DYS458,
DYS439, DYS460
DYS481, DYS549,
DYS456, DYS533,
DYS635, DYS389I,
YGATAH4, DYS391
DYS627, DYS576,
DYS518, DYS449,
DYS438, DYS392,
DYS448, DYS437
DYS438, DYS392,
DYS393, DYS437,
DYS643, DYS448,
DYS390, DYS19
Number of Y-STR
profilesb
102,253 102,880 103,129 107,748
Number of YHRD
metapopulations
25 27 25 27
aMutation rates were taken from www.yhrd.org. bNumbers include only Y-STR profiles with integer-valued alleles.
4
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
and ’slow’ depending on the mutation rates of the loci included in the respective kit. Only integer-valued alleles
were considered here because distances to and between intermediate alleles are difficult to define in accordance with
evolutionary proximity. The numbers of Y-STR profiles for each YHRD metapopulation and each kit are provided in
Supplementary Table S.1.
Except for the analysis explained in Section 2.3, the number of Y-STR profiles was limited to a maximum of 100
per YHRD metapopulation for each marker kit to ensure a balanced ancestry of the resulting sample. The Y-STR
profiles were selected randomly from the YHRD metapopulation under consideration.
2.2. Spatial autocorrelation
We examined the degree of spatial autocorrelation between Y-STR profiles and their sampling locations, using
Moran’s I [35]. In particular, we repeated the analysis by Roewer et al. [36] to assess whether similar levels of
autocorrelation as observed in early YHRD also characterise the data analysed in the present study. First, we calculated
pair-wise Great Circle Distances (in kilometres) between sampling locations from the respective longitude and latitude
data. Y-STR profile pairs were then grouped into distance classes defined as adjacent 250 kilometres-wide intervals,
with one additional class comprising all pairs sampled at the same location. We then calculated Moran’s I for each
class, setting the required spatial weights for Y-STR profile pairs equal to unity, if the profile pair fell into that class,
and equal to zero otherwise. While a value of I close to plus one indicates similarity of Y-STR profiles in a given
distance class, and a value of I close to minus one indicates dissimilarity, I = 0 means a lack of correlation in repeat
number between Y-STR profiles in the respective class [35].
2.3. Distance measures
The possible structuring of the YHRD data was to be analysed by multi-dimensional scaling (MDS) and hierar-
chical clustering, which requires a pair-wise distance measure between Y-STR profiles. For this, we considered the
following three quantities:
d1 =
lX
j=1
−m j log(µ j), (1)
d2 =
lX
j=1
−m j
1
log(1 − µ j) , (2)
d3 =
lX
j=1
log
m j + 1
− log
2µ j + 1
Ne
!!
· 1[m j > 0]. (3)
Here, l is the number of loci, µ j denotes the mutation rate at the jth locus, m j is the absolute difference in repeat
number, at the jth locus, between the two Y-STR profiles of interest, and Ne denotes the effective size of the under-
lying population. We considered effective population sizes of 1,000, 5,000, and 10,000, respectively [37]. All three
quantities satisfy the properties of distance measures (Appendix A).
5
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
The rationale of all three distance measures was to consider the respective mutation rates when weighting the
allelic differences. Hence, we introduced locus-specific weights that depended differently upon each locus-specific
mutation rate. More precisely, d1 weights the repeat difference by − log(µ j) so that loci with a low mutation rate
contribute disproportionally to the overall profile dissimilarity. The second distance measure, d2, weights the repeat
difference with−1/ log(1 − µ j), which has a similar effect tod1 but gives even more weight to loci with lower mutation
rates. The third measure, d3, additionally considers the estimated time to the most recent common ancestor of the two
profiles of interest. Thus, d3 subtracts log(2µj+1/Ne) from log(mj+1) to simultaneously account for the divergence of
profiles expected from mutation and drift. A more detailed justification of d3 is provided in Appendix B. In addition
to d1 to d3, we also considered the Euclidean and Manhattan distance.
All distance measures used in the present study were computational compromises that likely resulted in some loss
of ancestry information. Therefore, we compared each measure to BATWING [38] estimates of the total length of the
coalescent branches between two Y-STR profiles, as described in [13]. Unfortunately, due to high computational costs,
BATWING estimates were not suitable as distance measures themselves. For each marker kit, we instead selected 30
Y-STR profiles at random, estimated the lengths of the coalescent branches for all possible profile pairs, and repeated
this procedure five times. The other distance measures were then related to the BATWING estimates in each repetition
by way of Spearman’s rank correlation coefficient.
2.4. Dimensionality reduction
To be able to visualise whether and how Y-STR profiles from YHRD group into clusters, we first employed three
Methods
of dimensionality reduction, namely MDS, Uniform Manifold Approximation and Projection (UMAP), and t-
Distributed Stochastic Neighbour Embedding (t-SNE) [16, 17, 39–41]. The three methods differ in how they maintain
the overall structure of distance in the data during dimensionality reduction. While MDS aims to preserve all pairwise
distances between Y-STR profiles as much as possible, t-SNE focuses on the maintenance of local relationships.
Therefore, t-SNE often produces well-separated clusters, but only inadequately captures their global relationships.
UMAP lies somewhere in between and typically yields similar results to t-SNE regarding local structure, but tends to
infer the global structure somewhat better than t-SNE, although still less reliably than MDS.
2.5. Clustering
For each marker kit, the corresponding Y-STR profiles were subjected to hierarchical clustering with one of five
possible agglomeration methods: ‘average’, ‘single’, ‘complete’, ‘Ward’, and ‘weighted’ [42]. The selection of the
agglomeration method was based upon the respective agglomeration coefficient,ac. This figure lies between zero and
unity and relates the average dissimilarity of a Y-STR profile to the first cluster it joins to the dissimilarity between
the last two merged clusters [18]. An ac value close to unity indicates a data structure where data points join clusters
early, relative to the final merger, thereby suggesting the presence of well-defined and tight clusters. It should be noted
6
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
that ac tends to increase with an increasing number of observations and therefore should not be compared between
samples of different sizes.
For a given marker kit, we performed hierarchical clustering with the agglomeration method that yielded the largest
ac and with the distance measure that correlated most strongly with the coalescent tree branch lengths estimated before
with BATWING. The number of clusters present in the resulting tree was determined using the elbow method [43].
By plotting a given tree height against the number of sub-clusters present at that height, the elbow method identifies
a point (the ‘elbow’) where the decrease in height per added cluster slows down significantly. This property suggests
that adding more clusters beyond the elbow only slightly increases the similarity of Y-STR profiles within clusters.
The stability of the inferred clusters was assessed by bootstrapping using the Jaccard index [24, 44–46] as a
measure of cluster overlap. For this purpose, 100 bootstrap samples were drawn ignoring multiple drawings of one
and the same Y-STR profile. Each bootstrap sample was then subjected to the same hierarchical cluster analysis as
the original data and decomposed into the same number of clusters as originally inferred for the respective marker kit.
For each original cluster, the most similar bootstrap cluster was then identified through maximisation of the pairwise
Jaccard index. Finally, these maximum Jaccard indices were averaged over all bootstrap runs to yield a measure of
stability of each original cluster.
In addition to hierarchical clustering, we also performed STRUCTURE [47] analyses as a means to group Y-STR
profiles. STRUCTURE does not aim to maximise Y-STR profile similarity within groups, but rather maximises the
overall probability of each profile being assigned to a specific group by varying the profile frequencies within groups.
STRUCTURE analyses were performed in R as described in [48, 49]. The marker kit-specific numbers of groups
were again determined using the elbow method [43], but this time employing a cross-entropy criterion [49] calculated
for between one and 15 groups present.
3. Results
3.1. Spatial autocorrelation
Spatial autocorrelation analysis with Moran’sI index revealed that the current relationship between Y-STR profile
characteristics and sampling location resembled that reported for an early version of YHRD [36]. Thus, for each
marker kit, Y-STR profiles tended to be more similar (i.e. Moran’sI > 0) when sampled at nearby rather than distant
locations, and increasing the Great Circle Distance consistently decreased the spatial autocorrelation (Fig. 2).
3.2. Distance measures
To select the most appropriate of five possible distance measures for further analysis (see Section 2.3), each
distance measure was compared to the coalescent tree branch lengths estimated with BATWING [13, 38], assuming an
effective population size of 1,000 (Fig. 3; for results obtained with effective population sizes of 5,000 and 10,000, see
Supplementary Figs. S.1 and S.2, respectively). Benchmarking against BATWING was performed in five repetitions,
separately for each marker kit.
7
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
0.0
0.1
0 4000 8000 12000
Great Circle Distance (km)
Moran's I index
Marker kit Fast Medium Mixed Slow
Fig. 2: Spatial autocorrelation (Moran’sI index) analysis of Y-STR profiles. For the definition of marker kits, see main text and Table 1.
We selected d1 as the most appropriate distance measure for further analysis for all marker kits because it con-
sistently yielded the strongest correlation with the coalescent tree branch lengths (Fig. 3). Interestingly, the results
obtained for d1 were very similar to those obtained for the Manhattan distance. This suggests that additional weighting
of the absolute allele difference for a given Y-STR by minus the logarithm of the corresponding mutation rate (see
Section 2.3) did not significantly affect distance measure performance for the marker kits considered.
3.3. Dimensionality reduction
Employing distance measure d1, the Y-STR profiles were next subjected to MDS analysis separately for each
marker kit (Fig. 4). If Y-STR profiles were significantly more similar within than between YHRD metapopulations,
then Y-STR profiles from the same YHRD metapopulation should ’group’ in an MDS plot and somehow distinguish
themselves from other YHRD metapopulations. However, visual inspection of the marker kit-specific MDS plots
provided no evidence that this was true for the major YHRD metapopulations (Fig. 4). Only in a few cases, mainly
for slowly mutating markers, was a certain clustering noticeable (e.g. for major YHRD metapopulations African
or Eskimo Aleut). In all other cases, the Y-STR profiles from a given major YHRD metapopulation were approxi-
mately evenly distributed throughout the MDS plot of the entire data set. Similar results were obtained using UMAP
and t-SNE analyses for dimensionality reduction (Supplementary Figs. S.3 and S.4), and when considering certain
subgroups of the major YHRD metapopulations separately (Supplementary Figs. S.5-S.7).
8
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
Fig. 3: Benchmarking of pairwise Y-STR profile distance measures against BATWING. For each pair of Y-STR profiles, the respective distance
is plotted against the coalescent branch length estimated with BATWING, assuming an effective population size of 1,000. ˆρ: average Spearman’s
rank correlation coefficient, taken over five repetitions of the analysis. For the definition of marker kits, see main text and Table 1.
3.4. Clustering
3.4.1. Hierarchical clustering
A suitable agglomeration method for hierarchical clustering was selected for each marker kit by maximisation
of the respective agglomeration coefficientac (see Section 2.5). Of the five agglomeration methods considered (’av-
erage’, ’single’, ’complete’, ’Ward’, and ’weighted’), the Ward method consistently yielded both the highest and
the least variable ac values for the different marker kits (Table 2). Therefore, hierarchical clustering was performed
applying the Ward method to distance measure d1.
9
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
Fig. 4: Multidimensional scaling (MDS) analysis of Y-STR profiles. In each column, the Y-STR profiles from one major YHRD metapopulation
are highlighted in black. MDS1 (MDS2): first (second) MDS coordinate; n: number of Y-STR profiles included. For the definition of marker kits,
see main text and Table 1.
Table 2: Agglomeration coefficients (ac) obtained with different agglomeration methods.
Marker kita Agglomeration method
Average Single Complete Ward Weighted
Fast 0.89 0.80 0.93 0.99 0.90
Medium 0.93 0.92 0.96 0.99 0.95
Mixed 0.91 0.84 0.95 0.99 0.93
Slow 0.97 0.97 0.98 1.00 0.98
aFor the definition of marker kits, see main text and Table 1.
For each marker kit, we determined the height of the hierarchical cluster dendrogram at which a given residual
number of clusters was reduced by one by the subsequent merger. Applying the elbow method [43] to a graphical
representation of these residual cluster numbers and merger dendrogram heights (Fig. 5), we assigned six, four, seven,
and five structurally evident clusters to the Y-STR profiles for the fast, medium, mixed, and slow mutating marker kit,
respectively. The corresponding cluster dendrograms are provided in Supplementary Fig. S.8.
The stability of the clusters identified in this way was assessed by bootstrapping (100 bootstrap samples per
marker kit), using the Jaccard index as a measure of cluster overlap. Several clusters were characterised by an average
10
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
Fig. 5: Residual cluster number and merger dendrogram height during hierarchical clustering. k: number of clusters present as identified by the
elbow method (also highlighted in orange); grey bars: decrease in dendrogram height when the residual cluster number is increased by one. For
the definition of marker kits, see main text and Table 1.
Table 3: Average maximum Jaccard indices obtained during bootstrapping-based assessment of cluster stability.
Marker kita Structural cluster
1 2 3 4 5 6 7
Fast 0.70 0.61 0.53 0.53 0.47 0.27
Medium 0.66 0.62 0.50 0.44
Mixed 0.65 0.55 0.50 0.45 0.43 041 0.29
Slow 0.76 0.75 0.74 0.64 0.48
aFor the definition of marker kits, see main text and Table 1. Note: Clusters were
numbered post hoc according to cluster stability.
maximum Jaccard index, taken over the bootstrap samples, around or below 0.5 (Table 3) which means that, at best,
half of the Y-STR profiles from the cluster were, on average, also grouped together in bootstrap clusters. Furthermore,
the inferred clusters proved to be most stable for the slowly mutating marker kit which seems plausible since the
allelic similarity and historical relatedness of Y-STR profiles are more likely to be correlated when mutation rates are
small.
We next determined how the Y-STR profiles from a given major YHRD metapopulation were distributed over the
marker kit-specific clusters identified (Fig. 6). With the exception of slowly mutating markers, the clusters showed
only little major YHRD metapopulation specificity or, in cases where a cluster dominated one particular metapopu-
lation, it also did so for another, historically distant major YHRD metapopulation. These results suggest that major
YHRD metapopulations would be difficult to recover by agnostic clustering of Y-STR profiles alone, at least for
11
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
Fig. 6: Distribution of major YHRD metapopulations over identified Y-STR profile clusters. The height of each bar corresponds to the fraction of
Y-STR profiles in a given major YHRD metapopulation that belongs to a certain cluster. For the definition of marker kits, see main text and Table
1.
medium and fast mutating markers. For slowly mutating markers, in contrast, some major YHRD metapopulations
overlapped significantly with specific clusters and, with the exception of Eurasians, at least 50% of each major YHRD
metapopulation belonged to only a single cluster. Moreover, significant overlaps in cluster membership as, for exam-
ple, between Eskimo Aleut and Native Americans usually appeared to be historically meaningful.
3.4.2. STRUCTURE analysis
Y-STR profiles were also grouped by STRUCTURE [47] analysis as described in [48, 49]. To determine the
number of Y-STR profile groups present, we applied the elbow method [43] to a cross-entropy criterion [49] each time
calculated for a fixed group number of between one and 15. Seven, three, nine, and ten groups were identified this way
12
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
Fig. 7: Grouping of Y-STR profiles by STRUCTURE analysis. The height of each bar corresponds to the fraction of profiles in a major YHRD
metapopulation that belongs to a certain STRUCTURE-derived group. For the definition of marker kits, see main text and Table 1.
for the fast, medium, mixed, and slow marker kit, respectively. Each Y-STR profile was then assigned to the group
with the highest probability of membership for that profile. Line plots of the STRUCTURE results (Supplementary
Fig. S.9) highlighted that many Y-STR profiles were characterised by a large probability of membership in one group.
In particular, ca. 79%, 77%, 69%, and 69% of YHRD profiles belonged to one group with probability ≥ 0.9 for the
fast, medium, mixed, and slow marker kit, respectively.
Similar to hierarchical clustering, STRUCTURE yielded groupings of Y-STR profiles that, in most cases, only
poorly matched the definition of major YHRD metapopulations (Fig. 7). Only a few major YHRD metapopulations
had more than 50% of their Y-STR profiles assigned to one and the same group. In view of the large number of
groups inferred for the mixed and slow marker kits, we extended the analysis for these marker kits so as to also cover
the predefined subgroups of major YHRD metapopulations (Supplementary Fig. S.10). Since STRUCTURE assigns
13
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
Y-STR profiles to groups only with a certain probability, we additionally repeated the STRUCTURE analyses only
for profiles that had a group membership probability ≥ 0.9 (Supplementary Fig. S.11). Both analyses yielded similar
Results
to the original.
4. Discussion
When equating the pairwise distance between Y-STR profiles to the sum of marker-wise allele differences, clas-
sical cluster analysis suggests only weak inherent structure in the data currently included in YHRD. Furthermore,
some of the clusters identified in our study proved unstable when judged by the Jaccard indices obtained from com-
prehensive bootstrapping. Notably, clusters derived for Y-STR profiles comprising slowly mutating markers exhibited
greater stability than others, a pattern consistent with previous observations for much smaller numbers of markers
[50–52]. Also consistent with other studies [37], the most pronounced differences in YHRD were observed between
Y-STR profiles of African and non-African origin, while Eurasian profiles tended to be most similar to one another.
Our results suggest that YHRD cannot be divided into clearly distinguishable subsamples that correspond strongly
to the metapopulations highlighted by YHRD itself, at least not by drawing solely upon the Y-STR profiles themselves.
On the contrary, the relationship between Y-STR haplotype and metapopulation membership proved to be highly
variable and dependent upon the markers considered. The metapopulations highlighted in YHRD are therefore not
well-defined genetic units. This finding does not call into question the concept of metapopulations per se, but rather
suggests that the human genetic clock of the Y-chromosome has ticked too differently from cultural and geographical
clocks for the traces of these clocks to run synchronously.
What does this mean for forensic practice, support of which is one of the essential - if not the most essential -
concerns of YHRD? Some of us have previously pointed out that equating database frequency estimates with match
probabilities in forensic casework requires that the database, or at least the part of it that is being used, represents a
meaningful case-specific suspect population [5]. At first glance, it seems obvious that the definition of this population
should not be based primarily upon genetic considerations to avoid circular reasoning. Instead, it is generally accepted
that the group of alternative trace donors is first narrowed down using non-genetic information (location, time, cultural
context, witness statements, etc.) and only then characterized in detail genetically [2]. In this second step, it can of
course be useful to favor genetic markers that distinguish the suspect population from other groups of males.
These considerations inevitably lead to the question of whether and under what circumstances a YHRD metapop-
ulation can be considered a plausible suspect population. First of all, “plausible” in this context must mean that the
definition of the metapopulation applies to the suspect population in question almost unchanged. Furthermore, this
applicability must be fully accepted by all parties involved in the case, including the defense. In our view, the whole
approach is therefore only viable if the criteria used to define YHRD metapopulations are made fully transparent in
advance to all stakeholders - which, to our knowledge, has not yet been the case.
14
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
In view of the general problems associated with the forensic use of population database information [5], alternative
approaches to evaluate trace-donor matches of Y-STR profiles have been proposed, including evolutionary simulation
of potential matches [4] and calculation of the exact match probability inside and outside the suspect pedigree [53].
However, it will certainly take some time before such novel approaches can be incorporated into forensic casework,
and the current practice of estimating profile frequencies using databases is therefore likely to be maintained for a
while. If this is the case, then the genetic information contained in YHRD must at least be used to select metapopula-
tions - to serve as suspect populations - in a way that is fair to the suspects while preserving the evidentiary value of
the database as much as possible. A two-step approach is conceivable for this.
First, the YHRD metapopulation with the (likely) strongest Y-chromosomal similarity to the suspect is identified
after restricting the Y-STR profile to the slowly mutating markers. Developing a suitable method for such a selection
admittedly requires further research. Be that as it may, if the selection is consistent with the non-genetic evidence for
a particular suspect population, it can also be considered fair to the suspect. “Fair” here refers to the principle of “in
dubio pro reo” (presumption of innocence) because a suspect profile generated for any other set of Y-STR markers
should also be quite frequent, if not most frequent, in the selected metapopulation. In the second step, the frequency
of the (much more suspect-specific) Y-STR profile comprising moderately to rapidly mutating markers is estimated in
the selected YHRD metapopulation and, if this is to be practiced despite theoretical reservations, equated to the match
probability. In the vast majority of cases, the persuasiveness of the result of the second step should be similar to that
of using the entire profile. Furthermore, since mutations at different Y-STRs are statistically independent, the use of
different markers for the initial identification and subsequent genetic characterization of a suspect population would
also help to avoid circular reasoning when evaluating trace-donor profile matches. Should the genetic and non-genetic
information point to very different suspect populations in step one, we recommend applying the approach to both
selections and presenting both results in court.
Although our fundamental reservations about the approach taken so far for obtaining match probabilities remain,
and although YHRD offers only limited coverage for many populations, particularly Non-Eurasians [54], we believe
that a change to the procedure as proposed here could represent a significant improvement in terms of its scientific
and legal irrefutability.
Appendix A. Distance properties of distance measures d1, d2, and d3
Any valid distance measure, d, must fulfil the following properties for Y-STR profilesx1, x2, and x3:
1. d(x1, x2) = 0 ⇔ x1 = x2 (Reflexivity).
2. d(x1, x2) ≥ 0 (Non-negativity).
3. d(x1, x2) = d(x2, x1) (Symmetry).
4. d(x1, x2) + d(x1, x3) ≥ d(x2, x3) (Triangle inequality).
15
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
Measure d1
For d1, the first three properties follow from the fact that (i) log( µ j) < 0 since µ j ∈ (0, 1), (ii) m j ≥ 0 and m j = 0
for all j if and only if the two Y-STR profiles in question are identical, and (iii)m j is itself symmetric.
Let m{i,k}
j denote the absolute allelic difference betweenxi and xk at the jth locus. Then
d1(x1, x2) + d1(x1, x3) =
lX
j=1
−m{1,2}
j log(µ j) +
lX
j=1
−m{1,3}
j log(µ j)
=
lX
j=1
−
m{1,2}
j + m{1,3}
j
log(µ j)
≥
lX
j=1
−m{2,3}
j log(µ j)
= d1(x2, x3).
Thus, the triangle inequality also holds for d1.
Measure d2
Measure d2 satisfies the first three properties of a distance measure for the same reasons as d1 (see above). The
triangle inequality holds for d2 because
d2(x1, x2) + d2(x1, x3) =
lX
j=1
−m{1,2}
j
1
log(1 − µ j) +
lX
j=1
−m{1,3}
j
1
log(1 − µ j)
=
lX
j=1
−
m{1,2}
j + m{1,3}
j
1
log(1 − µ j)
≥
lX
j=1
−m{2,3}
j
1
log(1 − µ j)
= d2(x2, x3).
Measure d3
Reflexivity and symmetry hold for d3 by definition. Further, even for rapidly mutating Y-STRs, it appears reason-
able to assume that Ne ≥ 1/(1 − 2µ j). Hence, 2µ j + 1/Ne ≤ 1 and m j + 1 ≥ 1, which implies non-negativity. Finally,
the triangle inequality also holds for d3 because if m{i,k}
j = 0 for any j, i, and k, then m{i,p}
j = m{k,p}
j for any p, and if
m{i,k}
j > 0 for all j, i, and k, then
16
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
d3(x1, x2) + d3(x1, x3) =
lX
j=1
log
m{1,2}
j + 1
− log
2µ j + 1
Ne
!!
+
lX
j=1
log
m{1,3}
j + 1
− log
2µ j + 1
Ne
!!
≥
lX
j=1
log
m{1,2}
j + 1
+ log
m{1,3}
j + 1
− log
2µ j + 1
Ne
!!
=
lX
j=1
log
m{1,2}
j m{1,3}
j + m{1,2}
j + m{1,3}
j + 1
− log
2µ j + 1
Ne
!!
≥
lX
j=1
log
m{1,2}
j + m{1,3}
j + 1
− log
2µ j + 1
Ne
!!
≥
lX
j=1
log
m{2,3}
j + 1
− log
2µ j + 1
Ne
!!
= d3(x2, x3).
Appendix B. Motivation of distance measure d3
The following considerations are based upon [55, 56]. Let x1 and x2 be two single-locus Y-STR profiles for which
their coalescence time T should be estimated by some meaningful value bT. Let mi, for i ∈ {1, 2}, denote the number
of mutations that xi has undergone since the most recent common ancestor of x1 and x2.
As an estimate bT, we want to derive the posterior expectation of T given the allele difference betweenx1 and x2.
To this end, we assume the following prior distribution for T:
T ∼ Exp
1
Ne
!
,
where Ne is the effective population size [56]. If µ denotes the mutation rate, then the conditional distribution of mi
given T equals
mi | T ∼ Poisson(µT)
for i ∈ {1, 2}. Hence, the posterior distribution of T is given by
P(T | m1 + m2) = P(m1 + m2 | T)P(T)
P(m1 + m2)
∝ P(m1 + m2 | T)P(T)
= (2µT)m1+m2 e−2µT
(m1 + m2)!
1
Ne
e− 1
Ne T
∝ T m1+m2 e−(2µ+ 1
Ne )T .
This means that T | m1 + m2 ∼ Γ(m1 + m2 + 1, 2µ + 1
Ne
) with (posterior) expectation
E[T | m1 + m2] = m1 + m2 + 1
2µ + 1
Ne
.
17
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
In case of l loci, the likelihood equals the product of the locus-specific terms so that the posterior expectation now
equals
ˆT =
lY
j=1
m1 j + m2 j + 1
2µ j + 1
Ne
.
To avoid numerical overflow, we will take
log
ˆT
=
lX
j=1
log
m1 j + m2 j + 1
− log
2µ j + 1
Ne
!
and hence, we propose distance measure
d3(x1, x2) =
lX
j=1
log
m1 j + m2 j + 1
− log
2µ j + 1
Ne
!!
1[m 1 j + m2 j > 0].
Pragmatically, we set m1 j + m2 j equal to the absolute allelic difference between x1 and x2 at the jth locus, thereby
knowingly disregarding the possibility of (undetectable) backwards mutations.
Results
in forensic analysis. F orensic Sci. Int. Genet., 48:102308, 2020. doi:10.1016/j.fsigen.2020.102308.
[3] C. H. Brenner. Fundamental problem of forensic mathematics - The evidential value of a rare haplotype. F orensic Sci. Int. Genet., 4(5):
281–291, 2010. doi:10.1016/j.fsigen.2009.10.013.
[4] Mikkel M. Andersen and David J. Balding. How convincing is a matching Y-chromosome profile? PLOS Genetics, 13(11):1–16, 2017.
doi:10.1371/journal.pgen.1007028.
[5] Amke Caliebe and Michael Krawczak. Match probabilities for Y-chromosomal profiles: A paradigm shift. F orensic Sci. Int. Genet., 37:
200–203, 2018. doi:10.1016/j.fsigen.2018.08.009.
[6] David J. Balding and Richard A. Nichols. DNA profile match probability calculation: how to allow for population stratification, relatedness,
database selection and single bands. F orensic Sci. Int., 64(2):125–140, 1994. doi:10.1016/0379-0738(94)90222-4.
[7] Mikkel Meyer Andersen and David J. Balding. Assessing the Forensic Value of DNA Evidence from Y Chromosomes and Mitogenomes.
Genes, 12(8), 2021. doi:10.3390/genes12081209.
[8] L. Roewer, M. Kayser, P. de Knijff, K. Anslinger, A. Betz, A. Cagli `a, D. Corach, S. F ¨uredi, L. Henke, M. Hidding, H.J. K ¨argel, R. Lessig,
M. Nagy, V .L. Pascali, W. Parson, B. Rolf, C. Schmitt, R. Szibor, J. Teifel-Greding, and M. Krawczak. A new method for the evaluation of
matches in non-recombining genomes: application to Y-chromosomal short tandem repeat (STR) haplotypes in European males. F orensic
Sci. Int., 114(1):31–43, 2000. doi:10.1016/S0379-0738(00)00287-5.
[9] Mikkel Meyer Andersen, Amke Caliebe, Arne Jochens, Sascha Willuweit, and Michael Krawczak. Estimating trace-suspect match probabili-
ties for singleton Y-STR haplotypes using coalescent theory.F orensic Sci. Int. Genet., 7(2):264–271, 2013. doi:10.1016/j.fsigen.2012.11.004.
[10] Sascha Willuweit and Lutz Roewer. Y chromosome haplotype reference database (YHRD): Update. F orensic Sci. Int. Genet., 1(2):83–87,
2007. doi:10.1016/j.fsigen.2007.01.017.
18
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
[11] Sven Gundlach, Olaf Junge, Lars Wienbrandt, Michael Krawczak, and Amke Caliebe. Comparison of Markov Chain Monte Carlo Software
for the Evolutionary Analysis of Y-Chromosomal Microsatellite Data. Computational and Structural Biotechnology Journal, 17:1082–1090,
2019. doi:10.1016/j.csbj.2019.07.014.
[12] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024.
URL https://www.R-project.org/.
[13] Mikkel Meyer Andersen and Ian J. Wilson. rforensicbatwing: BATWING for Calculating F orensic Trace-Suspect Match
Probabilities, 2018. URL https://github.com/mikldk/rforensicbatwing. R package version 1.3.1, commit
d1585bfe1211b18693ce8d0e2dc0ae73a6ddeba5.
[14] Tomasz Konopka. umap: Uniform Manifold Approximation and Projection, 2023. URL https://CRAN.R-project.org/package=umap.
R package version 0.2.10.0.
[15] Jesse H. Krijthe. Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation, 2015. URL https://github.
com/jkrijthe/Rtsne. R package version 0.17.
[16] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, 9(86):
2579–2605, 2008.
[17] L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15(93):3221–3245, 2014.
[18] Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. cluster: Cluster Analysis Basics and Extensions, 2024. URL
https://CRAN.R-project.org/package=cluster.
[19] Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain. fields: Tools for spatial data, 2021. URL https://github.com/
dnychka/fieldsRPackage. R package version 16.3.1.
[20] Eric Frichot and Olivier Francois. LEA: an R package for Landscape and Ecological Association studies. Methods in Ecology and Evolution,
2015. URL http://membres-timc.imag.fr/Olivier.Francois/lea.html.
[21] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain Franc ¸ois, Garrett Grolemund, Alex
Hayes, Lionel Henry, Jim Hester, Max Kuhn, Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache, Kirill M¨uller, Jeroen Ooms, David
Robinson, Dana Paige Seidel, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Kara Woo, and Hiroaki Yutani. Welcome to
the tidyverse. Journal of Open Source Software, 4(43):1686, 2019. doi:10.21105/joss.01686.
[22] Kevin Wright. pals: Color Palettes, Colormaps, and Tools to Evaluate Them, 2025. URL https://CRAN.R-project.org/package=
pals. R package version 1.10.
[23] Tal Galili. dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics, 2015.
doi:10.1093/bioinformatics/btv428.
[24] Christian Hennig. fpc: Flexible Procedures for Clustering, 2024. URL https://CRAN.R-project.org/package=fpc. R package version
2.2-13.
[25] Alboukadel Kassambara. ggpubr: ’ggplot2’ Based Publication Ready Plots, 2023. URL https://CRAN.R-project.org/package=
ggpubr. R package version 0.6.0.
[26] Michael Hahsler, Kurt Hornik, and Christian Buchta. Getting things in order: An introduction to the R package seriation.Journal of Statistical
Software, 25(3):1–34, 2008. doi:10.18637/jss.v025.i03.
[27] Michael Hahsler, Christian Buchta, and Kurt Hornik. seriation: Infrastructure for Ordering Objects Using Seriation, 2025. URL https:
//CRAN.R-project.org/package=seriation. R package version 1.5.8.
[28] Christoph Glur. data.tree: General Purpose Hierarchical Data Structure, 2023. URL https://CRAN.R-project.org/package=data.
tree. R package version 1.1.0.
[29] G ´abor Cs´ardi and Tam´as Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.
URL https://igraph.org.
[30] G ´abor Cs´ardi, Tam´as Nepusz, Vincent Traag, Szabolcs Horv´at, Fabio Zanini, Daniel Noom, and Kirill M¨uller. igraph: Network Analysis and
Visualization in R, 2026. URL https://CRAN.R-project.org/package=igraph. R package version 2.1.4.
19
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
[31] Thomas Lin Pedersen. ggraph: An Implementation of Grammar of Graphics for Graphs and Networks , 2024. URL https://CRAN.
R-project.org/package=ggraph. R package version 2.2.1.
[32] Claus O. Wilke. cowplot: Streamlined Plot Theme and Plot Annotations for ’ggplot2’, 2024. URL https://CRAN.R-project.org/
package=cowplot. R package version 1.1.3.
[33] T ´ora Oluffa Stenberg Olsen. Code for reproducing the results. doi:10.6084/m9.figshare.31286275.
[34] S.Willuweit and L. Roewer. The new Y Chromosome Haplotype Reference Database. F orensic Sci. Int. Genet., 15:43–48, 2015.
doi:10.1016/j.fsigen.2014.11.024.
[35] P. A. P. Moran. The Interpretation of Statistical Maps. Journal of the Royal Statistical Society. Series B (Methodological), 10(2):243–251,
1948.
[36] Lutz Roewer, Peter J. P. Croucher, Sascha Willuweit, Tim T. Lu, Manfred Kayser, R ¨udiger Lessig, Peter de Knijff, Mark A. Jobling, Chris
Tyler-Smith, and Michael Krawczak. Signature of recent historical events in the European Y-chromosomal STR haplotype distribution.Hum.
Genet., 116:279–291, 2005. doi:10.1007/s00439-004-1201-z.
[37] Hongyang Xu, Chuan-Chao Wang, Rukesh Shrestha, Ling-Xiang Wang, Manfei Zhang, Yungang He, Judith R. Kidd, Kenneth K. Kidd,
Li Jin, and Hui Li. Inferring population structure and demographic history using Y-STR data from worldwide populations. Mol. Genet.
Genomics, 290:141–150, 2015. doi:10.1007/s00438-014-0903-8.
[38] Ian J. Wilson, Michael E. Weale, and David J. Balding. Inferences from DNA Data: Population Histories, Evolutionary Processes and
Forensic Match Probabilities. Journal of the Royal Statistical Society Series A: Statistics in Society, 166(2):155–188, 2003. doi:10.1111/1467-
985X.00264.
[39] J. C. Gower. Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis. Biometrika, 53(3/4):325–338,
1966. doi:10.2307/2333639.
[40] K. V . Mardia. Some properties of clasical multi-dimesional scaling. Communications in Statistics - Theory and Methods, 7(13):1233–1241,
1978. doi:10.1080/03610927808827707.
[41] J. Healy and L. McInnes. Uniform manifold approximation and projection. Nat Rev Methods Primers , 4(82), 2024. doi:10.1038/s43586-
024-00363-x.
[42] G. N. Lance and W. T. Williams. A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems. The Computer Journal, 9
(4):373–380, 1967. doi:10.1093/comjnl/9.4.373.
[43] R.L. Thorndike. Who belongs in the family? Psychometrika, 18(4):267–276, 1953. doi:10.1007/BF02289263.
[44] Paul Jaccard. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques r ´egions voisines. Bulletin de la Societe V audoise
des Sciences Naturelles, 37:241–272, 1901. doi:10.5169/seals-266440.
[45] Christian Hennig. Cluster-wise assessment of cluster stability. Computational Statistics & Data Analysis, 52(1):258–271, 2007.
doi:10.1016/j.csda.2006.11.025.
[46] Christian Hennig. Dissolution point and isolation robustness: Robustness criteria for general cluster analysis methods. Journal of Multivariate
Analysis, 99(6):1154–1176, 2008. doi:10.1016/j.jmva.2007.07.002.
[47] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population Structure Using Multilocus Genotype Data. Genetics,
155(2):945–959, 2000. doi:10.1093/genetics/155.2.945.
[48] O. Franc ¸ois. Running Structure-like Population Genetic Analyses with R, 2016. R Tutorials in Population Genetics. U. Grenoble-Alpes.
[49] Eric Frichot, Franc ¸ois Mathieu, Th ´eo Trouillon, Guillaume Bouchard, and Olivier Franc ¸ois. Fast and Efficient Estimation of Individual
Ancestry Coefficients.Genetics, 196(4):973–983, 2014. doi:10.1534/genetics.113.160572.
[50] Wentao Shi, Qasim Ayub, Mark Vermeulen, Rong guang Shao, Sofia Zuniga, Kristiaan van der Gaag, Peter de Knijff, Manfred Kayser,
Yali Xue, and Chris Tyler-Smith. A Worldwide Survey of Human Male Demographic History Based on Y-SNP and Y-STR Data from the
HGDP–CEPH Populations. Molecular Biology and Evolution, 27(2):385–393, 2010. doi:10.1093/molbev/msp243.
[51] Amalia Diaz-Lacava, Maja Walier, Sascha Willuweit, Thomas F. Wienker, Rolf Fimmers, Max P. Baur, and Lutz Roewer. Geostatistical
inference of main Y-STR-haplotype groups in Europe.F orensic Sci. Int. Genet., 5(2):91–94, 2011. doi:10.1016/j.fsigen.2010.09.010. Haploid
20
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint
DNA markers in Forensic Genetics.
[52] Josephine Purps, Sabine Siegert, Sascha Willuweit, Marion Nagy, C´ıntia Alves, Renato Salazar, Sheila M.T. Angustia, Lorna H. Santos, Katja
Anslinger, Birgit Bayer, Qasim Ayub, Wei Wei, Yali Xue, Chris Tyler-Smith, Miriam Baeta Bafalluy, Bego˜na Mart´ınez-Jarreta, Balazs Egyed,
Beate Balitzki, Sibylle Tschumi, David Ballard, Denise Syndercombe Court, Xinia Barrantes, Gerhard B¨aßler, Tina Wiest, Burkhard Berger,
Harald Niederst¨atter, Walther Parson, Carey Davis, Bruce Budowle, Helen Burri, Urs Borer, Christoph Koller, Elizeu F. Carvalho, Patricia M.
Domingues, Wafaa Takash Chamoun, Michael D. Coble, Carolyn R. Hill, Daniel Corach, Mariela Caputo, Maria E. D’Amato, Sean Davison,
Ronny Decorte, Maarten H.D. Larmuseau, Claudio Ottoni, Olga Rickards, Di Lu, Chengtao Jiang, Tadeusz Dobosz, Anna Jonkisz, William E.
Frank, Ivana Furac, Christian Gehrig, Vincent Castella, Branka Grskovic, Cordula Haas, Jana Wobst, Gavrilo Hadzic, Katja Drobnic, Katsuya
Honda, Yiping Hou, Di Zhou, Yan Li, Shengping Hu, Shenglan Chen, Uta-Dorothee Immel, R¨udiger Lessig, Zlatko Jakovski, Tanja Ilievska,
Anja E. Klann, Cristina Cano Garc´ıa, Peter de Knijff, Thirsa Kraaijenbrink, Aikaterini Kondili, Penelope Miniati, Maria V ouropoulou, Lejla
Kovacevic, Damir Marjanovic, Iris Lindner, Issam Mansour, Mouayyad Al-Azem, Ansar El Andari, Miguel Marino, Sandra Furfuro, Laura
Locarno, Pablo Mart´ın, Gracia M. Luque, Antonio Alonso, Lu´ıs Souto Miranda, Helena Moreira, Natsuko Mizuno, Yasuki Iwashima, Rodrigo
S. Moura Neto, Tatiana L.S. Nogueira, Rosane Silva, Marina Nastainczyk-Wulf, Jeanett Edelmann, Michael Kohl, Shengjie Nie, Xianping
Wang, Baowen Cheng, Carolina N´u˜nez, Marian Mart´ınez de Pancorbo, Jill K. Olofsson, Niels Morling, Valerio Onofri, Adriano Tagliabracci,
Horolma Pamjav, Antonia V olgyi, Gusztav Barany, Ryszard Pawlowski, Agnieszka Maciejewska, Susi Pelotti, Witold Pepinski, Monica
Abreu-Glowacka, Christopher Phillips, Jorge C ´ardenas, Danel Rey-Gonzalez, Antonio Salas, Francesca Brisighelli, Cristian Capelli, Ulises
Toscanini, Andrea Piccinini, Marilidia Piglionica, Stefania L. Baldassarra, Rafal Ploski, Magdalena Konarzewska, Emila Jastrzebska, Carlo
Robino, Antti Sajantila, Jukka U. Palo, Evelyn Guevara, Jazelyn Salvador, Maria Corazon De Ungria, Jae Joseph Russell Rodriguez, Ulrike
Schmidt, Nicola Schlauderer, Pekka Saukko, Peter M. Schneider, Miriam Sirker, Kyoung-Jin Shin, Yu Na Oh, Iulia Skitsa, Alexandra Ampati,
Tobi-Gail Smith, Lina Solis de Calvit, Vlastimil Stenzl, Thomas Capal, Andreas Tillmar, Helena Nilsson, Stefania Turrina, Domenico De
Leo, Andrea Verzeletti, Venusia Cortellini, Jon H. Wetton, Gareth M. Gwynne, Mark A. Jobling, Martin R. Whittle, Denilce R. Sumita,
Paulina Wola´nska-Nowak, Rita Y .Y . Yong, Michael Krawczak, Michael Nothnagel, and Lutz Roewer. A global analysis of Y-chromosomal
haplotype diversity for 23 STR loci. F orensic Sci. Int. Genet., 12:12–23, 2014. doi:https://doi.org/10.1016/j.fsigen.2014.04.008.
[53] Amke Caliebe, Dion Zandstra, Arwin Ralf, Manfred Kayser, and Michael Krawczak. A novel mathematical framework for pedigree-based
calculation of Y-STR match probabilities. Sci Rep, 15:14651, 2025. doi:10.1038/s41598-025-98644-2.
[54] Rita Costa, Jennifer Fadoni, Ant ´onio Amorim, and Laura Cain ´e. Y-STR Databases-Application in Sexual Crimes. Genes, 16(5):484, 2025.
doi:10.3390/genes16050484.
[55] Simon Tavar ´e, David J. Balding, R.C. Griffiths, and Peter Donnelly. Inferring Coalescence Times From DNA Sequence Data.Genetics, 145
(2):505–518, 1997. doi:10.1093/genetics/145.2.505.
[56] Bruce Walsh. Estimating the Time to the Most Recent Common Ancestor for the Y chromosome or Mitochondrial DNA for a Pair of
Individuals. Genetics, 158(2):897–912, 2001. doi:10.1093/genetics/158.2.897.
21
.CC-BY-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted February 10, 2026. ; https://doi.org/10.64898/2026.02.07.704579doi: bioRxiv preprint