Estimating mutation rates between evolutionarily related sequences is a central problem in molecular
evolution. Due to the rapid expansion of datasets, modern methods avoid costly alignment and instead focus
on comparing sketches of sets of constituent k-mers. While these methods perform well on many sequences,
they are not robust to highly repetitive sequences such as centromeres. In this paper, we present three new
estimators that are robust to the presence of repeats. The estimators are applicable in different settings, based
on whether they need count information from zero, one, or both of the sequences. We evaluate our estimators
empirically using highly repetitive alpha satellite sequences. Our estimators each perform best in their class
and our strongest estimator outperforms all other tested estimators. Our software is open-source and freely
available on https://github.com/medvedevgroup/Accurate_repeat-aware_kmer_based_estimator.
Key words: mutation rates, substitution rate estimation, centromere sequences
1. Introduction
Estimating mutation rates between evolutionarily related
sequences has long been a central problem in molecular evolution,
originating well before the advent of large-scale genomics. Early
quantitative methods concentrated on amino acid substitution
rates, such as the PAM matrices introduced by Dayhoff (1969)
and the BLOSUM matrices developed by Henikoff and Henikoff
(1992). These methodologies, along with later profile-based hidden
Markov models (Durbin, 2013) continue to serve as the benchmark
when high-quality alignments can be obtained.
Nevertheless, the rapid growth of sequencing has made
computationally intensive alignment-based pipelines increasingly
infeasible in the modern era. As a result, alignment-free methods
that characterize sequences using low-cost summary statistics have
become essential (Song et al., 2014; Zielezinski et al., 2017; Rathore
and Kashyap, 2026). Most of these techniques are based on
sketches of k-mer spectra. Widely used tools such as Mash (Ondov
et al., 2016) and Skmer (Sarmashghi et al., 2019), along with
more recent sketch-adjusted approaches including Sylph (Shaw
and Yu, 2024) and FracMinHash-based methods (Hera et al., 2024,
2023; Hera and Koslicki, 2025; Hera et al., 2025; Shaw and Yu,
2023; Wu et al., 2025), enable rapid construction of whole-genome
phylogenies, efficient metagenomic screening, and the estimation
of millions of pairwise point-mutation rates in minutes rather than
days.
Nearly all alignment-free approaches are derived based on
the assumption that most k-mers above a certain k-mer size
(e.g. k ≥ 19) occur only once in a sequence. However, recent
advances in sequencing technology are leading to more abundantly
available highly repetitive sequences. For example, the recent
telomere-to-telomere human assembly contains fully assembled
chromosome centromeres, which are alpha satellite DNA composed
of 171-bp monomers that are further arranged into higher-order
repeats (Logsdon et al., 2024). Unfortunately, most estimators are
not robust to repeat-rich sequences and methods to analyze the
mutation rates between such sequences remain in their infancy.
We can categorize the space of k-mer-based estimators based
on the type of information they use. In the absence of repeats, it
suffices to consider the set of k-mers present in the sequences and
ignore their occurrence counts. We call these type of estimators
as Presence-Presence, because they rely on presence/absence
information for both the source string s and the mutated string
t. Such estimators are especially useful in the setting where
occurrence counts are not readily available, such as raw sequencing
data. In the presence of repeats, however, occurrence counts
become an important signal. In the Presence-Count setting, an
estimator is restricted to presence/absence information for s but is
allowed to use occurrence counts oft. This can occur if for example
s is unassembled sequencing data while t is an assembly. The
Count-Count setting is the most powerful, allowing the estimator
to use information about counts in both s and t, but is limited to
applications such as when both s and t are assembled.
©The Author 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail:
[email protected] 1
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
2 Wu and Medvedev
Name Knowledge ofk-mers ins Knowledge ofk-mers int Formula
ˆqpp Presence/absence Presence/absence |sp(t)\sp(s)|
L
ˆqpc Presence/absence Counts
X
τ∈sp(t)\sp(s)
occ(τ, t)
L
ˆqcc Counts Counts ˆqpc + (1−ˆrpc)k−1 · ˆrpc
3L ·
X
τ∈sp(s)
occ(τ, s)·d 1(τ, s)
Table 1. Our contributed estimators. Here, s in arbitrary string with L = |s| − k + 1; t is the result of applying a mutation process to s with substitution
rate r; d1(τ, s) is the number of k-mers in the spectrum of s that are at a Hamming distance of one to τ. We use q as shorthand for 1 − (1 − r)k, e.g.
an estimator ˆq implicitly defines ˆr = 1 − (1 − ˆq)1/k.
The widely known Mash estimator falls in the Presence-
Presence category, but, as we show in this paper, is not repeat-
robust. There are two repeat-robust k-mer-based estimators that
we are aware of. The first is from our previous work (Wu et al.,
2025), which, as we will describe in Sec. 7, falls roughly in
between the Presence-Presence and the Presence-Count setting.
The second is a weighted-intersection-based estimator, which falls
in the Count-Count setting. This is a natural estimator that has
been mentioned in Rhie et al. (2020).
In this paper, we present three new estimators (Table 1): ˆqpp,
for the Presence-Presence setting, ˆ qpc, for the Presence-Count
setting, and ˆqcc, for the Count-Count setting. One of our main
insights is that the number of newly created k-mers is more
sensitive to the presence of repeats than the number of k-mers
that remain shared (or, as reflected in the title, we treat novel
k-mers as a gift to make use of).
We evaluate our estimators empirically using various types
of sequences, including the alpha satellite centromeric region
from a human chromosome. We evaluate estimator error across
a wide range of mutation rates and k-mer values. Moreover,
we demonstrate how our estimators can be combined with
FracMinHash sketching without systematically effecting bias.
Ultimately, each of our estimators performs best in their class,
with ˆqcc outperforming estimators in all classes. Finally, we show
how our estimators can be used on real data by applying them to
compute the ANI (Average Nucleotide Identity), which is widely
used to measure genomic distance in taxonomic analysis.
2. Preliminaries
Let s be a string and let k > 0 be the k-mer size. We let |s|
denote the number of nucleotides in s. We assume in this paper
that |s| ≥ k. We use L to denote the number of k-mers in s, i.e.
L = |s| − k + 1. Let sp (s) be the set of all distinct k-mers in s,
also called the spectrum of s. Let occ(τ, s) denote the number of
copies of k-mer τ in string s. Let di(τ, s) denote the number of
k-mers in sp(s) with Hamming distance i to k-mer τ. The Jaccard
similarity between two sets A and B is J(A, B) ≜ |A∩B|
|A∪B| .
We will consider the simple substitution mutation model as in
previous work (Blanca et al., 2022). Given a parameter 0≤r≤1
and a strings, the model generates an equal-length string where,
independently, the character at each position is unchanged from
swith probability 1−rand changed to one of the three other
nucleotides with probabilityr/3.
Our goal in this paper is to estimate the mutation ratergiven
observations aboutsandt. We assume that, at a minimum, we
can observeL,sp(s), andsp(t). Depending on the category of the
estimator, we may also observeocc(τ, s) and/orocc(τ, t), for allτ.
As a shorthand notation, we defineq≜1−(1−r) k; intuitively,
qis the probability that ak-mer is mutated. In this work and
others, estimators forrare usually derived by first obtaining an
estimator ˆqforqand then computing the estimator ˆrusing the
inverse formula ˆr= 1−(1−ˆq) 1/k.We take the same approach in
this paper, where we will explicitly define an estimator forqand
leave the definition ofrimplicit using the above formula.
Thebiasof an estimator ˆqis defined asE[ˆq]−q. An unbiased
estimator will on average return the correct value. In the case of
mutation rate estimators, it is usually easier to derive the bias of ˆq
rather than ˆr. Even when there is a closed-form expression for the
bias of ˆq, it does not lead to a closed-form expression for the bias
of ˆr, because ˆris not linear in ˆq. In this paper, we will derive the
bias of ourqestimators when possible. However, we will ultimately
rely on experimental results to judge the bias of our estimators.
In this paper, we will take themethod-of-momentsapproach
to derive our estimators (Wasserman, 2013). We first decide on
a random variable to observe, e.g.I pp =|sp(s)∩sp(t)|. We
then derive its expectation (possibly approximating it under some
assumptions), e.g.E[I pp]≈L(1−q). We then take the observed
value (denoted byI pp
obs), plug it into the expectation formula, and
solve the formula forqto get the estimate. In our example, ˆqis the
solution to the equationI pp
obs =L(1−ˆq), which is ˆq= 1−I pp
obs/L.
3. The Presence-Presence Setting
In this section we consider the setting where the only thing
we know aboutsandtare their spectra andL, i.e. no count
information. We propose the following estimator
ˆqpp = N pp
L ,
whereN pp =|sp(t)\sp(s)|is the number of new distinctk-mers
generated whensmutates intot. We will use the notation whereby
a superscript of “pp” indicates the Presence-Presence setting,
“pc” indicates the Presence-Count setting, and “cc” indicates
the Count-Count setting. In this section, we explain how ˆq pp is
derived and compare it to other known estimators for the Presence-
Presence setting. We do not attempt to derive the bias of ˆq pp, as
it is technically complicated.
Consider thek-span model, which is to assume that
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
Estimators of substitution rates 3
1.sdoes not contain anyk-mer that occur more than once, and
2. newk-mers generated after mutations are distinct from each
other and from thek-mers ins.
Let us defineI pp ≜|sp(s)∩sp(t)|and use the shorthandJ≜
J(sp(s), sp(t)). In thek-span model, we have thatE[I pp] =L(1−
q) and the method-of-moments approach gives the estimator
ˆq= L−I pp
obs
L .(1)
This estimator was introduced in Wu et al. (2025). Furthermore,
in this model,J=I pp/(2L − I pp) and as a result we have that
1−J
1+J = L−I pp
L . Thus, we can equivalently write
ˆq= 1−J obs
1 +J obs
.(2)
This is an improved version of the Mash estimator (Ondov et al.,
2016), described in Sarmashghi et al. (2019) and in Appendix A.6
of Belbasi et al. (2022). Finally, in this model we have thatN pp +
I pp =L, so we can equivalently write
ˆq= N pp
obs
L ,(3)
which corresponds the definition of our new estimator ˆq pp. The
three versions of ˆqhave the same derivation and are algebraically
equivalent in thek-span model. However, when applied to data
violating thek-span assumptions, we will see that three versions
produce different results and ˆqpp outperforms the others (Sec. 7).
The intuition for ˆqpp is based on considering what happens
when assumption 1 is violated, i.e. there are repeats. Letτbe a
k-mer with at least two occurrences insand letνbe ak-mer that
appears exactly once ins. Every mutation leads to a newk-mer in
N pp, regardless of whether it happens in an occurrence ofτorν.
On the other hand, a mutation in one of the copies ofτdoes not
effectI pp while a mutation inνincreasesN pp by one. This makes
Eq. 1 and Eq. 2, which rely onI pp, less accurate than Eq.3.
4. The Presence-Count Setting
In this section, we consider the Presence-Count setting and derive
our estimator ˆqpc and its bias. Given two stringssandt, we define
N pc ≜
X
τ∈sp(t)\sp(s)
occ(τ, t).
Intuitively,N pc is the number of newk-mers generated when
smutates intot. We will derive ˆq pc by applying a method-of-
moments approach toN pc. Let us first define
Rj(τ, s)≜d j(τ, s)(1−r) k−j
r
3
j
andR(τ, s)≜
kX
j=1
Rj(τ, s)
Rj(τ, s) is the probability thatτmutates to ak-mer that is ins
and has a Hamming distance ofjtoτ.R(τ, s) is the probability
thatτmutates to ak-mer that is insbut is different fromτ. Note
that we definedR j andRin terms ofrsince it is visually clearer,
but we can equivalently think of them as functions ofq. We can
now deriveE[N pc].
Lemma 1Letsbe a string and lettbe a string generated from
susing the mutation process parameterized byr. Then
E[N pc] =Lq−
X
τ∈sp(s)
occ(τ, s)R(τ, s).(4)
ProofLetHD(τ, ν) denote the Hamming distance betweenk-mers
τandν. For 1≤i≤L, lets i be thek-mer starting at position
iofs. LetX i be an indicator random variable representing the
event thats i mutated to ak-mer that does not appear ins, i.e.,
ti /∈s. We can expressN pc as a sumX is as follows.
N pc =
X
τ∈sp(t)\sp(s)
occ(τ, t) =
LX
i=1
1[t i /∈sp(s)] =
LX
i=1
Xi,
where1is the indicator function for an event. Then,
E[N pc] =
LX
i=1
Pr[Xi = 1]
=
LX
i=1
(1−Pr[X i = 0])
=L−
LX
i=1
X
τ∈sp(s)
Pr[ti =τ]
=L−
LX
i=1
Pr[ti =s i] +
X
τ∈sp(s)\s i
Pr[ti =τ]
=L−
LX
i=1
1−q+
X
τ∈sp(s)\s i
Pr[ti =τ]
=Lq−
LX
i=1
X
τ∈sp(s)\s i
Pr[ti =τ]
=Lq−
LX
i=1
kX
j=1
X
τ∈sp(s)\s i
s. t.
HD(τ,t i)=j
Pr[ti =τ]
=Lq−
LX
i=1
kX
j=1
dj(si, s)(1−r) k−j(r/3)j
=Lq−
LX
i=1
R(si, s)
=Lq−
X
τ∈sp(s)
occ(τ, s)R(τ, s)
□
A straightforward method-of-moments estimator based on
Eq. 4 would be to observe the value ofN pc (denoted asN pc
obs), let
f(r) =Lq− P
τ∈sp(s) occ(τ, s)R(τ, s), and use numerical methods
to solve the equationN pc
obs =f(r) forr. However, there is no
guarantee for the uniqueness of the solution. Furthermore, it would
be time-consuming and superfluous to computeR(τ, s) for allτ.
Instead, we derive an estimator based on an approximation.
Consider the two terms of Eq. 4. The first term is the expected
number of positions whosek-mer mutated. This may over-count
N pc, and the second term corrects this by accounting for the
possibility that a position mutates but to something that is already
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
4 Wu and Medvedev
ins. As we expect the first term to dominate, we approximate
E[N pc]≈Lq, leading to the estimator
ˆqpc = N pc
obs
L .
Unlike ˆqpp, this formula accounts for the possibility that two
occurrences of ak-mer insmutate to the same newk-mer in
t. The bias of ˆqpc follows immediately from Lemma 1:
Theorem 1The bias ofˆq pc is
E[ˆqpc]−q=−
kX
i=1
X
τ∈sp(s)
occ(τ, s)Ri(τ, s)
L .
Note that the bias is always negative, meaning that, on average,
ˆqpc underestimates the true value. Moreover, the bias is a sum ofk
negative terms, which leads to the intuition of reducing the bias by
including one of the terms inside the estimator itself—something
we pursue in the next section.
5. The Count-Count Setting
In this section, we first present a novel Count-Count estimator ˆrcc
and then describe an alternative Count-Count estimator proposed
by Rhie et al. (2020) in the context of assembly quality assessment.
5.1. New Estimator ˆqcc
Let us build on ˆ qpc by trying to include the first term of the
bias (i.e. −P
τ occ(τ, s)R1(τ, s)/L) into the estimator itself. With
the method-of-moments approach, we would need to solve the
following equation for r:
N pc
obs = Lq − (1 − r)k−1 · r
3 ·
X
τ∈s
occ(τ, s)·d 1(τ, s).(5)
Unfortunately, we cannot solve this equation analytically and we
cannot guarantee that a numerical method would produce a unique
solution. Instead, we will first compute the ˆrpc estimator and then
plug into the right hand side of Eq.5. Our estimator is then defined
as
ˆqcc = N pc
obs
L + (1−ˆrpc)k−1 · ˆrpc
3L ·
X
τ∈sp(s)
occ(τ, s)·d 1(τ, s)
While this approach of plugging in one estimator in order to derive
another one makes it challenging to prove anything, we will see
that it performs extraordinarily well in practice.
To compute ˆqcc, we need to computed 1(τ, s) for allτ∈sp(s).
We do this in a straightforward two-pass algorithm using a hash
table, resulting in runtime linear inL. There are more advanced
ways of doing this (e.g. using suffix arrays), but we were not
concerned with optimizing runtime since it was below one second.
5.2. Estimator mentioned in Rhie et al. (2020)
Rhie et al. (2020) mention an estimator that we recapitulate here
to show how it fits in our framework. It is designed in the spirit
of Eq. 1 by relying on the intersection size but also integrating
count information. Consider the size of the weighted intersection
between thek-mers ofsandt:
I cc ≜
X
τ∈sp(s)∪sp(t)
min(occ(τ, s), occ(τ, t)).
Let us assume that when a mutation occurs, the newk-mer is not
ins. Before any mutations,I cc =L. When ak-merτmutates
to anotherk-merτ ′, it increasesocc(τ ′, t) by one and decreases
occ(τ, t) by one. Sinceτ ′ is not ins,occ(τ ′, t) does not contribute
anything toI cc. Therefore, the only effect of a mutation onI cc
is to decrease it by one. By linearity of expectation, we can add
the probability of this happening for each of theL(non-distinct)
k-mers and get thatE[I cc]≈L(1−q). The method-of-moments
approach then gives the estimator
ˆqwi ≜1−I cc/L.
6. Combination With Sketching
A major reason why many estimators are based onk-mers
rather than on the full sequences is that it makes them easily
amenable to sketching. Sketching is a powerful technique that can
make it possible to quickly compute all-pairs estimates on large
datasets (Ondov et al., 2016). AFracMinHash sketchsp θ(s) of a
sequencesis defined as the subset ofsp(s) that includes only the
k-mers that map below a pre-defined thresholdθ, under a fixed
random hash function (Hera et al., 2023).
In this section, we will present a modification of ˆq pp, ˆqpc and
ˆqcc that work on data sketched with FracMinHash. Instead of
observingN pp orN pc, we now only observe them restricted to
the sketchedk-mers. Formally, let us define
N pp
θ =|sp θ(t)\sp θ(s)|andN pc
θ =
X
τ∈sp θ (t)\spθ (s)
occ(τ, t).
Intuitively, we expect each of these quantities to decrease by a
factor ofθrelative to their non-sketched versions. Based on this
intuition, we define ˆqθ
pp and ˆqθ
pc as ˆqθ
pp = N pp
θ
θL and ˆqθ
pc = N pc
θ
θL .
Formalizing our intuition, we prove that sketching does not effect
the bias.
Theorem 2The biases ofˆq θ
pp andˆqθ
pc are
E[ˆqθ
pp]−q=E[ˆq pp]−qandE[ˆq θ
pc]−q=E[ˆq pc]−q.
ProofFirst, we show thatE[N pc
θ ] =θ·E[N pc]. As before, letX i
be an indicator random variable representing the event thats i
mutated to ak-mer that does not appear ins. LetY τ be a binary
random variable and it is 1 ifk-merτhashes to less thanθ. By
the linearity, we have
E[N pc
θ ] =
LX
i=1
Pr[Ysi = 1∩X i = 1] =
LX
i=1
Pr[Ysi]·Pr[X i = 1]
=
LX
i=1
θ·Pr[X i = 1] =θ·E[N pc]
Then,
E[ˆqθ
pc]−q=E
N pc
θ
θ·L
−q= E[N pc]
L −q=E[ˆq pc]−q
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
Estimators of substitution rates 5
The proof for the bias of ˆqpp is similar and is omitted.□
Because ˆqθ
pc shows the same bias as ˆqpc, we use the same idea of
Sec. 5.1 to obtain a stronger estimator ˆqθ
cc by partially correcting
the bias of ˆqθ
pc, i.e.
ˆqθ
cc = ˆqθ
pc + (1−ˆrθ
pc)k−1 · ˆrθ
pc
3L ·
X
τ∈sp(s)
occ(τ, s)·d 1(τ, s).
Note thatP
τ∈s occ(τ, s)·d 1(τ, s)/Lneeds to be precomputed
prior to sketching. It does not increase the space needed for the
sketch as it becomes just a single constant that needs to be stored.
Because of the variance introduced by sketching, the observed
quantity can exceedθLand then any of our three estimators can
exceed 1. We therefore cap both estimators so that they never
return greater than 1.
7. Empirical Results
We aim to evaluate the accuracy of our three novel estimators in
relation to each other as well as to the other estimators mentioned
in this paper. We also show an application to real data, where
our estimators can be used to predict the ANI between various
genomes. Our software is available on GitHub (Wu and Medvedev,
2025).
7.1. Datasets
We use four different base sequences with various levels of
repetitiveness, summarized in Table S1. We use these sequences as
representative examples spanning different levels of repetitiveness.
Our main text evaluation is focused on the D-hardest sequence,
which is a 100kbp-long sequences of alpha satellite DNA extracted
from the human T2T chr21 centromere. We use a value ofk= 30
for our evaluation on this sequence, as consistent with previous
analyses (Wu et al., 2025), leading to 3,987 distinct 30-mers. Over
70% of the distinctk-mers in D-hardest occur more than once and
itsk-mers have on average at least one otherk-mer at a Hamming
distance of one. D-hardest violates both assumption 1 (i.e. because
it has repeats) and assumption 2 (i.e. because it has pairs ofk-
mers with small Hamming distances, ak-mer can mutate into one
that is already ins). Since the differences between the estimators
are more pronounced on this sequence, our main text focuses on
D-hardest. The results on the three other sequences are presented
in Sec. A of the Supplementary; they are all consistent with our
findings on D-hardest but with less pronounced differences on less
repetitive sequences.
7.2. Evaluation Metrics
We focus our evaluation on the estimators’ accuracy, measuring
both their bias and variance. We do not perform a runtime or
memory analysis because they each complete in less than a second
in total on all of the four datasets and use negligible memory.
First, we benchmark each estimator on the four datasets using
their default values ofk, i.e. the values chosen in Wu et al. (2025)
as most suitable for their analysis. We vary the mutation rater
from 0.001 to 0.251 and we show the distribution of ˆrfor 100
mutation-process simulation replicates for eachr(e.g. Fig. 1).
These experiments give a fine-grained separate view of the bias
and variance.
Second, we vary bothkandrand, for each (k, r) pair, compute
the average relative absolute error: 1
n
Pn
i=1
|ˆri−r|
r (e.g. Fig. 2 and
Table 2). We usen= 100 replicates for each (k, r) pair. This
error combines the bias and variance into one metric, enabling
us to easily visualize accuracy in two dimensions. Note that the
two types of benchmarks emphasize different aspects of estimator
performance.
As observed in our previous work (Wu et al., 2025), whenr
and/orkbecome sufficiently large, allk-mers mutate with very
high probability, causing all tested estimators to return the value
1. We refer to this unstable behavior asblow-upand it manifests
as high error in the top-right corner of the heatmaps.
7.3. Presence-Presence Setting
We compare ˆrpp with the two other estimators in this category:
the estimator defined by Eq. 1, which we refer to as ˆrobl, and the
estimator defined by Eq. 2, which we refer to as ˆrmash (this is the
widely used Mash estimator with a binomial correction). Fig. 1A
shows that fork= 30, ˆrpp dominates ˆrmash across nearly all tested
mutation rates; ˆrpp dominates ˆrobl at lower values ofrand has
similar variance and bias forr >0.161.
Fig. 2A presents heatmaps of estimator accuracy across a wide
range ofkandr. As expected, all estimators exhibit blow-up
behavior for sufficiently largekandr. However, ˆr pp performs
substantially better than ˆrmash and ˆrobl at smaller mutation rates.
7.4. Presence-Count and Count-Count Setting
For the Presence-Count setting, ˆr pc is the only estimator that
we are aware of, while for the Count-Count setting, we have
our new estimator ˆrcc and the ˆrwi estimator. Fig. 1B shows the
performance of these estimators withk= 30. and Fig. 2B shows
the performance across a wide range ofkandr. Table 2 shows
specific values of errors under different settings ofkandr.
First, we note that all estimators in these settings have smaller
or similar bias and error than the ˆrpp estimator, underscoring the
general power of using count information.
Second, in the Count-Count setting, ˆr cc outperforms ˆrwi for
bothk= 30 and more broadly across most tested (k, r) values.
Although ˆrcc and ˆrwi have access to the samek-mer count
information, ˆrcc achieves nearly unbiased estimation for all tested
values ofratk= 30.
Third, in the Presence-Count setting, ˆr pc does not have a
direct competitor, so we compare it against ˆr pp, ˆrcc, and ˆrwi.
Compared to the ˆrpp estimator of the Presence-Presence setting,
ˆrpc explicitly accounts for the event that multiplek-mers can
mutate into the same novelk-mer, resulting in a smaller bias than
ˆrpp. The estimator ˆrwi further addresses certain cases in which a
k-mer mutates into anotherk-mer already present in the original
sequence and, consequently, exhibits a slightly smaller bias than
ˆrpc forr <0.051. Nevertheless, ˆr pc and ˆrwi have very similar
performance. Compared to ˆrcc, ˆrpc does not perform as well, which
is expected since ˆrcc is designed to specifically offset some of the
bias of ˆrpc.
7.5. Comparison Against Estimator From Wu et al. (2025)
In a previous work, we tackled a similar problem and developed a
single repeat-robust estimator (Wu et al., 2025). We will refer to
it as ˆrwu here. It was the firstk-mer-based estimator evaluated for
accuracy in highly repetitive settings, showing robustness in these
settings compared to ˆqobl. It does not neatly fit into the categories
here, as it uses the abundance histogram ofs, i.e. the histogram of
k-mer occurrence counts. While this is based on the count of the
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
6 Wu and Medvedev
(A)
0.001 0.101 0.201
r
-0.04
0.00
0.04
0.08
Error (r r)
rpp
rmash
robl
(B)
0.001 0.101 0.201
r
-0.03
-0.02
-0.01
0.00
0.01
0.02
0.03
0.04
0.05
Error (r r)
rpc
rwi
rcc
(C)
0.001 0.061 0.121 0.181 0.241
r
-0.03
-0.02
-0.01
0.00
0.01
0.02
0.03
Error (r r)
rwu
rpp
rpc
rcc
Fig. 1: Comparison of estimator accuracy on D-hardest, withk= 30 and mutation raterfrom 0.001 to 0.251 with step size 0.02. For
each value ofr, we show box plots over 100 mutation replicates.(A)Comparison of ˆr pp against other presence–presence estimators.(B)
Comparison of ˆrpc against ˆrwi and ˆrcc.(C)Comparison of our three novel estimators against each other and against the ˆr wu estimator
of Wu et al. (2025).
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
Estimators of substitution rates 7
(A)
8 16 24 32
k
0.001 0.121 0.241 0.361
r
r_pp
8 16 24 32
k
r_mash
8 16 24 32
k
r_obl
0.0
0.2
0.4
0.6
0.8
1.0
Error
(B)
8 16 24 32
k
0.001 0.121 0.241 0.361
r
r_pc
8 16 24 32
k
r_wi
8 16 24 32
k
r_cc
0.00
0.02
0.04
0.06
0.08
0.10
Error
Fig. 2: The estimators accuracy on D-hardest as a function of bothkandr. Each cell shows the average relative absolute error of 100
replicates. Note that the scale of the heatmap is different between the top and bottom panels. Moreover, the errors are capped at 1.0
(for the panel A) and 0.10 (for the panel B), e.g., all errors greater than 1.0 are shown as 1.0 in the top panel.
Table 2.Errors of estimators under different parameter settings. Each cell shows the average relative absolute error of 100 replicates. Bold values highlight
the lowest error in the respective class, while red/italic values highlight the lowest error overall. Underline denotes estimators we introduce in this paper.
Theˆrwu estimator falls in between the Presence-Presence and Presence-Count categories.
r= 0.001r= 0.010r= 0.100
Estimator Category k= 16k= 24 k = 32 k= 16k= 24 k = 32 k= 16k= 24 k = 32
ˆrobl
Presence-Presence
202 130 94 19 12 8.7 1.1 .46 .18
ˆrmash 14 11 8.9 6.5 4.6 3.5 .64 .20.013
ˆrpp .089.083.083 .14 .095 .073 .23 .097.044
ˆrwu .88 .73 .67 .30 .20 .20 .37 .13 .037
ˆrpc Presence-Count .084 .084 .083 .035 .029 .025 .032 .018 .017
ˆrwi Count-Count .082.084.081 .030 .027 .024 .028 .017 .017
ˆrcc .083 .085.081 .027 .024 .023 .010 .011.014
k-mers ofs, it is only a summary and can be approximated using
a related species or a related type of sequence. In our framework,
therefore, ˆrwu lies somewhere between the Presence-Presence and
the Count-Presence setting. Fig. 1C compares ˆr wu against our
three estimators fork= 30. Table 2 and Fig. S1 show the error
across different settings ofkandr.
Despite relying on less information, ˆr pp is overall a better
estimator than ˆrwu. The relative bias depends onr: ˆr pp has
slightly smaller bias for 0.001≤r≤0.091 and has slightly larger
bias for 0.101≤r≤0.231 (Fig. 1C). However, ˆr wu often has more
variance than ˆrpp, as reflected in both Fig. 1C and the bigger
relative absolute errors shown in Table 2 and Fig. S1.
The improvement of ˆr pc over ˆrwu is more stark, as ˆr pc
consistently has lower bias than ˆrwu (Fig. 1C) and has consistently
lower overall error across all values ofkandr(Table 2 and Fig. S1).
The superior performance of ˆr pc relative to ˆrwu highlights the
benefits of accounting for novelk-mers in the estimate formula, as
ˆrwu does not take into account anyk-mers intthat are not ins.
7.6. Combination With Sketching Techniques
In Sec. 6, we proved that sketching does change the bias of ˆq pp
and ˆqpc. Here, we evaluate empirically the bias of all our three
estimators. Fig. S2 shows the performance of ˆr θ
pp, ˆrθ
pc, and ˆrθ
cc
usingθ∈ {0.1,0.01}. We observe that sketching does not introduce
any systematic bias, in any of the three estimators. As expected
with sketching, we observe increased variance forθ= 0.1 and even
larger variance forθ= 0.01. Overall, our results confirm that our
estimators can be applied naturally to sketched data, with the
obvious caveat that smaller sketches will lead to larger variance of
the estimator.
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
8 Wu and Medvedev
Table 3. Results on the ANI benchmark. The number of uncomputable
pairs is the number of pairs for which the estimator either returns 0 or
does not return anything. Pearson R value (Pearson) and mean absolute
error (MAE) are shown for a subset of pairs with ANI> 85%, to avoid
penalizing for uncomputable pairs.
Tool n. uncomputable Pearson MAE
skani 144 0.9929 0.28
FastANI 128 0.9968 0.19
Mash 11 0.9923 0.40
Sourmash 5 0.9909 0.54
1 - ˆrθ
pc 3 0.9904 0.47
1 - ˆrθ
cc 3 0.9904 0.477.7. ANI Estimation on Real Genomes
To evaluate the applicability of our estimators on real data, we
use 1−ˆr θ
pc and 1−ˆr θ
cc to estimate ANI between real genomes.
We use the slow but accurate alignment-based OrthoANIu (Yoon
et al., 2017) to compute the ground truth, consistent with previous
work (Shaw and Yu, 2023).
We evaluate using a dataset from Hera et al. (2023). They
first chose ten representative genomes from the Genome Taxonomy
Database, including seven bacterial and three archaeal species. For
each representative genome, they extract additional genomes from
along the evolutionary path to the root, selecting an additional
three non-representative genomes at each taxonomic rank on the
path up the tree. They construct pairs for comparison by matching
each representative genome with the genomes selected along its
evolutionary path up the tree. We further filtered out three pairs
that had accession numbers that were not currently available. The
resulting dataset contains 189 pairs and covers a wide range of
ANI from∼60% to 100%.
We benchmark our estimators against those of of Mash (Ondov
et al., 2016), sourmash (Irber et al., 2024), FastANI (Jain et al.,
2018), and skani (Shaw and Yu, 2023). FastANI and skani rely
on mapping techniques with seeds; we run them with default
parameters. Sourmash and Mash are both similar to our estimators
in that they rely onk-mer sketches. We therefore setk= 19 for our
tools and sourmash and Mash. Sourmash also uses FracMinHash,
so we setθ= 0.01 for both sourmash and our estimators. Mash
uses MinHash sketching instead of FracMinHash, so we set the
size of its sketch to be 30,000 to roughly match the sketch sizes
obtained with FracMinHash ofθ= 0.01. Aside from this, we use
the default parameters for sourmash and Mash. Table S2 in the
Supplementary gives the full commands and data used.
Fig. S3 shows the estimator results relative to the ground truth.
For some pairs at ANI<85%, some of the estimators either do not
report an estimate or report 0; we refer to these asuncomputable
pairs. In order to make a fair comparison, we measure both the
number of uncomputable pairs and the accuracy of the predictions
at ANI>85% (Table 3).
The most accurate estimators at high ANI levels are skani
and FastANI; however, they are not able to estimate ANI for
more than 100 pairs at lower ANI levels. On the other hand, our
estimators are the most comprehensive and are able to compute
all except three pairs. Overall, there is a general pattern that more
comprehensive estimators are less accurate at high ANI levels,
making the choice of the best estimator a trade-off.
8. Discussion and Conclusion
In this paper, we studied the problem of estimating the mutation
rate of a process that transforms an arbitrary stringsinto a string
tby introducing substitutions at rater. We focused on estimators
that are based onk-mers, as they can easily be combined with
sketching to make them scalable to large datasets. We observed
that various estimators for this problem, including our own, can be
categorized according to whether they have access tok-mer counts
or to only presence/absence information. We presented three novel
estimators, ˆqpp, ˆqpc, and ˆqcc, summarized in Table 1. On highly
repetitive data, ˆqpp and ˆqcc perform best in their category, with
ˆqcc outperforming all tested estimators in all categories. The ˆq pp
estimator is desirable when no count information is available,
while the ˆqpc estimator offers a middle ground trade-off between
accuracy and the amount of information required from the original
and mutated sequences.
The main insight behind our ˆq pp and ˆqpc estimators is that
in a repeat setting, it is important to count thek-mers that are
newly created intand do not appear ins. We reflect this in the
title, i.e.,novelk-mers are agiftthat we must make use of. In
many cases, such as ˆq mash and ˆqwu, the formula relies on the
observed number of sharedk-mers. This works fine in the absence
of repeats or mutations resulting in spurious matches, as there
is a one-to-one correspondence betweenk-mers removed from the
intersection and newly createdk-mers. However, when there are
repeats, a mutation in a repetitivek-merτonly destroys one copy
ofτand does not removeτfrom the count of sharedk-mers. On
the other hand, it does add to the count of newk-mers, asτ ′ is
added as a newk-mer tot. We show that these types of events
hurt the performance ˆqmash and ˆqwu, though the effect on ˆqwu is
compensated by its use of counts. Our ˆqpp and ˆqpc estimators, on
the other hand, are fully based on the count of newk-mers, giving
them a performance advantage.
In the Count-Count setting, ˆq wi already accounts for this
insight; in order to out-perform it, we add some accounting for
the possibility that ak-mer mutates into ak-mer that is already
ins. Consider the possibility that ak-merτinsmutates to a
k-merυwhile ak-merυinsmutates intoτ. The likelihood of this
is not directly related to repeats, as it can happen in a repeat-free
genome; it is related to twok-mers inshaving a small Hamming
distance between them. The ˆq wi estimator does not model the
chance of this event happening. Our estimator ˆq cc models this
event happening for the case that the Hamming distance is one;
we believe this accounts for its improved performance.
Our mutation model is naturally an idealized version of reality
and does not explicitly account for factors such as ploidy or indels.
Nevertheless, our estimators can be used as part of downstream
tools. For example, the Merqury tool (Rhie et al., 2020) computes
the quality of a candidate assembly by comparing itsk-mer content
to thek-mer content of the sequencing data used to validate it. The
score that Merqury reports is closely related to our ˆrpc estimator,
as it essentially treats the assembly as a mutated stringtfrom
the original genomes. Thek-mer counts oftare given by the
assembly; to approximate the presence-absence spectrum ofs,
Merqury uses higher-copyk-mers from the read set. Note that
what we are describing here differs from the ˆrwi estimator, which
is also alluded to in Rhie et al. (2020).
A natural question arising from our work is how to choosekand
θ. Guidance on selectingkis partially addressed by our heatmaps,
which display estimator accuracy across a grid of (k, r) values for
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint
Estimators of substitution rates 9
both repetitive and typical genomic settings. These plots allow
users to identify the stable operating regime for their sequence
type. Additionally, in our previous work (Wu et al., 2025), we
introducedP empty as a heuristic stability criterion: given sequence
lengthL, mutation rater, andk,P empty quantifies the probability
that ak-mer has no surviving copies after mutation, providing a
principled threshold below which the estimator is expected to be
unstable. The choice ofθ(the sampling rate for sketching) has
been widely studied in the sketching literature, though not for
our new estimators in particular. Our analysis here (e.g. Fig. S2)
can give a rough guidance for choosingθ, but a more rigorous
analysis of sketch estimator stability in repetitive regions remains
an interesting problem.
Our insights open the door for the continued improvement of
estimators. One particular direction is the Count-Presence setting
which we have not discussed in this paper. Note that this setting
is not symmetric to the Presence-Count setting; e.g. the counts in
sare a deterministic variable while the counts intare a random
variable. It remains an open problem to determine if estimators
in the Count-Presence setting can outperform estimators in the
Presence-Count setting. As more repetitive sequences become
available, we expect future work to uncover new insights to further
drive the improved quality of mutation rate estimators.
Acknowledgments:We thank Antonio Blanca for useful
discussions. This material is based upon work supported by
the National Science Foundation under Grants No. DBI2138585
and OAC1931531. Research reported in this publication was
supported by the National Institute Of General Medical Sciences
of the National Institutes of Health under Award Number
R01GM146462. The content is solely the responsibility of the
authors and does not necessarily represent the official views of
the National Institutes of Health.