{"paper_id":"15bf0fc0-e714-4c00-88bc-dedff2e427ae","body_text":"Journal Title Here,2022, 1–3\ndoi: DOI HERE\nThe gift of novelty: repeat-robustk-mer-based\nestimators of mutation rates\nHaonan Wu 1 and Paul Medvedev 1,2,3†\n1Department of Computer Science and Engineering, 2Department of Biochemistry and Molecular Biology\nand 3Huck Institutes of the Life Sciences, Pennsylvania State University\n†\nCorresponding author: pzm11@psu.edu\nAbstract\nEstimating mutation rates between evolutionarily related sequences is a central problem in molecular\nevolution. Due to the rapid expansion of datasets, modern methods avoid costly alignment and instead focus\non comparing sketches of sets of constituent k-mers. While these methods perform well on many sequences,\nthey are not robust to highly repetitive sequences such as centromeres. In this paper, we present three new\nestimators that are robust to the presence of repeats. The estimators are applicable in different settings, based\non whether they need count information from zero, one, or both of the sequences. We evaluate our estimators\nempirically using highly repetitive alpha satellite sequences. Our estimators each perform best in their class\nand our strongest estimator outperforms all other tested estimators. Our software is open-source and freely\navailable on https://github.com/medvedevgroup/Accurate_repeat-aware_kmer_based_estimator.\nKey words: mutation rates, substitution rate estimation, centromere sequences\n1. Introduction\nEstimating mutation rates between evolutionarily related\nsequences has long been a central problem in molecular evolution,\noriginating well before the advent of large-scale genomics. Early\nquantitative methods concentrated on amino acid substitution\nrates, such as the PAM matrices introduced by Dayhoff (1969)\nand the BLOSUM matrices developed by Henikoff and Henikoff\n(1992). These methodologies, along with later profile-based hidden\nMarkov models (Durbin, 2013) continue to serve as the benchmark\nwhen high-quality alignments can be obtained.\nNevertheless, the rapid growth of sequencing has made\ncomputationally intensive alignment-based pipelines increasingly\ninfeasible in the modern era. As a result, alignment-free methods\nthat characterize sequences using low-cost summary statistics have\nbecome essential (Song et al., 2014; Zielezinski et al., 2017; Rathore\nand Kashyap, 2026). Most of these techniques are based on\nsketches of k-mer spectra. Widely used tools such as Mash (Ondov\net al., 2016) and Skmer (Sarmashghi et al., 2019), along with\nmore recent sketch-adjusted approaches including Sylph (Shaw\nand Yu, 2024) and FracMinHash-based methods (Hera et al., 2024,\n2023; Hera and Koslicki, 2025; Hera et al., 2025; Shaw and Yu,\n2023; Wu et al., 2025), enable rapid construction of whole-genome\nphylogenies, efficient metagenomic screening, and the estimation\nof millions of pairwise point-mutation rates in minutes rather than\ndays.\nNearly all alignment-free approaches are derived based on\nthe assumption that most k-mers above a certain k-mer size\n(e.g. k ≥ 19) occur only once in a sequence. However, recent\nadvances in sequencing technology are leading to more abundantly\navailable highly repetitive sequences. For example, the recent\ntelomere-to-telomere human assembly contains fully assembled\nchromosome centromeres, which are alpha satellite DNA composed\nof 171-bp monomers that are further arranged into higher-order\nrepeats (Logsdon et al., 2024). Unfortunately, most estimators are\nnot robust to repeat-rich sequences and methods to analyze the\nmutation rates between such sequences remain in their infancy.\nWe can categorize the space of k-mer-based estimators based\non the type of information they use. In the absence of repeats, it\nsuffices to consider the set of k-mers present in the sequences and\nignore their occurrence counts. We call these type of estimators\nas Presence-Presence, because they rely on presence/absence\ninformation for both the source string s and the mutated string\nt. Such estimators are especially useful in the setting where\noccurrence counts are not readily available, such as raw sequencing\ndata. In the presence of repeats, however, occurrence counts\nbecome an important signal. In the Presence-Count setting, an\nestimator is restricted to presence/absence information for s but is\nallowed to use occurrence counts oft. This can occur if for example\ns is unassembled sequencing data while t is an assembly. The\nCount-Count setting is the most powerful, allowing the estimator\nto use information about counts in both s and t, but is limited to\napplications such as when both s and t are assembled.\n©The Author 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 1\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\n2 Wu and Medvedev\nName Knowledge ofk-mers ins Knowledge ofk-mers int Formula\nˆqpp Presence/absence Presence/absence |sp(t)\\sp(s)|\nL\nˆqpc Presence/absence Counts\nX\nτ∈sp(t)\\sp(s)\nocc(τ, t)\nL\nˆqcc Counts Counts ˆqpc + (1−ˆrpc)k−1 · ˆrpc\n3L ·\nX\nτ∈sp(s)\nocc(τ, s)·d 1(τ, s)\nTable 1. Our contributed estimators. Here, s in arbitrary string with L = |s| − k + 1; t is the result of applying a mutation process to s with substitution\nrate r; d1(τ, s) is the number of k-mers in the spectrum of s that are at a Hamming distance of one to τ. We use q as shorthand for 1 − (1 − r)k, e.g.\nan estimator ˆq implicitly defines ˆr = 1 − (1 − ˆq)1/k.\nThe widely known Mash estimator falls in the Presence-\nPresence category, but, as we show in this paper, is not repeat-\nrobust. There are two repeat-robust k-mer-based estimators that\nwe are aware of. The first is from our previous work (Wu et al.,\n2025), which, as we will describe in Sec. 7, falls roughly in\nbetween the Presence-Presence and the Presence-Count setting.\nThe second is a weighted-intersection-based estimator, which falls\nin the Count-Count setting. This is a natural estimator that has\nbeen mentioned in Rhie et al. (2020).\nIn this paper, we present three new estimators (Table 1): ˆqpp,\nfor the Presence-Presence setting, ˆ qpc, for the Presence-Count\nsetting, and ˆqcc, for the Count-Count setting. One of our main\ninsights is that the number of newly created k-mers is more\nsensitive to the presence of repeats than the number of k-mers\nthat remain shared (or, as reflected in the title, we treat novel\nk-mers as a gift to make use of).\nWe evaluate our estimators empirically using various types\nof sequences, including the alpha satellite centromeric region\nfrom a human chromosome. We evaluate estimator error across\na wide range of mutation rates and k-mer values. Moreover,\nwe demonstrate how our estimators can be combined with\nFracMinHash sketching without systematically effecting bias.\nUltimately, each of our estimators performs best in their class,\nwith ˆqcc outperforming estimators in all classes. Finally, we show\nhow our estimators can be used on real data by applying them to\ncompute the ANI (Average Nucleotide Identity), which is widely\nused to measure genomic distance in taxonomic analysis.\n2. Preliminaries\nLet s be a string and let k > 0 be the k-mer size. We let |s|\ndenote the number of nucleotides in s. We assume in this paper\nthat |s| ≥ k. We use L to denote the number of k-mers in s, i.e.\nL = |s| − k + 1. Let sp (s) be the set of all distinct k-mers in s,\nalso called the spectrum of s. Let occ(τ, s) denote the number of\ncopies of k-mer τ in string s. Let di(τ, s) denote the number of\nk-mers in sp(s) with Hamming distance i to k-mer τ. The Jaccard\nsimilarity between two sets A and B is J(A, B) ≜ |A∩B|\n|A∪B| .\nWe will consider the simple substitution mutation model as in\nprevious work (Blanca et al., 2022). Given a parameter 0≤r≤1\nand a strings, the model generates an equal-length string where,\nindependently, the character at each position is unchanged from\nswith probability 1−rand changed to one of the three other\nnucleotides with probabilityr/3.\nOur goal in this paper is to estimate the mutation ratergiven\nobservations aboutsandt. We assume that, at a minimum, we\ncan observeL,sp(s), andsp(t). Depending on the category of the\nestimator, we may also observeocc(τ, s) and/orocc(τ, t), for allτ.\nAs a shorthand notation, we defineq≜1−(1−r) k; intuitively,\nqis the probability that ak-mer is mutated. In this work and\nothers, estimators forrare usually derived by first obtaining an\nestimator ˆqforqand then computing the estimator ˆrusing the\ninverse formula ˆr= 1−(1−ˆq) 1/k.We take the same approach in\nthis paper, where we will explicitly define an estimator forqand\nleave the definition ofrimplicit using the above formula.\nThebiasof an estimator ˆqis defined asE[ˆq]−q. An unbiased\nestimator will on average return the correct value. In the case of\nmutation rate estimators, it is usually easier to derive the bias of ˆq\nrather than ˆr. Even when there is a closed-form expression for the\nbias of ˆq, it does not lead to a closed-form expression for the bias\nof ˆr, because ˆris not linear in ˆq. In this paper, we will derive the\nbias of ourqestimators when possible. However, we will ultimately\nrely on experimental results to judge the bias of our estimators.\nIn this paper, we will take themethod-of-momentsapproach\nto derive our estimators (Wasserman, 2013). We first decide on\na random variable to observe, e.g.I pp =|sp(s)∩sp(t)|. We\nthen derive its expectation (possibly approximating it under some\nassumptions), e.g.E[I pp]≈L(1−q). We then take the observed\nvalue (denoted byI pp\nobs), plug it into the expectation formula, and\nsolve the formula forqto get the estimate. In our example, ˆqis the\nsolution to the equationI pp\nobs =L(1−ˆq), which is ˆq= 1−I pp\nobs/L.\n3. The Presence-Presence Setting\nIn this section we consider the setting where the only thing\nwe know aboutsandtare their spectra andL, i.e. no count\ninformation. We propose the following estimator\nˆqpp = N pp\nL ,\nwhereN pp =|sp(t)\\sp(s)|is the number of new distinctk-mers\ngenerated whensmutates intot. We will use the notation whereby\na superscript of “pp” indicates the Presence-Presence setting,\n“pc” indicates the Presence-Count setting, and “cc” indicates\nthe Count-Count setting. In this section, we explain how ˆq pp is\nderived and compare it to other known estimators for the Presence-\nPresence setting. We do not attempt to derive the bias of ˆq pp, as\nit is technically complicated.\nConsider thek-span model, which is to assume that\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\nEstimators of substitution rates 3\n1.sdoes not contain anyk-mer that occur more than once, and\n2. newk-mers generated after mutations are distinct from each\nother and from thek-mers ins.\nLet us defineI pp ≜|sp(s)∩sp(t)|and use the shorthandJ≜\nJ(sp(s), sp(t)). In thek-span model, we have thatE[I pp] =L(1−\nq) and the method-of-moments approach gives the estimator\nˆq= L−I pp\nobs\nL .(1)\nThis estimator was introduced in Wu et al. (2025). Furthermore,\nin this model,J=I pp/(2L − I pp) and as a result we have that\n1−J\n1+J = L−I pp\nL . Thus, we can equivalently write\nˆq= 1−J obs\n1 +J obs\n.(2)\nThis is an improved version of the Mash estimator (Ondov et al.,\n2016), described in Sarmashghi et al. (2019) and in Appendix A.6\nof Belbasi et al. (2022). Finally, in this model we have thatN pp +\nI pp =L, so we can equivalently write\nˆq= N pp\nobs\nL ,(3)\nwhich corresponds the definition of our new estimator ˆq pp. The\nthree versions of ˆqhave the same derivation and are algebraically\nequivalent in thek-span model. However, when applied to data\nviolating thek-span assumptions, we will see that three versions\nproduce different results and ˆqpp outperforms the others (Sec. 7).\nThe intuition for ˆqpp is based on considering what happens\nwhen assumption 1 is violated, i.e. there are repeats. Letτbe a\nk-mer with at least two occurrences insand letνbe ak-mer that\nappears exactly once ins. Every mutation leads to a newk-mer in\nN pp, regardless of whether it happens in an occurrence ofτorν.\nOn the other hand, a mutation in one of the copies ofτdoes not\neffectI pp while a mutation inνincreasesN pp by one. This makes\nEq. 1 and Eq. 2, which rely onI pp, less accurate than Eq.3.\n4. The Presence-Count Setting\nIn this section, we consider the Presence-Count setting and derive\nour estimator ˆqpc and its bias. Given two stringssandt, we define\nN pc ≜\nX\nτ∈sp(t)\\sp(s)\nocc(τ, t).\nIntuitively,N pc is the number of newk-mers generated when\nsmutates intot. We will derive ˆq pc by applying a method-of-\nmoments approach toN pc. Let us first define\nRj(τ, s)≜d j(τ, s)(1−r) k−j\n\u0010 r\n3\n\u0011j\nandR(τ, s)≜\nkX\nj=1\nRj(τ, s)\nRj(τ, s) is the probability thatτmutates to ak-mer that is ins\nand has a Hamming distance ofjtoτ.R(τ, s) is the probability\nthatτmutates to ak-mer that is insbut is different fromτ. Note\nthat we definedR j andRin terms ofrsince it is visually clearer,\nbut we can equivalently think of them as functions ofq. We can\nnow deriveE[N pc].\nLemma 1Letsbe a string and lettbe a string generated from\nsusing the mutation process parameterized byr. Then\nE[N pc] =Lq−\nX\nτ∈sp(s)\nocc(τ, s)R(τ, s).(4)\nProofLetHD(τ, ν) denote the Hamming distance betweenk-mers\nτandν. For 1≤i≤L, lets i be thek-mer starting at position\niofs. LetX i be an indicator random variable representing the\nevent thats i mutated to ak-mer that does not appear ins, i.e.,\nti /∈s. We can expressN pc as a sumX is as follows.\nN pc =\nX\nτ∈sp(t)\\sp(s)\nocc(τ, t) =\nLX\ni=1\n1[t i /∈sp(s)] =\nLX\ni=1\nXi,\nwhere1is the indicator function for an event. Then,\nE[N pc] =\nLX\ni=1\nPr[Xi = 1]\n=\nLX\ni=1\n(1−Pr[X i = 0])\n=L−\nLX\ni=1\nX\nτ∈sp(s)\nPr[ti =τ]\n=L−\nLX\ni=1\n\nPr[ti =s i] +\nX\nτ∈sp(s)\\s i\nPr[ti =τ]\n\n\n=L−\nLX\ni=1\n\n1−q+\nX\nτ∈sp(s)\\s i\nPr[ti =τ]\n\n\n=Lq−\nLX\ni=1\nX\nτ∈sp(s)\\s i\nPr[ti =τ]\n=Lq−\nLX\ni=1\nkX\nj=1\nX\nτ∈sp(s)\\s i\ns. t.\nHD(τ,t i)=j\nPr[ti =τ]\n=Lq−\nLX\ni=1\nkX\nj=1\ndj(si, s)(1−r) k−j(r/3)j\n=Lq−\nLX\ni=1\nR(si, s)\n=Lq−\nX\nτ∈sp(s)\nocc(τ, s)R(τ, s)\n□\nA straightforward method-of-moments estimator based on\nEq. 4 would be to observe the value ofN pc (denoted asN pc\nobs), let\nf(r) =Lq− P\nτ∈sp(s) occ(τ, s)R(τ, s), and use numerical methods\nto solve the equationN pc\nobs =f(r) forr. However, there is no\nguarantee for the uniqueness of the solution. Furthermore, it would\nbe time-consuming and superfluous to computeR(τ, s) for allτ.\nInstead, we derive an estimator based on an approximation.\nConsider the two terms of Eq. 4. The first term is the expected\nnumber of positions whosek-mer mutated. This may over-count\nN pc, and the second term corrects this by accounting for the\npossibility that a position mutates but to something that is already\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\n4 Wu and Medvedev\nins. As we expect the first term to dominate, we approximate\nE[N pc]≈Lq, leading to the estimator\nˆqpc = N pc\nobs\nL .\nUnlike ˆqpp, this formula accounts for the possibility that two\noccurrences of ak-mer insmutate to the same newk-mer in\nt. The bias of ˆqpc follows immediately from Lemma 1:\nTheorem 1The bias ofˆq pc is\nE[ˆqpc]−q=−\nkX\ni=1\nX\nτ∈sp(s)\nocc(τ, s)Ri(τ, s)\nL .\nNote that the bias is always negative, meaning that, on average,\nˆqpc underestimates the true value. Moreover, the bias is a sum ofk\nnegative terms, which leads to the intuition of reducing the bias by\nincluding one of the terms inside the estimator itself—something\nwe pursue in the next section.\n5. The Count-Count Setting\nIn this section, we first present a novel Count-Count estimator ˆrcc\nand then describe an alternative Count-Count estimator proposed\nby Rhie et al. (2020) in the context of assembly quality assessment.\n5.1. New Estimator ˆqcc\nLet us build on ˆ qpc by trying to include the first term of the\nbias (i.e. −P\nτ occ(τ, s)R1(τ, s)/L) into the estimator itself. With\nthe method-of-moments approach, we would need to solve the\nfollowing equation for r:\nN pc\nobs = Lq − (1 − r)k−1 · r\n3 ·\nX\nτ∈s\nocc(τ, s)·d 1(τ, s).(5)\nUnfortunately, we cannot solve this equation analytically and we\ncannot guarantee that a numerical method would produce a unique\nsolution. Instead, we will first compute the ˆrpc estimator and then\nplug into the right hand side of Eq.5. Our estimator is then defined\nas\nˆqcc = N pc\nobs\nL + (1−ˆrpc)k−1 · ˆrpc\n3L ·\nX\nτ∈sp(s)\nocc(τ, s)·d 1(τ, s)\nWhile this approach of plugging in one estimator in order to derive\nanother one makes it challenging to prove anything, we will see\nthat it performs extraordinarily well in practice.\nTo compute ˆqcc, we need to computed 1(τ, s) for allτ∈sp(s).\nWe do this in a straightforward two-pass algorithm using a hash\ntable, resulting in runtime linear inL. There are more advanced\nways of doing this (e.g. using suffix arrays), but we were not\nconcerned with optimizing runtime since it was below one second.\n5.2. Estimator mentioned in Rhie et al. (2020)\nRhie et al. (2020) mention an estimator that we recapitulate here\nto show how it fits in our framework. It is designed in the spirit\nof Eq. 1 by relying on the intersection size but also integrating\ncount information. Consider the size of the weighted intersection\nbetween thek-mers ofsandt:\nI cc ≜\nX\nτ∈sp(s)∪sp(t)\nmin(occ(τ, s), occ(τ, t)).\nLet us assume that when a mutation occurs, the newk-mer is not\nins. Before any mutations,I cc =L. When ak-merτmutates\nto anotherk-merτ ′, it increasesocc(τ ′, t) by one and decreases\nocc(τ, t) by one. Sinceτ ′ is not ins,occ(τ ′, t) does not contribute\nanything toI cc. Therefore, the only effect of a mutation onI cc\nis to decrease it by one. By linearity of expectation, we can add\nthe probability of this happening for each of theL(non-distinct)\nk-mers and get thatE[I cc]≈L(1−q). The method-of-moments\napproach then gives the estimator\nˆqwi ≜1−I cc/L.\n6. Combination With Sketching\nA major reason why many estimators are based onk-mers\nrather than on the full sequences is that it makes them easily\namenable to sketching. Sketching is a powerful technique that can\nmake it possible to quickly compute all-pairs estimates on large\ndatasets (Ondov et al., 2016). AFracMinHash sketchsp θ(s) of a\nsequencesis defined as the subset ofsp(s) that includes only the\nk-mers that map below a pre-defined thresholdθ, under a fixed\nrandom hash function (Hera et al., 2023).\nIn this section, we will present a modification of ˆq pp, ˆqpc and\nˆqcc that work on data sketched with FracMinHash. Instead of\nobservingN pp orN pc, we now only observe them restricted to\nthe sketchedk-mers. Formally, let us define\nN pp\nθ =|sp θ(t)\\sp θ(s)|andN pc\nθ =\nX\nτ∈sp θ (t)\\spθ (s)\nocc(τ, t).\nIntuitively, we expect each of these quantities to decrease by a\nfactor ofθrelative to their non-sketched versions. Based on this\nintuition, we define ˆqθ\npp and ˆqθ\npc as ˆqθ\npp = N pp\nθ\nθL and ˆqθ\npc = N pc\nθ\nθL .\nFormalizing our intuition, we prove that sketching does not effect\nthe bias.\nTheorem 2The biases ofˆq θ\npp andˆqθ\npc are\nE[ˆqθ\npp]−q=E[ˆq pp]−qandE[ˆq θ\npc]−q=E[ˆq pc]−q.\nProofFirst, we show thatE[N pc\nθ ] =θ·E[N pc]. As before, letX i\nbe an indicator random variable representing the event thats i\nmutated to ak-mer that does not appear ins. LetY τ be a binary\nrandom variable and it is 1 ifk-merτhashes to less thanθ. By\nthe linearity, we have\nE[N pc\nθ ] =\nLX\ni=1\nPr[Ysi = 1∩X i = 1] =\nLX\ni=1\nPr[Ysi]·Pr[X i = 1]\n=\nLX\ni=1\nθ·Pr[X i = 1] =θ·E[N pc]\nThen,\nE[ˆqθ\npc]−q=E\n\u0014 N pc\nθ\nθ·L\n\u0015\n−q= E[N pc]\nL −q=E[ˆq pc]−q\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\nEstimators of substitution rates 5\nThe proof for the bias of ˆqpp is similar and is omitted.□\nBecause ˆqθ\npc shows the same bias as ˆqpc, we use the same idea of\nSec. 5.1 to obtain a stronger estimator ˆqθ\ncc by partially correcting\nthe bias of ˆqθ\npc, i.e.\nˆqθ\ncc = ˆqθ\npc + (1−ˆrθ\npc)k−1 · ˆrθ\npc\n3L ·\nX\nτ∈sp(s)\nocc(τ, s)·d 1(τ, s).\nNote thatP\nτ∈s occ(τ, s)·d 1(τ, s)/Lneeds to be precomputed\nprior to sketching. It does not increase the space needed for the\nsketch as it becomes just a single constant that needs to be stored.\nBecause of the variance introduced by sketching, the observed\nquantity can exceedθLand then any of our three estimators can\nexceed 1. We therefore cap both estimators so that they never\nreturn greater than 1.\n7. Empirical Results\nWe aim to evaluate the accuracy of our three novel estimators in\nrelation to each other as well as to the other estimators mentioned\nin this paper. We also show an application to real data, where\nour estimators can be used to predict the ANI between various\ngenomes. Our software is available on GitHub (Wu and Medvedev,\n2025).\n7.1. Datasets\nWe use four different base sequences with various levels of\nrepetitiveness, summarized in Table S1. We use these sequences as\nrepresentative examples spanning different levels of repetitiveness.\nOur main text evaluation is focused on the D-hardest sequence,\nwhich is a 100kbp-long sequences of alpha satellite DNA extracted\nfrom the human T2T chr21 centromere. We use a value ofk= 30\nfor our evaluation on this sequence, as consistent with previous\nanalyses (Wu et al., 2025), leading to 3,987 distinct 30-mers. Over\n70% of the distinctk-mers in D-hardest occur more than once and\nitsk-mers have on average at least one otherk-mer at a Hamming\ndistance of one. D-hardest violates both assumption 1 (i.e. because\nit has repeats) and assumption 2 (i.e. because it has pairs ofk-\nmers with small Hamming distances, ak-mer can mutate into one\nthat is already ins). Since the differences between the estimators\nare more pronounced on this sequence, our main text focuses on\nD-hardest. The results on the three other sequences are presented\nin Sec. A of the Supplementary; they are all consistent with our\nfindings on D-hardest but with less pronounced differences on less\nrepetitive sequences.\n7.2. Evaluation Metrics\nWe focus our evaluation on the estimators’ accuracy, measuring\nboth their bias and variance. We do not perform a runtime or\nmemory analysis because they each complete in less than a second\nin total on all of the four datasets and use negligible memory.\nFirst, we benchmark each estimator on the four datasets using\ntheir default values ofk, i.e. the values chosen in Wu et al. (2025)\nas most suitable for their analysis. We vary the mutation rater\nfrom 0.001 to 0.251 and we show the distribution of ˆrfor 100\nmutation-process simulation replicates for eachr(e.g. Fig. 1).\nThese experiments give a fine-grained separate view of the bias\nand variance.\nSecond, we vary bothkandrand, for each (k, r) pair, compute\nthe average relative absolute error: 1\nn\nPn\ni=1\n|ˆri−r|\nr (e.g. Fig. 2 and\nTable 2). We usen= 100 replicates for each (k, r) pair. This\nerror combines the bias and variance into one metric, enabling\nus to easily visualize accuracy in two dimensions. Note that the\ntwo types of benchmarks emphasize different aspects of estimator\nperformance.\nAs observed in our previous work (Wu et al., 2025), whenr\nand/orkbecome sufficiently large, allk-mers mutate with very\nhigh probability, causing all tested estimators to return the value\n1. We refer to this unstable behavior asblow-upand it manifests\nas high error in the top-right corner of the heatmaps.\n7.3. Presence-Presence Setting\nWe compare ˆrpp with the two other estimators in this category:\nthe estimator defined by Eq. 1, which we refer to as ˆrobl, and the\nestimator defined by Eq. 2, which we refer to as ˆrmash (this is the\nwidely used Mash estimator with a binomial correction). Fig. 1A\nshows that fork= 30, ˆrpp dominates ˆrmash across nearly all tested\nmutation rates; ˆrpp dominates ˆrobl at lower values ofrand has\nsimilar variance and bias forr >0.161.\nFig. 2A presents heatmaps of estimator accuracy across a wide\nrange ofkandr. As expected, all estimators exhibit blow-up\nbehavior for sufficiently largekandr. However, ˆr pp performs\nsubstantially better than ˆrmash and ˆrobl at smaller mutation rates.\n7.4. Presence-Count and Count-Count Setting\nFor the Presence-Count setting, ˆr pc is the only estimator that\nwe are aware of, while for the Count-Count setting, we have\nour new estimator ˆrcc and the ˆrwi estimator. Fig. 1B shows the\nperformance of these estimators withk= 30. and Fig. 2B shows\nthe performance across a wide range ofkandr. Table 2 shows\nspecific values of errors under different settings ofkandr.\nFirst, we note that all estimators in these settings have smaller\nor similar bias and error than the ˆrpp estimator, underscoring the\ngeneral power of using count information.\nSecond, in the Count-Count setting, ˆr cc outperforms ˆrwi for\nbothk= 30 and more broadly across most tested (k, r) values.\nAlthough ˆrcc and ˆrwi have access to the samek-mer count\ninformation, ˆrcc achieves nearly unbiased estimation for all tested\nvalues ofratk= 30.\nThird, in the Presence-Count setting, ˆr pc does not have a\ndirect competitor, so we compare it against ˆr pp, ˆrcc, and ˆrwi.\nCompared to the ˆrpp estimator of the Presence-Presence setting,\nˆrpc explicitly accounts for the event that multiplek-mers can\nmutate into the same novelk-mer, resulting in a smaller bias than\nˆrpp. The estimator ˆrwi further addresses certain cases in which a\nk-mer mutates into anotherk-mer already present in the original\nsequence and, consequently, exhibits a slightly smaller bias than\nˆrpc forr <0.051. Nevertheless, ˆr pc and ˆrwi have very similar\nperformance. Compared to ˆrcc, ˆrpc does not perform as well, which\nis expected since ˆrcc is designed to specifically offset some of the\nbias of ˆrpc.\n7.5. Comparison Against Estimator From Wu et al. (2025)\nIn a previous work, we tackled a similar problem and developed a\nsingle repeat-robust estimator (Wu et al., 2025). We will refer to\nit as ˆrwu here. It was the firstk-mer-based estimator evaluated for\naccuracy in highly repetitive settings, showing robustness in these\nsettings compared to ˆqobl. It does not neatly fit into the categories\nhere, as it uses the abundance histogram ofs, i.e. the histogram of\nk-mer occurrence counts. While this is based on the count of the\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\n6 Wu and Medvedev\n(A)\n0.001 0.101 0.201\nr\n-0.04\n0.00\n0.04\n0.08\nError (r r)\nrpp\nrmash\nrobl\n(B)\n0.001 0.101 0.201\nr\n-0.03\n-0.02\n-0.01\n0.00\n0.01\n0.02\n0.03\n0.04\n0.05\nError (r r)\nrpc\nrwi\nrcc\n(C)\n0.001 0.061 0.121 0.181 0.241\nr\n-0.03\n-0.02\n-0.01\n0.00\n0.01\n0.02\n0.03\nError (r r)\nrwu\nrpp\nrpc\nrcc\nFig. 1: Comparison of estimator accuracy on D-hardest, withk= 30 and mutation raterfrom 0.001 to 0.251 with step size 0.02. For\neach value ofr, we show box plots over 100 mutation replicates.(A)Comparison of ˆr pp against other presence–presence estimators.(B)\nComparison of ˆrpc against ˆrwi and ˆrcc.(C)Comparison of our three novel estimators against each other and against the ˆr wu estimator\nof Wu et al. (2025).\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\nEstimators of substitution rates 7\n(A)\n8 16 24 32\nk\n0.001 0.121 0.241 0.361\nr\nr_pp\n8 16 24 32\nk\nr_mash\n8 16 24 32\nk\nr_obl\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nError\n(B)\n8 16 24 32\nk\n0.001 0.121 0.241 0.361\nr\nr_pc\n8 16 24 32\nk\nr_wi\n8 16 24 32\nk\nr_cc\n0.00\n0.02\n0.04\n0.06\n0.08\n0.10\nError\nFig. 2: The estimators accuracy on D-hardest as a function of bothkandr. Each cell shows the average relative absolute error of 100\nreplicates. Note that the scale of the heatmap is different between the top and bottom panels. Moreover, the errors are capped at 1.0\n(for the panel A) and 0.10 (for the panel B), e.g., all errors greater than 1.0 are shown as 1.0 in the top panel.\nTable 2.Errors of estimators under different parameter settings. Each cell shows the average relative absolute error of 100 replicates. Bold values highlight\nthe lowest error in the respective class, while red/italic values highlight the lowest error overall. Underline denotes estimators we introduce in this paper.\nTheˆrwu estimator falls in between the Presence-Presence and Presence-Count categories.\nr= 0.001r= 0.010r= 0.100\nEstimator Category k= 16k= 24 k = 32 k= 16k= 24 k = 32 k= 16k= 24 k = 32\nˆrobl\nPresence-Presence\n202 130 94 19 12 8.7 1.1 .46 .18\nˆrmash 14 11 8.9 6.5 4.6 3.5 .64 .20.013\nˆrpp .089.083.083 .14 .095 .073 .23 .097.044\nˆrwu .88 .73 .67 .30 .20 .20 .37 .13 .037\nˆrpc Presence-Count .084 .084 .083 .035 .029 .025 .032 .018 .017\nˆrwi Count-Count .082.084.081 .030 .027 .024 .028 .017 .017\nˆrcc .083 .085.081 .027 .024 .023 .010 .011.014\nk-mers ofs, it is only a summary and can be approximated using\na related species or a related type of sequence. In our framework,\ntherefore, ˆrwu lies somewhere between the Presence-Presence and\nthe Count-Presence setting. Fig. 1C compares ˆr wu against our\nthree estimators fork= 30. Table 2 and Fig. S1 show the error\nacross different settings ofkandr.\nDespite relying on less information, ˆr pp is overall a better\nestimator than ˆrwu. The relative bias depends onr: ˆr pp has\nslightly smaller bias for 0.001≤r≤0.091 and has slightly larger\nbias for 0.101≤r≤0.231 (Fig. 1C). However, ˆr wu often has more\nvariance than ˆrpp, as reflected in both Fig. 1C and the bigger\nrelative absolute errors shown in Table 2 and Fig. S1.\nThe improvement of ˆr pc over ˆrwu is more stark, as ˆr pc\nconsistently has lower bias than ˆrwu (Fig. 1C) and has consistently\nlower overall error across all values ofkandr(Table 2 and Fig. S1).\nThe superior performance of ˆr pc relative to ˆrwu highlights the\nbenefits of accounting for novelk-mers in the estimate formula, as\nˆrwu does not take into account anyk-mers intthat are not ins.\n7.6. Combination With Sketching Techniques\nIn Sec. 6, we proved that sketching does change the bias of ˆq pp\nand ˆqpc. Here, we evaluate empirically the bias of all our three\nestimators. Fig. S2 shows the performance of ˆr θ\npp, ˆrθ\npc, and ˆrθ\ncc\nusingθ∈ {0.1,0.01}. We observe that sketching does not introduce\nany systematic bias, in any of the three estimators. As expected\nwith sketching, we observe increased variance forθ= 0.1 and even\nlarger variance forθ= 0.01. Overall, our results confirm that our\nestimators can be applied naturally to sketched data, with the\nobvious caveat that smaller sketches will lead to larger variance of\nthe estimator.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\n8 Wu and Medvedev\nTable 3. Results on the ANI benchmark. The number of uncomputable\npairs is the number of pairs for which the estimator either returns 0 or\ndoes not return anything. Pearson R value (Pearson) and mean absolute\nerror (MAE) are shown for a subset of pairs with ANI> 85%, to avoid\npenalizing for uncomputable pairs.\nTool n. uncomputable Pearson MAE\nskani 144 0.9929 0.28\nFastANI 128 0.9968 0.19\nMash 11 0.9923 0.40\nSourmash 5 0.9909 0.54\n1 - ˆrθ\npc 3 0.9904 0.47\n1 - ˆrθ\ncc 3 0.9904 0.477.7. ANI Estimation on Real Genomes\nTo evaluate the applicability of our estimators on real data, we\nuse 1−ˆr θ\npc and 1−ˆr θ\ncc to estimate ANI between real genomes.\nWe use the slow but accurate alignment-based OrthoANIu (Yoon\net al., 2017) to compute the ground truth, consistent with previous\nwork (Shaw and Yu, 2023).\nWe evaluate using a dataset from Hera et al. (2023). They\nfirst chose ten representative genomes from the Genome Taxonomy\nDatabase, including seven bacterial and three archaeal species. For\neach representative genome, they extract additional genomes from\nalong the evolutionary path to the root, selecting an additional\nthree non-representative genomes at each taxonomic rank on the\npath up the tree. They construct pairs for comparison by matching\neach representative genome with the genomes selected along its\nevolutionary path up the tree. We further filtered out three pairs\nthat had accession numbers that were not currently available. The\nresulting dataset contains 189 pairs and covers a wide range of\nANI from∼60% to 100%.\nWe benchmark our estimators against those of of Mash (Ondov\net al., 2016), sourmash (Irber et al., 2024), FastANI (Jain et al.,\n2018), and skani (Shaw and Yu, 2023). FastANI and skani rely\non mapping techniques with seeds; we run them with default\nparameters. Sourmash and Mash are both similar to our estimators\nin that they rely onk-mer sketches. We therefore setk= 19 for our\ntools and sourmash and Mash. Sourmash also uses FracMinHash,\nso we setθ= 0.01 for both sourmash and our estimators. Mash\nuses MinHash sketching instead of FracMinHash, so we set the\nsize of its sketch to be 30,000 to roughly match the sketch sizes\nobtained with FracMinHash ofθ= 0.01. Aside from this, we use\nthe default parameters for sourmash and Mash. Table S2 in the\nSupplementary gives the full commands and data used.\nFig. S3 shows the estimator results relative to the ground truth.\nFor some pairs at ANI<85%, some of the estimators either do not\nreport an estimate or report 0; we refer to these asuncomputable\npairs. In order to make a fair comparison, we measure both the\nnumber of uncomputable pairs and the accuracy of the predictions\nat ANI>85% (Table 3).\nThe most accurate estimators at high ANI levels are skani\nand FastANI; however, they are not able to estimate ANI for\nmore than 100 pairs at lower ANI levels. On the other hand, our\nestimators are the most comprehensive and are able to compute\nall except three pairs. Overall, there is a general pattern that more\ncomprehensive estimators are less accurate at high ANI levels,\nmaking the choice of the best estimator a trade-off.\n8. Discussion and Conclusion\nIn this paper, we studied the problem of estimating the mutation\nrate of a process that transforms an arbitrary stringsinto a string\ntby introducing substitutions at rater. We focused on estimators\nthat are based onk-mers, as they can easily be combined with\nsketching to make them scalable to large datasets. We observed\nthat various estimators for this problem, including our own, can be\ncategorized according to whether they have access tok-mer counts\nor to only presence/absence information. We presented three novel\nestimators, ˆqpp, ˆqpc, and ˆqcc, summarized in Table 1. On highly\nrepetitive data, ˆqpp and ˆqcc perform best in their category, with\nˆqcc outperforming all tested estimators in all categories. The ˆq pp\nestimator is desirable when no count information is available,\nwhile the ˆqpc estimator offers a middle ground trade-off between\naccuracy and the amount of information required from the original\nand mutated sequences.\nThe main insight behind our ˆq pp and ˆqpc estimators is that\nin a repeat setting, it is important to count thek-mers that are\nnewly created intand do not appear ins. We reflect this in the\ntitle, i.e.,novelk-mers are agiftthat we must make use of. In\nmany cases, such as ˆq mash and ˆqwu, the formula relies on the\nobserved number of sharedk-mers. This works fine in the absence\nof repeats or mutations resulting in spurious matches, as there\nis a one-to-one correspondence betweenk-mers removed from the\nintersection and newly createdk-mers. However, when there are\nrepeats, a mutation in a repetitivek-merτonly destroys one copy\nofτand does not removeτfrom the count of sharedk-mers. On\nthe other hand, it does add to the count of newk-mers, asτ ′ is\nadded as a newk-mer tot. We show that these types of events\nhurt the performance ˆqmash and ˆqwu, though the effect on ˆqwu is\ncompensated by its use of counts. Our ˆqpp and ˆqpc estimators, on\nthe other hand, are fully based on the count of newk-mers, giving\nthem a performance advantage.\nIn the Count-Count setting, ˆq wi already accounts for this\ninsight; in order to out-perform it, we add some accounting for\nthe possibility that ak-mer mutates into ak-mer that is already\nins. Consider the possibility that ak-merτinsmutates to a\nk-merυwhile ak-merυinsmutates intoτ. The likelihood of this\nis not directly related to repeats, as it can happen in a repeat-free\ngenome; it is related to twok-mers inshaving a small Hamming\ndistance between them. The ˆq wi estimator does not model the\nchance of this event happening. Our estimator ˆq cc models this\nevent happening for the case that the Hamming distance is one;\nwe believe this accounts for its improved performance.\nOur mutation model is naturally an idealized version of reality\nand does not explicitly account for factors such as ploidy or indels.\nNevertheless, our estimators can be used as part of downstream\ntools. For example, the Merqury tool (Rhie et al., 2020) computes\nthe quality of a candidate assembly by comparing itsk-mer content\nto thek-mer content of the sequencing data used to validate it. The\nscore that Merqury reports is closely related to our ˆrpc estimator,\nas it essentially treats the assembly as a mutated stringtfrom\nthe original genomes. Thek-mer counts oftare given by the\nassembly; to approximate the presence-absence spectrum ofs,\nMerqury uses higher-copyk-mers from the read set. Note that\nwhat we are describing here differs from the ˆrwi estimator, which\nis also alluded to in Rhie et al. (2020).\nA natural question arising from our work is how to choosekand\nθ. Guidance on selectingkis partially addressed by our heatmaps,\nwhich display estimator accuracy across a grid of (k, r) values for\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\nEstimators of substitution rates 9\nboth repetitive and typical genomic settings. These plots allow\nusers to identify the stable operating regime for their sequence\ntype. Additionally, in our previous work (Wu et al., 2025), we\nintroducedP empty as a heuristic stability criterion: given sequence\nlengthL, mutation rater, andk,P empty quantifies the probability\nthat ak-mer has no surviving copies after mutation, providing a\nprincipled threshold below which the estimator is expected to be\nunstable. The choice ofθ(the sampling rate for sketching) has\nbeen widely studied in the sketching literature, though not for\nour new estimators in particular. Our analysis here (e.g. Fig. S2)\ncan give a rough guidance for choosingθ, but a more rigorous\nanalysis of sketch estimator stability in repetitive regions remains\nan interesting problem.\nOur insights open the door for the continued improvement of\nestimators. One particular direction is the Count-Presence setting\nwhich we have not discussed in this paper. Note that this setting\nis not symmetric to the Presence-Count setting; e.g. the counts in\nsare a deterministic variable while the counts intare a random\nvariable. It remains an open problem to determine if estimators\nin the Count-Presence setting can outperform estimators in the\nPresence-Count setting. As more repetitive sequences become\navailable, we expect future work to uncover new insights to further\ndrive the improved quality of mutation rate estimators.\nAcknowledgments:We thank Antonio Blanca for useful\ndiscussions. This material is based upon work supported by\nthe National Science Foundation under Grants No. DBI2138585\nand OAC1931531. Research reported in this publication was\nsupported by the National Institute Of General Medical Sciences\nof the National Institutes of Health under Award Number\nR01GM146462. The content is solely the responsibility of the\nauthors and does not necessarily represent the official views of\nthe National Institutes of Health.\nReferences\nBelbasi, M., Blanca, A., Harris, R. S., Koslicki, D., and Medvedev,\nP. (2022). The minimizer Jaccard estimator is biased and\ninconsistent.Bioinformatics, 38(Supplement\n1):i169–i176.\nBlanca, A., Harris, R. S., Koslicki, D., and Medvedev, P.\n(2022). The statistics of k-mers from a sequence undergoing\na simple mutation process without spurious matches.Journal\nof Computational Biology, 29(2):155–168.\nDayhoff, M. O. (1969). Computer analysis of protein evolution.\nScientific American, 221(1):86–95.\nDurbin, R., editor (2013).Biological sequence analysis. Cambridge\nUniv. Press, Cambridge [u.a.], 17. print. edition.\nHenikoff, S. and Henikoff, J. G. (1992). Amino acid substitution\nmatrices from protein blocks.Proceedings of the National\nAcademy of Sciences, 89(22):10915–10919.\nHera, M. R. and Koslicki, D. (2025). Estimating similarity and\ndistance using FracMinHash.Algorithms for Molecular Biology,\n20(1):1–13.\nHera, M. R., Liu, S., Wei, W., Rodriguez, J. S., Ma, C., and\nKoslicki, D. (2024). Metagenomic functional profiling: to sketch\nor not to sketch?Bioinformatics, 40(Supplement\n2):ii165–ii173.\nHera, M. R., Medvedev, P., Koslicki, D., and Blanca, A.\n(2025). Estimation of substitution and indel rates via k-mer\nstatistics. In25th International Conference on Algorithms for\nBioinformatics (WABI 2025), volume 344, pages 16:1–16:15.\nSchloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik.\nHera, M. R., Pierce-Ward, N. T., and Koslicki, D. (2023). Deriving\nconfidence intervals for mutation rates across a wide range of\nevolutionary distances using FracMinHash.Genome research,\n33(7):1061–1068.\nIrber, L., Pierce-Ward, N. T., Abuelanin, M., Alexander, H.,\nAnant, A., Barve, K., Baumler, C., Botvinnik, O., Brooks, P.,\nDsouza, D., et al. (2024). sourmash v4: A multitool to quickly\nsearch, compare, and analyze genomic and metagenomic data\nsets.Journal of Open Source Software, 9(98):6830.\nJain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis,\nK. T., and Aluru, S. (2018). High throughput ani analysis of 90k\nprokaryotic genomes reveals clear species boundaries.Nature\ncommunications, 9(1):1–8.\nLogsdon, G. A., Rozanski, A. N., Ryabov, F., Potapova, T.,\nShepelev, V. A., Catacchio, C. R., Porubsky, D., Mao, Y., Yoo,\nD., Rautiainen, M., et al. (2024). The variation and evolution\nof complete human centromeres.Nature, 629(8010):136–145.\nOndov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,\nBergman, N. H., Koren, S., and Phillippy, A. M. (2016).\nMash: fast genome and metagenome distance estimation using\nMinHash.Genome Biology, 17(1):132.\nRathore, S. P. S. and Kashyap, N. (2026). Estimators for\nsubstitution rates in genomes from read data.arXiv preprint\narXiv:2601.07546.\nRhie, A., Walenz, B. P., Koren, S., and Phillippy, A. M. (2020).\nMerqury: reference-free quality, completeness, and phasing\nassessment for genome assemblies.Genome Biology, 21(1).\nSarmashghi, S., Bohmann, K., Gilbert, M. T. P., Bafna, V., and\nMirarab, S. (2019). Skmer: assembly-free and alignment-free\nsample identification using genome skims.Genome Biology,\n20(1):1–20.\nShaw, J. and Yu, Y. W. (2023). Fast and robust metagenomic\nsequence comparison through sparse chaining with skani.Nature\nMethods, 20(11):1661–1665.\nShaw, J. and Yu, Y. W. (2024). Rapid species-level metagenome\nprofiling and containment estimation with sylph.Nature\nBiotechnology, pages 1–12.\nSong, K., Ren, J., Reinert, G., Deng, M., Waterman, M. S.,\nand Sun, F. (2014). New developments of alignment-free\nsequence comparison: measures, statistics and next-generation\nsequencing.Briefings in bioinformatics, 15(3):343–353.\nWasserman, L. (2013).All of statistics: a concise course in\nstatistical inference. Springer Science & Business Media.\nWu, H., Blanca, A., and Medvedev, P. (2025). Ak-mer-\nbased estimator of the substitution rate between repetitive\nsequences. In25th International Conference on Algorithms\nfor Bioinformatics (WABI 2025), volume 344 ofLeibniz\nInternational Proceedings in Informatics (LIPIcs), pages 20:1–\n20:20.\nWu, H. and Medvedev, P. (2025). Accurate repeat-aware\nkmer based estimator.https://github.com/medvedevgroup/\nAccurate_repeat-aware_kmer_based_estimator.\nYoon, S.-H., Ha, S.-m., Lim, J., Kwon, S., and Chun, J. (2017).\nA large-scale evaluation of algorithms to calculate average\nnucleotide identity.Antonie Van Leeuwenhoek, 110(10):1281–\n1286.\nZielezinski, A., Vinga, S., Almeida, J., and Karlowski,\nW. M. (2017). Alignment-free sequence comparison: benefits,\napplications, and tools.Genome biology, 18(1):186.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\nEstimators of substitution rates 1\nA. Supplementary information for “The gift of novelty: repeat-robustk-mer-based estimators of mutation\nrates” by Haonan Wu and Paul Medvedev\nTable S1.Sequence properties of our four datasets. These sequences were originally used for evaluation by Wu et al. (2025). They are extracted from\nthe human T2T-CHM13v2.0 reference and made available to download directlyhttps://zenodo.org/records/18303511. The d1 column shows the\naverage number of neighbors that have a Hamming distance of 1, i.e. d1 ≜P\nτ∈sp(s) d1(τ, s)/|sp(s)|.\nName Defaultk L |sp(s)| d1 Type\nD-easy 20 100,000 98,786 0.06 arbitrary region\nD-medium 20 14,400 13,727 0.13 RBMY1A gene\nD-hard 10 2,264 1,199 0.77 simple repeat\nD-hardest 30 100,000 3,987 1.22 centromere\nTable S2.Commands used in our ANI benchmark pipeline. The dataset is available for download athttps://github.com/bluegenes/\n2022-focused-cani-comparisons/blob/main/gtdb-rs207.common-sp10-evolpaths.csv.\nMethod (version) Commands used (arguments in parentheses)\nskani (v0.2.2)skani sketch -t (threads) -o (ref db) (reference genomes);\nskani search -d (ref db) (query genome) -t (threads) -s (s).\nMash (v2.3)mash sketch -o (ref db/query.msh) -p (threads) -s (sketch size) -k (k) (reference genomes);\nmash dist -p (threads) (ref db/ref.msh) (query genome).\nF astANI (v1.33)fastANI -q (query genome) --rl (ref list.txt) -o (output.tsv) -t (threads).\nsourmash (v4.5.0)sourmash sketch dna -p k=(k),scaled=100 --output-dir (ref db) (reference genomes);\nsourmash sketch dna -p k=(k),scaled=100 -o (query.sig) (query genome);\nsourmash search (query.sig) (ref db/*.sig) -k (k) --max-containment\n--estimate-ani-ci -n 0 -t 0 -o (output.csv).\nANIu (v1.2)java -jar (OAU.jar) -n (threads) -f1 (genome1) -f2 (genome2) -u (usearch binary).\nFig. S1: Heatmaps showing the performance of the ˆr wu estimator on D-hardest. The left heatmap shows the error using a scale up to\n1.0, and the right heatmap shows the error using a finer resolution of a scale up to 0.1.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\n2 Wu and Medvedev\nˆrpp\n ˆrpc\nˆrcc\nFig. S2: Effect of sketching on the accuracy of our three estimators, on D-hardest. Top left is ˆrθ\npp, top right is ˆrθ\npc, and bottom is ˆrθ\ncc. We\nvary mutation rates inr∈[0.001,0.111], using a step size of 0.01. Because sketching introduces additional variance, estimator blow-up\noccurs at smaller values ofr, and we therefore restrict our evaluation tor≤0.111.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\nEstimators of substitution rates 3\nFig. S3: Performance of various estimators on the ANI benchmark. The dotted line represents a perfect predictor. For genome pairs\nwhere the estimator was unable to compute a value, we show the predicted value as 0.\nA.1. Comparison of all the estimators on D-easy, D-med, and D-hard\nIn the main text, we focused on the evaluation of the D-hardest dataset. For completeness, we also include this Supplementary section\nto show the results on the three easier datasets: D-easy, D-med, D-hard. These datasets were first introduced by Wu et al. (2025) to have\nprogressive levels of repetitiveness. D-easy is an arbitrarily chosen substring of chr6, with less than 1% ofk-mers being non-singleton.\nD-med is the sequence of the chrY RBMY1A1 gene, with approximately 3% ofk-mers being non-singleton. D-hard is a subsequence of\nD-med that is annotated as a simple repeat, with more than 40% ofk-mers being non-singletons. These datasets are summarized in\nTable S1. Fig. S4 shows the results for a singlekvalue and The results are consistent with our findings for D-hardest but the differences\nbecome less pronounced on less repetitive sequences. For D-easy and D-med, all the estimators perform relatively well though ˆr obl and\nˆrmash show more bias than the rest. For D-hard, the relative performance is similar to that on D-hardest. Fig. S5 shows the heatmaps\nfor a wide range ofkandr.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\n4 Wu and Medvedev\n(A)\n0.001 0.031 0.061 0.091 0.121 0.151 0.181 0.211 0.241\nr\n-0.015\n-0.010\n-0.005\n0.000\n0.005\n0.010\n0.015\nError (r r)\nrobl\nrmash\nrwu\nrpp\nrpc\nrwi\nrcc\n(B)\n0.001 0.031 0.061 0.091 0.121 0.151 0.181 0.211 0.241\nr\n-0.04\n-0.02\n0.00\n0.02\n0.04\n0.06\nError (r r)\nrobl\nrmash\nrwu\nrpp\nrpc\nrwi\nrcc\n(C)\n0.001 0.031 0.061 0.091 0.121 0.151 0.181 0.211 0.241\nr\n-0.06\n-0.04\n-0.02\n0.00\n0.02\n0.04\n0.06\nError (r r)\nrobl\nrmash\nrwu\nrpp\nrpc\nrwi\nrcc\nFig. S4: Comparison of all estimators on the(A)D-easy,(B)D-med, and(C)D-hard datasets. Thekvalues used are as shown in\nTable S1, i.e. for D-easy,k= 20, for D-med,k= 20, and for D-hard,k= 10.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint \n\nEstimators of substitution rates 5\n0.001 0.121 0.241 0.361\nr\nr_obl\n0.001 0.121 0.241 0.361\nr\nr_mash\n0.001 0.121 0.241 0.361\nr\nr_wu\n0.001 0.121 0.241 0.361\nr\nr_pp\n0.001 0.121 0.241 0.361\nr\nr_pc\n0.001 0.121 0.241 0.361\nr\nr_wi\n8 16 24 32\nk\n0.001 0.121 0.241 0.361\nr\nr_cc\n0.0 0.2 0.4 0.6 0.8 1.0\nError\n0.001 0.061 0.151 0.241\nr\nr_obl\n0.001 0.061 0.151 0.241\nr\nr_mash\n0.001 0.061 0.151 0.241\nr\nr_wu\n0.001 0.061 0.151 0.241\nr\nr_pp\n0.001 0.061 0.151 0.241\nr\nr_pc\n0.001 0.061 0.151 0.241\nr\nr_wi\n8 16 24 32\nk\n0.001 0.061 0.151 0.241\nr\nr_cc\n0.0 0.2 0.4 0.6 0.8 1.0\nError\n0.001 0.061 0.151 0.241\nr\nr_obl\n0.001 0.061 0.151 0.241\nr\nr_mash\n0.001 0.061 0.151 0.241\nr\nr_wu\n0.001 0.061 0.151 0.241\nr\nr_pp\n0.001 0.061 0.151 0.241\nr\nr_pc\n0.001 0.061 0.151 0.241\nr\nr_wi\n8 12 18 24\nk\n0.001 0.061 0.151 0.241\nr\nr_cc\n0.0 0.2 0.4 0.6 0.8 1.0\nError\nFig. S5: Heatmap comparison of all estimators on the D-easy (left column), D-med (middle column), and D-hard datasets (right column).\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 5, 2026. ; https://doi.org/10.64898/2026.04.01.715966doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}