Methods
to infer human population size history have typically focused on studying evolution more recent24
than 1-2 million years ago (Mya). The mean coalescence time in humans is ∼1Mya, indicating that there25
should be information within a present day genome sequence to probe more ancient time scales. Insights26
into population size history beyond this period would be particularly interesting, as it may elucidate the27
emergence of the genera Homo (∼3Mya [10, 11]) and Australopithicus (∼4.5Mya [12]), or even the ancestral28
human-chimpanzee population (∼6Mya [13]). The inference of N(t) from some methods, including PSMC,29
do extend to this period, however it is typically not discussed in the existing literature. This may be because30
it has until now been unclear how robust inference is in those time frames, and in particular how badly the31
model is affected when its assumptions are violated.32
1
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
Another set of methods uses cross-species analyses, in this case of present day sequences of humans, chim-33
panzees, and gorillas, to analyse the way in which these populations diverged. Methods such as CoalHMM34
and its successors employ coalescent hidden Markov model approaches, in which the hidden states corre-35
spond to the order of coalescent events, as well sometimes as ancestral coalescence times [14, 15, 16, 13, 17].36
Typically the model parameters to be estimated are the divergence times and ancestral population sizes,37
which in turn are informative about the amount of incomplete lineage sorting (ILS) or gene flow. Some of38
these studies have suggested that the human-chimp speciation was “complex”, in that after initial popula-39
tion divergence there was a period of gene flow before final separation [18, 19, 13, 20], though others have40
suggested a clean split scenario [21, 22]. Many of these analyses make coarse assumptions about population41
size changes in the ancestral populations, and typically assume these were panmictic. Typically the size of42
the human-chimpanzee ancestral population is estimated to be very large, for example this is estimated as43
167,400 in [17] and 177,368 in [23], both of which are more than 8 times the long-term effective population44
size of humans or chimpanzee. It is plausible that the size of the human-chimpanzee ancestral population45
really was extremely large, however it may be that this population was structured, which leads to inflated46
estimates of population size when assuming panmixia [24]. In this article we examine the changes in size of47
the ancestral human/chimp population in higher resolution and with fewer assumptions, finding a similar48
magnitude to that inferred in recent time.49
50
So far existing methods have offered limited population genetic insight into human evolution between51
1-2Mya and the divergence of the ancestral human/chimpanzee population around 6Mya [14, 13, 25]. Here,52
we examine the potential to broaden the scope of coalescent-based population size history inference beyond53
two million years ago. Specifically, we discuss an ancient inferred peak in effective population size around 5-54
7Mya seen in PSMC analysis of human and chimpanzee genomes, but not in gorilla or orangutan. While this55
peak has been partially observed in several previous studies in humans and chimpanzees [2, 8, 9, 16, 26, 27],56
it has received little attention in the literature. We present a revised analysis of this peak using modern data57
and discuss its potential implications for understanding the evolutionary history of humans and other great58
apes. We take care to address potential model violations and other factors that may affect the reliability of59
these inferences. Overall, our study suggests that we may be able to significantly extend the applicability of60
coalescent-based inference much further into the past, generating new insights into the complex history of61
hominin evolution.62
3 Results63
3.1 An ancient peak in human and chimpanzee population size history64
We separately inferred an effective population size (N (t)) curve over time for 26 single diploid genome se-65
quences, each from a distinct population in the 1000 Genomes Project [28, 29, 27, 30] (1000GP) using PSMC66
(Figure 1a). Importantly, we extended previous similar analyses by fitting the model with fewer parameter67
constraints [31] and using the high coverage (30x) whole-genome sequencing data resource [30], allowing68
greater resolution at time scales >2Mya (see Methods). The resulting N(t) estimates start between 5 kya69
and end around 10Mya, which is deeper in the past than in previous studies [2, 8, 9, 26]. In all human samples70
two peaks can be seen, one around the time of appearance of modern humans approximately 200kya and a71
second older one, starting around 2Mya going backwards in time, reaching a maximum around 5.5Mya with72
2
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
(a)
104
105
106
107
Years
0
10000
20000
30000
40000
50000
60000
70000
80000
N(t)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK (b)
104
105
106
107
Years
0
10000
20000
30000
40000
50000
60000
70000
80000
N(t)
Bonobo
Nigeria-Cameroon Chimpanzee
Eastern Chimpanzee
(c)
104
105
106
107
Years
0
20000
40000
60000
80000
100000
120000
140000
N(t)
Eastern Lowland
Cross River
Western Lowland (d)
104
105
106
107
Years
0
20000
40000
60000
80000
100000
120000
140000
N(t)
Sumatran
Bornean
Figure 1: Inference from PSMC on humans, chimpanzees and bonobos, gorillas, and orangutans. (a)
Estimates of population size history on humans, using 26 populations from the 1000 Genomes project, using
one individual per population. (b) Estimates of population size history on chimpanzees and bonobos, using
8 Nigeria-Cameroon chimpanzees, 5 Eastern chimpanzees, and 11 bonobos. (c) Estimates of population size
history on gorillas, using 3 Eastern lowland individuals, 23 Western lowland individuals, and 1 Cross River
individual. (d) Estimates of the population size history on orangutans, using 5 Sumatran individuals and 5
Bornean individuals.
a population size of ∼37,500, then decreasing until around 10Ma. While the ancient peak is partially visible73
in previous whole-genome analyses [2, 9, 26], it is frequently truncated or represented with poor resolution74
because of the way time intervals were specified. 30 iterations of block bootstrapping reveals there is less75
variance in N(t) in ancient time than there is recent time (Figure A1).76
77
As the ancient peak occurs around the estimated period of human and chimpanzee species separation,78
it is natural to ask if a similar signal exists in chimpanzee population size history, or in other great apes79
in general. We therefore used PSMC to infer the population size history from 79 great ape genomes, as in80
[16] (see Methods). First, we recovered an ancient peak in chimpanzees and bonobos, overlapping the peak81
seen in the human analysis, though perhaps with its maximum slightly older (Figure 1b). This extends a82
similar signal further back in time, present in previous analyses but not discussed [16]. As with humans,83
our re-analysis produces the ancient peak more clearly because we avoid grouping old time intervals. We84
3
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
note that the effective population size of the chimpanzee ancient peak is significantly higher than that of85
humans. The fact that the peaks observed in humans and chimpanzees align with the time frame of their86
estimated divergence prompts the question of whether these peaks signify a shared history between the two87
species. Indeed, as we discuss in the Appendix, correcting for mutation rate variations in time, as well as88
for variable generation times, in principle could align the two population size curves, both in their time89
period and their scale. However, uncertainty in these parameters, as well as in curve estimation, precludes90
a confident conclusion.91
92
For gorillas (Figure 1c) and orangutans (Figure 1d), we do not detect a discernible peak in the region93
between 5-10Mya. This is consistent with the fact that gorillas and orangutans genetically separated from94
humans and chimpanzees prior to the 5-7Mya time period, and so would not be expected to share their95
history at that time. Moreover, it supports the claim that the ancient peak is not an artefact of the method96
or of the data processing, as that would result in a peak with gorillas and orangutans as well.97
3.2 The fraction of uncoalesced genome as a function of time98
The power of PSMC to infer N(t) at 5.5Mya relies on there being sufficiently many places in the genome99
where the two input lineages coalesce more anciently than this time. To assess whether this is reasonable to100
expect, we note that under a simple panmictic model the observed range of heterozygosity in humans (e.g.,101
∼0.00069 for PEL to ∼0.001 for MSL in 1000GP) would correspond to a constant effective diploid population102
size of 13,800-20,000, using a mutation rate of 1.25e-8 per base pair per generation. Given an exponential103
distribution of coalescence times, and a generation time of 30 years, we expect a mean coalescence time104
of 828,000-1,200,000 years, and that the coalescence time will be older than 5.5Mya in 0.12%-1.01% of the105
genome, which would be sufficient to provide plenty of information for this period, in particular taking note106
that older regions of coalescence are correspondingly shorter than younger ones.107
108
As more direct confirmation from the data, we inferred the fraction of the genome that has not yet109
coalesced as a function of time for the 1000GP samples, using PSMC decoding (see Methods). At least 1%110
of the genome is estimated not to have coalesced by 5Mya, and approximately 0.1% by 10Mya (Figure 2a).111
Ancient regions tend to be shared among individuals (Figure 2b); e.g. positions with an inferred posterior112
mean TMRCA of over 3Mya show an average correlation of r2=0.21 across individuals (Figure A2). We113
note that further back in time, segments of shared ancestry become increasingly shorter because of longer114
exposure to recombination, and that these are scattered across the genome. For example we estimate that115
African individuals harbor roughly 4,000 non-contiguous segments whose coalescence time is older than 4Mya116
(Table 1; see Methods). We note that previous simulation studies [32, 33] and theoretical analysis [34] have117
shown that posterior mean estimates of true ancient coalescent times substantially older than the mean tend118
to be significantly downward biased, which may indicate an even older age for detected ancient regions.119
We repeated the analysis using an independent method, Relate [35], which does not rely on the PSMC120
assumptions and uses multiple sequences jointly; this also identifies sufficient coalescence material exists121
in this time period (see Methods and Figure A3). We further note that if we alter the time discretisation122
parameters in PSMC, then the inference of humanN(t) is almost consistent in each population until∼20Mya123
(Figure A4), at which point it begins to fray as likely the genome is fully coalesced.124
125
4
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
(a)
0 2.5M 5M 7.5M 10M 12.5M 15M
Years
0.001%
0.01%
0.1%
1%
10%
100%
Percentage uncoalesced
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK
(b)
Figure 2: (a) Inferred fraction of uncoalesced genome as a function of time, for the 26 individuals in Figure
1a. (b) (Top) Posterior mean TMRCAs for each population, for 20Mb on chromosome 20. The bottom
panel shows the mean of the top panel.
3.3 Model violations and their effects126
As with all statistical frameworks, the PSMC method relies on specific modelling assumptions that may127
not hold true in reality. Deviations from these assumptions could have varying impacts on inference across128
different temporal scales, and could potentially account for some or all of the observed ancient peak. We129
therefore considered how departures from the modelling assumptions could plausibly affect inference of pop-130
ulation size history in the deep past.131
132
To this end, we considered the following possible issues. First, repeats may manifest as an artifactual133
heterozygous signal, leading to long segments with high heterozygosity, which then leads to an apparent134
5
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
Population Sample 1Mya 2Mya 3Mya 4Mya 5Mya 6Mya
ESN HG03515 54,169 17,213 7,811 3,952 1,841 472
YRI NA18488 54,486 17,229 7,668 3,841 1,752 426
MSL HG03212 53,817 17,243 7,661 3,910 1,947 494
GWD HG02568 54,493 16,893 7,358 3,667 1,721 397
ACB HG01882 55,368 17,590 7,883 4,129 1,909 457
ASW NA19625 54,974 17,395 7,798 3,984 1,852 439
LWK NA19017 54,644 17,709 8,105 4,050 1,874 496
Table 1: The number of ancient non-contiguous segments inferred in different African populations from the
1000GP project. We count segments by looking for regions where the posterior probability of coalescence as
old or older than the time given in each column is greater than 0.9.
excess of ancient coalescence events. Second, variation in mutation rates across different genomic regions135
could cause some regions to have a higher density of mutations leading to an interpretation as having a more136
ancient TMRCA, and hence systematic biases in the inference of accurate evolutionary histories. Third,137
the human mutation rate is estimated to have slowed down over time [36, 37, 38, 39], while in PSMC it is138
assumed constant. Fourth, variability in recombination rates across the genome (in particular at recombi-139
nation hotspots) is also a departure from model assumptions [40, 41, 42, 43, 44]. Fifth, at longer timescales140
there may become a significant probability of recurrent mutations at the same site, which are not modelled141
in PSMC. Sixth, background selection, which has been estimated to be pervasive in humans [45, 46], has142
been shown to distort the inference of the coalescence rate [47, 48, 49]. Finally, balancing selection can also143
manifest as an excess in ancient coalescence events.144
145
We conducted an extensive set of simulation studies to analyse how these violations may affect inference,146
which we describe fully in the Appendix). We concluded that it is unlikely any of these issues or violations147
could cause a false ancient peak of the type we see in real data. Notably, we observed that if there are148
large variations in the mutation rate across the genome, then false inflations of inferred N(t) are observed.149
However, the degree of rate variation required to generate this is larger than the degree of this inferred in150
humans, and moreover their ancient inferred N(t) has high variance, contrasting the low variance we have151
shown in humans in the ∼5Mya peak.152
4 Discussion153
We have demonstrated the applicability of coalescent-based inference to examine deep evolutionary history154
(five million years ago and beyond) in the hominin lineage. Specifically, we highlighted and extended the155
analysis of a peak in ancient population size history inferred by PSMC, which is reliably estimated from156
human, chimpanzee and bonobo data, though not found in gorillas or orangutans. The human, chimpanzee157
and bonobo peaks span approximately the same time period, suggesting that they may reflect the same158
event prior to, or during, human-chimpanzee speciation. We discussed several ways in which the data may159
violate the model underlying our analysis, and concluded that none are likely to generate a false ancient peak.160
161
A straightforward interpretation of an ancient peak is that it reflects true changes in population size.162
Indeed, for humans, changes in inferred N(t) since ∼80kya have been associated with the out-of-Africa event163
followed by more recent recent population size increases [3, 6]. The ancient peak may similarly reflect ancient164
6
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
hominin population expansion and decline in the late miocene and the pliocene. To the extent the same165
ancient peak exists in chimpanzees and bonobos, it may reflect common environmental conditions affecting166
similar evolutionary processes in the different species.167
168
However, there are other possible explanations. The separation and reconnection of local populations169
alters the amount of possible gene flow between them, which affects the inferred N(t) [2, 50, 24, 51]. In a170
structured population, the coalescence rate decreases relative to a panmictic population of the same size,171
thus the effective population size (taken here as the inverse of the coalescence rate) increases. Indeed, in172
[24] the authors show that the inferred human effective population size as inferred by PSMC, including the173
ancient peak, can be equally well obtained by a set of populations with varying migration rates between174
them without any changes in size. Therefore, it is possible that the ancient peak is a signature of ancient175
population structure or other departures from panmixia.176
177
One such possible structured event may be complex speciation between humans and chimpanzees, in178
which after an initial split the two populations rejoined or exchanged genes for a period of time. Previous179
publications have put forward evidence for this type of model [18, 19, 13, 20], although other analysis has180
favoured a model with a clean split [21, 22]. Innan and Watanabe proposed there was no strong support181
for gene flow after divergence, and that a clean split best fit the data [21]. This result was reinforced by182
Yamamichi, whose maximum likelihood model also favoured a clean split [22]. Patterson et al. proposed183
that gene flow did occur, because of differences in the estimated divergence time between the autosomes and184
X chromosome [18], although this approach has been criticised [52, 53, 22]. Using a likelihood ratio test185
considering divergence times across the genome, Yang rejected the null hypothesis of an absence of gene flow186
[19]. With an HMM that explicitly tests for immediate or prolonged separation, Mailund et al. proposed187
that a model with an extended period of gene flow better fits the data [13]. Galtier introduced Aphid, which188
distinguishes between discordant coalescent trees (ones that do not match the species trees) generated by189
ILS or gene flow [20]. It leverages the fact that multispecies coalescent trees affected by gene flow tend to190
have shorter branches, and coalescent trees affected by incomplete lineage sorting longer branches, than the191
average coalescent tree. This information provided strong support for a model in which there was ancient192
gene flow between human, chimpanzee, and even gorilla, after the initial human-chimpanzee split.193
194
If the peak we observed in chimpanzees and bonobos is indeed reliable and aligned with the ancient195
peak observed in humans, this may be consistent with a period of ongoing gene flow after initial human and196
chimpanzee separation. Qualitatively, there are other observations which make this type of model attrac-197
tive. Notably, CoalHMM [23] and TRAILS [17] assume a panmictic ancestral human-chimpanzee population198
and estimate its size as 177,368 and 167,400, respectively. Both of these estimates are more than 8 times199
the long-term effective population size of humans and chimpanzees (in both species, θ = 4N µ is roughly200
0.001, which assuming µ=1.25e-8 gives N=20,000), and are not well aligned with the PSMC inference in201
this period (Figure 1). A structured ancestral population could explain the large CoalHMM and TRAILS202
estimates, as falsely assuming panmixia can inflate estimates of population size [24]. Moreover, Aphid [20],203
which provides strong support for human-chimp-gorilla gene flow, estimates the size of the ancestral human-204
chimpanzee population to be of similar size to present day human, chimpanzee and bonobo populations.205
Additionally, chimpanzees and bonobos have been estimated to split cleanly around 2Mya [13], and PSMC206
does not infer a unique peak around this period. Finally, widespread structure in the ancestral populations of207
7
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
humans, chimpanzees/bonobos, and gorillas, may plausibly explain why fossils such as Sahelanthropus from208
this period are hard to assign [54, 55, 56], and as a consequence morphological analyses can not confidently209
determine separation times [57].210
211
We examined the sensitivity of PSMC to various modelling violations such as variation in mutation212
and recombination rates, and selection. It will be useful to incorporate these as extensions to the PSMC213
model, modifying the transition and emission probabilities within the underlying HMM accordingly. Indeed,214
some work has already been done towards this goal [58, 59], and future modification could further address215
slowdown or genomic variability in mutation rate, recombination hotspots, and/or selection, so as to better216
understand the ancient speciation process in primates [17] and indeed other species.217
Acknowledgements218
We thank Aylwyn Scally for constructive comments. T.C. was funded by a Wellcome Postgraduate Stu-219
dentship 108864/B/15/Z, and R.D. and R.S. by Wellcome Investigator Award 207492/Z/17/Z. For the220
purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted221
Manuscript version arising from this submission.222
References223
[1] Brian Charlesworth. Effective population size and patterns of molecular evolution and variation.Nature224
Reviews Genetics, 10(3):195–205, 2009.225
[2] Heng Li and Richard Durbin. Inference of human population history from individual whole-genome226
sequences. Nature, 475(7357):493–496, 2011.227
[3] Stephan Schiffels and Richard Durbin. Inferring human population size and separation history from228
multiple genome sequences. Nature genetics, 46(8):919–925, 2014.229
[4] Sara Sheehan, Kelley Harris, and Yun S Song. Estimating variable effective population sizes from multi-230
ple genomes: a sequentially markov conditional sampling distribution approach. Genetics, 194(3):647–231
662, 2013.232
[5] Matthias Steinr¨ ucken, Joshua S Paul, and Yun S Song. A sequentially markov conditional sampling dis-233
tribution for structured populations with migration and recombination. Theoretical population biology,234
87:51–61, 2013.235
[6] Jack Kamm, Jonathan Terhorst, Richard Durbin, and Yun S Song. Efficiently inferring the demo-236
graphic history of many populations with allele count data. Journal of the American Statistical Asso-237
ciation, 115(531):1472–1487, 2020.238
[7] Donna Henderson, Sha Zhu, Christopher B Cole, and Gerton Lunter. Demographic inference from239
multiple whole genomes using a particle filter for continuous markov jump processes. Plos one ,240
16(3):e0247647, 2021.241
8
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
[8] Matthias Meyer, Martin Kircher, Marie-Theres Gansauge, Heng Li, Fernando Racimo, Swapan Mallick,242
Joshua G Schraiber, Flora Jay, Kay Pr¨ ufer, Cesare De Filippo, et al. A high-coverage genome sequence243
from an archaic denisovan individual. Science, 338(6104):222–226, 2012.244
[9] Kay Pr¨ ufer, Fernando Racimo, Nick Patterson, Flora Jay, Sriram Sankararaman, Susanna Sawyer,245
Anja Heinze, Gabriel Renaud, Peter H Sudmant, Cesare De Filippo, et al. The complete genome246
sequence of a neanderthal from the altai mountains. Nature, 505(7481):43–49, 2014.247
[10] Brian Villmoare, William H Kimbel, Chalachew Seyoum, Christopher J Campisano, Erin N DiMaggio,248
John Rowan, David R Braun, J Ram´ on Arrowsmith, and Kaye E Reed. Early homo at 2.8 ma from249
ledi-geraru, afar, ethiopia. Science, 347(6228):1352–1355, 2015.250
[11] Erin N DiMaggio, Christopher J Campisano, John Rowan, Guillaume Dupont-Nivet, Alan L Deino,251
Faysal Bibi, Margaret E Lewis, Antoine Souron, Dominique Garello, Lars Werdelin, et al. Late pliocene252
fossiliferous sedimentary record and the environmental context of early homo from afar, ethiopia.253
Science, 347(6228):1355–1359, 2015.254
[12] Jason E Lewis, Carol V Ward, William H Kimbel, Casey L Kidney, Frank H Brown, Rhonda L Quinn,255
John Rowan, Ignacio A Lazagabaster, William J Sanders, Meave G Leakey, et al. A 4.3-million-year-256
old australopithecus anamensis mandible from ileret, east turkana, kenya, and its paleoenvironmental257
context. Journal of Human Evolution, 194:103579, 2024.258
[13] Thomas Mailund, Anders E Halager, Michael Westergaard, Julien Y Dutheil, Kasper Munch, Lars N259
Andersen, Gerton Lunter, Kay Pr¨ ufer, Aylwyn Scally, Asger Hobolth, et al. A new isolation with mi-260
gration model along complete genomes infers very different divergence processes among closely related261
great ape species. PLoS genetics, 8(12):e1003125, 2012.262
[14] Asger Hobolth, Ole F Christensen, Thomas Mailund, and Mikkel H Schierup. Genomic relationships263
and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden markov264
model. PLoS genetics, 3(2):e7, 2007.265
[15] Julien Y Dutheil, Ganesh Ganapathy, Asger Hobolth, Thomas Mailund, Marcy K Uyenoyama, and266
Mikkel H Schierup. Ancestral population genomics: the coalescent hidden markov model approach.267
Genetics, 183(1):259–274, 2009.268
[16] Javier Prado-Martinez, Peter H Sudmant, Jeffrey M Kidd, Heng Li, Joanna L Kelley, Belen Lorente-269
Galdos, Krishna R Veeramah, August E Woerner, Timothy D O’connor, Gabriel Santpere, et al. Great270
ape genetic diversity and population history. Nature, 499(7459):471–475, 2013.271
[17] Iker Rivas-Gonz´ alez, Mikkel H Schierup, John Wakeley, and Asger Hobolth. Trails: Tree reconstruction272
of ancestry using incomplete lineage sorting. Plos Genetics , 20(2):e1010836, 2024.273
[18] Nick Patterson, Daniel J Richter, Sante Gnerre, Eric S Lander, and David Reich. Genetic evidence for274
complex speciation of humans and chimpanzees. Nature, 441(7097):1103–1108, 2006.275
[19] Ziheng Yang. A likelihood ratio test of speciation with gene flow using genomic sequence data. Genome276
biology and evolution , 2:200–211, 2010.277
[20] Nicolas Galtier. An approximate likelihood method reveals ancient gene flow between human, chim-278
panzee and gorilla. Peer Community Journal, 4, 2024.279
9
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
[21] Hideki Innan and Hidemi Watanabe. The effect of gene flow on the coalescent time in the human-280
chimpanzee ancestral population. Molecular biology and evolution , 23(5):1040–1047, 2006.281
[22] Masato Yamamichi, Jun Gojobori, and Hideki Innan. An autosomal analysis gives no genetic evidence282
for complex speciation of humans and chimpanzees. Molecular biology and evolution , 29(1):145–156,283
2012.284
[23] Iker Rivas-Gonz´ alez, Marjolaine Rousselle, Fang Li, Long Zhou, Julien Y Dutheil, Kasper Munch,285
Yong Shao, Dongdong Wu, Mikkel H Schierup, and Guojie Zhang. Pervasive incomplete lineage sorting286
illuminates speciation and selection in primates. Science, 380(6648):eabn4409, 2023.287
[24] Olivier Mazet, Willy Rodr´ ıguez, Simona Grusea, Simon Boitard, and Loun` es Chikhi. On the impor-288
tance of being structured: instantaneous coalescence rates and human evolution—lessons for ancestral289
population size inference? Heredity, 116(4):362–371, 2016.290
[25] Naoyuki Takahata and Yoko Satta. Evolution of the primate lineage leading to modern humans:291
phylogenetic and demographic inferences from dna sequences. Proceedings of the National Academy of292
Sciences, 94(9):4811–4815, 1997.293
[26] Qiaomei Fu, Heng Li, Priya Moorjani, Flora Jay, Sergey M Slepchenko, Aleksei A Bondarev, Philip LF294
Johnson, Ayinuer Aximu-Petri, Kay Pr¨ ufer, Cesare De Filippo, et al. Genome sequence of a 45,000-295
year-old modern human from western siberia. Nature, 514(7523):445–449, 2014.296
[27] 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature,297
526(7571):68, 2015.298
[28] 1000 Genomes Project Consortium et al. A map of human genome variation from population scale299
sequencing. Nature, 467(7319):1061, 2010.300
[29] 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human301
genomes. Nature, 491(7422):56, 2012.302
[30] Marta Byrska-Bishop, Uday S Evani, Xuefang Zhao, Anna O Basile, Haley J Abel, Allison A Regier,303
Andr´ e Corvelo, Wayne E Clarke, Rajeeva Musunuri, Kshithija Nagulapalli, et al. High-coverage304
whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell,305
185(18):3426–3440, 2022.306
[31] Leon Hilgers, Shenglin Liu, Axel Jensen, Thomas Brown, Trevor Cousins, Regev Schweiger, Katerina307
Guschanski, and Michael Hiller. Avoidable false psmc population size peaks occur across numerous308
studies. bioRxiv, pages 2024–06, 2024.309
[32] D´ ebora YC Brandt, Xinzhu Wei, Yun Deng, Andrew H Vaughn, and Rasmus Nielsen. Evalua-310
tion of methods for estimating coalescence times using ancestral recombination graphs. Genetics,311
221(1):iyac044, 2022.312
[33] Regev Schweiger and Richard Durbin. Ultrafast genome-wide inference of pairwise coalescence times.313
Genome Research, 33(7):1023–1031, 2023.314
[34] Yun S Song. Lecture notes on computational and mathematical population genetics. 2021.315
10
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
[35] Leo Speidel, Marie Forest, Sinan Shi, and Simon R Myers. A method for genome-wide genealogy316
estimation for thousands of samples. Nature genetics, 51(9):1321–1329, 2019.317
[36] Priya Moorjani, Ziyue Gao, and Molly Przeworski. Human germline mutation and the erratic evolu-318
tionary clock. PLoS biology, 14(10):e2000744, 2016.319
[37] Aylwyn Scally and Richard Durbin. Revising the human mutation rate: implications for understanding320
human evolution. Nature Reviews Genetics, 13(10):745–753, 2012.321
[38] V Yi Soojin. Morris goodman’s hominoid rate slowdown: the importance of being neutral. Molecular322
phylogenetics and evolution, 66(2):569–574, 2013.323
[39] Manjusha Chintalapati and Priya Moorjani. Evolution of the mutation rate across primates. Current324
opinion in genetics & development, 62:58–64, 2020.325
[40] S Myers, C C A Spencer, A Auton, L Bottolo, C Freeman, P Donnelly, and G McVean. The distribution326
and causes of meiotic recombination in the human genome. Biochem. Soc. Trans., 34(Pt 4):526–530,327
2006.328
[41] International HapMap Consortium. The International HapMap Project. Nature, 426(6968):789–796,329
2003.330
[42] Thomas D Petes. Meiotic recombination hot spots and cold spots. Nature Reviews Genetics, 2(5):360–331
369, 2001.332
[43] KT Nishant, Chetan Kumar, and MRS Rao. Humhot: a database of human meiotic recombination333
hot spots. Nucleic acids research, 34(suppl 1):D25–D28, 2006.334
[44] Laure S´ egurel, Ellen Miranda Leffler, and Molly Przeworski. The case of the fickle fingers: how335
the prdm9 zinc finger protein specifies meiotic recombination hotspots in humans. PLoS biology ,336
9(12):e1001211, 2011.337
[45] Graham McVicker, David Gordon, Colleen Davis, and Phil Green. Widespread genomic signatures of338
natural selection in hominid evolution. PLoS genetics, 5(5):e1000471, 2009.339
[46] David A Murphy, Eyal Elyashiv, Guy Amster, and Guy Sella. Broad-scale variation in human genetic340
diversity levels is predicted by purifying selection on coding and non-coding elements. Elife, 12:e76065,341
2022.342
[47] Daniel R Schrider, Alexander G Shanku, and Andrew D Kern. Effects of linked selective sweeps on343
demographic inference and model selection. Genetics, 204(3):1207–1223, 2016.344
[48] Parul Johri, Kellen Riall, Hannes Becher, Laurent Excoffier, Brian Charlesworth, and Jeffrey D Jensen.345
The impact of purifying and background selection on the inference of population history: problems346
and prospects. Molecular biology and evolution , 38(7):2986–3003, 2021.347
[49] Simon Boitard, Armando Arredondo, Loun` es Chikhi, and Olivier Mazet. Heterogeneity in effective348
size across the genome: effects on the inverse instantaneous coalescence rate (iicr) and implications for349
demographic inference under linked selection. Genetics, 220(3):iyac008, 2022.350
11
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
[50] Olivier Mazet, Willy Rodr´ ıguez, and Loun` es Chikhi. Demographic inference using genetic data from351
a single individual: Separating population size variation from population structure. Theoretical Pop-352
ulation Biology, 104:46–58, 2015.353
[51] Trevor Cousins, Aylwyn Scally, and Richard Durbin. A structured coalescent model reveals deep354
ancestral structure shared by all modern humans. bioRxiv, pages 2024–03, 2024.355
[52] John Wakeley. Complex speciation of humans and chimpanzees. Nature, 452(7184):E3–E4, 2008.356
[53] Daven C Presgraves and V Yi Soojin. Doubts about complex speciation between humans and chim-357
panzees. Trends in ecology & evolution , 24(10):533–540, 2009.358
[54] Michel Brunet, Franck Guy, David Pilbeam, Hassane Taisso Mackaye, Andossa Likius, Djimdoumal-359
baye Ahounta, Alain Beauvilain, C´ ecile Blondel, Herv´ e Bocherens, Jean-Renaud Boisserie, et al. A360
new hominid from the upper miocene of chad, central africa. Nature, 418(6894):145–151, 2002.361
[55] Milford H Wolpoff, Brigitte Senut, Martin Pickford, and John Hawks. Sahelanthropus362
or’sahelpithecus’? Nature, 419(6907):581–582, 2002.363
[56] Michel Brunet. Sahelanthropus or’sahelpithecus’? Nature, 419(6907):582–582, 2002.364
[57] Bernard Wood and Terry Harrison. The evolutionary context of the first hominins. Nature,365
470(7334):347–352, 2011.366
[58] Trevor Cousins, Daniel Tabin, Nick Patterson, David Reich, and Arun Durvasula. Accurate inference367
of population history in the presence of background selection. bioRxiv, 2024.368
[59] Gustavo V. Barroso, Nataˇ sa Puzovi´ c, and Julien Y Dutheil. Inference of recombination maps from a369
single pair of genomes and its application to ancient samples. PLoS genetics, 15(11):e1008449, 2019.370
[60] Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard,371
Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, et al. Twelve years of372
samtools and bcftools. Gigascience, 10(2):giab008, 2021.373
[61] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo374
Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence align-375
ment/map format and samtools. Bioinformatics, 25(16):2078–2079, 2009.376
[62] Heng Li. A statistical framework for SNP calling, mutation discovery, association mapping and popu-377
lation genetical parameter estimation from sequencing data. Bioinformatics, 27(21):2987–2993, 2011.378
[63] Mois` es Coll Maci` a, Laurits Skov, Benjamin Marco Peter, and Mikkel Heide Schierup. Different histori-379
cal generation intervals in human populations inferred from neanderthal fragment lengths and mutation380
signatures. Nature Communications, 12(1):5317, 2021.381
[64] Jared C Roach, Gustavo Glusman, Arian FA Smit, Chad D Huff, Robert Hubley, Paul T Shannon, Lee382
Rowen, Krishna P Pant, Nathan Goodman, Michael Bamshad, et al. Analysis of genetic inheritance383
in a family quartet by whole-genome sequencing. Science, 328(5978):636–639, 2010.384
12
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
[65] Philip Awadalla, Julie Gauthier, Rachel A Myers, Ferran Casals, Fadi F Hamdan, Alexander R Griffing,385
M´ elanie Cˆ ot´ e, Edouard Henrion, Dan Spiegelman, Julien Tarabeux, et al. Direct measure of the de386
novo mutation rate in autism and schizophrenia cohorts. The American Journal of Human Genetics,387
87(3):316–324, 2010.388
[66] Raheleh Rahbari, Arthur Wuster, Sarah J Lindsay, Robert J Hardwick, Ludmil B Alexandrov, Saeed389
Al Turki, Anna Dominiczak, Andrew Morris, David Porteous, Blair Smith, et al. Timing, rates and390
spectra of human germline mutation. Nature genetics, 48(2):126–133, 2016.391
[67] Augustine Kong, Michael L Frigge, Gisli Masson, Soren Besenbacher, Patrick Sulem, Gisli Magnusson,392
Sigurjon A Gudjonsson, Asgeir Sigurdsson, Aslaug Jonasdottir, Adalbjorg Jonasdottir, et al. Rate of393
de novo mutations and the importance of father’s age to disease risk. Nature, 488(7412):471–475, 2012.394
[68] Jacob J Michaelson, Yujian Shi, Madhusudan Gujral, Hancheng Zheng, Dheeraj Malhotra, Xin Jin,395
Minghan Jian, Guangming Liu, Douglas Greer, Abhishek Bhandari, et al. Whole-genome sequencing396
in autism identifies hot spots for de novo germline mutation. Cell, 151(7):1431–1442, 2012.397
[69] 1000 Genomes Project Consortium et al. Variation in genome-wide mutation rates within and between398
human families. Nature genetics, 43(7):712–714, 2011.399
[70] Aylwyn Scally. The mutation rate in human evolution and demographic inference. Current opinion in400
genetics & development, 41:36–43, 2016.401
[71] Hans Ellegren, Nick GC Smith, and Matthew T Webster. Mutation rate variation in the mammalian402
genome. Current opinion in genetics & development , 13(6):562–568, 2003.403
[72] Alan Hodgkinson and Adam Eyre-Walker. Variation in the mutation rate across mammalian genomes.404
Nature reviews genetics, 12(11):756–766, 2011.405
[73] Thomas CA Smith, Peter F Arndt, and Adam Eyre-Walker. Large scale variation in the rate of406
germ-line de novo mutation, base composition, divergence and diversity in humans. PLoS genetics,407
14(3):e1007254, 2018.408
[74] Jerome Kelleher, Alison M Etheridge, and Gilean McVean. Efficient coalescent simulation and ge-409
nealogical analysis for large sample sizes. PLoS computational biology , 12(5):e1004842, 2016.410
[75] Laurent C Francioli, Paz P Polak, Amnon Koren, Androniki Menelaou, Sung Chun, Ivo Renkens,411
Genome of the Netherlands Consortium, Cornelia M van Duijn, Morris Swertz, Cisca Wijmenga, et al.412
Genome-wide patterns and properties of de novo mutations in humans. Nature genetics, 47(7):822–826,413
2015.414
[76] H´ akon J´ onsson, Patrick Sulem, Birte Kehr, Snaedis Kristmundsdottir, Florian Zink, Eirikur Hjartar-415
son, Marteinn T Hardarson, Kristjan E Hjorleifsson, Hannes P Eggertsson, Sigurjon Axel Gudjonsson,416
et al. Parental influence on human germline de novo mutations in 1,548 trios from iceland. Nature,417
549(7673):519–522, 2017.418
[77] Wendy SW Wong, Benjamin D Solomon, Dale L Bodian, Prachi Kothiyal, Greg Eley, Kathi C Huddle-419
ston, Robin Baker, Dzung C Thach, Ramaswamy K Iyer, Joseph G Vockley, et al. New observations420
on maternal age effect on germline de novo mutations. Nature communications, 7(1):10486, 2016.421
13
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
[78] Miguel Rodriguez-Galindo, S` onia Casillas, Donate Weghorn, and Antonio Barbadilla. Germline de422
novo mutation rates on exons versus introns in humans. Nature communications, 11(1):3304, 2020.423
[79] Richard R Hudson. Properties of a neutral allele model with intragenic recombination. Theoretical424
population biology, 23(2):183–201, 1983.425
[80] S Myers, CCA Spencer, A Auton, L Bottolo, C Freeman, P Donnelly, and et G McVean. The distri-426
bution and causes of meiotic recombination in the human genome, 2006.427
[81] Richard A Gibbs, John W Belmont, Paul Hardenbol, Thomas D Willis, Fuli L Yu, HM Yang, Lan-Yang428
Ch’ang, Wei Huang, Bin Liu, Yan Shen, et al. The international hapmap project. 2003.429
[82] Simon Myers, Rory Bowden, Afidalina Tumian, Ronald E Bontrop, Colin Freeman, Tammie S MacFie,430
Gil McVean, and Peter Donnelly. Drive against hotspot motifs in primates implicates the prdm9 gene431
in meiotic recombination. Science, 327(5967):876–879, 2010.432
[83] Emil D Parvanov, Petko M Petkov, and Kenneth Paigen. Prdm9 controls activation of mammalian433
recombination hotspots. Science, 327(5967):835–835, 2010.434
[84] Kenneth Paigen and Petko M Petkov. Prdm9 and its role in genetic recombination. Trends in Genetics,435
34(4):291–300, 2018.436
[85] Corinne Grey, Fr´ ed´ eric Baudat, and Bernard de Massy. Prdm9, a driver of the genetic map. PLoS437
genetics, 14(8):e1007479, 2018.438
[86] International HapMap Consortium Altshuler David altshuler@ molbio. mgh. harvard. edu Donnelly439
Peter donnelly@ stats. ox. ac. uk. A haplotype map of the human genome. Nature, 437(7063):1299–440
1320, 2005.441
[87] International HapMap Consortium et al. A second generation human haplotype map of over 3.1 million442
snps. Nature, 449(7164):851, 2007.443
[88] TH Jukes. Evolution of protein molecules. Mammalian Protein Metabolism, 3, 1969.444
[89] Brian Charlesworth, MT Morgan, and Deborah Charlesworth. The effect of deleterious mutations on445
neutral molecular variation. Genetics, 134(4):1289–1303, 1993.446
[90] Magnus Nordborg, Brian Charlesworth, and Deborah Charlesworth. The effect of recombination on447
Background
selection. Genetics Research, 67(2):159–174, 1996.448
[91] Parul Johri, Susanne P Pfeifer, and Jeffrey D Jensen. Developing an evolutionary baseline model for449
humans: jointly inferring purifying selection with population history. Molecular biology and evolution ,450
40(5):msad100, 2023.451
[92] Jacob I Marsh and Parul Johri. Biases in arg-based inference of historical population size in populations452
experiencing selection. Molecular Biology and Evolution, page msae118, 2024.453
[93] Deborah Charlesworth. Balancing selection and its effects on sequences in nearby genome regions.454
PLoS genetics, 2(4):e64, 2006.455
14
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
[94] KL Bubb, D Bovee, D Buckley, E Haugen, M Kibukawa, M Paddock, A Palmieri, S Subramanian,456
Y Zhou, R Kaul, et al. Scan of human genome reveals no new loci under ancient balancing selection.457
Genetics, 173(4):2165–2177, 2006.458
[95] Saurabh Asthana, Steffen Schmidt, and Shamil Sunyaev. A limited role for balancing selection. Trends459
in Genetics , 21(1):30–32, 2005.460
[96] Felix M Key, Jo˜ ao C Teixeira, Cesare de Filippo, and Aida M Andr´ es. Advantageous diversity main-461
tained by balancing selection in humans. Current opinion in genetics & development, 29:45–51, 2014.462
[97] Bora E Baysal, Elizabeth C Lawrence, and Robert E Ferrell. Sequence variation in human succinate463
dehydrogenase genes: evidence for long-term balancing selection on sdha. BMC biology, 5:1–14, 2007.464
[98] Aida M Andr´ es, Melissa J Hubisz, Amit Indap, Dara G Torgerson, Jeremiah D Degenhardt, Adam R465
Boyko, Ryan N Gutenkunst, Thomas J White, Eric D Green, Carlos D Bustamante, et al. Targets of466
balancing selection in the human genome. Molecular biology and evolution , 26(12):2755–2764, 2009.467
[99] Ellen M Leffler, Ziyue Gao, Susanne Pfeifer, Laure S´ egurel, Adam Auton, Oliver Venn, Rory Bowden,468
Ronald Bontrop, Jeffrey D Wall, Guy Sella, et al. Multiple instances of ancient balancing selection469
shared between humans and chimpanzees. Science, 339(6127):1578–1582, 2013.470
[100] Radoje Drmanac, Andrew B Sparks, Matthew J Callow, Aaron L Halpern, Norman L Burns, Bahram G471
Kermani, Paolo Carnevali, Igor Nazarenko, Geoffrey B Nilsen, George Yeung, et al. Human genome472
sequencing using unchained base reads on self-assembling dna nanoarrays. Science, 327(5961):78–81,473
2010.474
[101] Matthew D Rasmussen, Melissa J Hubisz, Ilan Gronau, and Adam Siepel. Genome-wide inference of475
ancestral recombination graphs. PLoS Genetics, 10(5):e1004342, 2014.476
[102] W James Kent, Charles W Sugnet, Terrence S Furey, Krishna M Roskin, Tom H Pringle, Alan M477
Zahler, and David Haussler. The human genome browser at ucsc. Genome research, 12(6):996–1006,478
2002.479
5 Methods480
5.1 Processing data481
We took high-coverage whole-genome-sequence cram files for one individual in each of the 26 populations482
from the 1000 Genomes project. These are aligned to GRCh38. The cram files were converted to bam483
and indexed with samtools [60, 61]. The genotype likelihoods were calculated with bcftools mpileup [62]484
by skipping alignments with mapping quality less than 20, skipping bases with base alignment quality485
less than 20, and setting the coefficient for downgrading mapping quality to 50. SNPs were called using486
bcftools and all indels were excluded. Variants were then designated as uncallable if the minimum map-487
ping quality was less than 20, the minimum consensus quality was less than 20, or the coverage was less488
than half or more than double the mean coverage. Finally, we designated all regions in the strict map-489
pability mask for GRCh38 ( ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_490
genomes_project/working/20160622_genome_mask_GRCh38) as uncallable. Uncallable positions491
15
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
are labelled as missing data in the HMM.492
493
We downloaded processed primate data from eichlerlab.gs.washington.edu/greatape/data/494
[16]. We took the VCF files and masked positions according to the given bed files, which described sites495
where coverage was less than 5 and regions that did not pass the quality filters as discussed in their paper.496
After filtering, only 49% of the genome was designated as callable. For the eastern lowland gorilla, we497
arbitrarily chose the individual labelled “Mkubwa”, and for the Nigeria-Cameroon chimpanzee we chose the498
individual labelled “Akwaya Jean”. We note that analysis on other individuals looked similar. For the gorilla499
we used a mutation rate per base pair per generation of 1.43e-08, and a generation time of 19 years; for the500
chimpanzee, a mutation rate per base pair per generation of 1.78e-08 and a generation time of 24 years were501
used [36].502
5.2 Mutation rates and generation times for humans and primates503
In humans we set the generation time as 29 [63] and the mutation rate per generation per base pair as504
µ=1.25e-08 [64, 65, 66, 67, 68, 69, 70]. The rates we set for the other great apes are taken from [39]. In505
chimpanzees and bonobos we set µ=1.78e-08 and generation time equal to 24 years. For Gorillas we set506
µ=1.42e-08 and generation time equal to 19 years. For Orangutans we set µ=2.03e-08 and generation time507
equal to 27.508
5.3 PSMC analysis509
We ran PSMC as embedded in cobraa, www.github.com/trevorcousins/cobraa. We used 64 time510
intervals and enabled the parameters for the effective population size in each to be inferred freely. This511
contrasts the default settings of previous implementations of the algorithm that force adjacent intervals to512
be the same, which has been shown to lead to fitting problems [31]. In the 1000GP data, we fixed the scaled513
mutation rate θ as 0.0008 which is roughly the mean across populations. For the great apes, we fixed the θ514
for gorilas as 0.0014, chimpanzees and bonobos as 0.001, and orangutans as 0.0014. We found that fixing this515
parameter reduces noise in the estimation of ancient N(t), and also forces time interval boundaries across516
individuals to align. The initial value for the ratio of mutation rate to recombination rate was set as 1.5,517
and the recombination rate was set to be inferred as part of the EM algorithm, which was iterated for 30518
iterations. The time interval boundaries are controlled by equation (1) where D is the number of discrete519
time boundaries (we used 64 throughout), and ω and Tmax control the spread of time interval boundaries.520
In Figure 1 we used ω=0.01 and Tmax=50; in Figure A4 we used ω=0.01 and Tmax=150.521
5.4 PSMC decoding522
Using the inferred effective population size parameters, scaled recombination rate, and scaled mutation rate523
(θ=0.0008), we used the PSMC decoding to get a posterior probability of coalescence at every position.524
To calculate the amount of uncoalesced genome over time, we integrated over the full posterior probabil-525
ity (Figure 2). To scale from time in coalescent units (equation (1)) to time in years, we used N=16,000526
(θ = 4 N µ= 0.0008 with µ=1.25e-08) then multiplied coalescent time by 2N µg , where g is the number of527
years per generations. To get a point estimate from PSMC (Figure 2b), we take the posterior mean at each528
position using the midpoint of the time interval boundaries.529
530
16
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
To calculate the number of ancient non-contiguous segments inferred in different African populations531
(Table 1), we looked for regions where the posterior probability of coalescence as old or older than 2Mya,532
3Mya, or 4Mya, respectively, is greater than 0.9. We then only counted non-contiguous segments, which are533
all segments that are not adjacent.534
5.5 Nonparametric bootstrapping535
For a given individual, for each chromosome, we break the sequence up into segments of 5Mb, then reconstruct536
a new sequence by sampling with replacement from all 5Mb segments. These are then concatenated together537
to form a contiguous sequence, on which PSMC inference is performed using 30 iterations. This procedure538
is repeated 30 times for humans and 30 times for chimps.539
5.6 Relate540
We ran Relate [35] on all samples on the 1000GP for GBP, CHB, and YRI. We randomly selected 10 diploid541
sequences from each population. To calculate the fraction of uncoalesced genome we discretised time into542
intervals of 500 generations and computed a histogram based on Relate’s ARG. The inferred fraction of543
uncoalesced genome at 5Mya was roughly 0.3%, for each sample in each population.544
17
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
6 Appendix545
6.1 Figures546
10k 100k 1M 10M
0
10000
20000
30000
40000
N(t)
GBR
10k 100k 1M 10M
Years
JPT
10k 100k 1M 10M
ESN
Figure A1: 30 iterations of block bootstraps on three different humans from the 1000GP project. We chopped
the genome into 5Mb windows, then for each bootstrap we randomly resampled these with replacement.
(a)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK
TMRCA (Population 1)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK TMRCA (Population 2)
0.0
0.1
0.2
0.3
0.4
Pearson
given TMRCA(Population 1) > 1Ma (b)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK
TMRCA (Population 1)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK TMRCA (Population 2)
0.0
0.1
0.2
0.3
0.4
Pearson
given TMRCA(Population 1) > 2Ma
(c)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK
TMRCA (Population 1)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK TMRCA (Population 2)
0.0
0.1
0.2
0.3
0.4
Pearson
given TMRCA(Population 1) > 3Ma (d)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
JPT
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK
0.0
0.1
0.2
0.3
0.4
Spearman
given TMRCA(Population 1) > 5Ma
Figure A2: The Pearson correlation between inferred TMRCAs, for all pairwise combinations of two popu-
lations from the 1000GP. We calculated these for chromosome 1. To obtain point estimates at each position,
we used the posterior mean. For each possible population pair, we look at positions where population 1
(x-axis) is larger than t and correlate these with the TMRCAs from population 2 (y-axis; note these are not
necessarily larger than t, so the matrices are not symmetric). The value of t for (a), (b), (c), and (d) is
1Mya, 2Mya, 3Mya, and 5Mya, respectively. The average Pearson r2 for (a), (b), (c), and (d) is 0.3, 0.26,
0.21, and 0.11, respectively, and all correlations have p-value <0.05 except for 72 combinations from (d).
18
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
0 2.5 M 5 M 7.5 M 10 M 12.5 M 15 M
Years
1e-5%
1e-4%
1e-3%
0.01%
0.1%
1%
10%
100%Percentage Uncoalesced
YRI CHB GBR
Figure A3: The Relate inferred fraction of uncoalesced genome as a function of time, for 10 diploid samples
in the YRI, GBR and CHB populations from 1000GP.
(a)
104
105
106
107
Years
0
10000
20000
30000
40000
50000
60000
70000
80000
N(t)
GBR
TSI
IBS
FIN
CEU
CHS
KHV
CHB
CDX
BEB
PJL
GIH
STU
ITU
PUR
CLM
PEL
MXL
ESN
YRI
MSL
GWD
ACB
ASW
LWK
Figure A4: Using PSMC on humans with altered time interval boundaries ( ω=0.01 and Tmax=150) shows
relatively consistent inference beyond 10Ma. Across African samples, the estimated coalescence rate trajec-
tory up until ∼20Mya has relatively low variance.
19
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
6.2 Analysis of PSMC model violations547
In this section, we analyse the effect of various model mispecifications on the inference from PSMC. We also548
analyse how the inferred coalescence times across the genome associate with various annotations, including549
for example repeat content, distance to coding sequence, strength of background selection, and recombination550
rate.551
6.2.1 Repeats as a cause of artifactual heterozygous signal552
(a)
0.5Ma - 0.75Ma
0.75Ma - 1.0Ma
1.0Ma - 1.25Ma
1.25Ma - 1.5Ma
1.5Ma - 1.75Ma
1.75Ma - 2.0Ma
2.0Ma - 2.25Ma
2.25Ma - 2.5Ma
2.5Ma - 2.75Ma
2.75Ma - 3.0Ma
3.0Ma - 3.25Ma
3.25Ma - 3.5Ma
3.5Ma - 3.75Ma
3.75Ma - 4.0Ma
4.0Ma - 4.25Ma
4.25Ma - 4.5Ma
4.5Ma - 4.75Ma
4.75Ma - 5.0Ma
5.0Ma - 5.25Ma
5.25Ma - 5.5Ma
5.5Ma - 5.75Ma
5.75Ma - 6.0Ma
6.0Ma - 6.25Ma
6.25Ma - 6.5Ma
6.5Ma - 6.75Ma
6.75Ma - 7.0Ma
7.0Ma - 7.25Ma
7.25Ma - 7.5Ma
7.5Ma - 7.75Ma
7.75Ma - 8.0Ma
8.0Ma - 8.25Ma
8.25Ma - 8.5Ma
Time
0
20
40
60
80
100Percentage in repeat
GBR
JPT
ESN (b)
0.5Ma - 0.75Ma
0.75Ma - 1.0Ma
1.0Ma - 1.25Ma
1.25Ma - 1.5Ma
1.5Ma - 1.75Ma
1.75Ma - 2.0Ma
2.0Ma - 2.25Ma
2.25Ma - 2.5Ma
2.5Ma - 2.75Ma
2.75Ma - 3.0Ma
3.0Ma - 3.25Ma
3.25Ma - 3.5Ma
3.5Ma - 3.75Ma
3.75Ma - 4.0Ma
4.0Ma - 4.25Ma
4.25Ma - 4.5Ma
4.5Ma - 4.75Ma
4.75Ma - 5.0Ma
5.0Ma - 5.25Ma
5.25Ma - 5.5Ma
5.5Ma - 5.75Ma
5.75Ma - 6.0Ma
6.0Ma - 6.25Ma
6.25Ma - 6.5Ma
6.5Ma - 6.75Ma
6.75Ma - 7.0Ma
7.0Ma - 7.25Ma
7.25Ma - 7.5Ma
7.5Ma - 7.75Ma
7.75Ma - 8.0Ma
8.0Ma - 8.25Ma
8.25Ma - 8.5Ma
Time
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0Percentage in segmental duplication
GBR
JPT
ESN
Figure A5: The relationship between the posterior mean TMRCA and the repeat content ((a)) or segmental
duplications ((b)), on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. (a) The
reported Pearson r2 for each population is 0.73, 0.48, 0.69, with p-value <0.005 for each. The grey, dashed
line indicates the chromosome wide average repeat content. The reported Pearson r2 for GBR and ESN is
not significant, though for JPT it is -0.5 with p-value 0.003.
As noted in previous work [2], false heterozygotes caused by repeated regions or segmental duplications553
may lead to excessively long segments with high heterozygosity, which may lead to an excess of ancient554
coalescent events. To examine whether this might artefactually be creating the observed signal, we analysed555
the fraction of the genome in repeated regions and stratified this by inferred TMRCA. We observed that the556
repeat content increases as a function of TMRCA (regression slope ∼ 2%), and that the effect is significant557
(Pearson’s r2 ≈ 0.63, p-value <0.005; see Methods and Figure A5a). A similar analysis with segmental558
duplications revealed no significant correlation (Figure A5b). We re-ran our analysis with repeated regions559
and segmental duplications masked out, and observed that the ancient peak is still present, although when560
repeats are removed it exhibits less prominence (Figure A6a and Figure A6b, respectively). Thus we conclude561
that repeated regions or segmental duplications are unlikely to be the cause of the ancient peak.562
6.2.2 V ariable mutation rate across the genome563
The PSMC model assumes a constant mutation rate throughout the genome, despite ample evidence that564
the mutation rate varies significantly by genomic location [71, 72]. Unfortunately, variation in mutation rates565
across different genomic regions can lead to systematic biases in the estimation of TMRCAs. This is because566
20
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
(a)
10k 100k 1M 10M
0
10000
20000
30000
40000
N(t)
GBR
ESN
CHB
10k 100k 1M 10M
Years
10k 100k 1M 10M
Figure A6: Running PSMC inference on human samples after masking out repeated regions (a), segmental
duplications (b), and CpG islands (c).
(a)
(b)
(c)
(d)
Figure A7: PSMC inference on simulations with a mutation rate that varies across the genome. The mutation
rate changes at disjoint intervals, whose length is exponentially distributed with rate L. In each interval, the
mutation rate is drawn from gamma distribution with mean 1.25e-08 and a coefficient of variation v. The
red, pink and purple lines show inference from PSMC on a simulation where the mutation rate varies, and
the gold lines show inference from PSMC on a simulation where the mutation rate is constant. (a) and (c)
v = 0.25 and L =10kb. (b) and (d) v = 0.15 and L=10kb (red), 100kb (pink), and 1Mb (purple).
21
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
regions with faster mutation rates will have more mutations than otherwise, which PSMC will interpret as567
coming from a more ancient TMRCA, and vice versa. Consequently, overestimation and underestimation of568
TMRCAs can occur, which can confound efforts to infer accurate evolutionary histories.569
570
To test this effect, we performed a series of simulations of a spatially varying mutation rate that changes571
according to a gamma distribution, in line with the models proposed in [73]. We generated a mutation572
rate map as follows. First, we divided the genome into disjoint intervals, whose length is exponentially573
distributed according to a rate parameter, L. Then, for each interval, we drew a mutation rate from a574
gamma distribution whose mean is fixed to be 1.25e-08 [64, 65, 66, 67, 68, 69] with a coefficient of vari-575
ation, v. We simulated over L ∈ {10kb,100kb,1Mb} and v ∈ {0.15,0.25}. We simulated diploid genomes576
according to a coarse piecewise population size trajectory using msprime [74], then generated pointwise577
mutations according to the mutation rate map. We generated 5 replicates of this simulation and then in-578
ferred the population size history using PSMC. Finally, as a control, we generated an additional 5 replicates579
using the same procedure, but with a constant mutation rate of 1.25e-08, and infer N(t) using PSMC as well.580
581
This simulation study shows that if the variation in mutation rate is strong enough (v =0.15), and the582
rate of change along the genome is fast enough (L=10kb), PSMC detects a spurious peak in ancient time583
(Figure A7a). However, we note that qualitatively, the false peaks across different replicates seem to vary584
more in location and height than in the human inference, for which the ancient peaks are well aligned (Figure585
1a). Additionally, this simulation used v=0.25, which is larger than mutation rate model fitted by [73] for586
the datasets in [75, 76] (0.18 and 0.15, respectively) but smaller than in [77] (0.27). In a simulation with587
v=0.15, we do not consistently observe a second peak (Figure A7b) for any value of L. In a simulation with588
lower ancient N(t), v=0.25 does generate an artificial inflation of ancient N(t), but not a discernible peak589
(Figure A7c). A simulation with v = 0.15 and L=1Mb again can generate a false peak, though it is not590
consistent across replicates (purple line, Figure A7d); L=10kb or L=100kb with v=0.15 does not seem to591
have much of an effect as the inference is similar to the constant mutation rate inference.592
593
If the ancient peak was an artefact caused by variable mutation rates, we would expect to see a correlation594
between the mutation rate and the inferred TMRCA. To test this, we used de-novo mutation abundance595
counts obtained from trios [78] to generate a pointwise mutation rate map that depends on the local trinu-596
cleotide context (see Methods). The generated mutation map is positively correlated with SNP density as597
reported in 1000GP’s dbSNP (r2=0.09, p-value<1e-100, averaged across non-overlapping windows of 100bp),598
indicating positions with higher inferred rates have an elevated probability of mutations. We compared the599
variable mutation rates with the maximum inferred TMRCA across the 26 individuals in 1000GP, evaluated600
every 1kb. We observed a statistically insignificant Pearson’s correlation (r=0.002, p-value=0.29) and a sig-601
nificant negative Spearman’s correlation (rho=-0.04, p-value=3.7e-93) between TMRCA and mutation rate602
map. This suggests variable mutation rates do not play a significant role in causing the ancient peak.603
604
Finally, we explored the role of CpG islands in affecting N(t) inference. CpG islands are regions of605
DNA characterised by a high frequency of cytosine-guanine dinucleotides. They are often found near the606
transcription start site of genes and are associated with gene regulation. Due to their regulatory role, CpG607
islands have fewer mutations than expected by their high GC content, and therefore may contribute to608
mutation rate variability along the genome. The fraction of the genome in a CpG island is 0.007, and we609
22
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
0.5Ma - 0.75Ma
0.75Ma - 1.0Ma
1.0Ma - 1.25Ma
1.25Ma - 1.5Ma
1.5Ma - 1.75Ma
1.75Ma - 2.0Ma
2.0Ma - 2.25Ma
2.25Ma - 2.5Ma
2.5Ma - 2.75Ma
2.75Ma - 3.0Ma
3.0Ma - 3.25Ma
3.25Ma - 3.5Ma
3.5Ma - 3.75Ma
3.75Ma - 4.0Ma
4.0Ma - 4.25Ma
4.25Ma - 4.5Ma
4.5Ma - 4.75Ma
4.75Ma - 5.0Ma
5.0Ma - 5.25Ma
5.25Ma - 5.5Ma
5.5Ma - 5.75Ma
5.75Ma - 6.0Ma
6.0Ma - 6.25Ma
6.25Ma - 6.5Ma
6.5Ma - 6.75Ma
6.75Ma - 7.0Ma
7.0Ma - 7.25Ma
7.25Ma - 7.5Ma
7.5Ma - 7.75Ma
7.75Ma - 8.0Ma
8.0Ma - 8.25Ma
8.25Ma - 8.5Ma
Time
0
1
2
3
4
5Percentage in CPG Island
GBR; r2=-0.26, p=0.15
IBS; r2=-0.45, p=8.97e-03
CEU; r2=-0.46, p=7.06e-03
KHV; r2=-0.07, p=0.69
CDX; r2=-0.06, p=0.74
BEB; r2=-0.23, p=0.19
GIH; r2=0.06, p=0.73
ITU; r2=-0.52, p=2.19e-03
CLM; r2=-0.39, p=0.027
MXL; r2=-0.6, p=2.58e-04
YRI; r2=-0.26, p=0.14
GWD; r2=-0.72, p=2.90e-06
ASW; r2=-0.16, p=0.37
Figure A8: The correlation between the posterior mean TMRCA and the fraction of regions in a CpG island,
on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. The Pearson r2 and p-value
are shown in the legend, although the relationship appears to be non-linear, and dominated by noise in
ancient TMRCA bins. The dashed line indicates the genome-wide fraction of regions in a CpG island.
stratify this by inferred TMRCA in Figure A8. The relationship between the two variables is unclear, so to610
test the effect on inference we masked out CpG islands and inferred N(t). We still observed an ancient peak611
(Figure A6c), and thus concluded that CpG islands are not its cause.612
6.2.3 V ariable mutation rate through time613
The PSMC model also assumes a constant mutation rate across past generations. While this is a sufficient614
approximation to allow inference in the last ∼50kya [26], it may not hold in more ancient times. In [26],615
the authors show that by using a yearly mutation rate of 0.38e-09 to 0.49e-09, the PSMC curves of modern616
humans and a 45,000-year-old individual from Siberia are well aligned. Assuming a generation time of 30617
years, this indicates that a per generation rate of 1.25e-08 is relatively stable in the last ∼50kya. Indeed, it618
has been suggested that the mutation rate has changed over the course of human evolution, also referred to619
as the “hominoid rate slowdown” [36, 37, 38, 39]. This is based on the observation that the yearly mutation620
rate estimated from human pedigree studies is almost half the rate inferred by considering observed differ-621
ences between human and chimpanzee genomes. The effect of a possible slowdown on PSMC estimates is622
unclear.623
624
To this end, we simulated 10 replicates of a diploid genome, arising from a constant population size, and625
a mutation rate that is fixed at 2.5e-08 at 10Mya then decreases to 1.25e-8 at present time (Figure A9a). To626
simulate a temporally changing mutation rate, we discretised time into D=64 segments with 65 time interval627
boundaries, τ = [τ1, . . . , τ64], where628
τi = ωexp
i
D log
1 + Tmax
ω
− 1
(1)
and the changes in mutation rate are piecewise constant in these intervals,µ = [µ1, . . . , µ64]. We simulate the629
coalescent process using msprime [74] with Hudson’s model of recombination [79], then utilise the memoryless630
23
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
(a)
104
105
106
107
Years
1
1.5
2
2.5
(t) ( x 1e-08)
Contant mutation rate
Varying mutation rate
(b)
104
105
106
107
Years
0
20000
40000
60000
80000
100000
N(t)
Constant mutation rate
Varying mutation rate
Simulation (c)
104
105
106
107
Years
0
20000
40000
60000
80000
100000
N(t)
Constant mutation rate
Varying mutation rate
Simulation
(d)
104
105
106
107
Years
0
20000
40000
60000
80000
100000
N(t)
Constant mutation rate
Varying mutation rate
Simulation (e)
104
105
106
107
Years
0
20000
40000
60000
80000
100000Effective population sizes
Constant mutation rate
Varying mutation rate
Simulation
Figure A9: PSMC inference on simulations with a mutation rate that varies through time. (a) The mutation
rate model, µ(t): at 10Mya the mutation rate is 2.5e-08, then it slows down to 1.25e-08 at present. (b)
Inference from PSMC on a simulation with constant population size, where the mutation rate changes
through time according to the model in (a) in red, and a constant mutation rate (1.25e-8) in gold. (c) The
same simulation as in (b) but with a population undergoing size changes. (d) and (e) are the same as (b)
and (c), respectively, although the inferred N(t) on the simulation with a varying mutation rate (red) has
been multiplied through by µ(t) (red line in (a)).
property of the exponential distribution to add mutations at each position. With a coalescence time of t631
where τi ≤ t < τ i+1, the probability of a mutation arising on either lineage is:632
P (Mutation arises|t) = 1 − exp
−4N
i−1X
j=1
µj∆j − µi(t − τi)
(2)
where ∆j = τj+1 − τj.633
634
We used PSMC to infer an N(t) curve, and compared the results to N(t) curves inferred from simulations635
with a fixed mutation rate of 1.25e-8. We observed that PSMC infers an increasingly inflated N(t) in ancient636
times (Figure A9b). With the same mutation model, we also simulated changes in N(t) similar to that as637
24
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
(a)
104
105
106
107
Years
0
10000
20000
30000
40000
50000
60000
N(t)
= /5
= /2
=
= 2
= 5
Simulation
(b)
104
105
106
107
Time
0
10000
20000
30000
40000
50000
60000
N(t)
With hotspots
Without hotspots (c)
104
105
106
107
Time
0
10000
20000
30000
40000
50000
60000
N(t)
With hotspots
Without hotspots
Figure A10: Inference from PSMC when its model of recombination is violated. (a) In a simulation with
constant recombination rate ρ, we deliberately fixed the recombination rate used by the PSMC algorithm
ˆρ at an incorrect value. The black line indicated the simulated N(t) and the coloured lines indicate the
various values of ˆρ, which are expressed relative to the simulated value. (b) and (c) Inference from PSMC
on simulations with changes in the recombination rate according to HapMap (red lines) and a constant
recombination rate (gold lines).
inferred in the ESN (Figure 1). Again, we observe the inference of N(t) from PSMC is increasingly inflated638
in ancient times, though the general shape of the trajectory is recovered (Figure A9c). We note that if the639
changes in mutation rate are known, we can simply scale the inference appropriately to correct the error640
(Figure A9d and e). Even though changes in the mutation rate over time can affect the estimation of N(t),641
it is unlikely that the observed ancient peak in humans is due to this, as producing a peak as an artefact642
of changing mutation rates would require several rapid and severe mutation rate fluctuations, which seems643
unlikely. We conclude that mutation rate variation through time is not a likely cause of an ancient peak, and644
note that this may explain the differences in ancient N(t) magnitudes in humans and chimpanzees (Figures645
1a and b).646
6.2.4 V ariable recombination rate across the genome647
The PSMC model also assumes a constant recombination rate across the genome, and uniformly through all648
past generations. However, recombination rate varies along the genome, with high rates at recombination649
hotspots and lower rates in different regions [80, 81, 42, 43, 44]. Looking back in time, the landscape of650
25
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
0.5Ma - 0.75Ma
0.75Ma - 1.0Ma
1.0Ma - 1.25Ma
1.25Ma - 1.5Ma
1.5Ma - 1.75Ma
1.75Ma - 2.0Ma
2.0Ma - 2.25Ma
2.25Ma - 2.5Ma
2.5Ma - 2.75Ma
2.75Ma - 3.0Ma
3.0Ma - 3.25Ma
3.25Ma - 3.5Ma
3.5Ma - 3.75Ma
3.75Ma - 4.0Ma
4.0Ma - 4.25Ma
4.25Ma - 4.5Ma
4.5Ma - 4.75Ma
4.75Ma - 5.0Ma
5.0Ma - 5.25Ma
5.25Ma - 5.5Ma
5.5Ma - 5.75Ma
5.75Ma - 6.0Ma
6.0Ma - 6.25Ma
6.25Ma - 6.5Ma
6.5Ma - 6.75Ma
6.75Ma - 7.0Ma
7.0Ma - 7.25Ma
7.25Ma - 7.5Ma
7.5Ma - 7.75Ma
7.75Ma - 8.0Ma
8.0Ma - 8.25Ma
8.25Ma - 8.5Ma
Time
0
0.5
1
1.5
2
2.5
3
Recombination rate (x 1e-08)
GBR
JPT
ESN
Figure A11: The relationship between the posterior mean TMRCA and the HapMap recombination rate
map, on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. The reported Pearson
r2 for each population is 0.45, 0.41, and 0.50, with p-value <0.02 for each. However, clearly the relationship
is non-linear; it appears that recombination rate increases as does TMRCA up until ∼2Mya, after which
they become negatively correlated. This may come from model violations, in that PSMC assumes a uniform
rate of recombination, or that the recombination landscape changes over time [82]. The grey, dashed line
indicates the chromosome wide average recombination rate.
recombination is known to transform every few hundred thousand years [82, 83, 84, 85]. It is thus unclear651
how recombination rate variation affects N(t) inference.652
653
Previous work [2] has demonstrated that N(t) inference in PSMC is robust even under simulations654
including recombination hotspots. As additional confirmation, we tested the effect of mis-specifying the655
recombination rate. Denote the simulated scaled recombination rate as ρ; we fix PSMC’s recombination656
rate ˆρ as 0.2ρ, 0.5ρ, 1ρ, 2ρ, or 5 ρ, and infer N(t) (Figure A10a). We observed that PSMC is able to recover657
N(t) relatively accurately for all values of the recombination rate except ˆ ρ/ρ equal to 5. Finally, we saw658
a significant correlation between the local recombination probability (genetic map taken from HapMap [86,659
87]; downloaded from https://alkesgroup.broadinstitute.org/Eagle/) and inferred TMRCAs660
(Figure A11), though the relationship appears to be non-linear. We investigated this by simulating from661
this genetic map, and inferring an N(t) curve assuming a constant recombination rate. We observed that662
this type of model mis-specification does not significantly alter inference (Figure A10b and A10c). Given663
these observations, we believe it is unlikely that a spatially varying recombination rate could generate a fake664
ancient peak.665
6.2.5 Recurrent mutations666
PSMC assumes an infinite sites model, in which a site can only experience one mutation. In reality, some667
ancient sites will have experienced recurrent mutations. Under a Jukes-Cantor model of mutation [88], which668
assumes that each base pair mutates with uniform probability to another, with probability 1/3 a recurrent669
mutation will be revert to its ancestral state. These will be observed as a homozygote and therefore may be670
inferred as a lower TMRCA by PSMC. With probability 2/3 a recurrent mutation induces a distinct biallelic671
SNP, which is observed as a heterozygote. These biallelic SNPs will appear to PSMC as a younger TMRCA,672
26
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint
0.0Ma - 0.25Ma
0.25Ma - 0.5Ma
0.5Ma - 0.75Ma
0.75Ma - 1.0Ma
1.0Ma - 1.25Ma
1.25Ma - 1.5Ma
1.5Ma - 1.75Ma
1.75Ma - 2.0Ma
2.0Ma - 2.25Ma
2.25Ma - 2.5Ma
2.5Ma - 2.75Ma
2.75Ma - 3.0Ma
3.0Ma - 3.25Ma
3.25Ma - 3.5Ma
3.5Ma - 3.75Ma
3.75Ma - 4.0Ma
4.0Ma - 4.25Ma
4.25Ma - 4.5Ma
4.5Ma - 4.75Ma
4.75Ma - 5.0Ma
5.0Ma - 5.25Ma
5.25Ma - 5.5Ma
5.5Ma - 5.75Ma
5.75Ma - 6.0Ma
6.0Ma - 6.25Ma
6.25Ma - 6.5Ma
6.5Ma - 6.75Ma
6.75Ma - 7.0Ma
7.0Ma - 7.25Ma
7.25Ma - 7.5Ma
7.5Ma - 7.75Ma
7.75Ma - 8.0Ma
8.0Ma - 8.25Ma
8.25Ma - 8.5Ma
Time
0.70
0.75
0.80
0.85
0.90
0.95
1.00B value
GBR
JPT
ESN
Figure A12: The relationship between the posterior mean TMRCA and a high resolution map of background
selection [46], on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. The reported
Pearson r2 for each population is 0.67, 0.67, and 0.75, with p-value <1.3e-5 for each. In general this suggests
that B value increases with TMRCA (a linear line of best fit has been added to show this), though the
relationship appears to be non-linear. The grey, dashed line indicates the chromosome wide average B value.
so it is unlikely that these will significantly affect inference of ancient N(t). Moreover, it is generally a whole673
region of a chromosome that is informative about local TMRCA, rather than a single base pair. If a segment674
is old enough to experience a recurrent mutation, then there are likely many other neighbouring SNPs from675
which the TMRCA can be accurately inferred.676
677
Indeed, in almost all the simulations previously presented a Jukes-Cantor model [88] was used to generate678
mutations, in which recurrent mutations are allowed to occur. Generally, in simulations with uniform µ in679
time and space, we observe that inference of ancient N(t) is reasonably accurate in ancient time (Figures680
A7, A9, and A10).681
6.2.6 Background selection682