Deep coalescent history of the hominin lineage

doi:10.1101/2024.10.17.618932

Deep coalescent history of the hominin lineage

2024 · doi:10.1101/2024.10.17.618932

preprint OA: gold CC-BY-NC-4.0

📄 Open PDF Full text JSON View at publisher

Full text 90,697 characters · extracted from oa-pdf · 3 sections · click to expand

Methods

to infer human population size history have typically focused on studying evolution more recent24 than 1-2 million years ago (Mya). The mean coalescence time in humans is ∼1Mya, indicating that there25 should be information within a present day genome sequence to probe more ancient time scales. Insights26 into population size history beyond this period would be particularly interesting, as it may elucidate the27 emergence of the genera Homo (∼3Mya [10, 11]) and Australopithicus (∼4.5Mya [12]), or even the ancestral28 human-chimpanzee population (∼6Mya [13]). The inference of N(t) from some methods, including PSMC,29 do extend to this period, however it is typically not discussed in the existing literature. This may be because30 it has until now been unclear how robust inference is in those time frames, and in particular how badly the31 model is affected when its assumptions are violated.32 1 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint Another set of methods uses cross-species analyses, in this case of present day sequences of humans, chim-33 panzees, and gorillas, to analyse the way in which these populations diverged. Methods such as CoalHMM34 and its successors employ coalescent hidden Markov model approaches, in which the hidden states corre-35 spond to the order of coalescent events, as well sometimes as ancestral coalescence times [14, 15, 16, 13, 17].36 Typically the model parameters to be estimated are the divergence times and ancestral population sizes,37 which in turn are informative about the amount of incomplete lineage sorting (ILS) or gene flow. Some of38 these studies have suggested that the human-chimp speciation was “complex”, in that after initial popula-39 tion divergence there was a period of gene flow before final separation [18, 19, 13, 20], though others have40 suggested a clean split scenario [21, 22]. Many of these analyses make coarse assumptions about population41 size changes in the ancestral populations, and typically assume these were panmictic. Typically the size of42 the human-chimpanzee ancestral population is estimated to be very large, for example this is estimated as43 167,400 in [17] and 177,368 in [23], both of which are more than 8 times the long-term effective population44 size of humans or chimpanzee. It is plausible that the size of the human-chimpanzee ancestral population45 really was extremely large, however it may be that this population was structured, which leads to inflated46 estimates of population size when assuming panmixia [24]. In this article we examine the changes in size of47 the ancestral human/chimp population in higher resolution and with fewer assumptions, finding a similar48 magnitude to that inferred in recent time.49 50 So far existing methods have offered limited population genetic insight into human evolution between51 1-2Mya and the divergence of the ancestral human/chimpanzee population around 6Mya [14, 13, 25]. Here,52 we examine the potential to broaden the scope of coalescent-based population size history inference beyond53 two million years ago. Specifically, we discuss an ancient inferred peak in effective population size around 5-54 7Mya seen in PSMC analysis of human and chimpanzee genomes, but not in gorilla or orangutan. While this55 peak has been partially observed in several previous studies in humans and chimpanzees [2, 8, 9, 16, 26, 27],56 it has received little attention in the literature. We present a revised analysis of this peak using modern data57 and discuss its potential implications for understanding the evolutionary history of humans and other great58 apes. We take care to address potential model violations and other factors that may affect the reliability of59 these inferences. Overall, our study suggests that we may be able to significantly extend the applicability of60 coalescent-based inference much further into the past, generating new insights into the complex history of61 hominin evolution.62 3 Results63 3.1 An ancient peak in human and chimpanzee population size history64 We separately inferred an effective population size (N (t)) curve over time for 26 single diploid genome se-65 quences, each from a distinct population in the 1000 Genomes Project [28, 29, 27, 30] (1000GP) using PSMC66 (Figure 1a). Importantly, we extended previous similar analyses by fitting the model with fewer parameter67 constraints [31] and using the high coverage (30x) whole-genome sequencing data resource [30], allowing68 greater resolution at time scales >2Mya (see Methods). The resulting N(t) estimates start between 5 kya69 and end around 10Mya, which is deeper in the past than in previous studies [2, 8, 9, 26]. In all human samples70 two peaks can be seen, one around the time of appearance of modern humans approximately 200kya and a71 second older one, starting around 2Mya going backwards in time, reaching a maximum around 5.5Mya with72 2 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint (a) 104 105 106 107 Years 0 10000 20000 30000 40000 50000 60000 70000 80000 N(t) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK (b) 104 105 106 107 Years 0 10000 20000 30000 40000 50000 60000 70000 80000 N(t) Bonobo Nigeria-Cameroon Chimpanzee Eastern Chimpanzee (c) 104 105 106 107 Years 0 20000 40000 60000 80000 100000 120000 140000 N(t) Eastern Lowland Cross River Western Lowland (d) 104 105 106 107 Years 0 20000 40000 60000 80000 100000 120000 140000 N(t) Sumatran Bornean Figure 1: Inference from PSMC on humans, chimpanzees and bonobos, gorillas, and orangutans. (a) Estimates of population size history on humans, using 26 populations from the 1000 Genomes project, using one individual per population. (b) Estimates of population size history on chimpanzees and bonobos, using 8 Nigeria-Cameroon chimpanzees, 5 Eastern chimpanzees, and 11 bonobos. (c) Estimates of population size history on gorillas, using 3 Eastern lowland individuals, 23 Western lowland individuals, and 1 Cross River individual. (d) Estimates of the population size history on orangutans, using 5 Sumatran individuals and 5 Bornean individuals. a population size of ∼37,500, then decreasing until around 10Ma. While the ancient peak is partially visible73 in previous whole-genome analyses [2, 9, 26], it is frequently truncated or represented with poor resolution74 because of the way time intervals were specified. 30 iterations of block bootstrapping reveals there is less75 variance in N(t) in ancient time than there is recent time (Figure A1).76 77 As the ancient peak occurs around the estimated period of human and chimpanzee species separation,78 it is natural to ask if a similar signal exists in chimpanzee population size history, or in other great apes79 in general. We therefore used PSMC to infer the population size history from 79 great ape genomes, as in80 [16] (see Methods). First, we recovered an ancient peak in chimpanzees and bonobos, overlapping the peak81 seen in the human analysis, though perhaps with its maximum slightly older (Figure 1b). This extends a82 similar signal further back in time, present in previous analyses but not discussed [16]. As with humans,83 our re-analysis produces the ancient peak more clearly because we avoid grouping old time intervals. We84 3 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint note that the effective population size of the chimpanzee ancient peak is significantly higher than that of85 humans. The fact that the peaks observed in humans and chimpanzees align with the time frame of their86 estimated divergence prompts the question of whether these peaks signify a shared history between the two87 species. Indeed, as we discuss in the Appendix, correcting for mutation rate variations in time, as well as88 for variable generation times, in principle could align the two population size curves, both in their time89 period and their scale. However, uncertainty in these parameters, as well as in curve estimation, precludes90 a confident conclusion.91 92 For gorillas (Figure 1c) and orangutans (Figure 1d), we do not detect a discernible peak in the region93 between 5-10Mya. This is consistent with the fact that gorillas and orangutans genetically separated from94 humans and chimpanzees prior to the 5-7Mya time period, and so would not be expected to share their95 history at that time. Moreover, it supports the claim that the ancient peak is not an artefact of the method96 or of the data processing, as that would result in a peak with gorillas and orangutans as well.97 3.2 The fraction of uncoalesced genome as a function of time98 The power of PSMC to infer N(t) at 5.5Mya relies on there being sufficiently many places in the genome99 where the two input lineages coalesce more anciently than this time. To assess whether this is reasonable to100 expect, we note that under a simple panmictic model the observed range of heterozygosity in humans (e.g.,101 ∼0.00069 for PEL to ∼0.001 for MSL in 1000GP) would correspond to a constant effective diploid population102 size of 13,800-20,000, using a mutation rate of 1.25e-8 per base pair per generation. Given an exponential103 distribution of coalescence times, and a generation time of 30 years, we expect a mean coalescence time104 of 828,000-1,200,000 years, and that the coalescence time will be older than 5.5Mya in 0.12%-1.01% of the105 genome, which would be sufficient to provide plenty of information for this period, in particular taking note106 that older regions of coalescence are correspondingly shorter than younger ones.107 108 As more direct confirmation from the data, we inferred the fraction of the genome that has not yet109 coalesced as a function of time for the 1000GP samples, using PSMC decoding (see Methods). At least 1%110 of the genome is estimated not to have coalesced by 5Mya, and approximately 0.1% by 10Mya (Figure 2a).111 Ancient regions tend to be shared among individuals (Figure 2b); e.g. positions with an inferred posterior112 mean TMRCA of over 3Mya show an average correlation of r2=0.21 across individuals (Figure A2). We113 note that further back in time, segments of shared ancestry become increasingly shorter because of longer114 exposure to recombination, and that these are scattered across the genome. For example we estimate that115 African individuals harbor roughly 4,000 non-contiguous segments whose coalescence time is older than 4Mya116 (Table 1; see Methods). We note that previous simulation studies [32, 33] and theoretical analysis [34] have117 shown that posterior mean estimates of true ancient coalescent times substantially older than the mean tend118 to be significantly downward biased, which may indicate an even older age for detected ancient regions.119 We repeated the analysis using an independent method, Relate [35], which does not rely on the PSMC120 assumptions and uses multiple sequences jointly; this also identifies sufficient coalescence material exists121 in this time period (see Methods and Figure A3). We further note that if we alter the time discretisation122 parameters in PSMC, then the inference of humanN(t) is almost consistent in each population until∼20Mya123 (Figure A4), at which point it begins to fray as likely the genome is fully coalesced.124 125 4 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint (a) 0 2.5M 5M 7.5M 10M 12.5M 15M Years 0.001% 0.01% 0.1% 1% 10% 100% Percentage uncoalesced GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK (b) Figure 2: (a) Inferred fraction of uncoalesced genome as a function of time, for the 26 individuals in Figure 1a. (b) (Top) Posterior mean TMRCAs for each population, for 20Mb on chromosome 20. The bottom panel shows the mean of the top panel. 3.3 Model violations and their effects126 As with all statistical frameworks, the PSMC method relies on specific modelling assumptions that may127 not hold true in reality. Deviations from these assumptions could have varying impacts on inference across128 different temporal scales, and could potentially account for some or all of the observed ancient peak. We129 therefore considered how departures from the modelling assumptions could plausibly affect inference of pop-130 ulation size history in the deep past.131 132 To this end, we considered the following possible issues. First, repeats may manifest as an artifactual133 heterozygous signal, leading to long segments with high heterozygosity, which then leads to an apparent134 5 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint Population Sample 1Mya 2Mya 3Mya 4Mya 5Mya 6Mya ESN HG03515 54,169 17,213 7,811 3,952 1,841 472 YRI NA18488 54,486 17,229 7,668 3,841 1,752 426 MSL HG03212 53,817 17,243 7,661 3,910 1,947 494 GWD HG02568 54,493 16,893 7,358 3,667 1,721 397 ACB HG01882 55,368 17,590 7,883 4,129 1,909 457 ASW NA19625 54,974 17,395 7,798 3,984 1,852 439 LWK NA19017 54,644 17,709 8,105 4,050 1,874 496 Table 1: The number of ancient non-contiguous segments inferred in different African populations from the 1000GP project. We count segments by looking for regions where the posterior probability of coalescence as old or older than the time given in each column is greater than 0.9. excess of ancient coalescence events. Second, variation in mutation rates across different genomic regions135 could cause some regions to have a higher density of mutations leading to an interpretation as having a more136 ancient TMRCA, and hence systematic biases in the inference of accurate evolutionary histories. Third,137 the human mutation rate is estimated to have slowed down over time [36, 37, 38, 39], while in PSMC it is138 assumed constant. Fourth, variability in recombination rates across the genome (in particular at recombi-139 nation hotspots) is also a departure from model assumptions [40, 41, 42, 43, 44]. Fifth, at longer timescales140 there may become a significant probability of recurrent mutations at the same site, which are not modelled141 in PSMC. Sixth, background selection, which has been estimated to be pervasive in humans [45, 46], has142 been shown to distort the inference of the coalescence rate [47, 48, 49]. Finally, balancing selection can also143 manifest as an excess in ancient coalescence events.144 145 We conducted an extensive set of simulation studies to analyse how these violations may affect inference,146 which we describe fully in the Appendix). We concluded that it is unlikely any of these issues or violations147 could cause a false ancient peak of the type we see in real data. Notably, we observed that if there are148 large variations in the mutation rate across the genome, then false inflations of inferred N(t) are observed.149 However, the degree of rate variation required to generate this is larger than the degree of this inferred in150 humans, and moreover their ancient inferred N(t) has high variance, contrasting the low variance we have151 shown in humans in the ∼5Mya peak.152 4 Discussion153 We have demonstrated the applicability of coalescent-based inference to examine deep evolutionary history154 (five million years ago and beyond) in the hominin lineage. Specifically, we highlighted and extended the155 analysis of a peak in ancient population size history inferred by PSMC, which is reliably estimated from156 human, chimpanzee and bonobo data, though not found in gorillas or orangutans. The human, chimpanzee157 and bonobo peaks span approximately the same time period, suggesting that they may reflect the same158 event prior to, or during, human-chimpanzee speciation. We discussed several ways in which the data may159 violate the model underlying our analysis, and concluded that none are likely to generate a false ancient peak.160 161 A straightforward interpretation of an ancient peak is that it reflects true changes in population size.162 Indeed, for humans, changes in inferred N(t) since ∼80kya have been associated with the out-of-Africa event163 followed by more recent recent population size increases [3, 6]. The ancient peak may similarly reflect ancient164 6 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint hominin population expansion and decline in the late miocene and the pliocene. To the extent the same165 ancient peak exists in chimpanzees and bonobos, it may reflect common environmental conditions affecting166 similar evolutionary processes in the different species.167 168 However, there are other possible explanations. The separation and reconnection of local populations169 alters the amount of possible gene flow between them, which affects the inferred N(t) [2, 50, 24, 51]. In a170 structured population, the coalescence rate decreases relative to a panmictic population of the same size,171 thus the effective population size (taken here as the inverse of the coalescence rate) increases. Indeed, in172 [24] the authors show that the inferred human effective population size as inferred by PSMC, including the173 ancient peak, can be equally well obtained by a set of populations with varying migration rates between174 them without any changes in size. Therefore, it is possible that the ancient peak is a signature of ancient175 population structure or other departures from panmixia.176 177 One such possible structured event may be complex speciation between humans and chimpanzees, in178 which after an initial split the two populations rejoined or exchanged genes for a period of time. Previous179 publications have put forward evidence for this type of model [18, 19, 13, 20], although other analysis has180 favoured a model with a clean split [21, 22]. Innan and Watanabe proposed there was no strong support181 for gene flow after divergence, and that a clean split best fit the data [21]. This result was reinforced by182 Yamamichi, whose maximum likelihood model also favoured a clean split [22]. Patterson et al. proposed183 that gene flow did occur, because of differences in the estimated divergence time between the autosomes and184 X chromosome [18], although this approach has been criticised [52, 53, 22]. Using a likelihood ratio test185 considering divergence times across the genome, Yang rejected the null hypothesis of an absence of gene flow186 [19]. With an HMM that explicitly tests for immediate or prolonged separation, Mailund et al. proposed187 that a model with an extended period of gene flow better fits the data [13]. Galtier introduced Aphid, which188 distinguishes between discordant coalescent trees (ones that do not match the species trees) generated by189 ILS or gene flow [20]. It leverages the fact that multispecies coalescent trees affected by gene flow tend to190 have shorter branches, and coalescent trees affected by incomplete lineage sorting longer branches, than the191 average coalescent tree. This information provided strong support for a model in which there was ancient192 gene flow between human, chimpanzee, and even gorilla, after the initial human-chimpanzee split.193 194 If the peak we observed in chimpanzees and bonobos is indeed reliable and aligned with the ancient195 peak observed in humans, this may be consistent with a period of ongoing gene flow after initial human and196 chimpanzee separation. Qualitatively, there are other observations which make this type of model attrac-197 tive. Notably, CoalHMM [23] and TRAILS [17] assume a panmictic ancestral human-chimpanzee population198 and estimate its size as 177,368 and 167,400, respectively. Both of these estimates are more than 8 times199 the long-term effective population size of humans and chimpanzees (in both species, θ = 4N µ is roughly200 0.001, which assuming µ=1.25e-8 gives N=20,000), and are not well aligned with the PSMC inference in201 this period (Figure 1). A structured ancestral population could explain the large CoalHMM and TRAILS202 estimates, as falsely assuming panmixia can inflate estimates of population size [24]. Moreover, Aphid [20],203 which provides strong support for human-chimp-gorilla gene flow, estimates the size of the ancestral human-204 chimpanzee population to be of similar size to present day human, chimpanzee and bonobo populations.205 Additionally, chimpanzees and bonobos have been estimated to split cleanly around 2Mya [13], and PSMC206 does not infer a unique peak around this period. Finally, widespread structure in the ancestral populations of207 7 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint humans, chimpanzees/bonobos, and gorillas, may plausibly explain why fossils such as Sahelanthropus from208 this period are hard to assign [54, 55, 56], and as a consequence morphological analyses can not confidently209 determine separation times [57].210 211 We examined the sensitivity of PSMC to various modelling violations such as variation in mutation212 and recombination rates, and selection. It will be useful to incorporate these as extensions to the PSMC213 model, modifying the transition and emission probabilities within the underlying HMM accordingly. Indeed,214 some work has already been done towards this goal [58, 59], and future modification could further address215 slowdown or genomic variability in mutation rate, recombination hotspots, and/or selection, so as to better216 understand the ancient speciation process in primates [17] and indeed other species.217 Acknowledgements218 We thank Aylwyn Scally for constructive comments. T.C. was funded by a Wellcome Postgraduate Stu-219 dentship 108864/B/15/Z, and R.D. and R.S. by Wellcome Investigator Award 207492/Z/17/Z. For the220 purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted221 Manuscript version arising from this submission.222 References223 [1] Brian Charlesworth. Effective population size and patterns of molecular evolution and variation.Nature224 Reviews Genetics, 10(3):195–205, 2009.225 [2] Heng Li and Richard Durbin. Inference of human population history from individual whole-genome226 sequences. Nature, 475(7357):493–496, 2011.227 [3] Stephan Schiffels and Richard Durbin. Inferring human population size and separation history from228 multiple genome sequences. Nature genetics, 46(8):919–925, 2014.229 [4] Sara Sheehan, Kelley Harris, and Yun S Song. Estimating variable effective population sizes from multi-230 ple genomes: a sequentially markov conditional sampling distribution approach. Genetics, 194(3):647–231 662, 2013.232 [5] Matthias Steinr¨ ucken, Joshua S Paul, and Yun S Song. A sequentially markov conditional sampling dis-233 tribution for structured populations with migration and recombination. Theoretical population biology,234 87:51–61, 2013.235 [6] Jack Kamm, Jonathan Terhorst, Richard Durbin, and Yun S Song. Efficiently inferring the demo-236 graphic history of many populations with allele count data. Journal of the American Statistical Asso-237 ciation, 115(531):1472–1487, 2020.238 [7] Donna Henderson, Sha Zhu, Christopher B Cole, and Gerton Lunter. Demographic inference from239 multiple whole genomes using a particle filter for continuous markov jump processes. Plos one ,240 16(3):e0247647, 2021.241 8 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint [8] Matthias Meyer, Martin Kircher, Marie-Theres Gansauge, Heng Li, Fernando Racimo, Swapan Mallick,242 Joshua G Schraiber, Flora Jay, Kay Pr¨ ufer, Cesare De Filippo, et al. A high-coverage genome sequence243 from an archaic denisovan individual. Science, 338(6104):222–226, 2012.244 [9] Kay Pr¨ ufer, Fernando Racimo, Nick Patterson, Flora Jay, Sriram Sankararaman, Susanna Sawyer,245 Anja Heinze, Gabriel Renaud, Peter H Sudmant, Cesare De Filippo, et al. The complete genome246 sequence of a neanderthal from the altai mountains. Nature, 505(7481):43–49, 2014.247 [10] Brian Villmoare, William H Kimbel, Chalachew Seyoum, Christopher J Campisano, Erin N DiMaggio,248 John Rowan, David R Braun, J Ram´ on Arrowsmith, and Kaye E Reed. Early homo at 2.8 ma from249 ledi-geraru, afar, ethiopia. Science, 347(6228):1352–1355, 2015.250 [11] Erin N DiMaggio, Christopher J Campisano, John Rowan, Guillaume Dupont-Nivet, Alan L Deino,251 Faysal Bibi, Margaret E Lewis, Antoine Souron, Dominique Garello, Lars Werdelin, et al. Late pliocene252 fossiliferous sedimentary record and the environmental context of early homo from afar, ethiopia.253 Science, 347(6228):1355–1359, 2015.254 [12] Jason E Lewis, Carol V Ward, William H Kimbel, Casey L Kidney, Frank H Brown, Rhonda L Quinn,255 John Rowan, Ignacio A Lazagabaster, William J Sanders, Meave G Leakey, et al. A 4.3-million-year-256 old australopithecus anamensis mandible from ileret, east turkana, kenya, and its paleoenvironmental257 context. Journal of Human Evolution, 194:103579, 2024.258 [13] Thomas Mailund, Anders E Halager, Michael Westergaard, Julien Y Dutheil, Kasper Munch, Lars N259 Andersen, Gerton Lunter, Kay Pr¨ ufer, Aylwyn Scally, Asger Hobolth, et al. A new isolation with mi-260 gration model along complete genomes infers very different divergence processes among closely related261 great ape species. PLoS genetics, 8(12):e1003125, 2012.262 [14] Asger Hobolth, Ole F Christensen, Thomas Mailund, and Mikkel H Schierup. Genomic relationships263 and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden markov264 model. PLoS genetics, 3(2):e7, 2007.265 [15] Julien Y Dutheil, Ganesh Ganapathy, Asger Hobolth, Thomas Mailund, Marcy K Uyenoyama, and266 Mikkel H Schierup. Ancestral population genomics: the coalescent hidden markov model approach.267 Genetics, 183(1):259–274, 2009.268 [16] Javier Prado-Martinez, Peter H Sudmant, Jeffrey M Kidd, Heng Li, Joanna L Kelley, Belen Lorente-269 Galdos, Krishna R Veeramah, August E Woerner, Timothy D O’connor, Gabriel Santpere, et al. Great270 ape genetic diversity and population history. Nature, 499(7459):471–475, 2013.271 [17] Iker Rivas-Gonz´ alez, Mikkel H Schierup, John Wakeley, and Asger Hobolth. Trails: Tree reconstruction272 of ancestry using incomplete lineage sorting. Plos Genetics , 20(2):e1010836, 2024.273 [18] Nick Patterson, Daniel J Richter, Sante Gnerre, Eric S Lander, and David Reich. Genetic evidence for274 complex speciation of humans and chimpanzees. Nature, 441(7097):1103–1108, 2006.275 [19] Ziheng Yang. A likelihood ratio test of speciation with gene flow using genomic sequence data. Genome276 biology and evolution , 2:200–211, 2010.277 [20] Nicolas Galtier. An approximate likelihood method reveals ancient gene flow between human, chim-278 panzee and gorilla. Peer Community Journal, 4, 2024.279 9 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint [21] Hideki Innan and Hidemi Watanabe. The effect of gene flow on the coalescent time in the human-280 chimpanzee ancestral population. Molecular biology and evolution , 23(5):1040–1047, 2006.281 [22] Masato Yamamichi, Jun Gojobori, and Hideki Innan. An autosomal analysis gives no genetic evidence282 for complex speciation of humans and chimpanzees. Molecular biology and evolution , 29(1):145–156,283 2012.284 [23] Iker Rivas-Gonz´ alez, Marjolaine Rousselle, Fang Li, Long Zhou, Julien Y Dutheil, Kasper Munch,285 Yong Shao, Dongdong Wu, Mikkel H Schierup, and Guojie Zhang. Pervasive incomplete lineage sorting286 illuminates speciation and selection in primates. Science, 380(6648):eabn4409, 2023.287 [24] Olivier Mazet, Willy Rodr´ ıguez, Simona Grusea, Simon Boitard, and Loun` es Chikhi. On the impor-288 tance of being structured: instantaneous coalescence rates and human evolution—lessons for ancestral289 population size inference? Heredity, 116(4):362–371, 2016.290 [25] Naoyuki Takahata and Yoko Satta. Evolution of the primate lineage leading to modern humans:291 phylogenetic and demographic inferences from dna sequences. Proceedings of the National Academy of292 Sciences, 94(9):4811–4815, 1997.293 [26] Qiaomei Fu, Heng Li, Priya Moorjani, Flora Jay, Sergey M Slepchenko, Aleksei A Bondarev, Philip LF294 Johnson, Ayinuer Aximu-Petri, Kay Pr¨ ufer, Cesare De Filippo, et al. Genome sequence of a 45,000-295 year-old modern human from western siberia. Nature, 514(7523):445–449, 2014.296 [27] 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature,297 526(7571):68, 2015.298 [28] 1000 Genomes Project Consortium et al. A map of human genome variation from population scale299 sequencing. Nature, 467(7319):1061, 2010.300 [29] 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human301 genomes. Nature, 491(7422):56, 2012.302 [30] Marta Byrska-Bishop, Uday S Evani, Xuefang Zhao, Anna O Basile, Haley J Abel, Allison A Regier,303 Andr´ e Corvelo, Wayne E Clarke, Rajeeva Musunuri, Kshithija Nagulapalli, et al. High-coverage304 whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell,305 185(18):3426–3440, 2022.306 [31] Leon Hilgers, Shenglin Liu, Axel Jensen, Thomas Brown, Trevor Cousins, Regev Schweiger, Katerina307 Guschanski, and Michael Hiller. Avoidable false psmc population size peaks occur across numerous308 studies. bioRxiv, pages 2024–06, 2024.309 [32] D´ ebora YC Brandt, Xinzhu Wei, Yun Deng, Andrew H Vaughn, and Rasmus Nielsen. Evalua-310 tion of methods for estimating coalescence times using ancestral recombination graphs. Genetics,311 221(1):iyac044, 2022.312 [33] Regev Schweiger and Richard Durbin. Ultrafast genome-wide inference of pairwise coalescence times.313 Genome Research, 33(7):1023–1031, 2023.314 [34] Yun S Song. Lecture notes on computational and mathematical population genetics. 2021.315 10 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint [35] Leo Speidel, Marie Forest, Sinan Shi, and Simon R Myers. A method for genome-wide genealogy316 estimation for thousands of samples. Nature genetics, 51(9):1321–1329, 2019.317 [36] Priya Moorjani, Ziyue Gao, and Molly Przeworski. Human germline mutation and the erratic evolu-318 tionary clock. PLoS biology, 14(10):e2000744, 2016.319 [37] Aylwyn Scally and Richard Durbin. Revising the human mutation rate: implications for understanding320 human evolution. Nature Reviews Genetics, 13(10):745–753, 2012.321 [38] V Yi Soojin. Morris goodman’s hominoid rate slowdown: the importance of being neutral. Molecular322 phylogenetics and evolution, 66(2):569–574, 2013.323 [39] Manjusha Chintalapati and Priya Moorjani. Evolution of the mutation rate across primates. Current324 opinion in genetics & development, 62:58–64, 2020.325 [40] S Myers, C C A Spencer, A Auton, L Bottolo, C Freeman, P Donnelly, and G McVean. The distribution326 and causes of meiotic recombination in the human genome. Biochem. Soc. Trans., 34(Pt 4):526–530,327 2006.328 [41] International HapMap Consortium. The International HapMap Project. Nature, 426(6968):789–796,329 2003.330 [42] Thomas D Petes. Meiotic recombination hot spots and cold spots. Nature Reviews Genetics, 2(5):360–331 369, 2001.332 [43] KT Nishant, Chetan Kumar, and MRS Rao. Humhot: a database of human meiotic recombination333 hot spots. Nucleic acids research, 34(suppl 1):D25–D28, 2006.334 [44] Laure S´ egurel, Ellen Miranda Leffler, and Molly Przeworski. The case of the fickle fingers: how335 the prdm9 zinc finger protein specifies meiotic recombination hotspots in humans. PLoS biology ,336 9(12):e1001211, 2011.337 [45] Graham McVicker, David Gordon, Colleen Davis, and Phil Green. Widespread genomic signatures of338 natural selection in hominid evolution. PLoS genetics, 5(5):e1000471, 2009.339 [46] David A Murphy, Eyal Elyashiv, Guy Amster, and Guy Sella. Broad-scale variation in human genetic340 diversity levels is predicted by purifying selection on coding and non-coding elements. Elife, 12:e76065,341 2022.342 [47] Daniel R Schrider, Alexander G Shanku, and Andrew D Kern. Effects of linked selective sweeps on343 demographic inference and model selection. Genetics, 204(3):1207–1223, 2016.344 [48] Parul Johri, Kellen Riall, Hannes Becher, Laurent Excoffier, Brian Charlesworth, and Jeffrey D Jensen.345 The impact of purifying and background selection on the inference of population history: problems346 and prospects. Molecular biology and evolution , 38(7):2986–3003, 2021.347 [49] Simon Boitard, Armando Arredondo, Loun` es Chikhi, and Olivier Mazet. Heterogeneity in effective348 size across the genome: effects on the inverse instantaneous coalescence rate (iicr) and implications for349 demographic inference under linked selection. Genetics, 220(3):iyac008, 2022.350 11 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint [50] Olivier Mazet, Willy Rodr´ ıguez, and Loun` es Chikhi. Demographic inference using genetic data from351 a single individual: Separating population size variation from population structure. Theoretical Pop-352 ulation Biology, 104:46–58, 2015.353 [51] Trevor Cousins, Aylwyn Scally, and Richard Durbin. A structured coalescent model reveals deep354 ancestral structure shared by all modern humans. bioRxiv, pages 2024–03, 2024.355 [52] John Wakeley. Complex speciation of humans and chimpanzees. Nature, 452(7184):E3–E4, 2008.356 [53] Daven C Presgraves and V Yi Soojin. Doubts about complex speciation between humans and chim-357 panzees. Trends in ecology & evolution , 24(10):533–540, 2009.358 [54] Michel Brunet, Franck Guy, David Pilbeam, Hassane Taisso Mackaye, Andossa Likius, Djimdoumal-359 baye Ahounta, Alain Beauvilain, C´ ecile Blondel, Herv´ e Bocherens, Jean-Renaud Boisserie, et al. A360 new hominid from the upper miocene of chad, central africa. Nature, 418(6894):145–151, 2002.361 [55] Milford H Wolpoff, Brigitte Senut, Martin Pickford, and John Hawks. Sahelanthropus362 or’sahelpithecus’? Nature, 419(6907):581–582, 2002.363 [56] Michel Brunet. Sahelanthropus or’sahelpithecus’? Nature, 419(6907):582–582, 2002.364 [57] Bernard Wood and Terry Harrison. The evolutionary context of the first hominins. Nature,365 470(7334):347–352, 2011.366 [58] Trevor Cousins, Daniel Tabin, Nick Patterson, David Reich, and Arun Durvasula. Accurate inference367 of population history in the presence of background selection. bioRxiv, 2024.368 [59] Gustavo V. Barroso, Nataˇ sa Puzovi´ c, and Julien Y Dutheil. Inference of recombination maps from a369 single pair of genomes and its application to ancient samples. PLoS genetics, 15(11):e1008449, 2019.370 [60] Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard,371 Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, et al. Twelve years of372 samtools and bcftools. Gigascience, 10(2):giab008, 2021.373 [61] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo374 Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence align-375 ment/map format and samtools. Bioinformatics, 25(16):2078–2079, 2009.376 [62] Heng Li. A statistical framework for SNP calling, mutation discovery, association mapping and popu-377 lation genetical parameter estimation from sequencing data. Bioinformatics, 27(21):2987–2993, 2011.378 [63] Mois` es Coll Maci` a, Laurits Skov, Benjamin Marco Peter, and Mikkel Heide Schierup. Different histori-379 cal generation intervals in human populations inferred from neanderthal fragment lengths and mutation380 signatures. Nature Communications, 12(1):5317, 2021.381 [64] Jared C Roach, Gustavo Glusman, Arian FA Smit, Chad D Huff, Robert Hubley, Paul T Shannon, Lee382 Rowen, Krishna P Pant, Nathan Goodman, Michael Bamshad, et al. Analysis of genetic inheritance383 in a family quartet by whole-genome sequencing. Science, 328(5978):636–639, 2010.384 12 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint [65] Philip Awadalla, Julie Gauthier, Rachel A Myers, Ferran Casals, Fadi F Hamdan, Alexander R Griffing,385 M´ elanie Cˆ ot´ e, Edouard Henrion, Dan Spiegelman, Julien Tarabeux, et al. Direct measure of the de386 novo mutation rate in autism and schizophrenia cohorts. The American Journal of Human Genetics,387 87(3):316–324, 2010.388 [66] Raheleh Rahbari, Arthur Wuster, Sarah J Lindsay, Robert J Hardwick, Ludmil B Alexandrov, Saeed389 Al Turki, Anna Dominiczak, Andrew Morris, David Porteous, Blair Smith, et al. Timing, rates and390 spectra of human germline mutation. Nature genetics, 48(2):126–133, 2016.391 [67] Augustine Kong, Michael L Frigge, Gisli Masson, Soren Besenbacher, Patrick Sulem, Gisli Magnusson,392 Sigurjon A Gudjonsson, Asgeir Sigurdsson, Aslaug Jonasdottir, Adalbjorg Jonasdottir, et al. Rate of393 de novo mutations and the importance of father’s age to disease risk. Nature, 488(7412):471–475, 2012.394 [68] Jacob J Michaelson, Yujian Shi, Madhusudan Gujral, Hancheng Zheng, Dheeraj Malhotra, Xin Jin,395 Minghan Jian, Guangming Liu, Douglas Greer, Abhishek Bhandari, et al. Whole-genome sequencing396 in autism identifies hot spots for de novo germline mutation. Cell, 151(7):1431–1442, 2012.397 [69] 1000 Genomes Project Consortium et al. Variation in genome-wide mutation rates within and between398 human families. Nature genetics, 43(7):712–714, 2011.399 [70] Aylwyn Scally. The mutation rate in human evolution and demographic inference. Current opinion in400 genetics & development, 41:36–43, 2016.401 [71] Hans Ellegren, Nick GC Smith, and Matthew T Webster. Mutation rate variation in the mammalian402 genome. Current opinion in genetics & development , 13(6):562–568, 2003.403 [72] Alan Hodgkinson and Adam Eyre-Walker. Variation in the mutation rate across mammalian genomes.404 Nature reviews genetics, 12(11):756–766, 2011.405 [73] Thomas CA Smith, Peter F Arndt, and Adam Eyre-Walker. Large scale variation in the rate of406 germ-line de novo mutation, base composition, divergence and diversity in humans. PLoS genetics,407 14(3):e1007254, 2018.408 [74] Jerome Kelleher, Alison M Etheridge, and Gilean McVean. Efficient coalescent simulation and ge-409 nealogical analysis for large sample sizes. PLoS computational biology , 12(5):e1004842, 2016.410 [75] Laurent C Francioli, Paz P Polak, Amnon Koren, Androniki Menelaou, Sung Chun, Ivo Renkens,411 Genome of the Netherlands Consortium, Cornelia M van Duijn, Morris Swertz, Cisca Wijmenga, et al.412 Genome-wide patterns and properties of de novo mutations in humans. Nature genetics, 47(7):822–826,413 2015.414 [76] H´ akon J´ onsson, Patrick Sulem, Birte Kehr, Snaedis Kristmundsdottir, Florian Zink, Eirikur Hjartar-415 son, Marteinn T Hardarson, Kristjan E Hjorleifsson, Hannes P Eggertsson, Sigurjon Axel Gudjonsson,416 et al. Parental influence on human germline de novo mutations in 1,548 trios from iceland. Nature,417 549(7673):519–522, 2017.418 [77] Wendy SW Wong, Benjamin D Solomon, Dale L Bodian, Prachi Kothiyal, Greg Eley, Kathi C Huddle-419 ston, Robin Baker, Dzung C Thach, Ramaswamy K Iyer, Joseph G Vockley, et al. New observations420 on maternal age effect on germline de novo mutations. Nature communications, 7(1):10486, 2016.421 13 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint [78] Miguel Rodriguez-Galindo, S` onia Casillas, Donate Weghorn, and Antonio Barbadilla. Germline de422 novo mutation rates on exons versus introns in humans. Nature communications, 11(1):3304, 2020.423 [79] Richard R Hudson. Properties of a neutral allele model with intragenic recombination. Theoretical424 population biology, 23(2):183–201, 1983.425 [80] S Myers, CCA Spencer, A Auton, L Bottolo, C Freeman, P Donnelly, and et G McVean. The distri-426 bution and causes of meiotic recombination in the human genome, 2006.427 [81] Richard A Gibbs, John W Belmont, Paul Hardenbol, Thomas D Willis, Fuli L Yu, HM Yang, Lan-Yang428 Ch’ang, Wei Huang, Bin Liu, Yan Shen, et al. The international hapmap project. 2003.429 [82] Simon Myers, Rory Bowden, Afidalina Tumian, Ronald E Bontrop, Colin Freeman, Tammie S MacFie,430 Gil McVean, and Peter Donnelly. Drive against hotspot motifs in primates implicates the prdm9 gene431 in meiotic recombination. Science, 327(5967):876–879, 2010.432 [83] Emil D Parvanov, Petko M Petkov, and Kenneth Paigen. Prdm9 controls activation of mammalian433 recombination hotspots. Science, 327(5967):835–835, 2010.434 [84] Kenneth Paigen and Petko M Petkov. Prdm9 and its role in genetic recombination. Trends in Genetics,435 34(4):291–300, 2018.436 [85] Corinne Grey, Fr´ ed´ eric Baudat, and Bernard de Massy. Prdm9, a driver of the genetic map. PLoS437 genetics, 14(8):e1007479, 2018.438 [86] International HapMap Consortium Altshuler David altshuler@ molbio. mgh. harvard. edu Donnelly439 Peter donnelly@ stats. ox. ac. uk. A haplotype map of the human genome. Nature, 437(7063):1299–440 1320, 2005.441 [87] International HapMap Consortium et al. A second generation human haplotype map of over 3.1 million442 snps. Nature, 449(7164):851, 2007.443 [88] TH Jukes. Evolution of protein molecules. Mammalian Protein Metabolism, 3, 1969.444 [89] Brian Charlesworth, MT Morgan, and Deborah Charlesworth. The effect of deleterious mutations on445 neutral molecular variation. Genetics, 134(4):1289–1303, 1993.446 [90] Magnus Nordborg, Brian Charlesworth, and Deborah Charlesworth. The effect of recombination on447

Background

selection. Genetics Research, 67(2):159–174, 1996.448 [91] Parul Johri, Susanne P Pfeifer, and Jeffrey D Jensen. Developing an evolutionary baseline model for449 humans: jointly inferring purifying selection with population history. Molecular biology and evolution ,450 40(5):msad100, 2023.451 [92] Jacob I Marsh and Parul Johri. Biases in arg-based inference of historical population size in populations452 experiencing selection. Molecular Biology and Evolution, page msae118, 2024.453 [93] Deborah Charlesworth. Balancing selection and its effects on sequences in nearby genome regions.454 PLoS genetics, 2(4):e64, 2006.455 14 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint [94] KL Bubb, D Bovee, D Buckley, E Haugen, M Kibukawa, M Paddock, A Palmieri, S Subramanian,456 Y Zhou, R Kaul, et al. Scan of human genome reveals no new loci under ancient balancing selection.457 Genetics, 173(4):2165–2177, 2006.458 [95] Saurabh Asthana, Steffen Schmidt, and Shamil Sunyaev. A limited role for balancing selection. Trends459 in Genetics , 21(1):30–32, 2005.460 [96] Felix M Key, Jo˜ ao C Teixeira, Cesare de Filippo, and Aida M Andr´ es. Advantageous diversity main-461 tained by balancing selection in humans. Current opinion in genetics & development, 29:45–51, 2014.462 [97] Bora E Baysal, Elizabeth C Lawrence, and Robert E Ferrell. Sequence variation in human succinate463 dehydrogenase genes: evidence for long-term balancing selection on sdha. BMC biology, 5:1–14, 2007.464 [98] Aida M Andr´ es, Melissa J Hubisz, Amit Indap, Dara G Torgerson, Jeremiah D Degenhardt, Adam R465 Boyko, Ryan N Gutenkunst, Thomas J White, Eric D Green, Carlos D Bustamante, et al. Targets of466 balancing selection in the human genome. Molecular biology and evolution , 26(12):2755–2764, 2009.467 [99] Ellen M Leffler, Ziyue Gao, Susanne Pfeifer, Laure S´ egurel, Adam Auton, Oliver Venn, Rory Bowden,468 Ronald Bontrop, Jeffrey D Wall, Guy Sella, et al. Multiple instances of ancient balancing selection469 shared between humans and chimpanzees. Science, 339(6127):1578–1582, 2013.470 [100] Radoje Drmanac, Andrew B Sparks, Matthew J Callow, Aaron L Halpern, Norman L Burns, Bahram G471 Kermani, Paolo Carnevali, Igor Nazarenko, Geoffrey B Nilsen, George Yeung, et al. Human genome472 sequencing using unchained base reads on self-assembling dna nanoarrays. Science, 327(5961):78–81,473 2010.474 [101] Matthew D Rasmussen, Melissa J Hubisz, Ilan Gronau, and Adam Siepel. Genome-wide inference of475 ancestral recombination graphs. PLoS Genetics, 10(5):e1004342, 2014.476 [102] W James Kent, Charles W Sugnet, Terrence S Furey, Krishna M Roskin, Tom H Pringle, Alan M477 Zahler, and David Haussler. The human genome browser at ucsc. Genome research, 12(6):996–1006,478 2002.479 5 Methods480 5.1 Processing data481 We took high-coverage whole-genome-sequence cram files for one individual in each of the 26 populations482 from the 1000 Genomes project. These are aligned to GRCh38. The cram files were converted to bam483 and indexed with samtools [60, 61]. The genotype likelihoods were calculated with bcftools mpileup [62]484 by skipping alignments with mapping quality less than 20, skipping bases with base alignment quality485 less than 20, and setting the coefficient for downgrading mapping quality to 50. SNPs were called using486 bcftools and all indels were excluded. Variants were then designated as uncallable if the minimum map-487 ping quality was less than 20, the minimum consensus quality was less than 20, or the coverage was less488 than half or more than double the mean coverage. Finally, we designated all regions in the strict map-489 pability mask for GRCh38 ( ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_490 genomes_project/working/20160622_genome_mask_GRCh38) as uncallable. Uncallable positions491 15 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint are labelled as missing data in the HMM.492 493 We downloaded processed primate data from eichlerlab.gs.washington.edu/greatape/data/494 [16]. We took the VCF files and masked positions according to the given bed files, which described sites495 where coverage was less than 5 and regions that did not pass the quality filters as discussed in their paper.496 After filtering, only 49% of the genome was designated as callable. For the eastern lowland gorilla, we497 arbitrarily chose the individual labelled “Mkubwa”, and for the Nigeria-Cameroon chimpanzee we chose the498 individual labelled “Akwaya Jean”. We note that analysis on other individuals looked similar. For the gorilla499 we used a mutation rate per base pair per generation of 1.43e-08, and a generation time of 19 years; for the500 chimpanzee, a mutation rate per base pair per generation of 1.78e-08 and a generation time of 24 years were501 used [36].502 5.2 Mutation rates and generation times for humans and primates503 In humans we set the generation time as 29 [63] and the mutation rate per generation per base pair as504 µ=1.25e-08 [64, 65, 66, 67, 68, 69, 70]. The rates we set for the other great apes are taken from [39]. In505 chimpanzees and bonobos we set µ=1.78e-08 and generation time equal to 24 years. For Gorillas we set506 µ=1.42e-08 and generation time equal to 19 years. For Orangutans we set µ=2.03e-08 and generation time507 equal to 27.508 5.3 PSMC analysis509 We ran PSMC as embedded in cobraa, www.github.com/trevorcousins/cobraa. We used 64 time510 intervals and enabled the parameters for the effective population size in each to be inferred freely. This511 contrasts the default settings of previous implementations of the algorithm that force adjacent intervals to512 be the same, which has been shown to lead to fitting problems [31]. In the 1000GP data, we fixed the scaled513 mutation rate θ as 0.0008 which is roughly the mean across populations. For the great apes, we fixed the θ514 for gorilas as 0.0014, chimpanzees and bonobos as 0.001, and orangutans as 0.0014. We found that fixing this515 parameter reduces noise in the estimation of ancient N(t), and also forces time interval boundaries across516 individuals to align. The initial value for the ratio of mutation rate to recombination rate was set as 1.5,517 and the recombination rate was set to be inferred as part of the EM algorithm, which was iterated for 30518 iterations. The time interval boundaries are controlled by equation (1) where D is the number of discrete519 time boundaries (we used 64 throughout), and ω and Tmax control the spread of time interval boundaries.520 In Figure 1 we used ω=0.01 and Tmax=50; in Figure A4 we used ω=0.01 and Tmax=150.521 5.4 PSMC decoding522 Using the inferred effective population size parameters, scaled recombination rate, and scaled mutation rate523 (θ=0.0008), we used the PSMC decoding to get a posterior probability of coalescence at every position.524 To calculate the amount of uncoalesced genome over time, we integrated over the full posterior probabil-525 ity (Figure 2). To scale from time in coalescent units (equation (1)) to time in years, we used N=16,000526 (θ = 4 N µ= 0.0008 with µ=1.25e-08) then multiplied coalescent time by 2N µg , where g is the number of527 years per generations. To get a point estimate from PSMC (Figure 2b), we take the posterior mean at each528 position using the midpoint of the time interval boundaries.529 530 16 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint To calculate the number of ancient non-contiguous segments inferred in different African populations531 (Table 1), we looked for regions where the posterior probability of coalescence as old or older than 2Mya,532 3Mya, or 4Mya, respectively, is greater than 0.9. We then only counted non-contiguous segments, which are533 all segments that are not adjacent.534 5.5 Nonparametric bootstrapping535 For a given individual, for each chromosome, we break the sequence up into segments of 5Mb, then reconstruct536 a new sequence by sampling with replacement from all 5Mb segments. These are then concatenated together537 to form a contiguous sequence, on which PSMC inference is performed using 30 iterations. This procedure538 is repeated 30 times for humans and 30 times for chimps.539 5.6 Relate540 We ran Relate [35] on all samples on the 1000GP for GBP, CHB, and YRI. We randomly selected 10 diploid541 sequences from each population. To calculate the fraction of uncoalesced genome we discretised time into542 intervals of 500 generations and computed a histogram based on Relate’s ARG. The inferred fraction of543 uncoalesced genome at 5Mya was roughly 0.3%, for each sample in each population.544 17 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint 6 Appendix545 6.1 Figures546 10k 100k 1M 10M 0 10000 20000 30000 40000 N(t) GBR 10k 100k 1M 10M Years JPT 10k 100k 1M 10M ESN Figure A1: 30 iterations of block bootstraps on three different humans from the 1000GP project. We chopped the genome into 5Mb windows, then for each bootstrap we randomly resampled these with replacement. (a) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK TMRCA (Population 1) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK TMRCA (Population 2) 0.0 0.1 0.2 0.3 0.4 Pearson given TMRCA(Population 1) > 1Ma (b) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK TMRCA (Population 1) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK TMRCA (Population 2) 0.0 0.1 0.2 0.3 0.4 Pearson given TMRCA(Population 1) > 2Ma (c) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK TMRCA (Population 1) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK TMRCA (Population 2) 0.0 0.1 0.2 0.3 0.4 Pearson given TMRCA(Population 1) > 3Ma (d) GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK GBR TSI IBS FIN CEU CHS KHV CHB CDX JPT BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK 0.0 0.1 0.2 0.3 0.4 Spearman given TMRCA(Population 1) > 5Ma Figure A2: The Pearson correlation between inferred TMRCAs, for all pairwise combinations of two popu- lations from the 1000GP. We calculated these for chromosome 1. To obtain point estimates at each position, we used the posterior mean. For each possible population pair, we look at positions where population 1 (x-axis) is larger than t and correlate these with the TMRCAs from population 2 (y-axis; note these are not necessarily larger than t, so the matrices are not symmetric). The value of t for (a), (b), (c), and (d) is 1Mya, 2Mya, 3Mya, and 5Mya, respectively. The average Pearson r2 for (a), (b), (c), and (d) is 0.3, 0.26, 0.21, and 0.11, respectively, and all correlations have p-value <0.05 except for 72 combinations from (d). 18 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint 0 2.5 M 5 M 7.5 M 10 M 12.5 M 15 M Years 1e-5% 1e-4% 1e-3% 0.01% 0.1% 1% 10% 100%Percentage Uncoalesced YRI CHB GBR Figure A3: The Relate inferred fraction of uncoalesced genome as a function of time, for 10 diploid samples in the YRI, GBR and CHB populations from 1000GP. (a) 104 105 106 107 Years 0 10000 20000 30000 40000 50000 60000 70000 80000 N(t) GBR TSI IBS FIN CEU CHS KHV CHB CDX BEB PJL GIH STU ITU PUR CLM PEL MXL ESN YRI MSL GWD ACB ASW LWK Figure A4: Using PSMC on humans with altered time interval boundaries ( ω=0.01 and Tmax=150) shows relatively consistent inference beyond 10Ma. Across African samples, the estimated coalescence rate trajec- tory up until ∼20Mya has relatively low variance. 19 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint 6.2 Analysis of PSMC model violations547 In this section, we analyse the effect of various model mispecifications on the inference from PSMC. We also548 analyse how the inferred coalescence times across the genome associate with various annotations, including549 for example repeat content, distance to coding sequence, strength of background selection, and recombination550 rate.551 6.2.1 Repeats as a cause of artifactual heterozygous signal552 (a) 0.5Ma - 0.75Ma 0.75Ma - 1.0Ma 1.0Ma - 1.25Ma 1.25Ma - 1.5Ma 1.5Ma - 1.75Ma 1.75Ma - 2.0Ma 2.0Ma - 2.25Ma 2.25Ma - 2.5Ma 2.5Ma - 2.75Ma 2.75Ma - 3.0Ma 3.0Ma - 3.25Ma 3.25Ma - 3.5Ma 3.5Ma - 3.75Ma 3.75Ma - 4.0Ma 4.0Ma - 4.25Ma 4.25Ma - 4.5Ma 4.5Ma - 4.75Ma 4.75Ma - 5.0Ma 5.0Ma - 5.25Ma 5.25Ma - 5.5Ma 5.5Ma - 5.75Ma 5.75Ma - 6.0Ma 6.0Ma - 6.25Ma 6.25Ma - 6.5Ma 6.5Ma - 6.75Ma 6.75Ma - 7.0Ma 7.0Ma - 7.25Ma 7.25Ma - 7.5Ma 7.5Ma - 7.75Ma 7.75Ma - 8.0Ma 8.0Ma - 8.25Ma 8.25Ma - 8.5Ma Time 0 20 40 60 80 100Percentage in repeat GBR JPT ESN (b) 0.5Ma - 0.75Ma 0.75Ma - 1.0Ma 1.0Ma - 1.25Ma 1.25Ma - 1.5Ma 1.5Ma - 1.75Ma 1.75Ma - 2.0Ma 2.0Ma - 2.25Ma 2.25Ma - 2.5Ma 2.5Ma - 2.75Ma 2.75Ma - 3.0Ma 3.0Ma - 3.25Ma 3.25Ma - 3.5Ma 3.5Ma - 3.75Ma 3.75Ma - 4.0Ma 4.0Ma - 4.25Ma 4.25Ma - 4.5Ma 4.5Ma - 4.75Ma 4.75Ma - 5.0Ma 5.0Ma - 5.25Ma 5.25Ma - 5.5Ma 5.5Ma - 5.75Ma 5.75Ma - 6.0Ma 6.0Ma - 6.25Ma 6.25Ma - 6.5Ma 6.5Ma - 6.75Ma 6.75Ma - 7.0Ma 7.0Ma - 7.25Ma 7.25Ma - 7.5Ma 7.5Ma - 7.75Ma 7.75Ma - 8.0Ma 8.0Ma - 8.25Ma 8.25Ma - 8.5Ma Time 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Percentage in segmental duplication GBR JPT ESN Figure A5: The relationship between the posterior mean TMRCA and the repeat content ((a)) or segmental duplications ((b)), on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. (a) The reported Pearson r2 for each population is 0.73, 0.48, 0.69, with p-value <0.005 for each. The grey, dashed line indicates the chromosome wide average repeat content. The reported Pearson r2 for GBR and ESN is not significant, though for JPT it is -0.5 with p-value 0.003. As noted in previous work [2], false heterozygotes caused by repeated regions or segmental duplications553 may lead to excessively long segments with high heterozygosity, which may lead to an excess of ancient554 coalescent events. To examine whether this might artefactually be creating the observed signal, we analysed555 the fraction of the genome in repeated regions and stratified this by inferred TMRCA. We observed that the556 repeat content increases as a function of TMRCA (regression slope ∼ 2%), and that the effect is significant557 (Pearson’s r2 ≈ 0.63, p-value <0.005; see Methods and Figure A5a). A similar analysis with segmental558 duplications revealed no significant correlation (Figure A5b). We re-ran our analysis with repeated regions559 and segmental duplications masked out, and observed that the ancient peak is still present, although when560 repeats are removed it exhibits less prominence (Figure A6a and Figure A6b, respectively). Thus we conclude561 that repeated regions or segmental duplications are unlikely to be the cause of the ancient peak.562 6.2.2 V ariable mutation rate across the genome563 The PSMC model assumes a constant mutation rate throughout the genome, despite ample evidence that564 the mutation rate varies significantly by genomic location [71, 72]. Unfortunately, variation in mutation rates565 across different genomic regions can lead to systematic biases in the estimation of TMRCAs. This is because566 20 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint (a) 10k 100k 1M 10M 0 10000 20000 30000 40000 N(t) GBR ESN CHB 10k 100k 1M 10M Years 10k 100k 1M 10M Figure A6: Running PSMC inference on human samples after masking out repeated regions (a), segmental duplications (b), and CpG islands (c). (a) (b) (c) (d) Figure A7: PSMC inference on simulations with a mutation rate that varies across the genome. The mutation rate changes at disjoint intervals, whose length is exponentially distributed with rate L. In each interval, the mutation rate is drawn from gamma distribution with mean 1.25e-08 and a coefficient of variation v. The red, pink and purple lines show inference from PSMC on a simulation where the mutation rate varies, and the gold lines show inference from PSMC on a simulation where the mutation rate is constant. (a) and (c) v = 0.25 and L =10kb. (b) and (d) v = 0.15 and L=10kb (red), 100kb (pink), and 1Mb (purple). 21 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint regions with faster mutation rates will have more mutations than otherwise, which PSMC will interpret as567 coming from a more ancient TMRCA, and vice versa. Consequently, overestimation and underestimation of568 TMRCAs can occur, which can confound efforts to infer accurate evolutionary histories.569 570 To test this effect, we performed a series of simulations of a spatially varying mutation rate that changes571 according to a gamma distribution, in line with the models proposed in [73]. We generated a mutation572 rate map as follows. First, we divided the genome into disjoint intervals, whose length is exponentially573 distributed according to a rate parameter, L. Then, for each interval, we drew a mutation rate from a574 gamma distribution whose mean is fixed to be 1.25e-08 [64, 65, 66, 67, 68, 69] with a coefficient of vari-575 ation, v. We simulated over L ∈ {10kb,100kb,1Mb} and v ∈ {0.15,0.25}. We simulated diploid genomes576 according to a coarse piecewise population size trajectory using msprime [74], then generated pointwise577 mutations according to the mutation rate map. We generated 5 replicates of this simulation and then in-578 ferred the population size history using PSMC. Finally, as a control, we generated an additional 5 replicates579 using the same procedure, but with a constant mutation rate of 1.25e-08, and infer N(t) using PSMC as well.580 581 This simulation study shows that if the variation in mutation rate is strong enough (v =0.15), and the582 rate of change along the genome is fast enough (L=10kb), PSMC detects a spurious peak in ancient time583 (Figure A7a). However, we note that qualitatively, the false peaks across different replicates seem to vary584 more in location and height than in the human inference, for which the ancient peaks are well aligned (Figure585 1a). Additionally, this simulation used v=0.25, which is larger than mutation rate model fitted by [73] for586 the datasets in [75, 76] (0.18 and 0.15, respectively) but smaller than in [77] (0.27). In a simulation with587 v=0.15, we do not consistently observe a second peak (Figure A7b) for any value of L. In a simulation with588 lower ancient N(t), v=0.25 does generate an artificial inflation of ancient N(t), but not a discernible peak589 (Figure A7c). A simulation with v = 0.15 and L=1Mb again can generate a false peak, though it is not590 consistent across replicates (purple line, Figure A7d); L=10kb or L=100kb with v=0.15 does not seem to591 have much of an effect as the inference is similar to the constant mutation rate inference.592 593 If the ancient peak was an artefact caused by variable mutation rates, we would expect to see a correlation594 between the mutation rate and the inferred TMRCA. To test this, we used de-novo mutation abundance595 counts obtained from trios [78] to generate a pointwise mutation rate map that depends on the local trinu-596 cleotide context (see Methods). The generated mutation map is positively correlated with SNP density as597 reported in 1000GP’s dbSNP (r2=0.09, p-value<1e-100, averaged across non-overlapping windows of 100bp),598 indicating positions with higher inferred rates have an elevated probability of mutations. We compared the599 variable mutation rates with the maximum inferred TMRCA across the 26 individuals in 1000GP, evaluated600 every 1kb. We observed a statistically insignificant Pearson’s correlation (r=0.002, p-value=0.29) and a sig-601 nificant negative Spearman’s correlation (rho=-0.04, p-value=3.7e-93) between TMRCA and mutation rate602 map. This suggests variable mutation rates do not play a significant role in causing the ancient peak.603 604 Finally, we explored the role of CpG islands in affecting N(t) inference. CpG islands are regions of605 DNA characterised by a high frequency of cytosine-guanine dinucleotides. They are often found near the606 transcription start site of genes and are associated with gene regulation. Due to their regulatory role, CpG607 islands have fewer mutations than expected by their high GC content, and therefore may contribute to608 mutation rate variability along the genome. The fraction of the genome in a CpG island is 0.007, and we609 22 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint 0.5Ma - 0.75Ma 0.75Ma - 1.0Ma 1.0Ma - 1.25Ma 1.25Ma - 1.5Ma 1.5Ma - 1.75Ma 1.75Ma - 2.0Ma 2.0Ma - 2.25Ma 2.25Ma - 2.5Ma 2.5Ma - 2.75Ma 2.75Ma - 3.0Ma 3.0Ma - 3.25Ma 3.25Ma - 3.5Ma 3.5Ma - 3.75Ma 3.75Ma - 4.0Ma 4.0Ma - 4.25Ma 4.25Ma - 4.5Ma 4.5Ma - 4.75Ma 4.75Ma - 5.0Ma 5.0Ma - 5.25Ma 5.25Ma - 5.5Ma 5.5Ma - 5.75Ma 5.75Ma - 6.0Ma 6.0Ma - 6.25Ma 6.25Ma - 6.5Ma 6.5Ma - 6.75Ma 6.75Ma - 7.0Ma 7.0Ma - 7.25Ma 7.25Ma - 7.5Ma 7.5Ma - 7.75Ma 7.75Ma - 8.0Ma 8.0Ma - 8.25Ma 8.25Ma - 8.5Ma Time 0 1 2 3 4 5Percentage in CPG Island GBR; r2=-0.26, p=0.15 IBS; r2=-0.45, p=8.97e-03 CEU; r2=-0.46, p=7.06e-03 KHV; r2=-0.07, p=0.69 CDX; r2=-0.06, p=0.74 BEB; r2=-0.23, p=0.19 GIH; r2=0.06, p=0.73 ITU; r2=-0.52, p=2.19e-03 CLM; r2=-0.39, p=0.027 MXL; r2=-0.6, p=2.58e-04 YRI; r2=-0.26, p=0.14 GWD; r2=-0.72, p=2.90e-06 ASW; r2=-0.16, p=0.37 Figure A8: The correlation between the posterior mean TMRCA and the fraction of regions in a CpG island, on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. The Pearson r2 and p-value are shown in the legend, although the relationship appears to be non-linear, and dominated by noise in ancient TMRCA bins. The dashed line indicates the genome-wide fraction of regions in a CpG island. stratify this by inferred TMRCA in Figure A8. The relationship between the two variables is unclear, so to610 test the effect on inference we masked out CpG islands and inferred N(t). We still observed an ancient peak611 (Figure A6c), and thus concluded that CpG islands are not its cause.612 6.2.3 V ariable mutation rate through time613 The PSMC model also assumes a constant mutation rate across past generations. While this is a sufficient614 approximation to allow inference in the last ∼50kya [26], it may not hold in more ancient times. In [26],615 the authors show that by using a yearly mutation rate of 0.38e-09 to 0.49e-09, the PSMC curves of modern616 humans and a 45,000-year-old individual from Siberia are well aligned. Assuming a generation time of 30617 years, this indicates that a per generation rate of 1.25e-08 is relatively stable in the last ∼50kya. Indeed, it618 has been suggested that the mutation rate has changed over the course of human evolution, also referred to619 as the “hominoid rate slowdown” [36, 37, 38, 39]. This is based on the observation that the yearly mutation620 rate estimated from human pedigree studies is almost half the rate inferred by considering observed differ-621 ences between human and chimpanzee genomes. The effect of a possible slowdown on PSMC estimates is622 unclear.623 624 To this end, we simulated 10 replicates of a diploid genome, arising from a constant population size, and625 a mutation rate that is fixed at 2.5e-08 at 10Mya then decreases to 1.25e-8 at present time (Figure A9a). To626 simulate a temporally changing mutation rate, we discretised time into D=64 segments with 65 time interval627 boundaries, τ = [τ1, . . . , τ64], where628 τi = ωexp i D log 1 + Tmax ω − 1 (1) and the changes in mutation rate are piecewise constant in these intervals,µ = [µ1, . . . , µ64]. We simulate the629 coalescent process using msprime [74] with Hudson’s model of recombination [79], then utilise the memoryless630 23 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint (a) 104 105 106 107 Years 1 1.5 2 2.5 (t) ( x 1e-08) Contant mutation rate Varying mutation rate (b) 104 105 106 107 Years 0 20000 40000 60000 80000 100000 N(t) Constant mutation rate Varying mutation rate Simulation (c) 104 105 106 107 Years 0 20000 40000 60000 80000 100000 N(t) Constant mutation rate Varying mutation rate Simulation (d) 104 105 106 107 Years 0 20000 40000 60000 80000 100000 N(t) Constant mutation rate Varying mutation rate Simulation (e) 104 105 106 107 Years 0 20000 40000 60000 80000 100000Effective population sizes Constant mutation rate Varying mutation rate Simulation Figure A9: PSMC inference on simulations with a mutation rate that varies through time. (a) The mutation rate model, µ(t): at 10Mya the mutation rate is 2.5e-08, then it slows down to 1.25e-08 at present. (b) Inference from PSMC on a simulation with constant population size, where the mutation rate changes through time according to the model in (a) in red, and a constant mutation rate (1.25e-8) in gold. (c) The same simulation as in (b) but with a population undergoing size changes. (d) and (e) are the same as (b) and (c), respectively, although the inferred N(t) on the simulation with a varying mutation rate (red) has been multiplied through by µ(t) (red line in (a)). property of the exponential distribution to add mutations at each position. With a coalescence time of t631 where τi ≤ t < τ i+1, the probability of a mutation arising on either lineage is:632 P (Mutation arises|t) = 1 − exp  −4N   i−1X j=1 µj∆j − µi(t − τi)     (2) where ∆j = τj+1 − τj.633 634 We used PSMC to infer an N(t) curve, and compared the results to N(t) curves inferred from simulations635 with a fixed mutation rate of 1.25e-8. We observed that PSMC infers an increasingly inflated N(t) in ancient636 times (Figure A9b). With the same mutation model, we also simulated changes in N(t) similar to that as637 24 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint (a) 104 105 106 107 Years 0 10000 20000 30000 40000 50000 60000 N(t) = /5 = /2 = = 2 = 5 Simulation (b) 104 105 106 107 Time 0 10000 20000 30000 40000 50000 60000 N(t) With hotspots Without hotspots (c) 104 105 106 107 Time 0 10000 20000 30000 40000 50000 60000 N(t) With hotspots Without hotspots Figure A10: Inference from PSMC when its model of recombination is violated. (a) In a simulation with constant recombination rate ρ, we deliberately fixed the recombination rate used by the PSMC algorithm ˆρ at an incorrect value. The black line indicated the simulated N(t) and the coloured lines indicate the various values of ˆρ, which are expressed relative to the simulated value. (b) and (c) Inference from PSMC on simulations with changes in the recombination rate according to HapMap (red lines) and a constant recombination rate (gold lines). inferred in the ESN (Figure 1). Again, we observe the inference of N(t) from PSMC is increasingly inflated638 in ancient times, though the general shape of the trajectory is recovered (Figure A9c). We note that if the639 changes in mutation rate are known, we can simply scale the inference appropriately to correct the error640 (Figure A9d and e). Even though changes in the mutation rate over time can affect the estimation of N(t),641 it is unlikely that the observed ancient peak in humans is due to this, as producing a peak as an artefact642 of changing mutation rates would require several rapid and severe mutation rate fluctuations, which seems643 unlikely. We conclude that mutation rate variation through time is not a likely cause of an ancient peak, and644 note that this may explain the differences in ancient N(t) magnitudes in humans and chimpanzees (Figures645 1a and b).646 6.2.4 V ariable recombination rate across the genome647 The PSMC model also assumes a constant recombination rate across the genome, and uniformly through all648 past generations. However, recombination rate varies along the genome, with high rates at recombination649 hotspots and lower rates in different regions [80, 81, 42, 43, 44]. Looking back in time, the landscape of650 25 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint 0.5Ma - 0.75Ma 0.75Ma - 1.0Ma 1.0Ma - 1.25Ma 1.25Ma - 1.5Ma 1.5Ma - 1.75Ma 1.75Ma - 2.0Ma 2.0Ma - 2.25Ma 2.25Ma - 2.5Ma 2.5Ma - 2.75Ma 2.75Ma - 3.0Ma 3.0Ma - 3.25Ma 3.25Ma - 3.5Ma 3.5Ma - 3.75Ma 3.75Ma - 4.0Ma 4.0Ma - 4.25Ma 4.25Ma - 4.5Ma 4.5Ma - 4.75Ma 4.75Ma - 5.0Ma 5.0Ma - 5.25Ma 5.25Ma - 5.5Ma 5.5Ma - 5.75Ma 5.75Ma - 6.0Ma 6.0Ma - 6.25Ma 6.25Ma - 6.5Ma 6.5Ma - 6.75Ma 6.75Ma - 7.0Ma 7.0Ma - 7.25Ma 7.25Ma - 7.5Ma 7.5Ma - 7.75Ma 7.75Ma - 8.0Ma 8.0Ma - 8.25Ma 8.25Ma - 8.5Ma Time 0 0.5 1 1.5 2 2.5 3 Recombination rate (x 1e-08) GBR JPT ESN Figure A11: The relationship between the posterior mean TMRCA and the HapMap recombination rate map, on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. The reported Pearson r2 for each population is 0.45, 0.41, and 0.50, with p-value <0.02 for each. However, clearly the relationship is non-linear; it appears that recombination rate increases as does TMRCA up until ∼2Mya, after which they become negatively correlated. This may come from model violations, in that PSMC assumes a uniform rate of recombination, or that the recombination landscape changes over time [82]. The grey, dashed line indicates the chromosome wide average recombination rate. recombination is known to transform every few hundred thousand years [82, 83, 84, 85]. It is thus unclear651 how recombination rate variation affects N(t) inference.652 653 Previous work [2] has demonstrated that N(t) inference in PSMC is robust even under simulations654 including recombination hotspots. As additional confirmation, we tested the effect of mis-specifying the655 recombination rate. Denote the simulated scaled recombination rate as ρ; we fix PSMC’s recombination656 rate ˆρ as 0.2ρ, 0.5ρ, 1ρ, 2ρ, or 5 ρ, and infer N(t) (Figure A10a). We observed that PSMC is able to recover657 N(t) relatively accurately for all values of the recombination rate except ˆ ρ/ρ equal to 5. Finally, we saw658 a significant correlation between the local recombination probability (genetic map taken from HapMap [86,659 87]; downloaded from https://alkesgroup.broadinstitute.org/Eagle/) and inferred TMRCAs660 (Figure A11), though the relationship appears to be non-linear. We investigated this by simulating from661 this genetic map, and inferring an N(t) curve assuming a constant recombination rate. We observed that662 this type of model mis-specification does not significantly alter inference (Figure A10b and A10c). Given663 these observations, we believe it is unlikely that a spatially varying recombination rate could generate a fake664 ancient peak.665 6.2.5 Recurrent mutations666 PSMC assumes an infinite sites model, in which a site can only experience one mutation. In reality, some667 ancient sites will have experienced recurrent mutations. Under a Jukes-Cantor model of mutation [88], which668 assumes that each base pair mutates with uniform probability to another, with probability 1/3 a recurrent669 mutation will be revert to its ancestral state. These will be observed as a homozygote and therefore may be670 inferred as a lower TMRCA by PSMC. With probability 2/3 a recurrent mutation induces a distinct biallelic671 SNP, which is observed as a heterozygote. These biallelic SNPs will appear to PSMC as a younger TMRCA,672 26 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint 0.0Ma - 0.25Ma 0.25Ma - 0.5Ma 0.5Ma - 0.75Ma 0.75Ma - 1.0Ma 1.0Ma - 1.25Ma 1.25Ma - 1.5Ma 1.5Ma - 1.75Ma 1.75Ma - 2.0Ma 2.0Ma - 2.25Ma 2.25Ma - 2.5Ma 2.5Ma - 2.75Ma 2.75Ma - 3.0Ma 3.0Ma - 3.25Ma 3.25Ma - 3.5Ma 3.5Ma - 3.75Ma 3.75Ma - 4.0Ma 4.0Ma - 4.25Ma 4.25Ma - 4.5Ma 4.5Ma - 4.75Ma 4.75Ma - 5.0Ma 5.0Ma - 5.25Ma 5.25Ma - 5.5Ma 5.5Ma - 5.75Ma 5.75Ma - 6.0Ma 6.0Ma - 6.25Ma 6.25Ma - 6.5Ma 6.5Ma - 6.75Ma 6.75Ma - 7.0Ma 7.0Ma - 7.25Ma 7.25Ma - 7.5Ma 7.5Ma - 7.75Ma 7.75Ma - 8.0Ma 8.0Ma - 8.25Ma 8.25Ma - 8.5Ma Time 0.70 0.75 0.80 0.85 0.90 0.95 1.00B value GBR JPT ESN Figure A12: The relationship between the posterior mean TMRCA and a high resolution map of background selection [46], on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. The reported Pearson r2 for each population is 0.67, 0.67, and 0.75, with p-value <1.3e-5 for each. In general this suggests that B value increases with TMRCA (a linear line of best fit has been added to show this), though the relationship appears to be non-linear. The grey, dashed line indicates the chromosome wide average B value. so it is unlikely that these will significantly affect inference of ancient N(t). Moreover, it is generally a whole673 region of a chromosome that is informative about local TMRCA, rather than a single base pair. If a segment674 is old enough to experience a recurrent mutation, then there are likely many other neighbouring SNPs from675 which the TMRCA can be accurately inferred.676 677 Indeed, in almost all the simulations previously presented a Jukes-Cantor model [88] was used to generate678 mutations, in which recurrent mutations are allowed to occur. Generally, in simulations with uniform µ in679 time and space, we observe that inference of ancient N(t) is reasonably accurate in ancient time (Figures680 A7, A9, and A10).681 6.2.6 Background selection682

Background

selection (BGS) is a form of linked selection, where the removal of deleterious mutations reduces683 genetic diversity in the surrounding regions due to linkage [89, 90]. It has been demonstrated that BGS is684 pervasive throughout the human genome, and that this explains roughly 60% of the variance in diversity at685 the megabase scale [45, 46]. Many inference methods, however, assume that the genome evolves neutrally.686 This is problematic, as it has been shown that wrongly assuming the absence of selection means that infer-687 ence methods are not able to accurately reconstruct the demographic history [47, 49, 48].688 689 We correlated the inferred TMRCAs against a high resolution map that describes the strength of BGS690 across the human genome [46]. We saw a significant positive correlation (Pearson r2 ≈ 0.7, p-value<1e-5;691 Figure A12), which is consistent with models of BGS that model regions under strong linked selection as692 having lower effective population size [90]. To explore the effect of BGS on inference of N(t) in real data,693 the authors in [58] binned the genome of 10 YRI individuals into quintiles according the strength of BGS694 and ran PSMC in each (reproduced in Figure A13). In all quintiles except the one with strongest BGS, a695 clear second peak is seen in ancient time. The peak is not seen in the strongest BGS quintile likely because696 27 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint 104 105 106 107 Time (years) 0 10000 20000 30000 40000 50000 60000Ne B 0.54-0.73 B 0.73-0.81 B 0.81-0.85 B 0.85-0.9 B 0.9-0.99 Figure A13: PSMC inference of N(t) in 10 YRI individuals from the 1000GP project, stratified into quintiles according to the strength of BGS (as inferred by [46]). B values indicate the local strength of BGS, with 1 being no reduction in genetic diversity due to selection and 0 being full removal of genetic diversity. This figure is reproduced with permission from [58]. 0.5Ma - 0.75Ma 0.75Ma - 1.0Ma 1.0Ma - 1.25Ma 1.25Ma - 1.5Ma 1.5Ma - 1.75Ma 1.75Ma - 2.0Ma 2.0Ma - 2.25Ma 2.25Ma - 2.5Ma 2.5Ma - 2.75Ma 2.75Ma - 3.0Ma 3.0Ma - 3.25Ma 3.25Ma - 3.5Ma 3.5Ma - 3.75Ma 3.75Ma - 4.0Ma 4.0Ma - 4.25Ma 4.25Ma - 4.5Ma 4.5Ma - 4.75Ma 4.75Ma - 5.0Ma 5.0Ma - 5.25Ma 5.25Ma - 5.5Ma 5.5Ma - 5.75Ma 5.75Ma - 6.0Ma 6.0Ma - 6.25Ma 6.25Ma - 6.5Ma 6.5Ma - 6.75Ma 6.75Ma - 7.0Ma 7.0Ma - 7.25Ma 7.25Ma - 7.5Ma 7.5Ma - 7.75Ma 7.75Ma - 8.0Ma 8.0Ma - 8.25Ma 8.25Ma - 8.5Ma Time 0 100kb 200kb 300kb 400kb 500kb 600kb 700kbDistance to CDS GBR JPT ESN Figure A14: The relationship between the posterior mean TMRCA and mean distance to coding sequence (CDS), on chromosome 1 for one individual from GBR, JPT, and ESN in the 1000GP. The reported Pearson r2 for each population is 0.63, 0.67, and 0.83, with p-value <0.0001 for each. In general this suggests that dcCDS increases with TMRCA (a linear line of best fit has been added to show this), although the relationship appears reversed from ∼3Mya to ∼4.5Ma. The grey, dashed line indicates the chromosome wide average distance to CDS. all of the input sequence has already coalesced. In none of the simulations with realistic parameters of BGS,697 as shown in [48, 91, 58, 92], does PSMC infer infer a false ancient peak. We thus find it implausible that698 widespread BGS could generate an ancient peak as observed in humans.699 6.2.7 Balancing selection700 Balancing selection (BLS) is a type of natural selection that maintains genetic diversity in a population by701 favouring the maintenance of multiple alleles. This type of selection can occur through various mechanisms,702 such as heterozygote advantage, frequency-dependent selection, or spatially varying selection [93]. BLS oper-703 28 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint ating for a long time period will maintain advantageous polymorphism and result in an older TMRCA than704 expected under neutrality. Therefore, if BLS were sufficiently prevalent in the genome, this would manifest705 as enrichment of older coalescence times, which would increase inferred N(t) in the past.706 707 The prevalence of BLS in the human genome is unclear. Initially considered a rarity [94, 95] and over-708 looked, balanced polymorphisms have recently received renewed attention with several lines of evidence709 showing their relevance in human evolution [96, 97, 98]. Recently, hundreds of loci were implicated in possi-710 ble trans-species BLS, maintained since earlier than human-chimpanzee speciation [99]. By finding regions711 of ancient shared ancestry across 54 individuals from Complete Genomics [100], in [101] the authors suggest712 numerous regions that are under BLS. We calculated the correlation between inferred TMRCAs and distance713 to coding sequence (CDS), and found that ancient TMRCAs tend to be increasingly far away from CDS714 (Figure A14; Pearson r2 ≈ 0.71, p-value<0.0001). As BLS usually acts on or near functionally important715 parts of the genome, this makes it unlikely that the ancient TMRCAs underlying our ancient peak are driven716 by BLS and that this is the cause of the ancient peak.717 6.3 Methods718 6.3.1 Correlations with annotations719 We downloaded the the tracks for repeats, segmental duplications, and CpG islands from the UCSC Genome720 Browser https://genome.ucsc.edu/cgi-bin/hgTables [102]. The positions of coding sequence were721 obtained from Gencode ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/722 gencode.v45.chr_patch_hapl_scaff.basic.annotation.gff3.gz. We used the B-map as in-723 ferred by Murphy et al. [46] and lifted over from GRCh37 to GRCh38 [102].724 725 We took the posterior mean TMRCA across chromosome 1 for the 26 1000GP samples. We stratified726 the TMRCAs into windows of 250kya and analysed how various functional annotations correspond to these727 TMRCA windows, and ensured that positions in the analysis passed the strict mappability mask.728 29 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 19, 2024. ; https://doi.org/10.1101/2024.10.17.618932doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-NC-4.0