Accurate prediction of site- and amino-acid substitution rates with a mutation-selection model

doi:10.1101/2024.03.02.583099

Accurate prediction of site- and amino-acid substitution rates with a mutation-selection model

2024 · doi:10.1101/2024.03.02.583099

preprint OA: closed CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 58,232 characters · extracted from oa-pdf · 10 sections · click to expand

Keywords

Amino acid site-rates, substitution matrices, mutation-selection model 7 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 2

Abstract

8 9 The pattern of substitution s at sites in proteins provide s invaluable information about their 10 biophysical and functional importance and what selection pressures are acting at individual 11 sites. Amino acid site rates are typically estimated using phenomenological models in which 12 the sequence variability is described by rate factors that scale the overall substitution rate in a 13 protein to sites. In this study, we demonstrate that site rates can be calculated accurately from 14 amino acid sequences using a mutation-selection model in combination with a simple 15 nucleotide substitution model. The method performs better than the standard phylogenetic 16 approach on sequences generated by structure-based evolutionary dynamics simulations, 17 robustly estimates rates for shallow multiple sequence alignments , and can be rapidly 18 calculated also on larger sequence alignments . On natural sequence s, site rates from the 19 mutation-selection model are strongly correlated to rates calculated with the empirical Bayes 20 methods. The model provides a link between amino acid substitution rates and equilibrium 21 frequency distributions at sites in proteins. We show how an ensemble of equilibrium 22 frequency vectors can be used to represent the rate variation encoded in empirical amino acid 23 substitution matrices. This study demonstrates that a rapid and simple method can be developed 24 from the mutation -selection model to predict substitution rates from amino acid data, 25 complementing the standard phylogenetic approach. 26 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 3

Introduction

27 28 Proteins evolve under several genetic, biophysical, and functional fitness constraints(Liberles, 29 et al. 2012) . This results in a substantial variation in amino acid substitution patterns at the 30 level of sites in proteins. Amino acid-based phylogenetic models primarily account for this 31 variation by introducing factors that scale the relative substitution rate at sites to the overall 32 mean substitution rate of a multiple sequence alignment (MSA)(Yang 1993; Arenas 2015) . 33 Such models hav e resulted in the improved inference of phylogenic trees (Le and Gascuel 34 2008), but the individual rate values themselves have also proven to be insightful, in particular 35 when mapped onto three -dimensional structures of proteins to characterize sequence 36 conservation(Pupko, et al. 2002a; Landau, et al. 2005) . Site-level data can report on the 37 contribution of sites to the stability, structure, and function of proteins and enables correlation 38 between substitution patterns and molecular properties(Echave, et al. 2016). 39 40 Because the evolutionary process proceeds through changes at the nucleotide level, protein 41 models are inherently phenomenological(Liberles, et al. 2012; Jones, et al. 2018). Furthermore, 42 there is significant variability of amino acid propensities at sites in proteins , which cannot be 43 fully accounted for by models that combine site rates with invariable equilibrium 44 frequencies(Echave, et al. 2016) . Incompatibilities between the generative process behind 45 protein sequence evolution and inference models used to analyze sequence data can sometimes 46 lead to biased and overconfident predictions(Jones, et al. 2018). By modeling the evolutionary 47 process at the codon level, the mutation -selection model(Halpern and Bruno 1998; Yang and 48 Nielsen 2008; Rodrigue, et al. 2010; Tamuri, et al. 2012) provides an approach to describe site-49 level heterogeneity with greater realism than protein -level models. Modeling substitution 50 processes at the level of codons and sites requires models with a large number of parameters, 51 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 4 which results in computational as well as statistical challenges. Nonetheless, statistical 52 inference frameworks have successfully been implemented to derive site -based parameters 53 from sequence data (Yang and Nielsen 2008; Rodrigue, et al. 2010; Tamuri, et al. 2012) . 54 Mutation-selection models have primarily been used to detect signals of adaptive evolution in 55 genes or sites in genes(Tamuri, et al. 2014; Bloom 2017; Rodrigue, et al. 2020). 56 57 In the standa rd approach, the mutation-selection model is applied to nucleotide data. But in 58 many situations, it is preferable to analyze amino acid sequences. By combining the mutation-59 selection model with a nucleotide mutation model it is possible to predict site -based 60 substitution rates(Halpern and Bruno 1998; Moses, et al. 2003) based on predicted amino acid 61 equilibrium sequences. We show here that this can lead to accurate predictions of site rates 62 from amino acid sequences and improvements relative to the standard phylogenetic approach. 63 64 Average substitution rates for amino acids can also be predicted from the mutation -selection 65 model by condensing codon -level rates into amino acid instantaneous rate matrices (Norn, et 66 al. 2020). We demonstrate here that such matrices are very similar to empirical rate matrices 67 like WAG(Whelan and Goldman 2001) and LG(Le and Gascuel 2008). The mutation-selection 68 framework establishes a direct link between amino acid frequency distributions at sites and 69 protein substitution matrices. This connection has been made previously by Pollock and 70 Goldstein, who demonstrated that amino acid substitution rates could be constructed by 71 calculating a covariance matrix between site equilibrium frequency vectors (Goldstein and 72 Pollock 2016). Here we exploit the link between site frequency distributions and substitution 73 matrices to establish which distribution of site frequency vectors is encoded in empirical 74 matrices such as JTT(Jones, et al. 1992), WAG, and LG. 75 76 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 5

Results

77 Amino acid rates from a mutation-selection model 78 The mutation-selection model is a codon-based model that describes the effect of protein-level 79 site-specific selection(Bruno 1996). Several simplifying assumptions are made to arrive at the 80 model. Selection pressures at sites are taken to be constant over all lineages, fixation is assumed 81 to be rapid compared to mutation, sites are independent, and the evolutionary process is 82 dominated by negative selection. With these premises the relative instantaneous rate between 83 (𝑞!") from codon u with amino acid i to codon v with amino acid j is modeled as the product 84 of the proposal rate of mutating codon u to v, 𝑝!", and the fixation probability, 𝑓!" , and an 85 arbitrary scaling constant k: 86 87 𝑞!" = 𝑘 ∙ 𝑝!" ∙ 𝑓!" [1] 88 89 The codon transition probability can be modeled with many approaches, but we assume here 90 that the fitness of a codon is only a function of the amino acid it encodes and that the rate of 91 mutation between codons can be modeled by the nucleotide mutation rate of the nucleotide 92 change separating connected codons (Norn, et al. 2020). Many nucleotide substitution models 93 can be utilized, the results here are obtained with the K80 model with one parameter ( 𝜅) 94 controlling the relative rate of transitions to transversions (Kimura 1980) . In addition, we 95 assign a rate for multi-nucleotide mutations ( 𝜌) accounting for insertions, deletions , and 96 tandem mutations that have been shown to play an important role in evolution(Reid and Loeb 97 1993; Harris and Nielsen 2014). With these parameters, the mutation proposal rates become: 98 99 𝑝 = * 1 + 𝜌 𝜅 + 𝜌 𝜌 𝑖𝑓 𝑡𝑟𝑎𝑛𝑠𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑖𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑒𝑙𝑠𝑒 100 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 6 101 By using the weak mutation model of Golding and Felsenstein (Golding and Felsenstein 1990), 102 the fixation probability 𝑓#$ can be approximated as: 103 104 𝑓!" = %& (!"#"$ !$#$" ) *+!$#$" !"#"$ [2] 105 106 Where pu/v is the equilibrium frequency of codon u/v and p u/v is the probability of mutation 107 between u/v. To estimate the codon equilibrium frequencies pu/v we first estimated the amino 108 acid frequencies from a multiple sequence alignment (MSA) using their maximum likelihood 109 estimates. The frequency of individual codons was then calculating by multiplying the amino 110 acid frequency by the relative frequencies of codons encoding for this amino acid. This relative 111 frequency can be estimated from genomic data. However, these do not correspond to proposal 112 frequencies since they are determined from sequences after selection. Instead, we calculated 113 the relative codon frequencies by solving for the codon equilibrium frequencies in an 114 instantaneous rate matrix w here each codon is assigned equal fitness and renormalize the 115 frequencies coding for the same amino acids to sum to 1. 116 117 The site-specific rate substitution rate 𝜇, (the total flux) is: 118 119 𝜇, = ∑ ∑ 𝛷#$ , $-## [3] 120 121 Where i and j are amino acid types and where 𝛷#$ , is the flux between i and j which can be 122 found by summing the flux between all connected codons u and v that encode for amino acid i 123 and j: 124 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 7 [4] 125 We scale the site-specific rate to correspond to a n average of one neutral substitution by per 126 time-unit(Bruno 1996; Spielman and Wilke 2015). 127 128 Maximum-likelihood estimation of model parameters 129 There are two nuisance parameters in the model, 𝜅 and 𝜌. Following the approach in Norn et 130 al.(Norn, et al. 2020) we first construct a codon-level instantaneous rate matrix and aggregate 131 it into a protein -level protein rate matrix. To find optimal values of 𝜅 and 𝜌 we find the 132 combination of parameters that maximizes the likelihood in phylogenetic tree reconstruction 133 from a set of MSAs using the corresponding protein rate matrix. Substitution matrices were 134 computed from 221 pfam alignments used by Le and Gascuel to infer the LG matrix (Le and 135 Gascuel 2008) . The substitution matrix calculated using each set of parameters w as then 136 evaluated by inferring trees 59 TreeBase alignments, also taken from the study by Le and 137 Gascuel(Le and Gascuel 2008) . The s um log -likelihoods were used to identify the optimal 138 parameters. A total of 42 parameter combinations were evaluated in a grid search. 𝜅=1.4 and 139 𝜌=0.2 were identified as the optimal solution. The likelihood surface is very flat with respect 140 to the variation of 𝜅, whereas a clear optimum was found for 𝜌 (Supplementary Figure 1). 141 The resulting protein rate matrix can be compared to the empirical rates matrices JTT, WAG, 142 and LG. In Figure 1, the amino acid substitution rate values of the optimal rate matrix (referred 143 to as MutSel, with 𝜅=1.4 and 𝜌=0.2) are compared to the values of JTT, WAG, and LG on the 144 logarithmic scale. There is a strong correlation between the values in these matrices, with r2 145 values of 0.89, 0.83, and 0.79 for JTT, WAG, and LG, respectively. This demonstrates that the 146 relative average amino acid substitution rates are well predicted by the model (to the extent 147 that empirical matrices correspond to true values). However, there is a non -unit slope in the 148 Φi→ j L = π i Lqi→ j L = π v Lqvu L v∈j ∑ u∈i ∑ .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 8 correlation with the empirical matrices that is caused by a lower variation of amino acid rates 149 in the MutSel model. 150 151 Figure 1: Comparison of MutSel rate matrix with JTT(left), WAG(center) and LG(right). The 152 plot of amino -acid rates (qij) of WAG(left)/LG(right) and the same element in MutSel. Black 153 line, linear fit. 154 155 Two aspects could give rise to biased predictions. First, limited sampling of sequences means 156 that the equilibrium frequencies for rare amino acid variants will tend to zero values. This can 157 be combatted by introducing pseudo -counts, which are often used during remote homolog 158 detection(Henikoff and Henikoff 1994) . To avoid biasing results towards the empirical 159 matrices we compare to, we tested a simple approach for adding pseudo-counts using a Jeffreys 160 prior for a multinomial distribution . This , however, resulted in a significantly reduced 161 performance. The second aspect is that MSAs do not correspond to random samples of 162 sequences. This can be addressed with the reweighting of sequences before the calculation of 163 equilibrium frequencies. We tested a scheme that weight sequences based on a phylogenetic 164 tree(Stone and Sidow 2007). This resulted in a small, but consistent, reduction in performance. 165 166 167 168 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 9 Simulation of protein sequence data 169 Having established values for the nuisance parameters of the model we now go on to evaluate 170 the capability of the mutation-selection model in estimating site-rates. The dataset used in the 171 validation was the result of evolutionary dynamics simulations. A benefit of this simulation 172 strategy is that the ground truth is known in this scenario. For four proteins ( Barnase, 173 Ribonuclease T1, Lysozyme, and protein inhibitor CI2) we simulated evolutionary trajectories 174 according to phylogenetic trees inferred from natural MSAs for these systems (data taken from 175 Norn et. al.(Norn, et al. 2024)). In the simulations, a recursive algorithm was used to “crawl” 176 the tree and at each new node an evolutionary dynamics trajectory is launched with a branch 177 length taken from the empirical tree. RosettaEvolve (Norn and Andre 2023) , part of Rosetta 178 macromolecular modeling package(Leman, et al. 2020), was used for protein structure -based 179 evolutionary dynamics simulations. RosettaEvolve is similar to other structure -based 180 evolutionary dynamics methods presented in the literature(Goldstein 2011; Pollock, et al. 2012; 181 Shah, et al. 2015; Jiang, et al. 2018) . Mutations are proposed at the nucleotide level and 182 correspond to either single-base pair mutations or whole-codon codon changes. The probability 183 of fixating a mutation is estimated using Kimura’s expression for fixation probability(Kimura 184 1962). The simulations are carried out with the assumption that neutral evolution is primarily 185 driven by thermodynamic stability, with a fitness function that equals the fraction of folded 186 protein(Williams, et al. 2006). The effect of mutations on stability is calculated using structure-187 based DDG calculations (Park, et al. 2016) . The sequence diversity generated in these 188 simulations is depending on the assumed stability of the protein. The stability was selected in 189 these simulations to result in sequence entropies of simulated alignments that is similar to the 190 values in the empirical MSA. The leaf-nodes were collected to create MSAs from which site -191 rates were predicted. Because the full evolutionary trajectories during the tree crawling is 192 recorded the true substitution rate is known for each site in a protein. 193 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 10 194 The comparison of substitution rates estimated from the mutation -selection model and 195 substitutions counted per site counted from the trajectories are shown in the upper row in 196 Figure 2 for the four simulated proteins. 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 11 217 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 12 Figure 2: Comparison of rates predicted from sequence and the true substitution count. Left 218 column, comparison of site rates predicted from the mutation-selection model compared with 219 counted substitutions in evolutionary dynamics simulations based on the structure of Barnase 220 (PDBID: 1a2p), Ribonuclease T1 (PDBID: 1rn1), Lysozyme (PDBID: 1el1) and CI2 (PDBID: 221 2CI2). Black circles, individual site-rates. Black line, linear fit. Right column, a comparison of 222 site rates predicted using phylogenetic inference through LEISR(Spielman and Pond 2018) to 223 counted substitutions for 1a2p, 1rn1, 1el1, and 2ci2. Black circles, individual site-rates. Black 224 line, linear fit. 225 226 The rates predicted from the mutation-selection model have a strong linear correlation with the 227 counted substitutions from the evolutionary dynamics simulations. r 2 values range from 0.90-228 0.94 (Barnase: 0.93, Ribonuclease T1: 0.94, Lysozyme: 0.90, CI2: 0.94). We compared these 229

Results

with site rates inferred using Maximum likelihood calculations using LEISR(Spielman 230 and Kosakovsky Pond 2018) , which builds upon the methodology of the pioneering 231 rate4site(Pupko, et al. 2002b) , but has improved support for larger MSAs. LEISR also 232 accurately predicts site rates (bottom row Figure 2), with r 2 values ranging from 0.87 -0.92 233 (Barnase: 0.88, Ribonuclease T1: 0.92, Lysozyme: 0.87, CI2: 0.88). In contrast to the mutation-234 selection model the correlation counted substitutions deviates slightly from linear correlation. 235 236 The predictions from the mutation-selection model and LEISR were made with relatively large 237 MSAs (Barnase: 237 sequences, Ribonuclease T1: 306, Lysozyme: 537, CI2: 529). Because 238 the estimation of the equilibrium frequencies should become more uncertain as the number of 239 sequences decreases, we expect the performance of the site -rate prediction to deteriorate. To 240 investigate the correlation of accuracy of site-rate predictions as a function of the depth of the 241 sequence alignment we sampled smaller subsets of sequences and predicted site -rates from 242 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 13 these. Figure 3 shows how the r2 between predicted rates and counted substitutions varies with 243 the depth of the sequence alignment. 244 245 Figure 3: Effect of depth of MSA on site-rate prediction. Average Pearson r2 between predicted 246 rates from the mutation-selection model (black) and LEISR (red) as a function of the number 247 of sequences in the alignment. Each point is an average of 5 random sets of sequences. 248 249 The results show that 50 or more sequences are required to reach the highest correlation with 250 the empirical substitution counts for the mutation -selection model . Nonetheless, with 30 251 sequences the r2 values are still around 0.8 demonstrating that accurate predictions can be made 252 with relatively few sequences. In these examples, LEISR requires more than 125 sequences to 253 reach similar accuracies and has consistently lower correlation across all sequence depths. 254 0 50 100 150 200 250 300 Alignment depth 0.2 0.4 0.6 0.8Average r2 1RN1 0 100 200 300 400 500 Alignment depth 0.2 0.4 0.6 0.8 Average r2 2CI2 0 50 100 150 200 Alignment depth 0.2 0.4 0.6 0.8 Average r2 1A2P 0 100 200 300 400 500 Alignment depth 0.2 0.4 0.6 0.8 Average r2 1EL1 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 14 Comparison with rates inferred by maximum likelihood 255 Proteins evolve by an evolutionary mechanism that is considerably more complex than what is 256 encoded in simulated data. To investigate how the mutation -selection model predicted rates 257 from a set of 213 protein sequence alignments presented by Echave et. al. (Echave, et al. 2015). 258 This study included site rates estimated using the empirical Bayes rate4site method, using two 259 different substitution matrices, JC and JTT. In Figure 4 the correlation between rates estimated 260 from the mutation-selection model and the rates4sites are shown. 261 262 263 Figure 4: Correlation between rates from mutation -selection model and rates for site. Site -264 rates estimated from 213 MSAs using the mutation -selection model and rate4sites with two 265 different substitution matrices (JC and JTT, blue and orange respectively) and compared with 266 r2 correlation in swarmplot. Correlation between rates estimated using JC and JTT are shown 267 in green. 268 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 15 269 The average r 2 correlation is 0. 74 to rates inferred with the JC matrix and 0. 66 to JTT. The 270 amino-acid JC model (Mayrose, et al. 2004) assigns an equal probability for all amino acid 271 mutations, which gives higher correlations with the mutation -selection. The choice of 272 substitution matrix has an impact on inferred rates (average r2 0.93), but this variation is smaller 273 than the variation between mutation -selection model and maximum likelihood. Nonetheless, 274 the mutation-selection model gives similar rates to empirical Bayes. 275 276 Deconstructing empirical substitution matrices into the distribution of equilibrium frequency 277 vectors 278 The mutation-selection model couples the distributions of equilibrium frequencies to the amino 279 acid substitution rates in instantaneous rate matrices. The site frequency vectors used to 280 construct an amino acid substitution matrix constitute an ensemble of site probability vectors. 281 Such ensembles will be unique for each empirical substitution matrix. In this section, we derive 282 such probability vectors for JTT, WAG , and LG by fitting a distribution that maximize s the 283 correlation between the mutation -selection matrix and the empirical matrix. Such 284 deconvolution can give insight into what the implicit assumptions about site rates have gone 285 into empirical rate matrices used in phylogenetic analysis. 286 287 Previous studies have demonstrated that the amino -acid frequency distribution at sites in 288 proteins is well approximated by the following expression (Porto, et al. 2005; Johnson and 289 Wilke 2020) if the frequencies (p) are ordered from large to small values (p1>p2…>p20): 290 291 𝜋# = .%&' ∑ .%('( [5] 292 293 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 16 where i=1..20 and l controls the shape of the distribution of p-values. Large values of l results 294 in frequency profiles dominated by a few amino acids. We used the method presented in 295 Johnson and Wilke(Johnson and Wilke 2020) and determined a histogram of l-values in the 296 MSAs used to fit the LG matrix. The l-distribution is well -described by a gamma function 297 (Supplementary figure 2). We sampled 1000 values from this gamma l-distribution to create 298 an ensemble of frequency vectors. Equation 5 does not provide a unique mapping to amino 299 acid identities. We used a sampling approach to assign identities of amino acids to these 300 frequency vectors so that they approximately result ed in average frequencies of amino acids 301 that match ed the equilibrium frequency distribution in LG. From this ensemble of 1000 302 frequency vectors, a subset of 100 vectors were picked out that were used to calculate an 303 instantaneous rate matrix through the mutation -selection framework. The ensemble was 304 selected to maximize the correlation with the empirical matrices JTT, WAG , and LG. 305 Optimization was carried out by simulated annealing. The result of this procedure is an 306 ensemble of frequency distributions that maximizes the correlation to empirical matrices and 307 the result is shown in Figure 5. 308 309 310 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 17 311 Figure 5: Fit of amino acid frequency distributions to empirical matrices. Upper row, plot of 312 amino-acid rates (q ij) of JTT( left)/WAG(center)/LG(right) and the same element in fitted 313 mutation-selection model. Black line, linear fit. Lower row, histogram of the effective number 314 of amino acids (Neff) in fit to JTT, WAG, and LG. 315 316 The fitted matrices are highly correlated to their empirical counterparts with r2 values of 0.95, 317 0.92, and 0.87 for JTT, WAG , and LG, respectively. To characterize the fitted frequency 318 distributions, the effective number of amino acids (neff) was calculated for each vector in the 319 ensemble. This is a convenient metric to characterize variability at sites in a protein. neff values 320 range between 1 and 20, where 1 means that only one type of amino acid is found at a site and 321 20 means that all 20 occur with equal probability. The distribution of effective amino numbers 322 for each matrix is presented in Figure 5. JTT is substantially different from WAG and LG, 323 with a broader distribution and more vectors corresponding to highly variable “sites” (higher 324 neff values). The average neff for JTT is 4.6. The distribution of WAG and LG have more in 325 common, although WAG has a higher average neff than LG (4.0 vs 3.4). This primarily stems 326 from the higher frequency of very invariable “sites” for WAG (low neff values). 327 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 18

Discussion

328 The mutation-selection model provides a powerful framework for describing how the action of 329 mutation and selection combine to generate site variability in protein sequences. In this model, 330 site variability emerges as a consequence of the relative time -averaged fitness of codons and 331 individual amino acids. Phylogenetic models used in tree inference from protein sequences 332 typically assume that all sites evolve with the same equilibrium frequencies and site variability 333 is introduced by a phenomenological rate scale factor applied to each site. This restriction can 334 be loosened by allowing the equilibrium frequency distribution to vary at sites (Quang le, et al. 335 2008). However, the large number of parameters of such models coupled to their computational 336 complexity limits their utility. The performance of variable site frequency models in site -rate 337 predictions has also not been benchmarked. 338 339 We focus here on the prediction of substitution rates at individual sites of proteins from amino 340 acid sequence data and show that they can be calculated using the mutation -selection 341 framework with just two nuisance parameters (𝜅 and 𝜌). In practice, the model is insensitive to 342 the value of 𝜅 leaving a single modeling parameter. The main complication is the estimation 343 of equilibrium frequency values from MSAs. The amino acid distributions of leaf sequences 344 are not necessarily good estimators of equilibrium frequencies. For sequences with limited 345 divergence, this may become an issue, limiting the approach presented here. There is also an 346 issue with uncertainties in equilibrium frequency distributions due to limited statistical 347 sampling. However, based on experiments using a simulated data set, we showed that accurate 348 predictions of site rates can be predicted with as little as 20 sequences for two of the systems 349 studied here. This is surprising but can be explained because most sites are highly conserved 350 so equilibrium frequencies can be estimated with relative ly few sequences. Also, since site 351 rates are an aggregate of rates across many codons , errors in individual amino acid frequency 352 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 19 values may be washed out in the model. Comparison with rate -scale factors predicted from 353 phylogenetic inference demonstrates some benefits of the mutation-selection model. Prediction 354 accuracy is better for all simulated systems and across all alignment depths and computations 355 can be carried out much more rapidly. Nonetheless, with enough sequences, the maximum 356 likelihood estimated rate factors are also accurate. Since the underlying data is based o n 357 nucleotide mutation trajectories, it is encouraging to see that both approach es result in good 358 predictions. 359 360 Because benchmarks with known rates are difficult to construct from empirical data, we have 361 used a simulation approach in this study. The evolutionary dynamics simulation generates 362 sequences that result from time - as well as context -dependent fluctuations of fitness. By 363 simulating evolutionary dynamics using a phylogenetic tree, sequences maintain similar 364 phylogenetic relationships as those found in natural sequences. The approach thus provides a 365 more realistic model of protein evolution than nucleotide or amino acid -based simulators . 366 However, one caveat is that the mutation model underlying the evolutionary dynamics 367 simulations shares similarities to the one used in the mutation -selection model, which could 368 give the mutation-selection model an advantage in the comparisons. In particular, the proposal 369 rates in the evolutionary dynamics simulations are also governed by a single nucleotide 370 mutation rate 𝜅 and a multi -nucleotide rate 𝜌. However, the parameter values used in the 371 evolutionary dynamics simulation were different from the values used to calculate the site rates 372 and we show that the value of 𝜅 does not influence the rates much. One factor not included in 373 the evolutionary dynamics simulation is selection at the codon level, which we know occurs in 374 natural sequence evolution. Such codon-fitness models could be introduced into the mutation-375 selection framework, further improvement could also be made by using a more complex, but 376 parameter-rich, model for nucleotide mutations. However, i t is not clear that this would 377 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 20 improve the result given the fact that the input data is amino acid sequences that do not inform 378 on codon usage. With more parameters in the model, using a continuous method for maximum 379 likelihood optimization of parameters would be beneficial (Holder, et al. 2008). The accuracy 380 of site-rate predictions has been previously characterized by Scheffler et al. (Scheffler, et al. 381 2014). They found that rates predicted from phylogenetic inference were linearly correlated 382 with experimental values and that inference showed good convergence. However, one caveat 383 with their study is that rates were simulated with the same gamma-distributed rate factors used 384 in the inference model. 385 386 The matrix fitting result demonstrate s that mutation-selection model can be parametrized to 387 give rise to matrices that are very similar to the empirical rate matrices, with r 2 values going 388 up to 0.95 for JTT. This means that we can think of substitution matrices as being the result 389 of an averaging of an ensemble of site -specific rate matrices, each parametrized by a single 390 equilibrium frequency vector. Nonetheless, there appears to be an upper limit to the quality of 391 the fit because independent fitting experiments converge to similar correlation values. The 392 more complex the matrix (JTT<WAG<LG), the lower the correlation (same as the pattern for 393 the matrices calculated from natural sequences in Figure 1). LG, for example, differs from 394 WAG by being inferred by introducing site -rate scale factors in the maximum likelihood 395 estimation. Because of the vast number of parameters in the WAG and LG fitting and statistical 396 uncertainty in inference, it is not expected that mutation-selection model could explain the full 397 variation in the empirical matrices. 398 399 The fitted distributions of equilibrium frequencies can be used to understand some of the 400 implicit assumptions encoded in empirical matrices. JTT is constructed by the traditional 401

Method

of counting amino acid substitutions in MSAs (Jones, et al. 1992) . JTT is consistent 402 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 21 with a broad ensemble of highly variable site frequency vectors. The introduction of site -403 specific rate factors in LG compared to LG results in a lower frequency of highly conserved 404 sites. By varying the ensemble of frequency vectors, it is possible to create a vast array of 405 substitution matrices with different properties. 406 407 In this study, we have demonstrated that substitution rates at sites in proteins can be calculated 408 by combining the mutation-selection model with a nucleotide mutation model and equilibrium 409 frequency predictions from protein sequence alignment. This was used by Halpern and 410 Bruno(Halpern and Bruno 1998) in the original mutation-selection study to characterize branch 411 lengths but has not been generally applied for site -rate predictions from amino acid data. The 412 approach enables fast and accurate site -rate predictions even with limited sequence data and 413 provides a link between the frequency distribution at sites in proteins and amino acid 414 substitution rates encoded in amino acid substitution matrices. 415 416

Materials and methods

417 Rate matrix calculations 418 The 3912 pfam alignments employed by Le and Gascuel to train the LG matrix(Le and Gascuel 419 2008) were used as the basis for calculating protein rate matrices. We selected a subset of 420 alignments with 30 or more sequences, leaving 221 MSAs. Amino acid frequencies at each site 421 were estimated by counting the occurrences in each column of the MSA, not counting gaps. 422 Sites with more than 50% gaps were discarded, although this filter had no substantial impact 423 on the result. To account for deficiencies in statistical sampling pseudo -counts are often 424 employed. When pseudo-counts were tested, we used a Jeffreys prior and added a count 0.5 to 425 each amino acid. To account for biased sampling of amino acid sequences we weighted the 426 sequence based on a phylogenetic tree. For each of the 221 alignments we use FastTree(Price, 427 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 22 et al. 2010) to calculate phylogenetic trees. BranchManager(Stone and Sidow 2007) was then 428 utilized to create weights for each sequence based on the tree. A codon -based instantaneous 429 rate matrix was then constructed based on equation 1. By assigning equal fitness to codons in 430 the model we can solve for the equilibrium frequency distribution and normalize the 431 frequencies so that codons that code for the same amino acid sum to 1. Codon frequencies were 432 then calculated by multiplying the relative codon frequencies with the amino acid frequency 433 estimated from the MSA. Following the aggregation procedure in Yang et al.(Yang, et al. 1998) 434 and Norn et al. (Norn, et al. 2020) the codon -level instantaneous rate matrix can be by 435 condensed into a protein-level matrix. The instantaneous rate (𝑞#→$ , ) between two amino acids 436 (i, j) with codons uÎi and uÎi can be calculated as: 437 438 [6] 439 440 Where 𝜋 is an equilibrium frequency. By averaging the flux at many sites in many protein 441 MSAs we can calculate the mean instantaneous rate for amino acid i to j as: 442 443 𝑞#$ = 〈2&(,*+,- . 〉. 〈4&〉. = 5 2&(,*+,- . . 5 6& . . [7] 444 445 Before averaging, the site rates were normalized to correspond to one neutral substitution by 446 per time-unit(Bruno 1996; Spielman and Wilke 2015). Rate calculations take on the the order 447 of 10s. 448 449 450 qi→ j L = π v L π i L qvu L v∈j ∑ u∈i ∑ .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 23 Maximum Likelihood estimation of model parameters 451 A grid-based Maximum Likelihood search was carried out for the parameters, (𝜅 ∈ [1.4-2.6] 452 and 𝜌 ∈ [ 0-1.0]. For each combination 𝜅 and 𝜌 values we generated a protein rate matrix. The 453 rate matrix was used for the inference of phylogenetic trees, using the 59 TreeBase test 454 alignments utilized by Le and Gascuel in their study on the LG matrix(Le and Gascuel 2008). 455 IQTREE(Minh, et al. 2020) with MATRIX+FO+G4 (Where MATRIX is the matrix mutation 456 selection matrix) was used for tree inference. The likelihoods were summed up for all 59 457 alignments to identify the maximum likelihood solution. 458 459 𝜃? = argmax ∑ logL(𝑀𝑆𝐴#|𝑄(𝜅, 𝜌))7 #8* [8] 460 461 Evolutionary dynamics simulations 462 Evolutionary dynamics w ere simulated with RosettaEvolve, which is part of the Rosetta 463 macromolecular modeling package(Leman, et al. 2020). The data in this study was taken from 464 Norn et. al.(Norn, et al. 2024), which together with the study presenting RosettaEvolve (Norn 465 and Andre 2023) gives a complete methodological description. Phylogenetic t rees for 466 simulations were generated using RaXML (Stamatakis 2014) based on natural sequence 467 alignments of the four proteins simulated in the study. Tree-crawling started from the center of 468 the tree by simulating each individual branch of the tree in a recursive fashion. The leaf nodes 469 were then used for site-rate predictions. 470 471 Site-rate fitting of empirical matrices 472 The 3912 sequences in the pfam sequences used to fit the LG matrix was analyzed to extract a 473 histogram over l-values (equation 5) with scripts and methods presented by Johnson and 474 Wilke(Johnson and Wilke 2020) . A gamma distribution was fitted to the l-values, results 475 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 24 shown in Supplementary Figure 2, using the fitdistr package in R(Team 2020). The resulting 476 gamma distribution was used to sample 1000 l-values. Equation 5 provides a list of sorted 477 frequency values from large to small. To assign identities to a frequency vector a simple 478 approximate sampling approach was used. For the first largest frequency value in the vector 479 amino acids are sampled according to the equilibrium frequencies of LG. Then the second 480 largest element is sampled, with frequency according to the equilibrium frequencies of LG 481 excluding the already selected amino acid. This is iterated until the 20th amino acid is assigned. 482 The resulting vectors have mean probabilities that differs slightly from LG. An ensemble of 483 these can selected to perfectly converge to the LG equilibrium frequencies. Since the ensemble 484 will any way be fitted in the next step, we skipped this step. Simulated annealing 485 (https://github.com/perrygeo/simanneal) was then used select a 100 site-frequency vectors out 486 of the ensemble of 1000 vectors to optimize the Pearson -correlation between a matrix 487 calculated from the ensemble and a reference matrix (JTT,WAG or LG). 488 489 Effective number of amino acids was calculated as: 490 491 𝑛.99 = 𝑒𝑥𝑝(− ∑ 𝜋#𝑙𝑛# 𝜋# ) [12] 492 493 Data availability 494 The data underlying this article are available inhttps://github.com/Andre-lab/mutsel-rate-495 manuscript.git. or will be shared on reasonable request to the corresponding author. 496 497 498 499 500 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 25

References

501 Arenas M. 2015. Trends in substitution models of molecular evolution. Frontiers in genetics 502 6:319. 503 Bloom JD. 2017. Identification of positive selection in genes is greatly improved by using 504 experimentally informed site-specific models. Biology Direct 12. 505 Bruno WJ. 1996. Modeling residue usage in aligned protein sequences via maximum 506 likelihood. Molecular Biology and Evolution 13:1368-1374. 507 Echave J, Jackson EL, Wilke CO. 2015. Relationship between protein thermodynamic 508 constraints and variation of evolutionary rates among sites. Physical Biology 12:025002. 509 Echave J, Spielman SJ, Wilke CO. 2016. Causes of evolutionary rate variation among protein 510 sites. Nature Reviews Genetics 17:109-121. 511 Golding B, Felsenstein J. 1990. A Maximum -Likelihood Approach to the Detection of 512 Selection from a Phylogeny. Journal of Molecular Evolution 31:511-523. 513 Goldstein RA. 2011. The evolution and evolutionary consequences of marginal thermostability 514 in proteins. Proteins-Structure Function and Bioinformatics 79:1396-1407. 515 Goldstein RA, Pollock DD. 2016. The tangled bank of amino acids. Protein Science 25:1354-516 1362. 517 Halpern AL, Bruno WJ. 1998. Evolutionary distances for protein-coding sequences: modeling 518 site-specific residue frequencies. Molecular Biology and Evolution 15:910 917. 519 Harris K, Nielsen R. 2014. Error-prone polymerase activity causes multinucleotide mutations 520 in humans. Genome Research 24:1445-1454. 521 Henikoff S, Henikoff JG. 1994. Position -Based Sequence Weights. Journal of Molecular 522 Biology 243:574-578. 523 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 26 Holder MT, Zwickl DJ, Dessimoz C. 2008. Evaluating the robustness of phylogenetic methods 524 to among -site variability in substitution processes. Philos Trans R Soc Lond B Biol Sci 525 363:4013-4021. 526 Jiang Q, Teufel AI, Jackson EL, Wilke CO. 2018. Beyond Thermodynamic Constraints: 527 Evolutionary Sampling Generates Realistic Protein Sequence Variation. Genetics 208:1387 -528 1395. 529 Johnson MM, Wilke CO. 2020. Site -Specific Amino Acid Distributions Follow a Universal 530 Shape. Journal of Molecular Evolution 88:731-741. 531 Jones CT, Youssef N, Susko E, Bielawski JP. 2018. Phenomenological Load on Model 532 Parameters Can Lead to False Biological Conclusions. Molecular Biology and Evolution 533 35:1473-1488. 534 Jones DT, Taylor WR, Thornton JM. 1992. The Rapid Generation of Mutation Data Matrices 535 from Protein Sequences. Computer Applications in the Biosciences 8:275-282. 536 Kimura M. 1962. On the probability of fixation of mutant genes in a population. Genetics 537 47:713-719. 538 Kimura M. 1980. A Simple Method for Estimating Evolutionary Rates of Base Substitutions 539 through Comparative Studies of Nucleotide -Sequences. Journal of Molecular Evolution 540 16:111-120. 541 Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben -Tal N. 2005. ConSurf 542 2005: the projection of evolutionary conservation scores of residues on protein structures. 543 Nucleic Acids Research 33:W299-W302. 544 Le SQ, Gascuel O. 2008. An improved general amino acid replacement matrix. Molecular 545 Biology and Evolution 25:1307-1320. 546 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 27 Leman JK, Weitzner BD, Lewis SM, Adolf -Bryfogle J, Alam N, Alford RF, Aprahamian M, 547 Baker D, Barlow KA, Barth P, et al. 2020. Macromolecular modeling and design in Rosetta: 548 recent methods and frameworks. Nat Methods 17:665-680. 549 Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, Bornberg -Bauer E, Colwell LJ, 550 Koning APJd, Dokholyan NV, Echave J, et al. 2012. The interface of protein structure, protein 551 biophysics, and molecular evolution. Protein science : a publication of the Protein Society 552 21:769 785. 553 Mayrose I, Graur D, Ben -Tal N, Pupko T. 2004. Comparison of site -specific rate-inference 554

Methods

for protein sequences: empirical Bayesian methods are superior. Molecular Biology 555 and Evolution 21:1781-1791. 556 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear 557 R. 2020. IQ -TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the 558 Genomic Era. Molecular Biology and Evolution 37:1530-1534. 559 Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB. 2003. Position specific variation in 560 the rate of evolution in transcription factor binding sites. BMC Evol Biol 3:19. 561 Norn C, Andre I. 2023. Atomistic simulation of protein evolution reveals sequence covariation 562 and time -dependent fluctuations of site -specific substitution rates. PLoS Comput Biol 563 19:e1010262. 564 Norn C, André I, Theobald DL. 2020. A thermodynamic model of protein structure evolution 565 explains empirical amino acid rate matrices. bioRxiv:2020.2012.2002.408807. 566 Norn C, Oliveira F, Andre I. 2024. Improved prediction of site -rates from structure with 567 averaging across homologs. bioRxiv:2024.2002.2027.582061. 568 Park H, Bradley P, Greisen P, Jr., Liu Y, Mulligan VK, Kim DE, Baker D, DiMaio F. 2016. 569 Simultaneous Optimization of Biomolecular Energy Functions on Features from Small 570 Molecules and Macromolecules. J Chem Theory Comput 12:6201-6212. 571 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 28 Pollock DD, Thiltgen G, Goldstein RA. 2012. Amino acid coevolution induces an evolutionary 572 Stokes shift. Proc Natl Acad Sci U S A 109:E1352-1359. 573 Porto M, Roman HE, Vendruscolo M, Bastolla U. 2005. Prediction of site-specific amino acid 574 distributions and limits of divergent evolutionary changes in protein sequences (vol 22, pg 630, 575 2005). Molecular Biology and Evolution 22:1156-1156. 576 Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 --approximately maximum-likelihood trees 577 for large alignments. PLoS ONE 5:e9490. 578 Pupko T, Bell RE, Mayrose I, Glaser F, Ben -Tal N. 2002a. Rate4Site: an algorithmic tool for 579 the identification of functional regions in proteins by surface mapping of evolutionary 580 determinants within their homologues. Bioinformatics 18:S71-S77. 581 Pupko T, Bell RE, Mayrose I, Glaser F, Ben -Tal N. 2002b. Rate4Site: an algorithmic tool for 582 the identification of functional regions in proteins by surface mapping of evolutionary 583 determinants within their homologues. Bioinformatics 18 Suppl 1:S71-77. 584 Quang le S, Gascuel O, Lartillot N. 2008. Empirical profile mixture models for phylogenetic 585 reconstruction. Bioinformatics 24:2317-2323. 586 Reid TM, Loeb LA. 1993. Tandem Double Cc -]Tt Mutations Are Produced by Reactive 587 Oxygen Species. Proceedings of the National Academy of Sciences of the United States of 588 America 90:3904-3907. 589 Rodrigue N, Latrille T, Lartillot N. 2020. A Bayesian mutation -selection framework for 590 detecting site -specific adaptive evolution in protein -coding genes. Molecular Biology and 591 Evolution. 592 Rodrigue N, Philippe H, Lartillot N. 2010. Mutation -selection models of coding sequence 593 evolution with site -heterogeneous amino acid fitness profiles. Proc Natl Acad Sci U S A 594 107:4629-4634. 595 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 29 Scheffler K, Murrell B, Pond SLK. 2014. On the Validity of Evolutionary Models with Site -596 Specific Parameters. PLoS ONE 9:e94534. 597 Shah P, McCandlish DM, Plotkin JB. 2015. Contingency and entrenchment in protein 598 evolution under purifying selection. Proceedings of the National Academy of Sciences of the 599 United States of America 112:7627-7627. 600 Spielman SJ, Kosakovsky Pond SL. 2018. Relative evolutionary rate inference in HyPhy with 601 LEISR. PeerJ 6:e4339. 602 Spielman SJ, Pond SLK. 2018. Relative evolutionary rate inference in HyPhy with LEISR. 603 PeerJ 6:e4339. 604 Spielman SJ, Wilke CO. 2015. Pyvolve: A Flexible Python Module for Simulating Sequences 605 along Phylogenies. PLoS ONE 10. 606 Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post -analysis of 607 large phylogenies. Bioinformatics 30:1312-1313. 608 Stone EA, Sidow A. 2007. Constructing a meaningful evolutionary average at the phylogenetic 609 center of mass. Bmc Bioinformatics 8. 610 Tamuri AU, dos Reis M, Goldstein RA. 2012. Estimating the Distribution of Selection 611 Coefficients from Phylogenetic Data Using Sitewise Mutation -Selection Models. Genetics 612 190:1101-1115. 613 Tamuri AU, Goldman N, Reis Md. 2014. A Penalized -Likelihood Method to Estimate the 614 Distribution of Selection Coefficients from Phylogenetic Data. Genetics 197:257-271. 615 Team RC. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria. 616 Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from 617 multiple protein families using a maximum -likelihood approach. Molecular Biology and 618 Evolution 18:691-699. 619 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint 30 Williams PD, Pollock DD, Blackburne BP, Goldstein RA. 2006. Assessing the accuracy of 620 ancestral protein reconstruction methods. PLOS Computational Biology 2:e69. 621 Yang Z, Nielsen R, Hasegawa M. 1998. Models of amino acid substitution and applications to 622 mitochondrial protein evolution. Molecular Biology and Evolution 15:1600-1611. 623 Yang ZH. 1993. Maximum-Likelihood-Estimation of Phylogeny from DNA-Sequences When 624 Substitution Rates Differ over Sites. Molecular Biology and Evolution 10:1396-1401. 625 Yang ZH, Nielsen R. 2008. Mutation -selection models of codon substitution and their use to 626 estimate selective strengths on codon usage. Molecular Biology and Evolution 25:568-579. 627 628 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-NC-ND-4.0