Keywords
Amino acid site-rates, substitution matrices, mutation-selection model 7
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
2
Abstract
8
9
The pattern of substitution s at sites in proteins provide s invaluable information about their 10
biophysical and functional importance and what selection pressures are acting at individual 11
sites. Amino acid site rates are typically estimated using phenomenological models in which 12
the sequence variability is described by rate factors that scale the overall substitution rate in a 13
protein to sites. In this study, we demonstrate that site rates can be calculated accurately from 14
amino acid sequences using a mutation-selection model in combination with a simple 15
nucleotide substitution model. The method performs better than the standard phylogenetic 16
approach on sequences generated by structure-based evolutionary dynamics simulations, 17
robustly estimates rates for shallow multiple sequence alignments , and can be rapidly 18
calculated also on larger sequence alignments . On natural sequence s, site rates from the 19
mutation-selection model are strongly correlated to rates calculated with the empirical Bayes 20
methods. The model provides a link between amino acid substitution rates and equilibrium 21
frequency distributions at sites in proteins. We show how an ensemble of equilibrium 22
frequency vectors can be used to represent the rate variation encoded in empirical amino acid 23
substitution matrices. This study demonstrates that a rapid and simple method can be developed 24
from the mutation -selection model to predict substitution rates from amino acid data, 25
complementing the standard phylogenetic approach. 26
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
3
Introduction
27
28
Proteins evolve under several genetic, biophysical, and functional fitness constraints(Liberles, 29
et al. 2012) . This results in a substantial variation in amino acid substitution patterns at the 30
level of sites in proteins. Amino acid-based phylogenetic models primarily account for this 31
variation by introducing factors that scale the relative substitution rate at sites to the overall 32
mean substitution rate of a multiple sequence alignment (MSA)(Yang 1993; Arenas 2015) . 33
Such models hav e resulted in the improved inference of phylogenic trees (Le and Gascuel 34
2008), but the individual rate values themselves have also proven to be insightful, in particular 35
when mapped onto three -dimensional structures of proteins to characterize sequence 36
conservation(Pupko, et al. 2002a; Landau, et al. 2005) . Site-level data can report on the 37
contribution of sites to the stability, structure, and function of proteins and enables correlation 38
between substitution patterns and molecular properties(Echave, et al. 2016). 39
40
Because the evolutionary process proceeds through changes at the nucleotide level, protein 41
models are inherently phenomenological(Liberles, et al. 2012; Jones, et al. 2018). Furthermore, 42
there is significant variability of amino acid propensities at sites in proteins , which cannot be 43
fully accounted for by models that combine site rates with invariable equilibrium 44
frequencies(Echave, et al. 2016) . Incompatibilities between the generative process behind 45
protein sequence evolution and inference models used to analyze sequence data can sometimes 46
lead to biased and overconfident predictions(Jones, et al. 2018). By modeling the evolutionary 47
process at the codon level, the mutation -selection model(Halpern and Bruno 1998; Yang and 48
Nielsen 2008; Rodrigue, et al. 2010; Tamuri, et al. 2012) provides an approach to describe site-49
level heterogeneity with greater realism than protein -level models. Modeling substitution 50
processes at the level of codons and sites requires models with a large number of parameters, 51
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
4
which results in computational as well as statistical challenges. Nonetheless, statistical 52
inference frameworks have successfully been implemented to derive site -based parameters 53
from sequence data (Yang and Nielsen 2008; Rodrigue, et al. 2010; Tamuri, et al. 2012) . 54
Mutation-selection models have primarily been used to detect signals of adaptive evolution in 55
genes or sites in genes(Tamuri, et al. 2014; Bloom 2017; Rodrigue, et al. 2020). 56
57
In the standa rd approach, the mutation-selection model is applied to nucleotide data. But in 58
many situations, it is preferable to analyze amino acid sequences. By combining the mutation-59
selection model with a nucleotide mutation model it is possible to predict site -based 60
substitution rates(Halpern and Bruno 1998; Moses, et al. 2003) based on predicted amino acid 61
equilibrium sequences. We show here that this can lead to accurate predictions of site rates 62
from amino acid sequences and improvements relative to the standard phylogenetic approach. 63
64
Average substitution rates for amino acids can also be predicted from the mutation -selection 65
model by condensing codon -level rates into amino acid instantaneous rate matrices (Norn, et 66
al. 2020). We demonstrate here that such matrices are very similar to empirical rate matrices 67
like WAG(Whelan and Goldman 2001) and LG(Le and Gascuel 2008). The mutation-selection 68
framework establishes a direct link between amino acid frequency distributions at sites and 69
protein substitution matrices. This connection has been made previously by Pollock and 70
Goldstein, who demonstrated that amino acid substitution rates could be constructed by 71
calculating a covariance matrix between site equilibrium frequency vectors (Goldstein and 72
Pollock 2016). Here we exploit the link between site frequency distributions and substitution 73
matrices to establish which distribution of site frequency vectors is encoded in empirical 74
matrices such as JTT(Jones, et al. 1992), WAG, and LG. 75
76
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
5
Results
77
Amino acid rates from a mutation-selection model 78
The mutation-selection model is a codon-based model that describes the effect of protein-level 79
site-specific selection(Bruno 1996). Several simplifying assumptions are made to arrive at the 80
model. Selection pressures at sites are taken to be constant over all lineages, fixation is assumed 81
to be rapid compared to mutation, sites are independent, and the evolutionary process is 82
dominated by negative selection. With these premises the relative instantaneous rate between 83
(π!") from codon u with amino acid i to codon v with amino acid j is modeled as the product 84
of the proposal rate of mutating codon u to v, π!", and the fixation probability, π!" , and an 85
arbitrary scaling constant k: 86
87
π!" = π β π!" β π!" [1] 88
89
The codon transition probability can be modeled with many approaches, but we assume here 90
that the fitness of a codon is only a function of the amino acid it encodes and that the rate of 91
mutation between codons can be modeled by the nucleotide mutation rate of the nucleotide 92
change separating connected codons (Norn, et al. 2020). Many nucleotide substitution models 93
can be utilized, the results here are obtained with the K80 model with one parameter ( π
) 94
controlling the relative rate of transitions to transversions (Kimura 1980) . In addition, we 95
assign a rate for multi-nucleotide mutations ( π) accounting for insertions, deletions , and 96
tandem mutations that have been shown to play an important role in evolution(Reid and Loeb 97
1993; Harris and Nielsen 2014). With these parameters, the mutation proposal rates become: 98
99
π = *
1 + π
π
+ π
π
ππ π‘ππππ π£πππ πππ
ππ π‘ππππ ππ‘πππ
πππ π
100
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
6
101
By using the weak mutation model of Golding and Felsenstein (Golding and Felsenstein 1990), 102
the fixation probability π#$ can be approximated as: 103
104
π!" =
%& (!"#"$
!$#$"
)
*+!$#$"
!"#"$
[2] 105
106
Where pu/v is the equilibrium frequency of codon u/v and p u/v is the probability of mutation 107
between u/v. To estimate the codon equilibrium frequencies pu/v we first estimated the amino 108
acid frequencies from a multiple sequence alignment (MSA) using their maximum likelihood 109
estimates. The frequency of individual codons was then calculating by multiplying the amino 110
acid frequency by the relative frequencies of codons encoding for this amino acid. This relative 111
frequency can be estimated from genomic data. However, these do not correspond to proposal 112
frequencies since they are determined from sequences after selection. Instead, we calculated 113
the relative codon frequencies by solving for the codon equilibrium frequencies in an 114
instantaneous rate matrix w here each codon is assigned equal fitness and renormalize the 115
frequencies coding for the same amino acids to sum to 1. 116
117
The site-specific rate substitution rate π, (the total flux) is: 118
119
π, = β β π·#$
,
$-## [3] 120
121
Where i and j are amino acid types and where π·#$
, is the flux between i and j which can be 122
found by summing the flux between all connected codons u and v that encode for amino acid i 123
and j: 124
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
7
[4] 125
We scale the site-specific rate to correspond to a n average of one neutral substitution by per 126
time-unit(Bruno 1996; Spielman and Wilke 2015). 127
128
Maximum-likelihood estimation of model parameters 129
There are two nuisance parameters in the model, π
and π. Following the approach in Norn et 130
al.(Norn, et al. 2020) we first construct a codon-level instantaneous rate matrix and aggregate 131
it into a protein -level protein rate matrix. To find optimal values of π
and π we find the 132
combination of parameters that maximizes the likelihood in phylogenetic tree reconstruction 133
from a set of MSAs using the corresponding protein rate matrix. Substitution matrices were 134
computed from 221 pfam alignments used by Le and Gascuel to infer the LG matrix (Le and 135
Gascuel 2008) . The substitution matrix calculated using each set of parameters w as then 136
evaluated by inferring trees 59 TreeBase alignments, also taken from the study by Le and 137
Gascuel(Le and Gascuel 2008) . The s um log -likelihoods were used to identify the optimal 138
parameters. A total of 42 parameter combinations were evaluated in a grid search. π
=1.4 and 139
π=0.2 were identified as the optimal solution. The likelihood surface is very flat with respect 140
to the variation of π
, whereas a clear optimum was found for π (Supplementary Figure 1). 141
The resulting protein rate matrix can be compared to the empirical rates matrices JTT, WAG, 142
and LG. In Figure 1, the amino acid substitution rate values of the optimal rate matrix (referred 143
to as MutSel, with π
=1.4 and π=0.2) are compared to the values of JTT, WAG, and LG on the 144
logarithmic scale. There is a strong correlation between the values in these matrices, with r2 145
values of 0.89, 0.83, and 0.79 for JTT, WAG, and LG, respectively. This demonstrates that the 146
relative average amino acid substitution rates are well predicted by the model (to the extent 147
that empirical matrices correspond to true values). However, there is a non -unit slope in the 148
Ξ¦iβ j
L = Ο i
Lqiβ j
L = Ο v
Lqvu
L
vβj
β
uβi
β
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
8
correlation with the empirical matrices that is caused by a lower variation of amino acid rates 149
in the MutSel model. 150
151
Figure 1: Comparison of MutSel rate matrix with JTT(left), WAG(center) and LG(right). The 152
plot of amino -acid rates (qij) of WAG(left)/LG(right) and the same element in MutSel. Black 153
line, linear fit. 154
155
Two aspects could give rise to biased predictions. First, limited sampling of sequences means 156
that the equilibrium frequencies for rare amino acid variants will tend to zero values. This can 157
be combatted by introducing pseudo -counts, which are often used during remote homolog 158
detection(Henikoff and Henikoff 1994) . To avoid biasing results towards the empirical 159
matrices we compare to, we tested a simple approach for adding pseudo-counts using a Jeffreys 160
prior for a multinomial distribution . This , however, resulted in a significantly reduced 161
performance. The second aspect is that MSAs do not correspond to random samples of 162
sequences. This can be addressed with the reweighting of sequences before the calculation of 163
equilibrium frequencies. We tested a scheme that weight sequences based on a phylogenetic 164
tree(Stone and Sidow 2007). This resulted in a small, but consistent, reduction in performance. 165
166
167
168
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
9
Simulation of protein sequence data 169
Having established values for the nuisance parameters of the model we now go on to evaluate 170
the capability of the mutation-selection model in estimating site-rates. The dataset used in the 171
validation was the result of evolutionary dynamics simulations. A benefit of this simulation 172
strategy is that the ground truth is known in this scenario. For four proteins ( Barnase, 173
Ribonuclease T1, Lysozyme, and protein inhibitor CI2) we simulated evolutionary trajectories 174
according to phylogenetic trees inferred from natural MSAs for these systems (data taken from 175
Norn et. al.(Norn, et al. 2024)). In the simulations, a recursive algorithm was used to βcrawlβ 176
the tree and at each new node an evolutionary dynamics trajectory is launched with a branch 177
length taken from the empirical tree. RosettaEvolve (Norn and Andre 2023) , part of Rosetta 178
macromolecular modeling package(Leman, et al. 2020), was used for protein structure -based 179
evolutionary dynamics simulations. RosettaEvolve is similar to other structure -based 180
evolutionary dynamics methods presented in the literature(Goldstein 2011; Pollock, et al. 2012; 181
Shah, et al. 2015; Jiang, et al. 2018) . Mutations are proposed at the nucleotide level and 182
correspond to either single-base pair mutations or whole-codon codon changes. The probability 183
of fixating a mutation is estimated using Kimuraβs expression for fixation probability(Kimura 184
1962). The simulations are carried out with the assumption that neutral evolution is primarily 185
driven by thermodynamic stability, with a fitness function that equals the fraction of folded 186
protein(Williams, et al. 2006). The effect of mutations on stability is calculated using structure-187
based DDG calculations (Park, et al. 2016) . The sequence diversity generated in these 188
simulations is depending on the assumed stability of the protein. The stability was selected in 189
these simulations to result in sequence entropies of simulated alignments that is similar to the 190
values in the empirical MSA. The leaf-nodes were collected to create MSAs from which site -191
rates were predicted. Because the full evolutionary trajectories during the tree crawling is 192
recorded the true substitution rate is known for each site in a protein. 193
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
10
194
The comparison of substitution rates estimated from the mutation -selection model and 195
substitutions counted per site counted from the trajectories are shown in the upper row in 196
Figure 2 for the four simulated proteins. 197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
11
217
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
12
Figure 2: Comparison of rates predicted from sequence and the true substitution count. Left 218
column, comparison of site rates predicted from the mutation-selection model compared with 219
counted substitutions in evolutionary dynamics simulations based on the structure of Barnase 220
(PDBID: 1a2p), Ribonuclease T1 (PDBID: 1rn1), Lysozyme (PDBID: 1el1) and CI2 (PDBID: 221
2CI2). Black circles, individual site-rates. Black line, linear fit. Right column, a comparison of 222
site rates predicted using phylogenetic inference through LEISR(Spielman and Pond 2018) to 223
counted substitutions for 1a2p, 1rn1, 1el1, and 2ci2. Black circles, individual site-rates. Black 224
line, linear fit. 225
226
The rates predicted from the mutation-selection model have a strong linear correlation with the 227
counted substitutions from the evolutionary dynamics simulations. r 2 values range from 0.90-228
0.94 (Barnase: 0.93, Ribonuclease T1: 0.94, Lysozyme: 0.90, CI2: 0.94). We compared these 229
Results
with site rates inferred using Maximum likelihood calculations using LEISR(Spielman 230
and Kosakovsky Pond 2018) , which builds upon the methodology of the pioneering 231
rate4site(Pupko, et al. 2002b) , but has improved support for larger MSAs. LEISR also 232
accurately predicts site rates (bottom row Figure 2), with r 2 values ranging from 0.87 -0.92 233
(Barnase: 0.88, Ribonuclease T1: 0.92, Lysozyme: 0.87, CI2: 0.88). In contrast to the mutation-234
selection model the correlation counted substitutions deviates slightly from linear correlation. 235
236
The predictions from the mutation-selection model and LEISR were made with relatively large 237
MSAs (Barnase: 237 sequences, Ribonuclease T1: 306, Lysozyme: 537, CI2: 529). Because 238
the estimation of the equilibrium frequencies should become more uncertain as the number of 239
sequences decreases, we expect the performance of the site -rate prediction to deteriorate. To 240
investigate the correlation of accuracy of site-rate predictions as a function of the depth of the 241
sequence alignment we sampled smaller subsets of sequences and predicted site -rates from 242
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
13
these. Figure 3 shows how the r2 between predicted rates and counted substitutions varies with 243
the depth of the sequence alignment. 244
245
Figure 3: Effect of depth of MSA on site-rate prediction. Average Pearson r2 between predicted 246
rates from the mutation-selection model (black) and LEISR (red) as a function of the number 247
of sequences in the alignment. Each point is an average of 5 random sets of sequences. 248
249
The results show that 50 or more sequences are required to reach the highest correlation with 250
the empirical substitution counts for the mutation -selection model . Nonetheless, with 30 251
sequences the r2 values are still around 0.8 demonstrating that accurate predictions can be made 252
with relatively few sequences. In these examples, LEISR requires more than 125 sequences to 253
reach similar accuracies and has consistently lower correlation across all sequence depths. 254
0 50 100 150 200 250 300
Alignment depth
0.2
0.4
0.6
0.8Average r2
1RN1
0 100 200 300 400 500
Alignment depth
0.2
0.4
0.6
0.8
Average r2
2CI2
0 50 100 150 200
Alignment depth
0.2
0.4
0.6
0.8
Average r2
1A2P
0 100 200 300 400 500
Alignment depth
0.2
0.4
0.6
0.8
Average r2
1EL1
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
14
Comparison with rates inferred by maximum likelihood 255
Proteins evolve by an evolutionary mechanism that is considerably more complex than what is 256
encoded in simulated data. To investigate how the mutation -selection model predicted rates 257
from a set of 213 protein sequence alignments presented by Echave et. al. (Echave, et al. 2015). 258
This study included site rates estimated using the empirical Bayes rate4site method, using two 259
different substitution matrices, JC and JTT. In Figure 4 the correlation between rates estimated 260
from the mutation-selection model and the rates4sites are shown. 261
262
263
Figure 4: Correlation between rates from mutation -selection model and rates for site. Site -264
rates estimated from 213 MSAs using the mutation -selection model and rate4sites with two 265
different substitution matrices (JC and JTT, blue and orange respectively) and compared with 266
r2 correlation in swarmplot. Correlation between rates estimated using JC and JTT are shown 267
in green. 268
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
15
269
The average r 2 correlation is 0. 74 to rates inferred with the JC matrix and 0. 66 to JTT. The 270
amino-acid JC model (Mayrose, et al. 2004) assigns an equal probability for all amino acid 271
mutations, which gives higher correlations with the mutation -selection. The choice of 272
substitution matrix has an impact on inferred rates (average r2 0.93), but this variation is smaller 273
than the variation between mutation -selection model and maximum likelihood. Nonetheless, 274
the mutation-selection model gives similar rates to empirical Bayes. 275
276
Deconstructing empirical substitution matrices into the distribution of equilibrium frequency 277
vectors 278
The mutation-selection model couples the distributions of equilibrium frequencies to the amino 279
acid substitution rates in instantaneous rate matrices. The site frequency vectors used to 280
construct an amino acid substitution matrix constitute an ensemble of site probability vectors. 281
Such ensembles will be unique for each empirical substitution matrix. In this section, we derive 282
such probability vectors for JTT, WAG , and LG by fitting a distribution that maximize s the 283
correlation between the mutation -selection matrix and the empirical matrix. Such 284
deconvolution can give insight into what the implicit assumptions about site rates have gone 285
into empirical rate matrices used in phylogenetic analysis. 286
287
Previous studies have demonstrated that the amino -acid frequency distribution at sites in 288
proteins is well approximated by the following expression (Porto, et al. 2005; Johnson and 289
Wilke 2020) if the frequencies (p) are ordered from large to small values (p1>p2β¦>p20): 290
291
π# =
.%&'
β .%('(
[5] 292
293
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
16
where i=1..20 and l controls the shape of the distribution of p-values. Large values of l results 294
in frequency profiles dominated by a few amino acids. We used the method presented in 295
Johnson and Wilke(Johnson and Wilke 2020) and determined a histogram of l-values in the 296
MSAs used to fit the LG matrix. The l-distribution is well -described by a gamma function 297
(Supplementary figure 2). We sampled 1000 values from this gamma l-distribution to create 298
an ensemble of frequency vectors. Equation 5 does not provide a unique mapping to amino 299
acid identities. We used a sampling approach to assign identities of amino acids to these 300
frequency vectors so that they approximately result ed in average frequencies of amino acids 301
that match ed the equilibrium frequency distribution in LG. From this ensemble of 1000 302
frequency vectors, a subset of 100 vectors were picked out that were used to calculate an 303
instantaneous rate matrix through the mutation -selection framework. The ensemble was 304
selected to maximize the correlation with the empirical matrices JTT, WAG , and LG. 305
Optimization was carried out by simulated annealing. The result of this procedure is an 306
ensemble of frequency distributions that maximizes the correlation to empirical matrices and 307
the result is shown in Figure 5. 308
309
310
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
17
311
Figure 5: Fit of amino acid frequency distributions to empirical matrices. Upper row, plot of 312
amino-acid rates (q ij) of JTT( left)/WAG(center)/LG(right) and the same element in fitted 313
mutation-selection model. Black line, linear fit. Lower row, histogram of the effective number 314
of amino acids (Neff) in fit to JTT, WAG, and LG. 315
316
The fitted matrices are highly correlated to their empirical counterparts with r2 values of 0.95, 317
0.92, and 0.87 for JTT, WAG , and LG, respectively. To characterize the fitted frequency 318
distributions, the effective number of amino acids (neff) was calculated for each vector in the 319
ensemble. This is a convenient metric to characterize variability at sites in a protein. neff values 320
range between 1 and 20, where 1 means that only one type of amino acid is found at a site and 321
20 means that all 20 occur with equal probability. The distribution of effective amino numbers 322
for each matrix is presented in Figure 5. JTT is substantially different from WAG and LG, 323
with a broader distribution and more vectors corresponding to highly variable βsitesβ (higher 324
neff values). The average neff for JTT is 4.6. The distribution of WAG and LG have more in 325
common, although WAG has a higher average neff than LG (4.0 vs 3.4). This primarily stems 326
from the higher frequency of very invariable βsitesβ for WAG (low neff values). 327
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
18
Discussion
328
The mutation-selection model provides a powerful framework for describing how the action of 329
mutation and selection combine to generate site variability in protein sequences. In this model, 330
site variability emerges as a consequence of the relative time -averaged fitness of codons and 331
individual amino acids. Phylogenetic models used in tree inference from protein sequences 332
typically assume that all sites evolve with the same equilibrium frequencies and site variability 333
is introduced by a phenomenological rate scale factor applied to each site. This restriction can 334
be loosened by allowing the equilibrium frequency distribution to vary at sites (Quang le, et al. 335
2008). However, the large number of parameters of such models coupled to their computational 336
complexity limits their utility. The performance of variable site frequency models in site -rate 337
predictions has also not been benchmarked. 338
339
We focus here on the prediction of substitution rates at individual sites of proteins from amino 340
acid sequence data and show that they can be calculated using the mutation -selection 341
framework with just two nuisance parameters (π
and π). In practice, the model is insensitive to 342
the value of π
leaving a single modeling parameter. The main complication is the estimation 343
of equilibrium frequency values from MSAs. The amino acid distributions of leaf sequences 344
are not necessarily good estimators of equilibrium frequencies. For sequences with limited 345
divergence, this may become an issue, limiting the approach presented here. There is also an 346
issue with uncertainties in equilibrium frequency distributions due to limited statistical 347
sampling. However, based on experiments using a simulated data set, we showed that accurate 348
predictions of site rates can be predicted with as little as 20 sequences for two of the systems 349
studied here. This is surprising but can be explained because most sites are highly conserved 350
so equilibrium frequencies can be estimated with relative ly few sequences. Also, since site 351
rates are an aggregate of rates across many codons , errors in individual amino acid frequency 352
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
19
values may be washed out in the model. Comparison with rate -scale factors predicted from 353
phylogenetic inference demonstrates some benefits of the mutation-selection model. Prediction 354
accuracy is better for all simulated systems and across all alignment depths and computations 355
can be carried out much more rapidly. Nonetheless, with enough sequences, the maximum 356
likelihood estimated rate factors are also accurate. Since the underlying data is based o n 357
nucleotide mutation trajectories, it is encouraging to see that both approach es result in good 358
predictions. 359
360
Because benchmarks with known rates are difficult to construct from empirical data, we have 361
used a simulation approach in this study. The evolutionary dynamics simulation generates 362
sequences that result from time - as well as context -dependent fluctuations of fitness. By 363
simulating evolutionary dynamics using a phylogenetic tree, sequences maintain similar 364
phylogenetic relationships as those found in natural sequences. The approach thus provides a 365
more realistic model of protein evolution than nucleotide or amino acid -based simulators . 366
However, one caveat is that the mutation model underlying the evolutionary dynamics 367
simulations shares similarities to the one used in the mutation -selection model, which could 368
give the mutation-selection model an advantage in the comparisons. In particular, the proposal 369
rates in the evolutionary dynamics simulations are also governed by a single nucleotide 370
mutation rate π
and a multi -nucleotide rate π. However, the parameter values used in the 371
evolutionary dynamics simulation were different from the values used to calculate the site rates 372
and we show that the value of π
does not influence the rates much. One factor not included in 373
the evolutionary dynamics simulation is selection at the codon level, which we know occurs in 374
natural sequence evolution. Such codon-fitness models could be introduced into the mutation-375
selection framework, further improvement could also be made by using a more complex, but 376
parameter-rich, model for nucleotide mutations. However, i t is not clear that this would 377
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
20
improve the result given the fact that the input data is amino acid sequences that do not inform 378
on codon usage. With more parameters in the model, using a continuous method for maximum 379
likelihood optimization of parameters would be beneficial (Holder, et al. 2008). The accuracy 380
of site-rate predictions has been previously characterized by Scheffler et al. (Scheffler, et al. 381
2014). They found that rates predicted from phylogenetic inference were linearly correlated 382
with experimental values and that inference showed good convergence. However, one caveat 383
with their study is that rates were simulated with the same gamma-distributed rate factors used 384
in the inference model. 385
386
The matrix fitting result demonstrate s that mutation-selection model can be parametrized to 387
give rise to matrices that are very similar to the empirical rate matrices, with r 2 values going 388
up to 0.95 for JTT. This means that we can think of substitution matrices as being the result 389
of an averaging of an ensemble of site -specific rate matrices, each parametrized by a single 390
equilibrium frequency vector. Nonetheless, there appears to be an upper limit to the quality of 391
the fit because independent fitting experiments converge to similar correlation values. The 392
more complex the matrix (JTT<WAG<LG), the lower the correlation (same as the pattern for 393
the matrices calculated from natural sequences in Figure 1). LG, for example, differs from 394
WAG by being inferred by introducing site -rate scale factors in the maximum likelihood 395
estimation. Because of the vast number of parameters in the WAG and LG fitting and statistical 396
uncertainty in inference, it is not expected that mutation-selection model could explain the full 397
variation in the empirical matrices. 398
399
The fitted distributions of equilibrium frequencies can be used to understand some of the 400
implicit assumptions encoded in empirical matrices. JTT is constructed by the traditional 401
Method
of counting amino acid substitutions in MSAs (Jones, et al. 1992) . JTT is consistent 402
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
21
with a broad ensemble of highly variable site frequency vectors. The introduction of site -403
specific rate factors in LG compared to LG results in a lower frequency of highly conserved 404
sites. By varying the ensemble of frequency vectors, it is possible to create a vast array of 405
substitution matrices with different properties. 406
407
In this study, we have demonstrated that substitution rates at sites in proteins can be calculated 408
by combining the mutation-selection model with a nucleotide mutation model and equilibrium 409
frequency predictions from protein sequence alignment. This was used by Halpern and 410
Bruno(Halpern and Bruno 1998) in the original mutation-selection study to characterize branch 411
lengths but has not been generally applied for site -rate predictions from amino acid data. The 412
approach enables fast and accurate site -rate predictions even with limited sequence data and 413
provides a link between the frequency distribution at sites in proteins and amino acid 414
substitution rates encoded in amino acid substitution matrices. 415
416
Materials and methods
417
Rate matrix calculations 418
The 3912 pfam alignments employed by Le and Gascuel to train the LG matrix(Le and Gascuel 419
2008) were used as the basis for calculating protein rate matrices. We selected a subset of 420
alignments with 30 or more sequences, leaving 221 MSAs. Amino acid frequencies at each site 421
were estimated by counting the occurrences in each column of the MSA, not counting gaps. 422
Sites with more than 50% gaps were discarded, although this filter had no substantial impact 423
on the result. To account for deficiencies in statistical sampling pseudo -counts are often 424
employed. When pseudo-counts were tested, we used a Jeffreys prior and added a count 0.5 to 425
each amino acid. To account for biased sampling of amino acid sequences we weighted the 426
sequence based on a phylogenetic tree. For each of the 221 alignments we use FastTree(Price, 427
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
22
et al. 2010) to calculate phylogenetic trees. BranchManager(Stone and Sidow 2007) was then 428
utilized to create weights for each sequence based on the tree. A codon -based instantaneous 429
rate matrix was then constructed based on equation 1. By assigning equal fitness to codons in 430
the model we can solve for the equilibrium frequency distribution and normalize the 431
frequencies so that codons that code for the same amino acid sum to 1. Codon frequencies were 432
then calculated by multiplying the relative codon frequencies with the amino acid frequency 433
estimated from the MSA. Following the aggregation procedure in Yang et al.(Yang, et al. 1998) 434
and Norn et al. (Norn, et al. 2020) the codon -level instantaneous rate matrix can be by 435
condensed into a protein-level matrix. The instantaneous rate (π#β$
, ) between two amino acids 436
(i, j) with codons uΓi and uΓi can be calculated as: 437
438
[6] 439
440
Where π is an equilibrium frequency. By averaging the flux at many sites in many protein 441
MSAs we can calculate the mean instantaneous rate for amino acid i to j as: 442
443
π#$ =
β©2&(,*+,-
. βͺ.
β©4&βͺ.
=
5 2&(,*+,-
.
.
5 6&
.
.
[7] 444
445
Before averaging, the site rates were normalized to correspond to one neutral substitution by 446
per time-unit(Bruno 1996; Spielman and Wilke 2015). Rate calculations take on the the order 447
of 10s. 448
449
450
qiβ j
L = Ο v
L
Ο i
L qvu
L
vβj
β
uβi
β
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
23
Maximum Likelihood estimation of model parameters 451
A grid-based Maximum Likelihood search was carried out for the parameters, (π
β [1.4-2.6] 452
and π β [ 0-1.0]. For each combination π
and π values we generated a protein rate matrix. The 453
rate matrix was used for the inference of phylogenetic trees, using the 59 TreeBase test 454
alignments utilized by Le and Gascuel in their study on the LG matrix(Le and Gascuel 2008). 455
IQTREE(Minh, et al. 2020) with MATRIX+FO+G4 (Where MATRIX is the matrix mutation 456
selection matrix) was used for tree inference. The likelihoods were summed up for all 59 457
alignments to identify the maximum likelihood solution. 458
459
π? = argmax β logL(πππ΄#|π(π
, π))7
#8* [8] 460
461
Evolutionary dynamics simulations 462
Evolutionary dynamics w ere simulated with RosettaEvolve, which is part of the Rosetta 463
macromolecular modeling package(Leman, et al. 2020). The data in this study was taken from 464
Norn et. al.(Norn, et al. 2024), which together with the study presenting RosettaEvolve (Norn 465
and Andre 2023) gives a complete methodological description. Phylogenetic t rees for 466
simulations were generated using RaXML (Stamatakis 2014) based on natural sequence 467
alignments of the four proteins simulated in the study. Tree-crawling started from the center of 468
the tree by simulating each individual branch of the tree in a recursive fashion. The leaf nodes 469
were then used for site-rate predictions. 470
471
Site-rate fitting of empirical matrices 472
The 3912 sequences in the pfam sequences used to fit the LG matrix was analyzed to extract a 473
histogram over l-values (equation 5) with scripts and methods presented by Johnson and 474
Wilke(Johnson and Wilke 2020) . A gamma distribution was fitted to the l-values, results 475
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
24
shown in Supplementary Figure 2, using the fitdistr package in R(Team 2020). The resulting 476
gamma distribution was used to sample 1000 l-values. Equation 5 provides a list of sorted 477
frequency values from large to small. To assign identities to a frequency vector a simple 478
approximate sampling approach was used. For the first largest frequency value in the vector 479
amino acids are sampled according to the equilibrium frequencies of LG. Then the second 480
largest element is sampled, with frequency according to the equilibrium frequencies of LG 481
excluding the already selected amino acid. This is iterated until the 20th amino acid is assigned. 482
The resulting vectors have mean probabilities that differs slightly from LG. An ensemble of 483
these can selected to perfectly converge to the LG equilibrium frequencies. Since the ensemble 484
will any way be fitted in the next step, we skipped this step. Simulated annealing 485
(https://github.com/perrygeo/simanneal) was then used select a 100 site-frequency vectors out 486
of the ensemble of 1000 vectors to optimize the Pearson -correlation between a matrix 487
calculated from the ensemble and a reference matrix (JTT,WAG or LG). 488
489
Effective number of amino acids was calculated as: 490
491
π.99 = ππ₯π(β β π#ππ# π# ) [12] 492
493
Data availability 494
The data underlying this article are available inhttps://github.com/Andre-lab/mutsel-rate-495
manuscript.git. or will be shared on reasonable request to the corresponding author. 496
497
498
499
500
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
25
References
501
Arenas M. 2015. Trends in substitution models of molecular evolution. Frontiers in genetics 502
6:319. 503
Bloom JD. 2017. Identification of positive selection in genes is greatly improved by using 504
experimentally informed site-specific models. Biology Direct 12. 505
Bruno WJ. 1996. Modeling residue usage in aligned protein sequences via maximum 506
likelihood. Molecular Biology and Evolution 13:1368-1374. 507
Echave J, Jackson EL, Wilke CO. 2015. Relationship between protein thermodynamic 508
constraints and variation of evolutionary rates among sites. Physical Biology 12:025002. 509
Echave J, Spielman SJ, Wilke CO. 2016. Causes of evolutionary rate variation among protein 510
sites. Nature Reviews Genetics 17:109-121. 511
Golding B, Felsenstein J. 1990. A Maximum -Likelihood Approach to the Detection of 512
Selection from a Phylogeny. Journal of Molecular Evolution 31:511-523. 513
Goldstein RA. 2011. The evolution and evolutionary consequences of marginal thermostability 514
in proteins. Proteins-Structure Function and Bioinformatics 79:1396-1407. 515
Goldstein RA, Pollock DD. 2016. The tangled bank of amino acids. Protein Science 25:1354-516
1362. 517
Halpern AL, Bruno WJ. 1998. Evolutionary distances for protein-coding sequences: modeling 518
site-specific residue frequencies. Molecular Biology and Evolution 15:910 917. 519
Harris K, Nielsen R. 2014. Error-prone polymerase activity causes multinucleotide mutations 520
in humans. Genome Research 24:1445-1454. 521
Henikoff S, Henikoff JG. 1994. Position -Based Sequence Weights. Journal of Molecular 522
Biology 243:574-578. 523
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
26
Holder MT, Zwickl DJ, Dessimoz C. 2008. Evaluating the robustness of phylogenetic methods 524
to among -site variability in substitution processes. Philos Trans R Soc Lond B Biol Sci 525
363:4013-4021. 526
Jiang Q, Teufel AI, Jackson EL, Wilke CO. 2018. Beyond Thermodynamic Constraints: 527
Evolutionary Sampling Generates Realistic Protein Sequence Variation. Genetics 208:1387 -528
1395. 529
Johnson MM, Wilke CO. 2020. Site -Specific Amino Acid Distributions Follow a Universal 530
Shape. Journal of Molecular Evolution 88:731-741. 531
Jones CT, Youssef N, Susko E, Bielawski JP. 2018. Phenomenological Load on Model 532
Parameters Can Lead to False Biological Conclusions. Molecular Biology and Evolution 533
35:1473-1488. 534
Jones DT, Taylor WR, Thornton JM. 1992. The Rapid Generation of Mutation Data Matrices 535
from Protein Sequences. Computer Applications in the Biosciences 8:275-282. 536
Kimura M. 1962. On the probability of fixation of mutant genes in a population. Genetics 537
47:713-719. 538
Kimura M. 1980. A Simple Method for Estimating Evolutionary Rates of Base Substitutions 539
through Comparative Studies of Nucleotide -Sequences. Journal of Molecular Evolution 540
16:111-120. 541
Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben -Tal N. 2005. ConSurf 542
2005: the projection of evolutionary conservation scores of residues on protein structures. 543
Nucleic Acids Research 33:W299-W302. 544
Le SQ, Gascuel O. 2008. An improved general amino acid replacement matrix. Molecular 545
Biology and Evolution 25:1307-1320. 546
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
27
Leman JK, Weitzner BD, Lewis SM, Adolf -Bryfogle J, Alam N, Alford RF, Aprahamian M, 547
Baker D, Barlow KA, Barth P, et al. 2020. Macromolecular modeling and design in Rosetta: 548
recent methods and frameworks. Nat Methods 17:665-680. 549
Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, Bornberg -Bauer E, Colwell LJ, 550
Koning APJd, Dokholyan NV, Echave J, et al. 2012. The interface of protein structure, protein 551
biophysics, and molecular evolution. Protein science : a publication of the Protein Society 552
21:769 785. 553
Mayrose I, Graur D, Ben -Tal N, Pupko T. 2004. Comparison of site -specific rate-inference 554
Methods
for protein sequences: empirical Bayesian methods are superior. Molecular Biology 555
and Evolution 21:1781-1791. 556
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear 557
R. 2020. IQ -TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the 558
Genomic Era. Molecular Biology and Evolution 37:1530-1534. 559
Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB. 2003. Position specific variation in 560
the rate of evolution in transcription factor binding sites. BMC Evol Biol 3:19. 561
Norn C, Andre I. 2023. Atomistic simulation of protein evolution reveals sequence covariation 562
and time -dependent fluctuations of site -specific substitution rates. PLoS Comput Biol 563
19:e1010262. 564
Norn C, AndrΓ© I, Theobald DL. 2020. A thermodynamic model of protein structure evolution 565
explains empirical amino acid rate matrices. bioRxiv:2020.2012.2002.408807. 566
Norn C, Oliveira F, Andre I. 2024. Improved prediction of site -rates from structure with 567
averaging across homologs. bioRxiv:2024.2002.2027.582061. 568
Park H, Bradley P, Greisen P, Jr., Liu Y, Mulligan VK, Kim DE, Baker D, DiMaio F. 2016. 569
Simultaneous Optimization of Biomolecular Energy Functions on Features from Small 570
Molecules and Macromolecules. J Chem Theory Comput 12:6201-6212. 571
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
28
Pollock DD, Thiltgen G, Goldstein RA. 2012. Amino acid coevolution induces an evolutionary 572
Stokes shift. Proc Natl Acad Sci U S A 109:E1352-1359. 573
Porto M, Roman HE, Vendruscolo M, Bastolla U. 2005. Prediction of site-specific amino acid 574
distributions and limits of divergent evolutionary changes in protein sequences (vol 22, pg 630, 575
2005). Molecular Biology and Evolution 22:1156-1156. 576
Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 --approximately maximum-likelihood trees 577
for large alignments. PLoS ONE 5:e9490. 578
Pupko T, Bell RE, Mayrose I, Glaser F, Ben -Tal N. 2002a. Rate4Site: an algorithmic tool for 579
the identification of functional regions in proteins by surface mapping of evolutionary 580
determinants within their homologues. Bioinformatics 18:S71-S77. 581
Pupko T, Bell RE, Mayrose I, Glaser F, Ben -Tal N. 2002b. Rate4Site: an algorithmic tool for 582
the identification of functional regions in proteins by surface mapping of evolutionary 583
determinants within their homologues. Bioinformatics 18 Suppl 1:S71-77. 584
Quang le S, Gascuel O, Lartillot N. 2008. Empirical profile mixture models for phylogenetic 585
reconstruction. Bioinformatics 24:2317-2323. 586
Reid TM, Loeb LA. 1993. Tandem Double Cc -]Tt Mutations Are Produced by Reactive 587
Oxygen Species. Proceedings of the National Academy of Sciences of the United States of 588
America 90:3904-3907. 589
Rodrigue N, Latrille T, Lartillot N. 2020. A Bayesian mutation -selection framework for 590
detecting site -specific adaptive evolution in protein -coding genes. Molecular Biology and 591
Evolution. 592
Rodrigue N, Philippe H, Lartillot N. 2010. Mutation -selection models of coding sequence 593
evolution with site -heterogeneous amino acid fitness profiles. Proc Natl Acad Sci U S A 594
107:4629-4634. 595
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
29
Scheffler K, Murrell B, Pond SLK. 2014. On the Validity of Evolutionary Models with Site -596
Specific Parameters. PLoS ONE 9:e94534. 597
Shah P, McCandlish DM, Plotkin JB. 2015. Contingency and entrenchment in protein 598
evolution under purifying selection. Proceedings of the National Academy of Sciences of the 599
United States of America 112:7627-7627. 600
Spielman SJ, Kosakovsky Pond SL. 2018. Relative evolutionary rate inference in HyPhy with 601
LEISR. PeerJ 6:e4339. 602
Spielman SJ, Pond SLK. 2018. Relative evolutionary rate inference in HyPhy with LEISR. 603
PeerJ 6:e4339. 604
Spielman SJ, Wilke CO. 2015. Pyvolve: A Flexible Python Module for Simulating Sequences 605
along Phylogenies. PLoS ONE 10. 606
Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post -analysis of 607
large phylogenies. Bioinformatics 30:1312-1313. 608
Stone EA, Sidow A. 2007. Constructing a meaningful evolutionary average at the phylogenetic 609
center of mass. Bmc Bioinformatics 8. 610
Tamuri AU, dos Reis M, Goldstein RA. 2012. Estimating the Distribution of Selection 611
Coefficients from Phylogenetic Data Using Sitewise Mutation -Selection Models. Genetics 612
190:1101-1115. 613
Tamuri AU, Goldman N, Reis Md. 2014. A Penalized -Likelihood Method to Estimate the 614
Distribution of Selection Coefficients from Phylogenetic Data. Genetics 197:257-271. 615
Team RC. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria. 616
Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from 617
multiple protein families using a maximum -likelihood approach. Molecular Biology and 618
Evolution 18:691-699. 619
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
30
Williams PD, Pollock DD, Blackburne BP, Goldstein RA. 2006. Assessing the accuracy of 620
ancestral protein reconstruction methods. PLOS Computational Biology 2:e69. 621
Yang Z, Nielsen R, Hasegawa M. 1998. Models of amino acid substitution and applications to 622
mitochondrial protein evolution. Molecular Biology and Evolution 15:1600-1611. 623
Yang ZH. 1993. Maximum-Likelihood-Estimation of Phylogeny from DNA-Sequences When 624
Substitution Rates Differ over Sites. Molecular Biology and Evolution 10:1396-1401. 625
Yang ZH, Nielsen R. 2008. Mutation -selection models of codon substitution and their use to 626
estimate selective strengths on codon usage. Molecular Biology and Evolution 25:568-579. 627
628
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted March 6, 2024. ; https://doi.org/10.1101/2024.03.02.583099doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source β PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.