Abstract
Protein language models ( PLMs) capture features of protein three-dimensional structure from
amino acid sequences alone , with out requiring multiple sequence alignments (MSA). The
concepts of grammar and semantics from natural language have been shown to be applicable to
understanding functional properties of viral proteins like antigenicity. Here, we investigate how
these representations more broadly enable assessment of both observed and previously
unobserved variation due to mutation. Applied to SARS -CoV-2's spike protein we demonstrate
the PLM, ESM -2, has learned the sequence context within which variation occurs, capturing
evolutionary constraint. We demonstrate this recapitulates what conventionally requires MSA
data predicting both where variation does and does not accumulate across the protein. We show
the PLM-derived measures of amino acid change represent novel metrics as they do not correlate
strongly with classical metrics of change in sequences. Applied to SARS-CoV-2 variants across
the pandemic we demonstrate that ESM-2 representations encode the relatedness between
variants, i.e., their evolutionary history, as well as the distinct nature of variants of concern upon
their emergence, associated with shifts in receptor binding and antigenicity. This application of
ESM-2 can be used for characterising the evolutionary potential and possible epidemiological
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
2
impact of both existing and emerging variation , with the potential to be applied to any protein
sequence, pathogen or otherwise.
Introduction
The conventional study of protein sequence variation requires the alignment of amino acid
sequence data. Alignments help identify where variation accumulates within sequences, revealing
the evolutionary constraints that dictate observed variation. While impactful missense mutations,
non-synonymous substitutions resulting in amino acid changes, in the human genome are
primarily linked to disease , in viruses amino acid changes are more often linked to adaptive
evolution1,2. Alternatively, these changes (and in fact the majority of possible missense mutations)
are deleterious resulting in, e.g., misfolding and a decrease in fitness or be “neutral” and have
little to no consequence.
Approaches for assessing the impact of mutations are premised on the availability of comparative
genome sequence data to assess features like the relative proportions of non -synonymous to
synonymous substitutions (dN/dS) or evolutionary conservation at individual sites. Similarly,
genome sequence-based surveillance methods usually require aligned real-time sequence data
to assess the relative growth rate of a pathogen lineage3,4. Experiments on novel variants can be
initiated when first detected, but they take time to complete.
Natural language processing (NLP) methods can be applied to biological sequences like DNA
and proteins5–7. Biological sequences are physical molecules represented by arbitrary characters.
This is convenient since these characters, representing nucleotides or amino acids, can be
processed in the same way as words in natural language. Hie. et al6 show how, much like natural
language, biological sequence context can be used to create informative embeddings that reflect
complex properties of the sequence. Specifically, concepts like grammar (the structural rules of a
language) and semantics (the meaning of the language) can be applied to changes in protein
amino acid sequences. The semantic difference between protein sequences can be measured
by computing distances between sequences within the embedding space. The grammar of a
sequence is captured by its probability, indicating how well it conforms to the rules of protein
formation and composition.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
3
Evolutionary scale model 2 (ESM-2)5,7 is a protein language model (PLM) trained on
approximately 65 million unique protein sequences5,7. Here, we demonstrate that ESM-2 can infer
fundamental properties that classically require a multiple sequence alignment. Applied to SARS-
CoV-2 w e show that the grammaticality and semantic score metrics reveal characteristic
properties of the SARS-CoV-2 spike protein sequence (Figure 1). We further assess how to use
these metrics for variant and sequence horizon scanning . The metrics, produced from a single
protein sequence using pre-trained models, bypass the need for alignment data. We make a case
for their future application in characterising proteins from emerging pathogens and assessing the
constraints acting on protein sequences.
Figure 1. Schematic summary of our methodology . Deep mutational scanning involves taking a
sequence and mutating every position to each of the possible alternative amino acids. These are
then passed to the PLM to produce embeddings and logits which can be used for downstream tasks
or to produce metric s like relative grammaticality or the semantic score. Identifying epistatic
interactions involves reverting the mutations from a SARS-CoV-2 variant ( here BA.1) and measuring
the effect this has on the other likelihoods . Embeddings and log-likelihoods (logi ts) can be used for
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
4
surveillance by producing metrics over timescales for newly emergent protein sequences of concern,
or by looking at evolutionary trajectories such as with evo -velocity.
Results
PLMs capture evolutionary potential
We first assess how well ESM-2 reflects the consequences of protein structure as a constraint on
sequence variation. Specifically , does the model assign appropriate values for relative
grammaticality (the log-likelihood ratio between the mutation and the reference amino acid) and
semantic score (the distance between sequence embedding s, e.g. , mutant and wild -type) at
different sites in the protein sequence? To do this analysis we utilised an in silico deep mutational
scanning (DMS) approach8 using embeddings for every possible single substitution in the SARS-
CoV-2 spike protein. In vitro deep mutational scans involve experimentally substituting each site
in a protein with every other amino acid and measuring the substitutions’ effect on phenotype. We
do this computationally by embedding each of these substituted sequences using ESM-2 and
then calculating grammaticality and semantic scores for each sequence.
The ESM-2 scores correlate well with the structur al subunits of the spike (Figure 2). Using the
DMS approach to produce relative grammaticality and semantic scores, we observe that the
protein has two main regions that broadly correspond to the two subunits of spike protein S1 and
S2 (Figure 2B and C). Relative grammaticalities decrease when substitutions are introduced in
the S2 indicating that the region is less tolerant to change compared to S1 . We observe a
statistically significant difference between the mean relative grammaticalities of the S1 and S2 (p-
value = 2.168e-157, Mann-Whitney U test). Spike is a trimer with the core of the S2 subunit being
formed by three closely interacting alpha helices from each monomer. As such, changes in this
region could disrupt the formation of a stable and functional spike trimer by disrupting the inter-
monomer interactions. On the other hand, in the S1 subunit, mutations in the N-terminal Domain
(NTD) and receptor-binding domain (RBD) have higher relative grammaticalities. Despite being
important to protein function, these regions need to be sufficiently flexible to facilitate receptor
binding interactions and accommodate mutations to evade host immunity. This mirrors sequence
variation from alignments. Specifically, using entropy, a measure of site-specific variation
computed from multiple sequence alignments, there is significantly higher entropy observed at
sites in S1 compared to S2 (Figure 2A), with each region found to be significantly different ( p-
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
5
value = 1.121e-10, Mann-Whitney U test). This demonstrates that ESM-2 has correctly identified
S1 as a more variable region than S2 (an analysis based on permutations of a single sequence)
and indicates that ESM-2 has learnt this property about spike. The semantic rank also increases
in the NTD region of the S1 (Figure 2A and B), indicating that changes here may produce large
structural changes and are more likely to be accommodated due to higher relative
grammaticalities across the region; presumably reflecting properties of the NTD, i.e., it is a region
on the surface of the protein and contains many antigenic epitopes 9. These results echo prior
Results
that show PLMs can be used as effective predictors of protein conservation10. If ESM can
model regional constraints, it may also be able to identify epistatic interactions that are linked with
protein structural constraints.
PLMs identify epistatic interaction sites
Another consequence of protein structural constraint is that compensatory mutations are often
required before a change can be accommodated. Such epistatic interactions can be due to direct
or indirect mechanisms linked to residues in close contact in the protein structure conformation
and/or protein stability 11. We present a novel approach for assessing epistasis, where we find
putative epistatic interactions among sites by identifying where changes in the amino acid at one
site cause significant changes in amino acid likelihoods at another. Language models compute
likelihoods for each amino acid in a sequence based on the context of the rest of the sequence.
This means that the likelihoods depend on the other amino acids in the sequence and will change
if other positions mutate. Positions that are unaffected by the mutated site would be expected to
change minimally, whereas those affected may change their probabilities to be lower or higher
depending on whether an epistatic interaction is antagonistic or synergistic, respectively. The
BA.1 Omicron, variant of concern (VOC) is defined by over 30 u nique substitutions in the spike
protein sequence. By reverting these mutations from the BA.1 mutation to the reference sequence
amino acid, we can measure the effect this reversion has on the probabilities of all other BA.1
amino acids computed by the mod el. We did this for each of the mutated sites in BA.1 and
highlight the interactions of three key substitutions E484A, D614G and N969K 12–15. These
mutations were selected due to their effects on antigenicity, conformational changes and
presence in the S2 domain.
The E484A substitution in the RBD of spike is involved in modulating binding to ACE -2 receptor
and is a good example of a substitution with localised effect (Figure 3A). Prior variants Beta and
Gamma contained the E484K substitution and this helped the variants escape several RBD-
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
6
Figure 2. (A) Line graph of the protein sequence alignment entropy at each position in the SARS -
CoV-2 spike. S1 contains the majority of the sites with high entropy, while S2 contains only a few.
(B) Scatter plot of spike protein DMS. Relative grammaticality is shown on the y -axis, with the amino
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
7
acid position on the x -axis. Points are coloured on the semantic rank of each change. (C) Average
relative grammaticality at each position on the spike protein plotted on three structures, the full Spike
protein, the spike monomer, and the spike receptor binding domain (RBD) bound to the ACE -2
receptor in purple.
neutralising antibodies16,17. A, D, G and K substitutions at 484 all confer resistance to human
convalescent sera16,18. E484A shows decreased immune evasiveness relative to E484K, however
interactions between N501Y and Q498R mutations recover this 13. Reversion of the mutation
changes amino acid likelihoods primarily within the RBD ( Figure 3A). RBD positions 488, 485,
and 486 have the largest absolute changes in probability. Q498R, 488 and 486 reduce their
likelihoods upon E484As reversion, indicating that the mutation is synergistic with these positions.
486 increases and thus appears to be antagonistic. S1 reversions have shorter distances to affected
positions in sequence space and in three-dimensional space (Supplementary Figure 3) likely due
to it being the subunit comprising most of the accessible surfaces of the protein. More ‘central’
changes have a greater chance to affect more amino acids, causing bigger changes and having
knock on effects in at more distant locations. While 501Y is not identified, position 500 decreases
its likelihood further suggesting synergistic interactions with this region.
The D614G substitution emerged early during the pandemic 19, is now present in all circulating
lineages and is an example of a substitution with broader effect (Figure 3A). D614G has two main
effects: (1) to remove a hydrogen bond between 614 and 859 on the adjacent spike monomer,
and (2) at 647 to contribute to the structure of the C-terminal domain. 614G increases the stability
of the spike protein12 as well as the infectivity of the virus at the cost of increased susceptibility to
neutralising antibodies12,20. Its reversion affected 45 other positions ( Supplementary Figure 4),
nine of which are in the RBD, with a further 30 in S1 and 6 in S2. While 859 and 647 were not
affected positions, 615, 617 and 649 did change their likelihoods. 649 is an internal amino acid
located behind 614 while 615 is in direct contact with the site and 617 is slightly downstream. 617
and 615 both reduced their likelihoods, indicating that the D614G acquisition was synergistic with
those sites. 649 increase its likelihood by 3.7% on reversion suggesting an antagonistic effect
with 614G.
The N969K reversion affected 46 positions on the protein ( Figure 3A, Supplementary Figure 3
and Supplementary Figure 4) and has been linked to the change in Omicron entry route14. N969K
forms electrostatic contacts with Q755 on the adjacent protomer in the pre-fusion state21, as well
as interacting with and displacing the HR2 backbone in the post -fusion spike structure15. It has
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
8
long-range interactions both in linear sequence and protein structure context ( Supplementary
Figure 3A) and has the greatest number of affected sites compared with the other BA.1 mutations.
Position 76 had a likelihood drop of 21% while position 964 had a 15.9% decrease, indicating a
strong synergistic effect with the N969K substitution. N969K is positioned in the middle of a loop
connecting two alpha helices, with 976 and 964 positioned at the end of the first helix and the
start of the second. The large likelihood shift indicates strong synergy with the N969K mutation
and could be as y et undiscovered epistatic positions. Both D614G and N969K had positions of
interaction on other monomers, however ESM was only given a single monomer, so the lack other
monomers in the embedding may prevent the detection of these interactions.
Position 330 was found to decrease its likelihood in these three reversions, but also in several
other mutations in BA.1. This prompted an investigation into whether certain sites were more
prone to fluctuations than others, and whether these sites were of functional importance. We
identified positions that significantly changed their likelihoods in each DMS mutant, before filtering
for sites that occur across many different mutations and investigating their functional relevance to
the protein. Figure 3C shows several sites in the protein that consistently fluctuate their
likelihoods in response to changes elsewhere. Strikingly, positions 499 , 723, 1087 appear to be
affected by substitutions across the whole protein while other positions such as 330 tend to have
effects at closer ranges. The amino acids in the NTD and RBD that consistently change their
likelihoods are mostly prolines and cystines, while elsewhere there is a larger spectrum of amino
acid types, eight of which contain aromatic side chains (F and Y) (Figure 3D).
Since likelihood changes across the protein identify epistatic and structural effects, we can
produce whole sequence grammaticalities that measure the effect of a mutation on the whole of
the sequence ( Figure 3A and B). This relative sequence grammaticality is the product of the
mutated sequence logits (a pseudo -log likelihood)22. While some mutations are deemed to be
unlikely by the protein language model, selection pressure may allow these changes to propagate
if they are beneficial. We see in Figure 3D a similar distribution to (a), however many more
positions are now positive indicating that the epistatic effect of the mutation has made the
sequence more likely than the reference. Several important substitutions that have been observed
in VOCs are among these including D614G, E484K/A, K417N, P681R and more.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
9
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
10
Figure 3. (A) Monomeric structures showing the changes in probabilities for three example
substitutions in SARS -CoV-2's spike protein : E484A, D614G and N969K. The mutation site is
coloured yellow, blue sites increase in probability while red sites decrease. Mutation probabilities
were only shown if they were outside two standard deviations of the mean change. (B) Relative
sequence gramm aticalities, the product of each amino acid likelihood rather than just the mutations,
against the amino acid position. Amino acids are coloured on the semantic rank, which is a ranking
of the semantic scores of all positions from highest to lowest semantic score. (C) The significantly
changed logits across the whole DMS were identified, with positions that were repeatedly identified
(called critical sites) as being affected by the in silico mutations counted. These were then mapped
onto the spike structure and coloured on their domains. (D) Amino acids of each consistently affected
Reference
residues from the DMS data. The NTD and RBD contain mostly Prolines(P) and
Cystines(C), while the rest of spike has a wider distribution of amino acids.
Each of these mutations make the sequence less likely compared to Wuhan -Hu-1 when
measured with the relative grammaticality. D614G has a negative relative grammaticality of -3.07,
yet its sequence grammaticality is positive at 0.105, indicating that the overall sequence is fitter
with 614G than 614D based on the model. E484A has a positive relative sequence grammaticality
of 0.151 while the mutation has a negative grammaticality of -2.53. N969K remains negative (-
11.31 relative grammaticality, -2.90 relative sequence grammaticality ), indicating that it still
appears to reduce sequence fitness.
PLMs provide new metrics for assessing mutational impact
An important question is the relationship between the PLM metrics and conventional metrics, i.e.,
does grammaticality and semantic score represent a novel measure of change? Extensive high-
throughput experimental investigations have revealed various aspects of spike biology. We
sought to correlate these measurements with PLM metrics, to better understand what
characteristics are captured by the model. We calculated spearman's rank correlations between
the embedding metrics and experimental and computational metrics. We then used the full
embeddings and logits as well as score combinations to fit linear and support vector regressions
(SVR) to each metric and again scored using a spearman's rank correlation.
Experimental scores were used from 3 in vitro DMS studies, 2 of which were performed on the
RBD while the other was performed on a full spike protein. “Escape”, “Entry” and “Binding” were
determined from a full spike DMS by Dadonaite et al. (2023)23 and correspond to: immune escape
from human sera, cell entry, and binding affinity to soluble ACE -2 respectively. “RBD Escape”
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
11
was produced by Yisimayi et al. (2024)24 using an RBD only DMS. The “ Wuhan” and “ Variant”
binding and expression scores are from an RBD only DMS from Starr et al. (2022)25 where binding
to ACE-2 and expression of the protein were measured across five spikes. The “Variant” score is
made from an average of both measurements across all five spikes.
Neither grammaticality nor the semantic score correlated with the full spike experimental DMS
scores; with coefficients below 0.2 for “Binding” and “Escape”. “Entry” was better correlated with
the PLM scores, with correlation coefficients greater than +0.3 and less than -0.3 for
grammaticality and semantic score respectively. “Escape” correlation coefficients were also below
0.2 for all metrics both for the full spike and the RBD. The semantic score showed strong er
negative correlations with RBD based DMS r esults of around -0.3 (Figure 4A). Fitted linear
regressions improve the correlation coefficient for the “RBD Escape” with the mean embedding
and the Wuhan-Hu-1 “RBD Binding” and “RBD Expression” when using sequence logits (Figure
4B). However, linear models failed to improve much upon metric correlations likely due to their
inability to exploit non -linear relationships in the high -dimensional representations . SVR
regressions improve substantially here, with a correlation coefficient between 0.58 and 0.73 for
all experimental measurements aside from “Escape” and “Binding”, although both improve their
coefficients over the metric and linear model correlations (Figure 4C).
We also used several computational metrics for analysing protein sequences. “ Entropy” was
calculated using an alignment of related Sarbecovirus spike sequences. The crystal structure of
the spike protein was used to calculate “ Accessibility”, “β-Factor” and “ΔΔG”. “ Accessibility”
measures how accessible to the surface of the protein a position is and was calculated with BEpro.
“β-Factor” comes from the 6VXX PDB file and is related to the flexibility of the site. “ΔΔG” was
calculated from FoldX and measures the free energy and relates to protein stability. “PSST”
calculated from Delta Blast and “ESST” calculated from CRESENDO and are measures of
mutational likelihood26.
The semantic score showed strong negative correlations coefficients of < -0.4 with the
computational likelihood metrics “ESST” and “PSST”. Grammaticality scores performed similarly
or slightly worse for the probability metrics although the scores were inversely correlated
compared to the semantic scores. “β-Factor” had the best correlation with relative grammaticality
and relative sequence grammaticality, with both having correlation coefficients greater than 0.4.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
12
This indicates that grammaticalities (which align well with conservation) may also be a reasonable
proxy for assessing protein flexibility.
Figure 4. (A) Spearman's Rank correlations between the semantic score, grammaticality and relative
sequence grammaticality and traditional metrics. Bars not present in a metric category means the
correlation was not found to be significant after a Bonferroni corre ction. (B) Spearman's Rank
correlations between the language model metric and the traditional metric. Each pair was fitted using
a grid search and a linear regression model, with 5 -fold cross validation. Bars represent the mean
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
13
of the correlations, with the error bar +/ - 1 standard deviation of the correlations. Bars not present in
a metric category means the correlation was not found to be significant after a Bonferroni
correction. (C) Spearman's Rank correlations with a support vector regression model using an RBF
kernel and 5 -fold cross validation.
Both Linear and SVR regressions drastically improve correlations for the computational metrics.
The Linear model produced correlation coefficients >0.5 for all measurements except “Entropy”,
although this might be expected since it’s a site-specific metric rather than mutation specific like
DMS. “Accessibility” and “β-Factor” with both logits and embeddings achie ve very strong
correlation coefficients of >0.7 while “ΔΔG”, also improves with a co-efficient of >0.5 for logits and
embedding features. The SVR regression improves all of the correlation coefficients across the
board except for “ Entropy”. “PSST” experiences the greatest increase from ~0.5 to ~0.8 using
logits and SVR.
We observe that the PLM scores (semantic score, relative grammaticality and relative sequence
grammaticality) weakly correlate with all the tested experimentally or computationally derived
traditional metrics. However, at least one of the language model me trics significantly correlated
with all the other metrics ( Figure 4A). These experiments show that embedding metrics are not
simply an alternative way to describe existing metrics.
When a significant correlation was achieved, linear regression predictions for logits and
embeddings either match or drastically outperform correlations using the model metrics, or their
combinations ( Figure 4B). Fitting a SVR regression results in more consistent and higher
correlation coefficients across the board, with only “Binding”, “Escape” and “Entropy” achieving
correlations of less than 0.5 with logits and embeddings as features ( Figure 4C). Score
combinations show little improvement over individual scores.
Language models recapitulate evolutionary relationships between related
sequences
The earliest known SARS -CoV-2 sequence for each PANGO lineage w as extracted from the
global data (retrieved from GISAID) and their spike protein sequences embedded using ESM-25,7.
The evo-velocity package was then used to produce a n evo-velocity UMAP for the sequence
embeddings. Evo -velocity assigns a putative directionality between the embeddings that
describes the flow of evolution through the UMAP embedding space. Firstly, evo-velocity27 shows
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
14
that PLM embeddings can distinguish between meaningfully different spike proteins ( Figure 5A
and B. Secondly, it illustrates that embeddings can help to understand the evolutionary landscape
of spike and the directionality of its evolution.
The evo-velocity accurately describes the evolution of SARS-CoV-2 moving from the earlier ‘non-
VOC’ clusters into VOC clusters, consistent with their evolution during the pandemic (Figure 5A
and B). Omicron and Delta form distinct clusters while Gamma forms a concentrated cluster on
the fringes of the non-VOC sequences. Beta and Alpha are less homogenous, and recombinants
tend to fall with their parental lineages (primarily Omicrons). Sarbecoviruses (which include bat
and pangolin viruses related to SARS -CoV-2) form another distinct cluster close to the early
SARS-CoV-2 sequences. These are also identified as the earliest sequences by the evo-velocity
pseudotime algorithm (Figure 5B and Supplementary Figure 5). Evo-velocity first uses diffusion
analysis to identify root and endpoint sequences before estimating the order of evolution using a
pseudotime simulation (Figure 5B and Supplementary Figure 5). Pseudotime analysis achieved
a significant (p-value = 3.24e -296) spearman's rank correlation of 0.86 against sampling time,
confirming that the model has inferred the evolution of the sequences in the correct order. Evo-
velocity recapitulates the topology of the representative sequence phylogeny in Figure 5C, with
early VOCs more closely related to non-VOC sequences than Omicrons and recombinants. This
shows PLM embeddings capture meaningful representations that differentiate between distinct
sequences. The congruence between the phylogenetic tree topology and the evo-velocity derived
structure provides further confirmation that this method captures the evolutionary history of the
spike protein.
PLM metrics also differentiate between distinct SARS-CoV-2 spike proteins. Using the semantic
score or the relative grammaticality , a non-VOC, early VOC (Alpha, Beta, Gamma and Delta),
and an Omicron cluster can be identified (Figure 5 and Supplementary Figure 7). Unlike the UMAP
(Figure 5A and Figure 5B) which is a dimensionality reduction and projection from high -
dimensional space, grammaticality and the semantic score are much simpler to produce. Despite
this, they still recapitulate much of the information shown in the UMAP.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
15
Figure 5. (A) UMAP of initial spike sequence embeddings for SARS -CoV-2 PANGO lineages and a
selection of other known Sarbecovirus spike sequences. Each lineage is represented by one spike
embedding. Points are coloured on VOC classification. Arrows represent the evo -velocity through
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
16
the embedding space, which shows a “directionality” of evolution. (B) shows the sequences coloured
by pseudotime inferred using sequence embedding probabilities to order sequences in time using
an inferred root and an endpoint. (C) Shows an unrooted nucleo tide phylogenetic tree of the spike
sequences, coloured again by VOC. (D) shows the spike protein sequences plotted using their
sample date and semantic score coloured consistently to Figure 1 B.
Detecting the distinct nature of the variants of concern on emergence
The emergence of SARS -CoV-2 and its subsequent variants of concern (VOCs) caused large
waves of infections during the pandemic. Predicting their fitness advantage just from their initial
sequences, before the viruses were in widespread circulation, has proven incredibly difficult .
PLMs could be used to meet this challenge of rapid characterisation of individual pathogen
genomes.
The PLM metrics produced for each PANGO lineage can be used to assess SARA-CoV-2 variants
since they detect differences between sequences. We observed a characteristic "jump” in both
semantic score and relative grammaticality between non-VOC, early VOCs and Omicron
sequence clusters (Figure 5D and Supplementary Figure 7). Earlier VOCs may have required
fewer changes to compete since most of the human population was still naive to infection or
vaccine-derived immunity. The Omicron lineage emerged during high levels of both vaccination
and infection. It contained many more substitutions in the spike protein than previous variants, it
changed its entry mechanism preference and managed to largely evade previous immunity which
resulted in an extended vaccination regimen of three doses being recom mended28,29. Unlike the
other VOC groups, Omicron sequences initially decreased their semantic score s. The BA.2
variant had fewer mutations than the initial BA.1 variant, however later sequences that increased
their number of mutations yet continued to decrease their semantic scores. Clearly the semantic
score is therefore not a proxy for mutation count.
Once regular sequencing and surveillance are underway, a pipeline for analysing emerging
sequences could identify outlie rs that may become future VOCs. Despite PLM scores
distinguishing between VOCs well, a larger score does not always mean a variant will be
successful. This could be due to measuring against Wuhan-Hu-1, which is now an extinct and
increasingly irrelevant SARS-CoV-2 variant. To tackle this, we created a dynamic embedding that
represents the average sequence circulating at a given time. This means we can check whether
sequences are both divergent from the SARS -CoV-2 reference or from what is currently or
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
17
recently circulating. By looking at a UK subset of sequences, we can assess this approach thanks
to the UKs high sequencing capacity paired with well-defined lineage waves. UK sequences were
clustered to 99.9% similarity and a representative haplotype sequence for each cluster was
embedded using the model.
Waves of semantic change take place between all major VOC waves in the UK (Alpha to Delta to
BA.1 to BA.2) ( Figure 6). The non -VOC to Alpha transition resulted in sequences deviating by
semantic score of >2, although this quickly increased to almost three after the emergence of Delta.
Following BA.2, the semantic changes in Omicron have been more incremental, with a gradual
increase up to March 2023, followed by a decrease until July 2023. There have been several
smaller waves after Omicron’s emergence, which suggests repeated dominance and replacement
of consecutive distinct sub -lineages throughout this period. The haplotype embeddings also
uncover the large diversity of semantic scores contained outside just the PANGO representative
sequences from Figure 5. Alpha and Delta have a large range of semantic scores, equalling some
of the most divergent Omicron sequences despite pre -dating them by months (Figure 5C). To a
lesser extent, the non-VOC sequences also have some very high semantic scores relative to the
average semantic score during the Delta wave. What is clear is that as a new VOC began to take
over, sequences diverge very quickly from the average circulating sequence embedding.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
18
Figure 6. (A) UK SARS -CoV-2 spike sequences through the pandemic. Each point represents a
sequence cluster with 99.9% sequence similarity. Dynamic semantic scores were calculated for each
sequence cluster, with the black line showing the mean sliding score. (B) Rel ative sequence
grammaticalities for each of the haplotype spikes. (C) Semantic scores for each of the haplotype
spikes.
Discussion
ESM-2 models spike protein constraints, mapping low relative grammaticalities to the S2 high
relative grammaticalities to the S1 which is heavily targeted by host antibodies (NTD, RBD)9,18,30–
32 and predominantly on the protein surface (Figure 2). Hie et al 6 use the semantic score as a
proxy for antigenic change, however, this is difficult to reason with for protein sites that are not
involved in antibody binding. Since ESM -2 is trained on a wide diversity of proteins, semantic
scores likely represent shifts in structure/function rather than antigenic change. Many high
semantically ranked changes occur in the NTD ( Figure 2B) which contains a “supersite” with
several epitopes targeted by neutralising antibodies9. The NTDs low relative grammaticality and
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
19
high semantic score suggests it allows large structural changes to alter its antigenic properties
without greatly affecting function.
PLMs can detect epistatic interactions and quantify the impact of amino acid substitutions on other
sites (Figure 3). Antigenic substitutions (e.g., E484A) have localised effects while mutations in the
spike core have much longer -range effects across the protein ( Figure 3 and Supplementary
Figure 3). Training models to detect inter-protomer interactions could allow for better identification
of epistatic positions on multi-meric proteins. The relative sequence grammaticality shows many
mutations (including many VOC substitutions including D614G) are more likely than the reference
when compared to relative grammaticality (Figure 2B and Figure 3B). N969K maintains a negative
relative grammaticality but may represent a fitness trade off as it appears to destabilise the
proteins post-fusion conformation15 while stabilising it pre -fusion context21. The identification of
constrained sites by the PLM is another useful application of these models. These sites are
predominantly amino acids with important structural features. Proline is a rigid amino acid due to
its side -chain which reduces its flexibil ity33,34 and prevents it from forming stable a -helices33.
Prolines are often found where sharp turns are necessary for structure. Cystines thiol side-chain
allows them to form covalent di-sulphide bridges between adjacent cystines35. Cystines are well
conserved throughout proteins 35, and several of the identified cysteines were involved in di -
sulphide bridges (Figure 3C and Figure 3D). This shows PLMs identify constraints positions are
under due to their functional and structural importance within sequences.
Experimental data can validate that the embeddings contain relevant biological and evolutionary
information (Figure 4). ESM-2 is a good predictor of the computational and experimental metrics
used to understand how mutations impact protein structures. Most model scores do not correlate
strongly with any one feature, suggesting that PLM scores are telling us something different about
proteins. PLMs are now being used for DMS and variant effect prediction tasks across an
impressive array of datasets, and often outcompete other methods 36. They are helping improve
the effectiveness of antibodies 37 and even generate new proteins 38. Our results here affirm that
these models are useful and can be applied to great effect to understand novel viral pathogens.
Evo-velocity captures differences between VOC sequences , recapitulates the topology of the
nucleotide sequence phylogeny and reconstructs the direction of evolution accurately (Figure 5).
The PLM scores can effectively group lineages into their VOC categories ( Figure 5D,
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
20
Supplementary Figure 6 and Supplementary Figure 7). Some Delta sequences have higher
relative sequence grammaticalities than the reference SARS-CoV-2 sequence (Figure 3B), which
suggests Delta may more “fit” than other SARS-CoV-2 sequences when all likelihoods are
considered. The metrics also highlight outlier sequences, although these might be related to
factors beyond sequence properties such as epidemiology which make variant prediction difficult
for both experimental and computational methods. Dynamic embedding references can assist
with modelling the shifting immune landscapes SARS -CoV-2 is evolving within. Emergent
antigenically distinct variants are likely to cause repeated infection by SARS -CoV-239.
Consequences of this are quickly becoming apparent through the necessity for updated vaccines,
with recent variants evading previously neutralising antibodies 40,41. We capture major VOC
transitions but also incremental variants like JN.1. The heavily mutated BA.2.86 lineage acquired
an L455S substitution 42 to produce JN.1 , which subsequently became the dominant lineage
circulating. Its identification by the dynamic reference score highlights the approach as a
potentially viable horizon scanning method that does not simply count mutations, but accurately
models them.
In conclusion, applied to SARS-CoV-2 we show that the grammaticality and semantic score PLM
metrics reveal characteristic properties of the SARS-CoV-2 spike protein sequence. While there
are clear areas for improvement, PLMs represent a paradigm shift for interpreting variation within
protein sequences. Already, there are efforts being made to improve these models such as recent
examples from Ito et al. 43 and Thadani et al. 44 where additional in formation can be used to
supplement language models and improve their predictive performance. To summarise, using just
a single spike sequence we were able to determine the likely regions of variability in the spike
protein, and identify regions where mutations were most likely to impact the structure and function
of the protein. We describe how PLMs can map out evolutionary landscapes, identify epistatic
effects and provide a method for horizon scanning of viral sequences of interest. This simply was
not possible before. PLMs offer an exciting glimpse into the future of language modelling within
biology, and we should continue to press on with understanding the possibilities as well as the
Limitations
of what these models can do.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
21
Methods
ESM-2 and in silico DMS
SARS-CoV-2 spike proteins were acquired by filtering the GISAID database (https://gisaid.org)
for the earliest sequences from each PANGO lineage with a fully intact spike protein sequence.
The sequence was then embedded in the ESM -2 model to produce an embedding for the
sequence. For the DMS data and for the embedding scores, the ESM -2 three billion parameter
variant was used. The semantic score is equivalent to the L1 (Manhattan) distance between the
embedding of the reference sequence (the spike protein from the original Wuhan -Hu-1 SARS-
CoV-2 genome) and each of the PANGO spike proteins. The grammaticality of a sequence is
calculated as the product of the probabilities of each amino acid at each position in the spike
protein. The probabilities come from a softmax of the last layer o f the embedding and range
between 0 and 1. However, many of the probabilities are small and for numerical stability the
probabilities are represented in the log space. Relative grammaticality is the same as the
grammaticality, except the probability of a reference sequence is subtracted from the probability
of the variant sequence so that the score is relative , in this case, to the SARS-CoV-2 reference
sequence Wuhan-Hu-1. The sequence grammaticality represents the summed log-likelihoods of
every reference position in the sequence, rather than just the mutated posit ions. For the DMS
computations, a sequence was produced for every potential amino acid at every position in the
spike protein. Each sequence was then embedded to calculate semantic scores and relative
grammaticalities for every mutation.
Evo-velocity
We used the evo -velocity package 27 to embed the initial sequences using the ESM -2 650M
parameter variant, and then performed velocity analysis and spearman’s rank for the sample
dates against the pseudo-time. The smaller model was used primarily due to hardware limitations
of using the larger model, although in a number of cases this model has been show to equal or
even do better than the larger model.
Epistasis Experiments
The epistasis experiments used a SARS-CoV-2 BA.1 spike protein sequence and embedding this
using the ESM -2 three billion parameter variant. Due to the EPE insertion, the logits for this
position were subsequently removed to map logits to spike’s three-dimensional protein structure.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
22
The BA.1 sequence contains several mutations relative to the Wuhan -Hu-1 SARS -CoV-2
Reference
spike sequence. Each of these mutations was reverted one by one back to the
Reference
position, and the likelihood differences upon reversion were recorded. To eliminate
noise, changes less than two standard deviations from the mean across all reverted mutations
were removed and deemed not significant.
Dynamic embeddings and horizon scanning
Dynamic embeddings were computed by first gathering UK GISAID data and clustering
sequences into variants with 99.9% similarity. These haplotype variant clusters produced 11,272
sample date labelled haplotypes with a sequence returned for each cluster. Each haplotype
embedding was measured against a mean of the embeddings from the prior three-month period
using the L1 distance, i.e., the semantic score. For sequence grammaticalities, mean sequence
grammaticalities were calculated in a similar way and differences measured.
Assessing embeddings scores with known metrics
The language model’s metrics were first assessed using a Spearman’s rank against several
known biological scores. Next, Support Vector Regression (SVR) was used as a simple model to
fit the model scores as well as the embeddings and logits to the biologically relevant metrics.
Models were fit to the data using five-fold cross validation, and a linear kernel for the SVR. Model
Results
were reported as the average spearman’s rank between the folds, with the error bars as
+/-1 standard deviation from the mean.
Selection analysis signals
Signals of ancestral evolutionary selection in the animal (bat and pangolin) sarbecovirus most
closely related to SARS-CoV-2 referred to as the “nCoV” clade (see Lytras et al.45) were inferred
on a set of 167 sarbecovirus genomes, accounting for recombination by inferring selection
separately in each non-recombinant segment. These results are published in Martin et al. 46 and
presented in more detail in the following Observable notebook:
https://observablehq.com/@spond/ncos-evolution-nov-2021. Sites under negative selection were
inferred using the FEL and sites under positive selection using MEME 47 by testing on internal
branches of the nCoV clade. Sites denoted as conserved have the same amino -acid residue
among all sarbecovirus sequences in the analysis. In addition, the "evolvability" of the site in the
SARS-CoV-2 sequence was obtained as the entropy of the predicted distribution of credible
evolutionary states.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
23
Antibody accessibility, spike protein stability and Deep Mutational Scanning
Structure-based epitope score, referred to as “accessibility”, which approximates antibody
accessibility for each spike protein amino acid position, was calculated using BEpro software 48
for a Woo et al. model of the Wuhan spike. Scores relating to substitution probabilities, namely,
ESST probability, Log PSS and ΔΔG, were obtained for every possible single amino acid
substitution for the 6VXX SARS -CoV-2 spike structure (note that only values for Chain A are
included in the results as data is generally identical across all three chains). ESST probability
values were calculated using Environment Specific Substitution Table (ESST) with
CRESCENDO49 software. Log Likelihood substitution values were calculated using Position
Specific Scoring Matrices (PSSM) with the DELTA -BLAST50 algorithm in BlastX. It should be
noted that this is a sequence-based method, so residue numbering does not match the numbering
in 6VXX and values are available for residues not described by the 6VXX PDB file. ΔΔG values
were obtained by FoldX 51 software. Substitutions were performed on all three chains
simultaneously and the change in Gibbs free energy was calculated, with negative values implying
stabilising mutations.
Deep mutational scanning data
The receptor binding deep mutational scanning data was taken from experimental DMS studies
of the SARS -CoV-2 from Yisimayi et al. (2024) and Starr et al. (2022) 24,25. The Wuhan -Hu-1
scores were taken as is, while the variant average score was calculated by averaging the scores
for each position between each of the SARS-CoV-2 variant specific DMS results25. For the RBD
mutational escape values we utilised the high -throughput mutation antibody escape profiling
Results
presented in Yisimayi et al. (2024)24 This study used a panel of 1,350 monoclonal
antibodies against all possible RBD substitutions. The backbone virus used was the SARS-CoV-
2 BA.5 variant instead of Wuhan-Hu-1, however this should still provide the most comprehensive
dataset of unique substitutions’ effect on antibody escape. Our mutational escape metric is the
average of the raw antibody escape values for each substitution on each site of the RBD across
all tested monoclonal antibodies. The full spike DMS data was taken from Dadonaite et al. (2023)
and DMS scores (“Binding”, “Escape” and “Entry”) are from a BA.2 and XBB.1.5 spike backbone
and averaged together to give the presented score23.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
24
Data Availability
The initial sequences for each PANGO lineage in Figure 5 were extracted from GISAID , their
accession numbers can be found here: https://doi.org/10.55876/gis8.240620pm. The SARS-CoV-
2 spike sequences used for Figure 6 are from GISAID, their accession numbers can be found
here: https://doi.org/10.55876/gis8.240621ma. The code as well as data can be found on GitHub:
https://github.com/kieran12lamb/PLM_SARS-CoV-2.
Acknowledgements
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories
responsible for obtaining the specimens, and their Submitting laboratories for generating the
genetic sequence and metadata and sharing via the GISAID Ini tiative, on which this research is
based. We thank James Herzig and Simon Lovell for providing the protein stability changes data.
Thanks for the neural network icon made by Becris from www.flaticon.com. The authors
acknowledge funding from the UK Medical Research Council (MRC, MC_UU_12014/12,
MC_UU_00034/5, MR/V01157X/1 and a Doctoral Training Programme in Precision Medicine
studentship for KDL, MR/N013166/1), the Wellcome Trust (220977/Z/20/Z), the UK Research and
Innovation (UKRI) to the G2P -UK consortium (MR/W005611/1) and G2P2 consortium
(MR/Y004205), and the COVID-19 Genomics UK Consortium (COG -UK), which was supported
by funding from the MRC, part of UKRI, the UK National Institute of Health and Care Research
(MC_PC_19027) and Genome Research Limited, operating as the Wellcome Sanger Institute.
For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY)
licence to any Author Accepted Manuscript version arising.
References
1. Markov, P. V. et al. The evolution of SARS-CoV-2. Nat Rev Microbiol 21, 361–379 (2023).
2. Harari, S. et al. Drivers of adaptive evolution during chronic SARS-CoV-2 infections. Nat Med
28, 1501–1508 (2022).
3. Elliott, P. et al. Exponential growth, high prevalence of SARS-CoV-2, and vaccine effectiveness
associated with the Delta variant. Science 374, eabl9551 (2021).
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
25
4. Viana, R. et al. Rapid epidemic expansion of the SARS -CoV-2 Omicron variant in southern
Africa. Nature 603, 679–686 (2022).
5. Lin, Z. et al. Evolutionary-scale prediction of atomic -level protein structure with a language
model. Science 379, 1123–1130 (2023).
6. Hie, B., Zhong, E. D., Berger, B. & Bryson, B. Learning the language of viral evolution and
escape. Science 371, 284–288 (2021).
7. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to
250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021).
8. Tiberti, M. et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein
structures and structural ensembles. Briefings in Bioinformatics 23, bbac074 (2022).
9. Lok, S.-M. An NTD supersite of attack. Cell Host & Microbe 29, 744–746 (2021).
10. Marquet, C. et al. Embeddings from protein language models predict conservation and variant
effects. Hum Genet 141, 1629–1647 (2022).
11. Anishchenko, I., Ovchinnikov, S., Kamisetty, H. & Baker, D. Origins of coevolution between
residues distant in protein 3D structures. Proceedings of the National Academy of Sciences
114, 9122–9127 (2017).
12. Zhang, J. et al. Structural impact on SARS -CoV-2 spike protein by D614G substitution.
Science 372, 525–530 (2021).
13. Schröder, S. et al. Characterization of intrinsic and effective fitness changes caused by
temporarily fixed mutations in the SARS -CoV-2 spike E484 epitope and identification of an
epistatic precondition for the evolution of E484A in variant Omicron. Virology Journal 20, 257
(2023).
14. Peacock, T. P. et al. The altered entry pathway and antigenic distance of the SARS -CoV-2
Omicron variant map to separate domains of spike protein. 2021.12.31.474653 Preprint at
https://doi.org/10.1101/2021.12.31.474653 (2022).
15. Yang, K. et al. Structure-based design of a SARS -CoV-2 Omicron -specific inhibitor.
Proceedings of the National Academy of Sciences 120, e2300360120 (2023).
16. Harvey, W. T. et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat Rev
Microbiol 19, 409–424 (2021).
17. Wang, P. et al. Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7. Nature 593,
130–135 (2021).
18. Liu, Z. et al. Identification of SARS -CoV-2 spike mutations that attenuate monoclonal and
serum antibody neutralization. Cell Host & Microbe 29, 477-488.e4 (2021).
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
26
19. Korber, B. et al. Tracking Changes in SARS -CoV-2 Spike: Evidence that D614G Increases
Infectivity of the COVID-19 Virus. Cell 182, 812-827.e19 (2020).
20. Hou, Y. J. et al. SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and
transmission in vivo. Science 370, 1464–1468 (2020).
21. McCallum, M. et al. Structural basis of SARS-CoV-2 Omicron immune evasion and receptor
engagement. Science 375, 864–868 (2022).
22. Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome -wide prediction of
disease variant effects with a deep protein language model. Nat Genet 55, 1512–1522 (2023).
23. Dadonaite, B. et al. Full-spike deep mutational scanning helps predict the evolutionary
success of SARS -CoV-2 clades. 2023.11.13.566961 Preprint at
https://doi.org/10.1101/2023.11.13.566961 (2023).
24. Yisimayi, A. et al. Repeated Omicron exposures override ancestral SARS -CoV-2 immune
imprinting. Nature 625, 148–156 (2024).
25. Starr, T. N. et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain
during viral evolution. Science 377, 420–424 (2022).
26. Walls, A. C. et al. Structure, Function, and Antigenicity of the SARS -CoV-2 Spike
Glycoprotein. Cell 181, 281-292.e6 (2020).
27. Hie, B. L., Yang, K. K. & Kim, P. S. Evolutionary velocity with protein language models predicts
evolutionary dynamics of diverse proteins. cels 13, 274-285.e6 (2022).
28. Nemet Ital et al. Third BNT162b2 Vaccination Neutralization of SARS -CoV-2 Omicron
Infection. New England Journal of Medicine 386, 492–494 (2022).
29. Lauring, A. S. et al. Clinical severity of, and effectiveness of mRNA vaccines against, covid -
19 from omicron, delta, and alpha SARS -CoV-2 variants in the United States: prospective
observational study. BMJ 376, e069761 (2022).
30. Cerutti, G. et al. Potent SARS-CoV-2 neutralizing antibodies directed against spike N-terminal
domain target a single supersite. Cell Host & Microbe 29, 819-833.e7 (2021).
31. Cui, Z. et al. Structural and functional characterizations of infectivity and immune evasion of
SARS-CoV-2 Omicron. Cell 185, 860-871.e13 (2022).
32. McCallum, M. et al. N-terminal domain antigenic mapping reveals a site of vulnerability for
SARS-CoV-2. Cell 184, 2332-2347.e16 (2021).
33. Watanabe, K. & Suzuki, Y. Protein thermostabilization by proline substitutions. Journal of
Molecular Catalysis B: Enzymatic 4, 167–180 (1998).
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
27
34. Choi, E. J. & Mayo, S. L. Generation and analysis of proline mutants in protein G. Protein
Engineering, Design and Selection 19, 285–289 (2006).
35. Wong, J. W. H., Ho, S. Y. W. & Hogg, P. J. Disulfide Bond Acquisition through Eukaryotic
Protein Evolution. Molecular Biology and Evolution 28, 327–334 (2011).
36. Livesey, B. J. & Marsh, J. A. Updated benchmarking of variant effect predictors using deep
mutational scanning. Mol Syst Biol 19, e11474 (2023).
37. Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models.
Nat Biotechnol 42, 275–283 (2024).
38. Madani, A. et al. Large language models generate functional protein sequences across
diverse families. Nat Biotechnol 41, 1099–1106 (2023).
39. Wei, J. et al. Risk of SARS-CoV-2 reinfection during multiple Omicron variant waves in the
UK general population. Nat Commun 15, 1008 (2024).
40. Andrews, N. et al. Covid-19 Vaccine Effectiveness against the Omicron (B.1.1.529) Variant.
New England Journal of Medicine 386, 1532–1546 (2022).
41. Carabelli, A. M. et al. SARS-CoV-2 variant biology: immune escape, transmission and fitness.
Nat Rev Microbiol 21, 162–177 (2023).
42. Yang, S. et al. Fast evolution of SARS-CoV-2 BA.2.86 to JN.1 under heavy immune pressure.
The Lancet Infectious Diseases 24, e70–e72 (2024).
43. Ito, J. et al. A Protein Language Model for Exploring Viral Fitness Landscapes.
2024.03.15.584819 Preprint at https://doi.org/10.1101/2024.03.15.584819 (2024).
44. Thadani, N. N. et al. Learning from prepandemic data to forecast viral escape. Nature 622,
818–825 (2023).
45. Lytras, S. et al. Exploring the Natural Origins of SARS-CoV-2 in the Light of Recombination.
Genome Biology and Evolution 14, evac018 (2022).
46. Martin, D. P. et al. Selection Analysis Identifies Clusters of Unusual Mutational Changes in
Omicron Lineage BA.1 That Likely Impact Spike Function. Molecular Biology and Evolution
39, msac061 (2022).
47. Murrell, B. et al. Modeling HIV-1 Drug Resistance as Episodic Directional Selection. PLOS
Computational Biology 8, e1002507 (2012).
48. Sweredoski, M. J. & Baldi, P. PEPITO: improved discontinuous B-cell epitope prediction using
multiple distance thresholds and half sphere exposure. Bioinformatics 24, 1459–1460 (2008).
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
28
49. Chelliah, V., Chen, L., Blundell, T. L. & Lovell, S. C. Distinguishing Structural and Functional
Restraints in Evolution in Order to Identify Interaction Sites. Journal of Molecular Biology 342,
1487–1504 (2004).
50. Boratyn, G. M. et al. Domain enhanced lookup time accelerated BLAST. Biology Direct 7, 12
(2012).
51. Joost Schymkowitz et al. The FoldX web server: an online force field. Nucleic Acids Research
33, W382–W388 (2005).
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
29
Supplementary figures
Supplementary Figure 1. Swarm plots for the distributions of entropy and grammaticality for each
of the SARS-CoV-2 spike protein subunits S1 and S2. The black line shows the mean while the green
shows the median. For both entropy and grammaticality, the S2 subunit has on average lower scores
compared to the S1.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
30
Supplementary Figure 2. DMS plot with only the substitutions observed in SARS-CoV-2 VOC
sequences highlighted. Most of the mutations occur in the spike protein S1 region which the model
predicts has a higher likelihood of mutations , i.e., a higher grammaticality .
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
31
A
B
Supplementary Figure 3. (A) Boxplots showing the distribution of distances between mutations and
the positions affected by mutations in each subunit. (B) Boxplots showing the distribution of distances
between mutations and the positions for each mutation.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
32
Supplementary Figure 4. Number of sites with a significant (+/ -2 deviations from mean) change in
probability for each SARS-CoV-2 BA.1 reversion change.
Supplementary Figure 5. Predicted root nodes and endpoints identified by running Markov diffusion
process over the weighted edges of the evo -velocity network. The root nodes are correctly identified
as the Sarbecovirus spike sequences, with Omicron VOC sequences predicted as the end nodes.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
33
Supplementary Figure 6. SARS-CoV-2 Pango lineage representative sequences plotted by their
semantic scores and relative grammaticalities.
Supplementary Figure 7. SARS-CoV-2 Pango lineage representative sequences plotted by their
relative grammaticalities against sampling date.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.