From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences

doi:10.1101/2024.07.05.602129

From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences

2024 · doi:10.1101/2024.07.05.602129

preprint OA: gold CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 70,783 characters · extracted from oa-pdf · 14 sections · click to expand

Abstract

Protein language models ( PLMs) capture features of protein three-dimensional structure from amino acid sequences alone , with out requiring multiple sequence alignments (MSA). The concepts of grammar and semantics from natural language have been shown to be applicable to understanding functional properties of viral proteins like antigenicity. Here, we investigate how these representations more broadly enable assessment of both observed and previously unobserved variation due to mutation. Applied to SARS -CoV-2's spike protein we demonstrate the PLM, ESM -2, has learned the sequence context within which variation occurs, capturing evolutionary constraint. We demonstrate this recapitulates what conventionally requires MSA data predicting both where variation does and does not accumulate across the protein. We show the PLM-derived measures of amino acid change represent novel metrics as they do not correlate strongly with classical metrics of change in sequences. Applied to SARS-CoV-2 variants across the pandemic we demonstrate that ESM-2 representations encode the relatedness between variants, i.e., their evolutionary history, as well as the distinct nature of variants of concern upon their emergence, associated with shifts in receptor binding and antigenicity. This application of ESM-2 can be used for characterising the evolutionary potential and possible epidemiological .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 2 impact of both existing and emerging variation , with the potential to be applied to any protein sequence, pathogen or otherwise.

Introduction

The conventional study of protein sequence variation requires the alignment of amino acid sequence data. Alignments help identify where variation accumulates within sequences, revealing the evolutionary constraints that dictate observed variation. While impactful missense mutations, non-synonymous substitutions resulting in amino acid changes, in the human genome are primarily linked to disease , in viruses amino acid changes are more often linked to adaptive evolution1,2. Alternatively, these changes (and in fact the majority of possible missense mutations) are deleterious resulting in, e.g., misfolding and a decrease in fitness or be “neutral” and have little to no consequence. Approaches for assessing the impact of mutations are premised on the availability of comparative genome sequence data to assess features like the relative proportions of non -synonymous to synonymous substitutions (dN/dS) or evolutionary conservation at individual sites. Similarly, genome sequence-based surveillance methods usually require aligned real-time sequence data to assess the relative growth rate of a pathogen lineage3,4. Experiments on novel variants can be initiated when first detected, but they take time to complete. Natural language processing (NLP) methods can be applied to biological sequences like DNA and proteins5–7. Biological sequences are physical molecules represented by arbitrary characters. This is convenient since these characters, representing nucleotides or amino acids, can be processed in the same way as words in natural language. Hie. et al6 show how, much like natural language, biological sequence context can be used to create informative embeddings that reflect complex properties of the sequence. Specifically, concepts like grammar (the structural rules of a language) and semantics (the meaning of the language) can be applied to changes in protein amino acid sequences. The semantic difference between protein sequences can be measured by computing distances between sequences within the embedding space. The grammar of a sequence is captured by its probability, indicating how well it conforms to the rules of protein formation and composition. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 3 Evolutionary scale model 2 (ESM-2)5,7 is a protein language model (PLM) trained on approximately 65 million unique protein sequences5,7. Here, we demonstrate that ESM-2 can infer fundamental properties that classically require a multiple sequence alignment. Applied to SARS- CoV-2 w e show that the grammaticality and semantic score metrics reveal characteristic properties of the SARS-CoV-2 spike protein sequence (Figure 1). We further assess how to use these metrics for variant and sequence horizon scanning . The metrics, produced from a single protein sequence using pre-trained models, bypass the need for alignment data. We make a case for their future application in characterising proteins from emerging pathogens and assessing the constraints acting on protein sequences. Figure 1. Schematic summary of our methodology . Deep mutational scanning involves taking a sequence and mutating every position to each of the possible alternative amino acids. These are then passed to the PLM to produce embeddings and logits which can be used for downstream tasks or to produce metric s like relative grammaticality or the semantic score. Identifying epistatic interactions involves reverting the mutations from a SARS-CoV-2 variant ( here BA.1) and measuring the effect this has on the other likelihoods . Embeddings and log-likelihoods (logi ts) can be used for .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 4 surveillance by producing metrics over timescales for newly emergent protein sequences of concern, or by looking at evolutionary trajectories such as with evo -velocity.

Results

PLMs capture evolutionary potential We first assess how well ESM-2 reflects the consequences of protein structure as a constraint on sequence variation. Specifically , does the model assign appropriate values for relative grammaticality (the log-likelihood ratio between the mutation and the reference amino acid) and semantic score (the distance between sequence embedding s, e.g. , mutant and wild -type) at different sites in the protein sequence? To do this analysis we utilised an in silico deep mutational scanning (DMS) approach8 using embeddings for every possible single substitution in the SARS- CoV-2 spike protein. In vitro deep mutational scans involve experimentally substituting each site in a protein with every other amino acid and measuring the substitutions’ effect on phenotype. We do this computationally by embedding each of these substituted sequences using ESM-2 and then calculating grammaticality and semantic scores for each sequence. The ESM-2 scores correlate well with the structur al subunits of the spike (Figure 2). Using the DMS approach to produce relative grammaticality and semantic scores, we observe that the protein has two main regions that broadly correspond to the two subunits of spike protein S1 and S2 (Figure 2B and C). Relative grammaticalities decrease when substitutions are introduced in the S2 indicating that the region is less tolerant to change compared to S1 . We observe a statistically significant difference between the mean relative grammaticalities of the S1 and S2 (p- value = 2.168e-157, Mann-Whitney U test). Spike is a trimer with the core of the S2 subunit being formed by three closely interacting alpha helices from each monomer. As such, changes in this region could disrupt the formation of a stable and functional spike trimer by disrupting the inter- monomer interactions. On the other hand, in the S1 subunit, mutations in the N-terminal Domain (NTD) and receptor-binding domain (RBD) have higher relative grammaticalities. Despite being important to protein function, these regions need to be sufficiently flexible to facilitate receptor binding interactions and accommodate mutations to evade host immunity. This mirrors sequence variation from alignments. Specifically, using entropy, a measure of site-specific variation computed from multiple sequence alignments, there is significantly higher entropy observed at sites in S1 compared to S2 (Figure 2A), with each region found to be significantly different ( p- .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 5 value = 1.121e-10, Mann-Whitney U test). This demonstrates that ESM-2 has correctly identified S1 as a more variable region than S2 (an analysis based on permutations of a single sequence) and indicates that ESM-2 has learnt this property about spike. The semantic rank also increases in the NTD region of the S1 (Figure 2A and B), indicating that changes here may produce large structural changes and are more likely to be accommodated due to higher relative grammaticalities across the region; presumably reflecting properties of the NTD, i.e., it is a region on the surface of the protein and contains many antigenic epitopes 9. These results echo prior

Results

that show PLMs can be used as effective predictors of protein conservation10. If ESM can model regional constraints, it may also be able to identify epistatic interactions that are linked with protein structural constraints. PLMs identify epistatic interaction sites Another consequence of protein structural constraint is that compensatory mutations are often required before a change can be accommodated. Such epistatic interactions can be due to direct or indirect mechanisms linked to residues in close contact in the protein structure conformation and/or protein stability 11. We present a novel approach for assessing epistasis, where we find putative epistatic interactions among sites by identifying where changes in the amino acid at one site cause significant changes in amino acid likelihoods at another. Language models compute likelihoods for each amino acid in a sequence based on the context of the rest of the sequence. This means that the likelihoods depend on the other amino acids in the sequence and will change if other positions mutate. Positions that are unaffected by the mutated site would be expected to change minimally, whereas those affected may change their probabilities to be lower or higher depending on whether an epistatic interaction is antagonistic or synergistic, respectively. The BA.1 Omicron, variant of concern (VOC) is defined by over 30 u nique substitutions in the spike protein sequence. By reverting these mutations from the BA.1 mutation to the reference sequence amino acid, we can measure the effect this reversion has on the probabilities of all other BA.1 amino acids computed by the mod el. We did this for each of the mutated sites in BA.1 and highlight the interactions of three key substitutions E484A, D614G and N969K 12–15. These mutations were selected due to their effects on antigenicity, conformational changes and presence in the S2 domain. The E484A substitution in the RBD of spike is involved in modulating binding to ACE -2 receptor and is a good example of a substitution with localised effect (Figure 3A). Prior variants Beta and Gamma contained the E484K substitution and this helped the variants escape several RBD- .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 6 Figure 2. (A) Line graph of the protein sequence alignment entropy at each position in the SARS - CoV-2 spike. S1 contains the majority of the sites with high entropy, while S2 contains only a few. (B) Scatter plot of spike protein DMS. Relative grammaticality is shown on the y -axis, with the amino .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 7 acid position on the x -axis. Points are coloured on the semantic rank of each change. (C) Average relative grammaticality at each position on the spike protein plotted on three structures, the full Spike protein, the spike monomer, and the spike receptor binding domain (RBD) bound to the ACE -2 receptor in purple. neutralising antibodies16,17. A, D, G and K substitutions at 484 all confer resistance to human convalescent sera16,18. E484A shows decreased immune evasiveness relative to E484K, however interactions between N501Y and Q498R mutations recover this 13. Reversion of the mutation changes amino acid likelihoods primarily within the RBD ( Figure 3A). RBD positions 488, 485, and 486 have the largest absolute changes in probability. Q498R, 488 and 486 reduce their likelihoods upon E484As reversion, indicating that the mutation is synergistic with these positions. 486 increases and thus appears to be antagonistic. S1 reversions have shorter distances to affected positions in sequence space and in three-dimensional space (Supplementary Figure 3) likely due to it being the subunit comprising most of the accessible surfaces of the protein. More ‘central’ changes have a greater chance to affect more amino acids, causing bigger changes and having knock on effects in at more distant locations. While 501Y is not identified, position 500 decreases its likelihood further suggesting synergistic interactions with this region. The D614G substitution emerged early during the pandemic 19, is now present in all circulating lineages and is an example of a substitution with broader effect (Figure 3A). D614G has two main effects: (1) to remove a hydrogen bond between 614 and 859 on the adjacent spike monomer, and (2) at 647 to contribute to the structure of the C-terminal domain. 614G increases the stability of the spike protein12 as well as the infectivity of the virus at the cost of increased susceptibility to neutralising antibodies12,20. Its reversion affected 45 other positions ( Supplementary Figure 4), nine of which are in the RBD, with a further 30 in S1 and 6 in S2. While 859 and 647 were not affected positions, 615, 617 and 649 did change their likelihoods. 649 is an internal amino acid located behind 614 while 615 is in direct contact with the site and 617 is slightly downstream. 617 and 615 both reduced their likelihoods, indicating that the D614G acquisition was synergistic with those sites. 649 increase its likelihood by 3.7% on reversion suggesting an antagonistic effect with 614G. The N969K reversion affected 46 positions on the protein ( Figure 3A, Supplementary Figure 3 and Supplementary Figure 4) and has been linked to the change in Omicron entry route14. N969K forms electrostatic contacts with Q755 on the adjacent protomer in the pre-fusion state21, as well as interacting with and displacing the HR2 backbone in the post -fusion spike structure15. It has .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 8 long-range interactions both in linear sequence and protein structure context ( Supplementary Figure 3A) and has the greatest number of affected sites compared with the other BA.1 mutations. Position 76 had a likelihood drop of 21% while position 964 had a 15.9% decrease, indicating a strong synergistic effect with the N969K substitution. N969K is positioned in the middle of a loop connecting two alpha helices, with 976 and 964 positioned at the end of the first helix and the start of the second. The large likelihood shift indicates strong synergy with the N969K mutation and could be as y et undiscovered epistatic positions. Both D614G and N969K had positions of interaction on other monomers, however ESM was only given a single monomer, so the lack other monomers in the embedding may prevent the detection of these interactions. Position 330 was found to decrease its likelihood in these three reversions, but also in several other mutations in BA.1. This prompted an investigation into whether certain sites were more prone to fluctuations than others, and whether these sites were of functional importance. We identified positions that significantly changed their likelihoods in each DMS mutant, before filtering for sites that occur across many different mutations and investigating their functional relevance to the protein. Figure 3C shows several sites in the protein that consistently fluctuate their likelihoods in response to changes elsewhere. Strikingly, positions 499 , 723, 1087 appear to be affected by substitutions across the whole protein while other positions such as 330 tend to have effects at closer ranges. The amino acids in the NTD and RBD that consistently change their likelihoods are mostly prolines and cystines, while elsewhere there is a larger spectrum of amino acid types, eight of which contain aromatic side chains (F and Y) (Figure 3D). Since likelihood changes across the protein identify epistatic and structural effects, we can produce whole sequence grammaticalities that measure the effect of a mutation on the whole of the sequence ( Figure 3A and B). This relative sequence grammaticality is the product of the mutated sequence logits (a pseudo -log likelihood)22. While some mutations are deemed to be unlikely by the protein language model, selection pressure may allow these changes to propagate if they are beneficial. We see in Figure 3D a similar distribution to (a), however many more positions are now positive indicating that the epistatic effect of the mutation has made the sequence more likely than the reference. Several important substitutions that have been observed in VOCs are among these including D614G, E484K/A, K417N, P681R and more. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 9 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 10 Figure 3. (A) Monomeric structures showing the changes in probabilities for three example substitutions in SARS -CoV-2's spike protein : E484A, D614G and N969K. The mutation site is coloured yellow, blue sites increase in probability while red sites decrease. Mutation probabilities were only shown if they were outside two standard deviations of the mean change. (B) Relative sequence gramm aticalities, the product of each amino acid likelihood rather than just the mutations, against the amino acid position. Amino acids are coloured on the semantic rank, which is a ranking of the semantic scores of all positions from highest to lowest semantic score. (C) The significantly changed logits across the whole DMS were identified, with positions that were repeatedly identified (called critical sites) as being affected by the in silico mutations counted. These were then mapped onto the spike structure and coloured on their domains. (D) Amino acids of each consistently affected

Reference

residues from the DMS data. The NTD and RBD contain mostly Prolines(P) and Cystines(C), while the rest of spike has a wider distribution of amino acids. Each of these mutations make the sequence less likely compared to Wuhan -Hu-1 when measured with the relative grammaticality. D614G has a negative relative grammaticality of -3.07, yet its sequence grammaticality is positive at 0.105, indicating that the overall sequence is fitter with 614G than 614D based on the model. E484A has a positive relative sequence grammaticality of 0.151 while the mutation has a negative grammaticality of -2.53. N969K remains negative (- 11.31 relative grammaticality, -2.90 relative sequence grammaticality ), indicating that it still appears to reduce sequence fitness. PLMs provide new metrics for assessing mutational impact An important question is the relationship between the PLM metrics and conventional metrics, i.e., does grammaticality and semantic score represent a novel measure of change? Extensive high- throughput experimental investigations have revealed various aspects of spike biology. We sought to correlate these measurements with PLM metrics, to better understand what characteristics are captured by the model. We calculated spearman's rank correlations between the embedding metrics and experimental and computational metrics. We then used the full embeddings and logits as well as score combinations to fit linear and support vector regressions (SVR) to each metric and again scored using a spearman's rank correlation. Experimental scores were used from 3 in vitro DMS studies, 2 of which were performed on the RBD while the other was performed on a full spike protein. “Escape”, “Entry” and “Binding” were determined from a full spike DMS by Dadonaite et al. (2023)23 and correspond to: immune escape from human sera, cell entry, and binding affinity to soluble ACE -2 respectively. “RBD Escape” .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 11 was produced by Yisimayi et al. (2024)24 using an RBD only DMS. The “ Wuhan” and “ Variant” binding and expression scores are from an RBD only DMS from Starr et al. (2022)25 where binding to ACE-2 and expression of the protein were measured across five spikes. The “Variant” score is made from an average of both measurements across all five spikes. Neither grammaticality nor the semantic score correlated with the full spike experimental DMS scores; with coefficients below 0.2 for “Binding” and “Escape”. “Entry” was better correlated with the PLM scores, with correlation coefficients greater than +0.3 and less than -0.3 for grammaticality and semantic score respectively. “Escape” correlation coefficients were also below 0.2 for all metrics both for the full spike and the RBD. The semantic score showed strong er negative correlations with RBD based DMS r esults of around -0.3 (Figure 4A). Fitted linear regressions improve the correlation coefficient for the “RBD Escape” with the mean embedding and the Wuhan-Hu-1 “RBD Binding” and “RBD Expression” when using sequence logits (Figure 4B). However, linear models failed to improve much upon metric correlations likely due to their inability to exploit non -linear relationships in the high -dimensional representations . SVR regressions improve substantially here, with a correlation coefficient between 0.58 and 0.73 for all experimental measurements aside from “Escape” and “Binding”, although both improve their coefficients over the metric and linear model correlations (Figure 4C). We also used several computational metrics for analysing protein sequences. “ Entropy” was calculated using an alignment of related Sarbecovirus spike sequences. The crystal structure of the spike protein was used to calculate “ Accessibility”, “β-Factor” and “ΔΔG”. “ Accessibility” measures how accessible to the surface of the protein a position is and was calculated with BEpro. “β-Factor” comes from the 6VXX PDB file and is related to the flexibility of the site. “ΔΔG” was calculated from FoldX and measures the free energy and relates to protein stability. “PSST” calculated from Delta Blast and “ESST” calculated from CRESENDO and are measures of mutational likelihood26. The semantic score showed strong negative correlations coefficients of < -0.4 with the computational likelihood metrics “ESST” and “PSST”. Grammaticality scores performed similarly or slightly worse for the probability metrics although the scores were inversely correlated compared to the semantic scores. “β-Factor” had the best correlation with relative grammaticality and relative sequence grammaticality, with both having correlation coefficients greater than 0.4. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 12 This indicates that grammaticalities (which align well with conservation) may also be a reasonable proxy for assessing protein flexibility. Figure 4. (A) Spearman's Rank correlations between the semantic score, grammaticality and relative sequence grammaticality and traditional metrics. Bars not present in a metric category means the correlation was not found to be significant after a Bonferroni corre ction. (B) Spearman's Rank correlations between the language model metric and the traditional metric. Each pair was fitted using a grid search and a linear regression model, with 5 -fold cross validation. Bars represent the mean .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 13 of the correlations, with the error bar +/ - 1 standard deviation of the correlations. Bars not present in a metric category means the correlation was not found to be significant after a Bonferroni correction. (C) Spearman's Rank correlations with a support vector regression model using an RBF kernel and 5 -fold cross validation. Both Linear and SVR regressions drastically improve correlations for the computational metrics. The Linear model produced correlation coefficients >0.5 for all measurements except “Entropy”, although this might be expected since it’s a site-specific metric rather than mutation specific like DMS. “Accessibility” and “β-Factor” with both logits and embeddings achie ve very strong correlation coefficients of >0.7 while “ΔΔG”, also improves with a co-efficient of >0.5 for logits and embedding features. The SVR regression improves all of the correlation coefficients across the board except for “ Entropy”. “PSST” experiences the greatest increase from ~0.5 to ~0.8 using logits and SVR. We observe that the PLM scores (semantic score, relative grammaticality and relative sequence grammaticality) weakly correlate with all the tested experimentally or computationally derived traditional metrics. However, at least one of the language model me trics significantly correlated with all the other metrics ( Figure 4A). These experiments show that embedding metrics are not simply an alternative way to describe existing metrics. When a significant correlation was achieved, linear regression predictions for logits and embeddings either match or drastically outperform correlations using the model metrics, or their combinations ( Figure 4B). Fitting a SVR regression results in more consistent and higher correlation coefficients across the board, with only “Binding”, “Escape” and “Entropy” achieving correlations of less than 0.5 with logits and embeddings as features ( Figure 4C). Score combinations show little improvement over individual scores. Language models recapitulate evolutionary relationships between related sequences The earliest known SARS -CoV-2 sequence for each PANGO lineage w as extracted from the global data (retrieved from GISAID) and their spike protein sequences embedded using ESM-25,7. The evo-velocity package was then used to produce a n evo-velocity UMAP for the sequence embeddings. Evo -velocity assigns a putative directionality between the embeddings that describes the flow of evolution through the UMAP embedding space. Firstly, evo-velocity27 shows .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 14 that PLM embeddings can distinguish between meaningfully different spike proteins ( Figure 5A and B. Secondly, it illustrates that embeddings can help to understand the evolutionary landscape of spike and the directionality of its evolution. The evo-velocity accurately describes the evolution of SARS-CoV-2 moving from the earlier ‘non- VOC’ clusters into VOC clusters, consistent with their evolution during the pandemic (Figure 5A and B). Omicron and Delta form distinct clusters while Gamma forms a concentrated cluster on the fringes of the non-VOC sequences. Beta and Alpha are less homogenous, and recombinants tend to fall with their parental lineages (primarily Omicrons). Sarbecoviruses (which include bat and pangolin viruses related to SARS -CoV-2) form another distinct cluster close to the early SARS-CoV-2 sequences. These are also identified as the earliest sequences by the evo-velocity pseudotime algorithm (Figure 5B and Supplementary Figure 5). Evo-velocity first uses diffusion analysis to identify root and endpoint sequences before estimating the order of evolution using a pseudotime simulation (Figure 5B and Supplementary Figure 5). Pseudotime analysis achieved a significant (p-value = 3.24e -296) spearman's rank correlation of 0.86 against sampling time, confirming that the model has inferred the evolution of the sequences in the correct order. Evo- velocity recapitulates the topology of the representative sequence phylogeny in Figure 5C, with early VOCs more closely related to non-VOC sequences than Omicrons and recombinants. This shows PLM embeddings capture meaningful representations that differentiate between distinct sequences. The congruence between the phylogenetic tree topology and the evo-velocity derived structure provides further confirmation that this method captures the evolutionary history of the spike protein. PLM metrics also differentiate between distinct SARS-CoV-2 spike proteins. Using the semantic score or the relative grammaticality , a non-VOC, early VOC (Alpha, Beta, Gamma and Delta), and an Omicron cluster can be identified (Figure 5 and Supplementary Figure 7). Unlike the UMAP (Figure 5A and Figure 5B) which is a dimensionality reduction and projection from high - dimensional space, grammaticality and the semantic score are much simpler to produce. Despite this, they still recapitulate much of the information shown in the UMAP. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 15 Figure 5. (A) UMAP of initial spike sequence embeddings for SARS -CoV-2 PANGO lineages and a selection of other known Sarbecovirus spike sequences. Each lineage is represented by one spike embedding. Points are coloured on VOC classification. Arrows represent the evo -velocity through .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 16 the embedding space, which shows a “directionality” of evolution. (B) shows the sequences coloured by pseudotime inferred using sequence embedding probabilities to order sequences in time using an inferred root and an endpoint. (C) Shows an unrooted nucleo tide phylogenetic tree of the spike sequences, coloured again by VOC. (D) shows the spike protein sequences plotted using their sample date and semantic score coloured consistently to Figure 1 B. Detecting the distinct nature of the variants of concern on emergence The emergence of SARS -CoV-2 and its subsequent variants of concern (VOCs) caused large waves of infections during the pandemic. Predicting their fitness advantage just from their initial sequences, before the viruses were in widespread circulation, has proven incredibly difficult . PLMs could be used to meet this challenge of rapid characterisation of individual pathogen genomes. The PLM metrics produced for each PANGO lineage can be used to assess SARA-CoV-2 variants since they detect differences between sequences. We observed a characteristic "jump” in both semantic score and relative grammaticality between non-VOC, early VOCs and Omicron sequence clusters (Figure 5D and Supplementary Figure 7). Earlier VOCs may have required fewer changes to compete since most of the human population was still naive to infection or vaccine-derived immunity. The Omicron lineage emerged during high levels of both vaccination and infection. It contained many more substitutions in the spike protein than previous variants, it changed its entry mechanism preference and managed to largely evade previous immunity which resulted in an extended vaccination regimen of three doses being recom mended28,29. Unlike the other VOC groups, Omicron sequences initially decreased their semantic score s. The BA.2 variant had fewer mutations than the initial BA.1 variant, however later sequences that increased their number of mutations yet continued to decrease their semantic scores. Clearly the semantic score is therefore not a proxy for mutation count. Once regular sequencing and surveillance are underway, a pipeline for analysing emerging sequences could identify outlie rs that may become future VOCs. Despite PLM scores distinguishing between VOCs well, a larger score does not always mean a variant will be successful. This could be due to measuring against Wuhan-Hu-1, which is now an extinct and increasingly irrelevant SARS-CoV-2 variant. To tackle this, we created a dynamic embedding that represents the average sequence circulating at a given time. This means we can check whether sequences are both divergent from the SARS -CoV-2 reference or from what is currently or .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 17 recently circulating. By looking at a UK subset of sequences, we can assess this approach thanks to the UKs high sequencing capacity paired with well-defined lineage waves. UK sequences were clustered to 99.9% similarity and a representative haplotype sequence for each cluster was embedded using the model. Waves of semantic change take place between all major VOC waves in the UK (Alpha to Delta to BA.1 to BA.2) ( Figure 6). The non -VOC to Alpha transition resulted in sequences deviating by semantic score of >2, although this quickly increased to almost three after the emergence of Delta. Following BA.2, the semantic changes in Omicron have been more incremental, with a gradual increase up to March 2023, followed by a decrease until July 2023. There have been several smaller waves after Omicron’s emergence, which suggests repeated dominance and replacement of consecutive distinct sub -lineages throughout this period. The haplotype embeddings also uncover the large diversity of semantic scores contained outside just the PANGO representative sequences from Figure 5. Alpha and Delta have a large range of semantic scores, equalling some of the most divergent Omicron sequences despite pre -dating them by months (Figure 5C). To a lesser extent, the non-VOC sequences also have some very high semantic scores relative to the average semantic score during the Delta wave. What is clear is that as a new VOC began to take over, sequences diverge very quickly from the average circulating sequence embedding. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 18 Figure 6. (A) UK SARS -CoV-2 spike sequences through the pandemic. Each point represents a sequence cluster with 99.9% sequence similarity. Dynamic semantic scores were calculated for each sequence cluster, with the black line showing the mean sliding score. (B) Rel ative sequence grammaticalities for each of the haplotype spikes. (C) Semantic scores for each of the haplotype spikes.

Discussion

ESM-2 models spike protein constraints, mapping low relative grammaticalities to the S2 high relative grammaticalities to the S1 which is heavily targeted by host antibodies (NTD, RBD)9,18,30– 32 and predominantly on the protein surface (Figure 2). Hie et al 6 use the semantic score as a proxy for antigenic change, however, this is difficult to reason with for protein sites that are not involved in antibody binding. Since ESM -2 is trained on a wide diversity of proteins, semantic scores likely represent shifts in structure/function rather than antigenic change. Many high semantically ranked changes occur in the NTD ( Figure 2B) which contains a “supersite” with several epitopes targeted by neutralising antibodies9. The NTDs low relative grammaticality and .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 19 high semantic score suggests it allows large structural changes to alter its antigenic properties without greatly affecting function. PLMs can detect epistatic interactions and quantify the impact of amino acid substitutions on other sites (Figure 3). Antigenic substitutions (e.g., E484A) have localised effects while mutations in the spike core have much longer -range effects across the protein ( Figure 3 and Supplementary Figure 3). Training models to detect inter-protomer interactions could allow for better identification of epistatic positions on multi-meric proteins. The relative sequence grammaticality shows many mutations (including many VOC substitutions including D614G) are more likely than the reference when compared to relative grammaticality (Figure 2B and Figure 3B). N969K maintains a negative relative grammaticality but may represent a fitness trade off as it appears to destabilise the proteins post-fusion conformation15 while stabilising it pre -fusion context21. The identification of constrained sites by the PLM is another useful application of these models. These sites are predominantly amino acids with important structural features. Proline is a rigid amino acid due to its side -chain which reduces its flexibil ity33,34 and prevents it from forming stable a -helices33. Prolines are often found where sharp turns are necessary for structure. Cystines thiol side-chain allows them to form covalent di-sulphide bridges between adjacent cystines35. Cystines are well conserved throughout proteins 35, and several of the identified cysteines were involved in di - sulphide bridges (Figure 3C and Figure 3D). This shows PLMs identify constraints positions are under due to their functional and structural importance within sequences. Experimental data can validate that the embeddings contain relevant biological and evolutionary information (Figure 4). ESM-2 is a good predictor of the computational and experimental metrics used to understand how mutations impact protein structures. Most model scores do not correlate strongly with any one feature, suggesting that PLM scores are telling us something different about proteins. PLMs are now being used for DMS and variant effect prediction tasks across an impressive array of datasets, and often outcompete other methods 36. They are helping improve the effectiveness of antibodies 37 and even generate new proteins 38. Our results here affirm that these models are useful and can be applied to great effect to understand novel viral pathogens. Evo-velocity captures differences between VOC sequences , recapitulates the topology of the nucleotide sequence phylogeny and reconstructs the direction of evolution accurately (Figure 5). The PLM scores can effectively group lineages into their VOC categories ( Figure 5D, .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 20 Supplementary Figure 6 and Supplementary Figure 7). Some Delta sequences have higher relative sequence grammaticalities than the reference SARS-CoV-2 sequence (Figure 3B), which suggests Delta may more “fit” than other SARS-CoV-2 sequences when all likelihoods are considered. The metrics also highlight outlier sequences, although these might be related to factors beyond sequence properties such as epidemiology which make variant prediction difficult for both experimental and computational methods. Dynamic embedding references can assist with modelling the shifting immune landscapes SARS -CoV-2 is evolving within. Emergent antigenically distinct variants are likely to cause repeated infection by SARS -CoV-239. Consequences of this are quickly becoming apparent through the necessity for updated vaccines, with recent variants evading previously neutralising antibodies 40,41. We capture major VOC transitions but also incremental variants like JN.1. The heavily mutated BA.2.86 lineage acquired an L455S substitution 42 to produce JN.1 , which subsequently became the dominant lineage circulating. Its identification by the dynamic reference score highlights the approach as a potentially viable horizon scanning method that does not simply count mutations, but accurately models them. In conclusion, applied to SARS-CoV-2 we show that the grammaticality and semantic score PLM metrics reveal characteristic properties of the SARS-CoV-2 spike protein sequence. While there are clear areas for improvement, PLMs represent a paradigm shift for interpreting variation within protein sequences. Already, there are efforts being made to improve these models such as recent examples from Ito et al. 43 and Thadani et al. 44 where additional in formation can be used to supplement language models and improve their predictive performance. To summarise, using just a single spike sequence we were able to determine the likely regions of variability in the spike protein, and identify regions where mutations were most likely to impact the structure and function of the protein. We describe how PLMs can map out evolutionary landscapes, identify epistatic effects and provide a method for horizon scanning of viral sequences of interest. This simply was not possible before. PLMs offer an exciting glimpse into the future of language modelling within biology, and we should continue to press on with understanding the possibilities as well as the

Limitations

of what these models can do. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 21

Methods

ESM-2 and in silico DMS SARS-CoV-2 spike proteins were acquired by filtering the GISAID database (https://gisaid.org) for the earliest sequences from each PANGO lineage with a fully intact spike protein sequence. The sequence was then embedded in the ESM -2 model to produce an embedding for the sequence. For the DMS data and for the embedding scores, the ESM -2 three billion parameter variant was used. The semantic score is equivalent to the L1 (Manhattan) distance between the embedding of the reference sequence (the spike protein from the original Wuhan -Hu-1 SARS- CoV-2 genome) and each of the PANGO spike proteins. The grammaticality of a sequence is calculated as the product of the probabilities of each amino acid at each position in the spike protein. The probabilities come from a softmax of the last layer o f the embedding and range between 0 and 1. However, many of the probabilities are small and for numerical stability the probabilities are represented in the log space. Relative grammaticality is the same as the grammaticality, except the probability of a reference sequence is subtracted from the probability of the variant sequence so that the score is relative , in this case, to the SARS-CoV-2 reference sequence Wuhan-Hu-1. The sequence grammaticality represents the summed log-likelihoods of every reference position in the sequence, rather than just the mutated posit ions. For the DMS computations, a sequence was produced for every potential amino acid at every position in the spike protein. Each sequence was then embedded to calculate semantic scores and relative grammaticalities for every mutation. Evo-velocity We used the evo -velocity package 27 to embed the initial sequences using the ESM -2 650M parameter variant, and then performed velocity analysis and spearman’s rank for the sample dates against the pseudo-time. The smaller model was used primarily due to hardware limitations of using the larger model, although in a number of cases this model has been show to equal or even do better than the larger model. Epistasis Experiments The epistasis experiments used a SARS-CoV-2 BA.1 spike protein sequence and embedding this using the ESM -2 three billion parameter variant. Due to the EPE insertion, the logits for this position were subsequently removed to map logits to spike’s three-dimensional protein structure. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 22 The BA.1 sequence contains several mutations relative to the Wuhan -Hu-1 SARS -CoV-2

Reference

spike sequence. Each of these mutations was reverted one by one back to the

Reference

position, and the likelihood differences upon reversion were recorded. To eliminate noise, changes less than two standard deviations from the mean across all reverted mutations were removed and deemed not significant. Dynamic embeddings and horizon scanning Dynamic embeddings were computed by first gathering UK GISAID data and clustering sequences into variants with 99.9% similarity. These haplotype variant clusters produced 11,272 sample date labelled haplotypes with a sequence returned for each cluster. Each haplotype embedding was measured against a mean of the embeddings from the prior three-month period using the L1 distance, i.e., the semantic score. For sequence grammaticalities, mean sequence grammaticalities were calculated in a similar way and differences measured. Assessing embeddings scores with known metrics The language model’s metrics were first assessed using a Spearman’s rank against several known biological scores. Next, Support Vector Regression (SVR) was used as a simple model to fit the model scores as well as the embeddings and logits to the biologically relevant metrics. Models were fit to the data using five-fold cross validation, and a linear kernel for the SVR. Model

Results

were reported as the average spearman’s rank between the folds, with the error bars as +/-1 standard deviation from the mean. Selection analysis signals Signals of ancestral evolutionary selection in the animal (bat and pangolin) sarbecovirus most closely related to SARS-CoV-2 referred to as the “nCoV” clade (see Lytras et al.45) were inferred on a set of 167 sarbecovirus genomes, accounting for recombination by inferring selection separately in each non-recombinant segment. These results are published in Martin et al. 46 and presented in more detail in the following Observable notebook: https://observablehq.com/@spond/ncos-evolution-nov-2021. Sites under negative selection were inferred using the FEL and sites under positive selection using MEME 47 by testing on internal branches of the nCoV clade. Sites denoted as conserved have the same amino -acid residue among all sarbecovirus sequences in the analysis. In addition, the "evolvability" of the site in the SARS-CoV-2 sequence was obtained as the entropy of the predicted distribution of credible evolutionary states. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 23 Antibody accessibility, spike protein stability and Deep Mutational Scanning Structure-based epitope score, referred to as “accessibility”, which approximates antibody accessibility for each spike protein amino acid position, was calculated using BEpro software 48 for a Woo et al. model of the Wuhan spike. Scores relating to substitution probabilities, namely, ESST probability, Log PSS and ΔΔG, were obtained for every possible single amino acid substitution for the 6VXX SARS -CoV-2 spike structure (note that only values for Chain A are included in the results as data is generally identical across all three chains). ESST probability values were calculated using Environment Specific Substitution Table (ESST) with CRESCENDO49 software. Log Likelihood substitution values were calculated using Position Specific Scoring Matrices (PSSM) with the DELTA -BLAST50 algorithm in BlastX. It should be noted that this is a sequence-based method, so residue numbering does not match the numbering in 6VXX and values are available for residues not described by the 6VXX PDB file. ΔΔG values were obtained by FoldX 51 software. Substitutions were performed on all three chains simultaneously and the change in Gibbs free energy was calculated, with negative values implying stabilising mutations. Deep mutational scanning data The receptor binding deep mutational scanning data was taken from experimental DMS studies of the SARS -CoV-2 from Yisimayi et al. (2024) and Starr et al. (2022) 24,25. The Wuhan -Hu-1 scores were taken as is, while the variant average score was calculated by averaging the scores for each position between each of the SARS-CoV-2 variant specific DMS results25. For the RBD mutational escape values we utilised the high -throughput mutation antibody escape profiling

Results

presented in Yisimayi et al. (2024)24 This study used a panel of 1,350 monoclonal antibodies against all possible RBD substitutions. The backbone virus used was the SARS-CoV- 2 BA.5 variant instead of Wuhan-Hu-1, however this should still provide the most comprehensive dataset of unique substitutions’ effect on antibody escape. Our mutational escape metric is the average of the raw antibody escape values for each substitution on each site of the RBD across all tested monoclonal antibodies. The full spike DMS data was taken from Dadonaite et al. (2023) and DMS scores (“Binding”, “Escape” and “Entry”) are from a BA.2 and XBB.1.5 spike backbone and averaged together to give the presented score23. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 24 Data Availability The initial sequences for each PANGO lineage in Figure 5 were extracted from GISAID , their accession numbers can be found here: https://doi.org/10.55876/gis8.240620pm. The SARS-CoV- 2 spike sequences used for Figure 6 are from GISAID, their accession numbers can be found here: https://doi.org/10.55876/gis8.240621ma. The code as well as data can be found on GitHub: https://github.com/kieran12lamb/PLM_SARS-CoV-2.

Acknowledgements

We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Ini tiative, on which this research is based. We thank James Herzig and Simon Lovell for providing the protein stability changes data. Thanks for the neural network icon made by Becris from www.flaticon.com. The authors acknowledge funding from the UK Medical Research Council (MRC, MC_UU_12014/12, MC_UU_00034/5, MR/V01157X/1 and a Doctoral Training Programme in Precision Medicine studentship for KDL, MR/N013166/1), the Wellcome Trust (220977/Z/20/Z), the UK Research and Innovation (UKRI) to the G2P -UK consortium (MR/W005611/1) and G2P2 consortium (MR/Y004205), and the COVID-19 Genomics UK Consortium (COG -UK), which was supported by funding from the MRC, part of UKRI, the UK National Institute of Health and Care Research (MC_PC_19027) and Genome Research Limited, operating as the Wellcome Sanger Institute. For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.

References

1. Markov, P. V. et al. The evolution of SARS-CoV-2. Nat Rev Microbiol 21, 361–379 (2023). 2. Harari, S. et al. Drivers of adaptive evolution during chronic SARS-CoV-2 infections. Nat Med 28, 1501–1508 (2022). 3. Elliott, P. et al. Exponential growth, high prevalence of SARS-CoV-2, and vaccine effectiveness associated with the Delta variant. Science 374, eabl9551 (2021). .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 25 4. Viana, R. et al. Rapid epidemic expansion of the SARS -CoV-2 Omicron variant in southern Africa. Nature 603, 679–686 (2022). 5. Lin, Z. et al. Evolutionary-scale prediction of atomic -level protein structure with a language model. Science 379, 1123–1130 (2023). 6. Hie, B., Zhong, E. D., Berger, B. & Bryson, B. Learning the language of viral evolution and escape. Science 371, 284–288 (2021). 7. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021). 8. Tiberti, M. et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles. Briefings in Bioinformatics 23, bbac074 (2022). 9. Lok, S.-M. An NTD supersite of attack. Cell Host & Microbe 29, 744–746 (2021). 10. Marquet, C. et al. Embeddings from protein language models predict conservation and variant effects. Hum Genet 141, 1629–1647 (2022). 11. Anishchenko, I., Ovchinnikov, S., Kamisetty, H. & Baker, D. Origins of coevolution between residues distant in protein 3D structures. Proceedings of the National Academy of Sciences 114, 9122–9127 (2017). 12. Zhang, J. et al. Structural impact on SARS -CoV-2 spike protein by D614G substitution. Science 372, 525–530 (2021). 13. Schröder, S. et al. Characterization of intrinsic and effective fitness changes caused by temporarily fixed mutations in the SARS -CoV-2 spike E484 epitope and identification of an epistatic precondition for the evolution of E484A in variant Omicron. Virology Journal 20, 257 (2023). 14. Peacock, T. P. et al. The altered entry pathway and antigenic distance of the SARS -CoV-2 Omicron variant map to separate domains of spike protein. 2021.12.31.474653 Preprint at https://doi.org/10.1101/2021.12.31.474653 (2022). 15. Yang, K. et al. Structure-based design of a SARS -CoV-2 Omicron -specific inhibitor. Proceedings of the National Academy of Sciences 120, e2300360120 (2023). 16. Harvey, W. T. et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat Rev Microbiol 19, 409–424 (2021). 17. Wang, P. et al. Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7. Nature 593, 130–135 (2021). 18. Liu, Z. et al. Identification of SARS -CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Cell Host & Microbe 29, 477-488.e4 (2021). .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 26 19. Korber, B. et al. Tracking Changes in SARS -CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell 182, 812-827.e19 (2020). 20. Hou, Y. J. et al. SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo. Science 370, 1464–1468 (2020). 21. McCallum, M. et al. Structural basis of SARS-CoV-2 Omicron immune evasion and receptor engagement. Science 375, 864–868 (2022). 22. Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome -wide prediction of disease variant effects with a deep protein language model. Nat Genet 55, 1512–1522 (2023). 23. Dadonaite, B. et al. Full-spike deep mutational scanning helps predict the evolutionary success of SARS -CoV-2 clades. 2023.11.13.566961 Preprint at https://doi.org/10.1101/2023.11.13.566961 (2023). 24. Yisimayi, A. et al. Repeated Omicron exposures override ancestral SARS -CoV-2 immune imprinting. Nature 625, 148–156 (2024). 25. Starr, T. N. et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science 377, 420–424 (2022). 26. Walls, A. C. et al. Structure, Function, and Antigenicity of the SARS -CoV-2 Spike Glycoprotein. Cell 181, 281-292.e6 (2020). 27. Hie, B. L., Yang, K. K. & Kim, P. S. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. cels 13, 274-285.e6 (2022). 28. Nemet Ital et al. Third BNT162b2 Vaccination Neutralization of SARS -CoV-2 Omicron Infection. New England Journal of Medicine 386, 492–494 (2022). 29. Lauring, A. S. et al. Clinical severity of, and effectiveness of mRNA vaccines against, covid - 19 from omicron, delta, and alpha SARS -CoV-2 variants in the United States: prospective observational study. BMJ 376, e069761 (2022). 30. Cerutti, G. et al. Potent SARS-CoV-2 neutralizing antibodies directed against spike N-terminal domain target a single supersite. Cell Host & Microbe 29, 819-833.e7 (2021). 31. Cui, Z. et al. Structural and functional characterizations of infectivity and immune evasion of SARS-CoV-2 Omicron. Cell 185, 860-871.e13 (2022). 32. McCallum, M. et al. N-terminal domain antigenic mapping reveals a site of vulnerability for SARS-CoV-2. Cell 184, 2332-2347.e16 (2021). 33. Watanabe, K. & Suzuki, Y. Protein thermostabilization by proline substitutions. Journal of Molecular Catalysis B: Enzymatic 4, 167–180 (1998). .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 27 34. Choi, E. J. & Mayo, S. L. Generation and analysis of proline mutants in protein G. Protein Engineering, Design and Selection 19, 285–289 (2006). 35. Wong, J. W. H., Ho, S. Y. W. & Hogg, P. J. Disulfide Bond Acquisition through Eukaryotic Protein Evolution. Molecular Biology and Evolution 28, 327–334 (2011). 36. Livesey, B. J. & Marsh, J. A. Updated benchmarking of variant effect predictors using deep mutational scanning. Mol Syst Biol 19, e11474 (2023). 37. Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat Biotechnol 42, 275–283 (2024). 38. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 41, 1099–1106 (2023). 39. Wei, J. et al. Risk of SARS-CoV-2 reinfection during multiple Omicron variant waves in the UK general population. Nat Commun 15, 1008 (2024). 40. Andrews, N. et al. Covid-19 Vaccine Effectiveness against the Omicron (B.1.1.529) Variant. New England Journal of Medicine 386, 1532–1546 (2022). 41. Carabelli, A. M. et al. SARS-CoV-2 variant biology: immune escape, transmission and fitness. Nat Rev Microbiol 21, 162–177 (2023). 42. Yang, S. et al. Fast evolution of SARS-CoV-2 BA.2.86 to JN.1 under heavy immune pressure. The Lancet Infectious Diseases 24, e70–e72 (2024). 43. Ito, J. et al. A Protein Language Model for Exploring Viral Fitness Landscapes. 2024.03.15.584819 Preprint at https://doi.org/10.1101/2024.03.15.584819 (2024). 44. Thadani, N. N. et al. Learning from prepandemic data to forecast viral escape. Nature 622, 818–825 (2023). 45. Lytras, S. et al. Exploring the Natural Origins of SARS-CoV-2 in the Light of Recombination. Genome Biology and Evolution 14, evac018 (2022). 46. Martin, D. P. et al. Selection Analysis Identifies Clusters of Unusual Mutational Changes in Omicron Lineage BA.1 That Likely Impact Spike Function. Molecular Biology and Evolution 39, msac061 (2022). 47. Murrell, B. et al. Modeling HIV-1 Drug Resistance as Episodic Directional Selection. PLOS Computational Biology 8, e1002507 (2012). 48. Sweredoski, M. J. & Baldi, P. PEPITO: improved discontinuous B-cell epitope prediction using multiple distance thresholds and half sphere exposure. Bioinformatics 24, 1459–1460 (2008). .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 28 49. Chelliah, V., Chen, L., Blundell, T. L. & Lovell, S. C. Distinguishing Structural and Functional Restraints in Evolution in Order to Identify Interaction Sites. Journal of Molecular Biology 342, 1487–1504 (2004). 50. Boratyn, G. M. et al. Domain enhanced lookup time accelerated BLAST. Biology Direct 7, 12 (2012). 51. Joost Schymkowitz et al. The FoldX web server: an online force field. Nucleic Acids Research 33, W382–W388 (2005). .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 29 Supplementary figures Supplementary Figure 1. Swarm plots for the distributions of entropy and grammaticality for each of the SARS-CoV-2 spike protein subunits S1 and S2. The black line shows the mean while the green shows the median. For both entropy and grammaticality, the S2 subunit has on average lower scores compared to the S1. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 30 Supplementary Figure 2. DMS plot with only the substitutions observed in SARS-CoV-2 VOC sequences highlighted. Most of the mutations occur in the spike protein S1 region which the model predicts has a higher likelihood of mutations , i.e., a higher grammaticality . .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 31 A B Supplementary Figure 3. (A) Boxplots showing the distribution of distances between mutations and the positions affected by mutations in each subunit. (B) Boxplots showing the distribution of distances between mutations and the positions for each mutation. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 32 Supplementary Figure 4. Number of sites with a significant (+/ -2 deviations from mean) change in probability for each SARS-CoV-2 BA.1 reversion change. Supplementary Figure 5. Predicted root nodes and endpoints identified by running Markov diffusion process over the weighted edges of the evo -velocity network. The root nodes are correctly identified as the Sarbecovirus spike sequences, with Omicron VOC sequences predicted as the end nodes. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint 33 Supplementary Figure 6. SARS-CoV-2 Pango lineage representative sequences plotted by their semantic scores and relative grammaticalities. Supplementary Figure 7. SARS-CoV-2 Pango lineage representative sequences plotted by their relative grammaticalities against sampling date. .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted July 6, 2024. ; https://doi.org/10.1101/2024.07.05.602129doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-NC-ND-4.0