EvoAI enables extreme compression and reconstruction of the protein sequence space

doi:10.21203/rs.3.rs-3930833/v1

EvoAI enables extreme compression and reconstruction of the protein sequence space

2024 · doi:10.21203/rs.3.rs-3930833/v1

preprint OA: closed

Full text JSON View at publisher

Full text 157,447 characters · extracted from preprint-html · click to expand

EvoAI enables extreme compression and reconstruction of the protein sequence space | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article EvoAI enables extreme compression and reconstruction of the protein sequence space Shuyi Zhang, Ziyuan Ma, Wenjie Li, Yunhao Shen, Yunxin Xu, Gengjiang Liu, and 8 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-3930833/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 11 Nov, 2024 Read the published version in Nature Methods → Version 1 posted You are reading this latest preprint version Abstract Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here, we first establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 10 48 . The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution. Biological sciences/Systems biology/Synthetic biology Biological sciences/Biological techniques Biological sciences/Computational biology and bioinformatics Biological sciences/Systems biology/Molecular engineering/Synthetic biology Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Main Protein engineering and design can create proteins with optimized functions for various applications in biotechnology, medicine, and synthetic biology 1 – 3 . The fundamental challenge of protein engineering is to understand and manipulate the protein fitness landscape, which is a high-dimensional and complex space that contains a vast number of possible sequences and functions 4 , 5 . Although there have been considerable attempts over the past several decades to search this space for high-fitness sequences, we have only scratched the surface of understanding the rules and features of the space 6 – 9 . Experimental methods using directed evolution techniques, such as deep mutational scanning 10 , 11 , site-saturated mutagenesis 12 , 13 , and random library construction 14 , 15 , can provide valuable information, but they are laborious and time-consuming to scale up and typically must trade off accuracy and precision with sequence space coverage. These experimental methods are also usually restricted to low-dimensional mutations that do not take into account the natural selection pressure that shapes the protein fitness landscape in high-dimensional space. Advanced directed evolution tools that support the necessary scale, such as phage-assisted continuous evolution (PACE) 16 , 17 or OrthoRep 18 , provide information primarily about trajectories that lead to high-fitness variants, which is insufficient to model the fitness landscape in its entirety. Computational methods, such as structure or sequence-based modeling of the protein fitness landscape 19 – 25 , can evaluate larger sequence spaces but are limited by the availability and quality of training data, especially for proteins with few homologs or no structure information. These computational methods also typically do not account for other biological factors that affect the protein function, such as in vivo interactions or post-translational modifications. An ideal approach to understanding and navigating this space for design and engineering purposes would use comprehensive high-throughput experimental data to inform efficient computational models. It was shown that high-throughput short sequencing data from directed evolution experiments can enable machine learning methods to reconstruct the full-length genotype and identify high-fitness variants 26 . Furthermore, it has been demonstrated that deep learning models for protein design can benefit from even a limited number of functionally characterized variants 27 . A recent work demonstrated that the protein fitness landscape is rugged with many local peaks but still easily navigable 28 . We view these functional variants or local peaks as key “anchor” points that capture the features of high-fitness genotype space. We hypothesize that the design space for high-fitness genotypes can be effectively compressed by identifying a sufficient number of these “anchor” points to capture all the essential features, which can then instruct deep learning models to reconstruct and explore the whole space. However, no existing method can generate these anchors in a rapid and comprehensive way, especially for anchors from the high-dimensional space. Such a method would need to capture functional information about variants evenly distributed across protein sequence space in a very high throughput manner. Here, we present EvoAI, a novel approach to empirically interrogate, then model, compress, and reconstruct, the sequence space. Our approach combines high-throughput experimental evolution and computational methods to capture and learn from the essential features of the space. We first developed an evolutionary scanning method that adapts phage-assisted non-continuous evolution (PANCE) 17 by incorporating a segmented mutagenesis system based on EvolvR 29 . Compared to traditional methods, this method enabled rapid and thorough evolutionary scanning from low to high dimensions and captured valuable fitness anchors. We then developed a deep learning and large language model to reconstruct the sequence space from these anchors and design new proteins with more than 10-fold improved activity compared to wild-type. For a repressor protein, we demonstrated that this vast design space can be extremely compressed by a factor of 10 48 to 82 points. Results The evolutionary scanning method The M13 bacteriophage has a single-stranded DNA genome, but it generates a double-stranded form after infecting the host cell 30 (Fig. 1 a). We reasoned that this should allow the targeted CRISPR-guided DNA polymerase mutagenesis system (TP) to introduce mutations into the M13 phage genome for selection and evolution 29 . Here, the expression of the nCas9-PolI complex was controlled by the vanillic acid induced VanR-pVanA expression system that has a large induction fold change and low background expression, and is suitable for expressing large and highly toxic proteins 31 , 32 . The evolution target was inserted into the M13 genome in place of gIII (the major coat protein of M13) to generate the selection phage (SP). The accessory plasmid (AP) expresses guide RNAs (gRNAs) that target different regions of the gene of interest for mutagenesis. The AP also contains gIII under the control of a genetic circuit that links the function of the gene of interest to the expression of gIII. This allows the selection of phages with improved and high-fitness protein function during phage propagation, while phages with non-functional genes are eliminated after dilution (Fig. 1 a). We named this system EvoScan (Evolutionary Scanning). EvoScan can explore specific regions of the fitness landscape to generate valuable anchors. These anchors are obtained by using different gRNAs to divide the target gene into defined segments, thus reducing the dimensionality of the fitness space. Moreover, the combination of different gRNAs through serial propagation on host cells bearing different APs enables the scanning and identification of anchors in higher dimensions, which can capture more details of the protein sequence space. To investigate and scan the protein sequence space, we validated and used this system to study three proteins with diverse functions: an EGFP-specific nanobody for protein-protein interaction; SARS-CoV-2 M pro and its inhibitors for protein-ligand interaction; and AmeR and its DNA operator for protein-nucleic acid interaction. Validation of EvoScan and rapid identification of anchors in nanobody To validate EvoScan and apply this system to proteins involved in protein-protein interaction, we chose antigen-antibody interaction, in this case, EGFP and its cognate nanobody 33 . We first established a reverse two-hybrid system (RTHS) that coupled the nanobody-EGFP interaction to the expression of gIII. We fused EGFP to the cI434 repressor, and its nanobody to cIp22, which can interact with cI434 but not with itself 34 . The gene encoding nanobody-cIp22 was inserted on phage to replace gIII. The gene encoding EGFP-cI434 was integrated on the AP and transformed into E. coli . After phage infection, interaction between EGFP and nanobody will enable the interaction between cI434 and cIp22 to form a tetramer complex and inhibit the p434 promoter (Fig. 1 b, 1 c). In the AP, a transcriptional repressor PhlF was placed downstream of the p434 promoter, and gIII was placed under the control of the pPhlF promoter, such that interaction between EGFP and the nanobody will eventually induce the expression of gIII and allow phage propagation (Fig. 1 b). We tested several combinations of ribosome binding sites and chose P3 RBS for PhlF and B0064 for gIII (Fig. 1 d). This circuit propagated phage carrying EGFP nanobody while limiting the propagation of empty phage. To test whether EvoScan could quickly identify a fitness-increasing protein variant “anchor” site, we artificially disrupted the interaction between EGFP and nanobody by introducing the E103K mutation in the CDR3 of the nanobody, which is essential for binding to its target (Fig. 1 c, Fig. 1 e). We designed four different gRNAs targeting different segments of the nanobody gene, with gRNA3 designed to target the segment containing the E103K mutation site of the nanobody (Fig. 1 f). After two passages in EvoScan, we observed that only the group with gRNA3 targeting the E103K segment showed increased phage titer, while the other three groups all decreased (Fig. 1 g). Sequencing results of the phage supernatant confirmed that in the gRNA3 group, the E103K mutation had reverted back to glutamate. This validated that EvoScan can successfully and efficiently identify anchors that play important roles in protein function. For comparison, we also implemented a traditional phage-assisted non-continuous evolution system (PANCE) using the same E103K phage (Extended Data Fig. 1 a). The two systems differed only in the use of targeted (EvoScan, TP) or non-targeted (PANCE, MP6) mutagenesis. After 8 passages in PANCE, no consensus mutations were found in the nanobody gene. Interestingly, a N29D single mutation appeared on cIp22 (Extended Data Fig. 1 b, Fig. 1 g), which disrupted the selection pressure on nanobody function due to the strong self-interaction between the two cI repressors (Extended Data Fig. 1 c, 1 d). These results further demonstrated that EvoScan can rapidly guide the evolution for precise searching of target proteins, even in the context of a more likely background mutation that could interfere with the desired evolution process. Thorough identification of anchors reveals novel M pro drug resistant variants We next applied EvoScan to investigate protein-ligand interaction. In this case, we chose M pro , a crucial protease in the SARS-CoV-2 virus 35 , 36 . Several M pro inhibitors have been developed and used to treat COVID-19 patients, such as GC376 37 and PF-07321332 38 , which is a key component of Paxlovid. However, the rapid mutation of SARS-CoV-2 may reduce or even eliminate the efficacy of these drugs. Previous studies have identified mutational hotspots for drug resistance but have not comprehensively profiled the M pro drug resistance fitness landscape 39 , 40 . It is important to thoroughly study possible escape mechanisms of M pro in order to inform future drug development efforts. Here, we used EvoScan to systematically identify and extract key anchors from different regions of M pro that affect its interaction with small molecule inhibitors. To couple the protease activity of M pro to the expression of downstream reporter genes, we fused the two cI repressors, cI434 and cIp22, with a linker that contains the specific sequence motif recognized by M pro , such that only functional M pro will cleave and deactivate the fused cI repressor (Fig. 2 a). We also used a previously reported inactive M pro mutant (C145A) to validate this system 35 , 36 . Our results demonstrated that this selection circuit can accurately and sensitively report on M pro activity and the inhibition efficiency of small molecules (Fig. 2 b, 2 c). In addition, we found that this genetic circuit can be used for proteases from other viruses such as HCV (Extended Data Fig. 2 a, 2 b), demonstrating its robustness and broad applications. Compared to previously reported selections used for protease evolution in PACE 41 – 43 , our circuit represents an alternative and improved strategy with better response properties (Extended Data Fig. 2 c, 2 d). We then used our genetic circuit to couple the cleavage activity of M pro , encoded on SP, to the expression of gIII, which was controlled by the p434 promoter (Fig. 2 d). This circuit enables selection for M pro variants that can escape inhibition by small molecules. Our results showed that wild-type M pro supported robust phage propagation, while the C145A mutant behaved like empty phage (Fig. 2 e, Extended Data Fig. 2 e). We tested phage propagation at various concentrations of the inhibitors GC376 and PF-07321332, and selected 20 µM as the initial concentration for evolution (Fig. 2 f, Extended Data Fig. 2 f). We designed 32 different gRNAs to systematically cover the M pro gene and performed EvoScan with two inhibitors (Fig. 2 g, 2 h). Surprisingly, we found that escaping mutations can occur across the whole M pro gene (Fig. 2 h). Some of these mutations, such as F140L, E166V, and S144A, have also been reported in previous studies on drug resistance against PF-07321332 44,45 , proving the effectiveness and reliability of our system. Most other mutations were not observed in previous works, demonstrating that EvoScan can successfully identify novel key mutations. We also identified conserved mutation sites for both inhibitors, such as S62, L75, N119, S144, T169, A191, P241, and G302 (Fig. 2 h). Interestingly, we observed that the phage propagation trajectories of the 32 segments targeted for mutagenesis varied during the evolution process, and more than 10 segments showed no overall enrichment during serial passaging, suggesting that mutations within each of these segments taken individually cannot enable drug resistance, which may serve as regions for future drug development studies (Fig. 2 h). We further verified the ability of these mutations to confer inhibition resistance (Fig. 2 i, Extended Data Fig. 2 g, 2 h). Nearly all of these mutations showed increased resistance against inhibitors compared to wild-type M pro . In group I mutations, we found that A191V had a strong resistance effect against both inhibitors, while N119D had a moderate resistance effect, and other mutations had relatively weak resistance effects on their specific inhibitors. Strikingly, we found a set of group II mutations (such as E166K), of which the enzyme activities were even improved by inhibitors (Fig. 2 i). Similar to a previously reported mechanism where GC376 increased the catalytic activity of M pro mutants 46 , E166K has a different interaction with the inhibitors compared to WT, which may then improve the dimerization of M pro and thus the enzyme activity (Fig. 2 j, Fig. 2 k). The same phenomenon was observed with other mutations such as I136V, T169P, F140L, and S144A. However, how these mutation sites increase the enzyme activities when inhibitors were added is not clear, as they are located far from the active pocket. As a comparison, we also evolved M pro using the PANCE system (Extended Data Fig. 3 a, 3 b). With only the mutagenesis method changed, M pro SP failed to accumulate any consensus mutations after 36 passages. After 96 passages, 4 dominant variants with escaping abilities emerged in the four groups in total (Extended Data Fig. 3 c-f). All these variants have the N119D or A191V mutation, which appeared after only 8 passages in EvoScan. These results further showed that EvoScan can effectively explore protein-ligand interaction and identify novel key anchor mutations related to small molecule interactions. Systematic searching for anchors in high-dimensional space Having demonstrated that EvoScan can rapidly and thoroughly explore the sequence space and generate more diverse functional variants than traditional methods, we next applied this approach to protein-nucleic acid interaction and systematically searched the space from low to high dimensions. We selected AmeR, a transcriptional regulator from the TetR family which plays important roles in many biological processes and synthetic biology 47 , 48 . AmeR has few known sequence homologs, making it challenging to use traditional methods to explore its sequence-function relationship, especially in high-dimensional space (Fig. 3 a). We planned to first carry out a rapid scan of all gRNAs that cover the full sequence of AmeR, then select only those that generated enriched mutations for further use. Several different evolution routes could then be designed using the remaining APs. Serial passaging of phage across hosts containing different APs would identify anchors in high dimensions – that is, combining multiple mutations in different segments – that thoroughly and representatively sampled the AmeR sequence space (Fig. 3 a). To link AmeR interaction with its operator to gIII expression, we inserted a PhlF repressor after the pAmeR promoter, such that the repression ability of AmeR is positively correlated with gIII expression (Fig. 3 b). We tested several combinations of plasmid origins, ribosome binding sites (RBS) and repressor types 49 to optimize the circuit. The optimal combination resulted in 73-fold propagation of SP carrying AmeR (Fig. 3 c, Extended Data Fig. 4 b). To start the scanning process, we selected 13 gRNA sites that cover both the N-terminal and C-terminal domains of AmeR, which are involved in DNA binding and dimerization, respectively. We measured phage titers after each of the 4 passages and found that most groups enriched ≥ 50-fold. Of the 13 different groups, 8 generated dominant mutations in the phage supernatant. These mutations were observed within the targeting segment corresponding to each gRNA (Fig. 4 b, 4 e). These results provide one-dimensional information about the protein sequence space. We next designed 8 evolutionary routes to sample the high-dimensional space, in which SPs were passaged across all these 8 APs in different orders (Fig. 3 d, 3 e). For each route, we sequenced the supernatant and 2 single plaques from each round (Fig. 3 e). After the full evolutionary scanning process, we obtained 82 anchor variants encompassing 52 different mutations at 39 residue sites (Fig. 3 e). Among all the variants, a large portion (~ 83%) of variants had more than 2 mutations, demonstrating the successful exploration and even sampling of the high-dimensional space. We measured the fold repression of the 82 variants, and nearly all of them showed improved function compared to WT, demonstrating again the effectiveness of EvoScan in searching for high-fitness sequences (Extended Data Fig. 5 a, Supplementary Table 1). For comparison, we also applied PANCE to AmeR evolution (Extended Data Fig. 4 a, 4 b, 4 c). After 16 passages, only R43S and S57R single mutants and the R43S S57R double mutant appeared (Fig. 3 e, Extended Data Fig. 4 d), all of which appeared during EvoScan within 8 passages. That only the variants from the low-dimensional space were observed in PANCE again illustrated how allowing competition between variants from all parts of sequence space can suppress and obscure many functionally informative mutation sites and high-fitness variants from the high-dimensional space, which were systematically captured by EvoScan. Anchors capture key features of the design space Alignment between mutations and predicted structure by AlphaFold2 suggested that these beneficial mutations accumulated not only on the helix-turn-helix domain near the N terminus that interacts with DNA, but also on regions related to dimerization of AmeR near the C terminus (Fig. 3 f). To investigate the mutation relationship between variants, we drew a relation map linking variants that contained less than three different residues (Fig. 3 g). The evolution paths leading to different mutants were connected with complexity, indicating the complex interactive nature of protein evolution in high-dimensional space. We were able to identify four evolution paths from the complex map, leading to different mutants that shared the same intermediates or reached the same destination (Fig. 3 h). This suggested the existence of shared local peaks in the landscape, consistent with a recent study demonstrating the simultaneous accessibility of multiple peaks during evolution 27 . These mutations usually contained one or more of D33E, R43S, S57R, P94L, and variants containing these mutations appeared to be fitter than WT AmeR (Fig. 3 g, Extended Data Fig. 5 a), indicating that these mutation sites provide important information about the sequence-function relationships for high-fitness genotypes. The best-performing single mutant, S57R, outperformed the wild-type AmeR repression ability in both bacteria and mammalian cell systems (Fig. 4 a, Extended Data Fig. 6a, 6b). Repressors with better properties are crucial for robust gates and genetic circuits construction in synthetic biology (e.g., low leakage, high circuit score) 50 . We next incorporated it into several genetic circuit contexts such as IMPLY, NIMPLY, and NAND 49 (Fig. 4 b, Extended Data Fig. 6c-h). The S57R variant significantly increased the circuit score of all these genetic circuits and reduced the circuit leakage at the same time (Fig. 4 b). These results show that the identified mutations affected the protein-DNA interaction directly and captured essential features of the protein itself, rather than increasing fitness only in the context of our evolution selection. Anchors capture complex epistasis interactions in the high-dimensional space We also found that, in these anchors, mutational combinations had synergistically enhanced repression abilities in both E. coli and HEK293T, demonstrating that exploring the higher dimensions is vital for identifying proteins with improved functions (Fig. 4 c, Extended Data Fig. 6a, 6b). However, we found that the order of introducing mutations significantly affected the evolvability, even if the start point and end point were the same (Fig. 4 d). For example, S57R P94L double-mutants had lower fitness than S57R, which suggested that it was more difficult for natural evolution to reach the final genotype (I80V P94L S57R) if S57R was introduced first (Fig. 4 d). We further built a phylogenetic tree to investigate the evolvability among these variants (Fig. 4 e). The results revealed that, by designing different routes (Fig. 3 e), EvoScan likely bypassed these evolvability limitations and achieved long genetic distance searching to obtain these anchors by “jumping” between domains in different orders (APs) in the high-dimensional space. These results further highlighted the need for high-throughput targeting methods to effectively explore the sequence space. These non-additive interactions between two or more mutations are known as epistasis, which has profound impacts on the landscape in the high-dimensional space. We next systematically investigated the epistasis effect in these anchors and calculated the epistasis value (ε) using fold repression as the fitness value of different genotypes (Supplementary Table 2). We identified both negative (such as D33E and S57R, R43S and [D33E S57R A75T C93R]) and positive epistasis (such as [S57R P94L] and V188F, [P94L S57R] and [G83V V188F A199S G212S]) for different mutation combinations in both low dimensions and high dimensions (Supplementary Table 2). We also studied the magnitude and sign epistasis of different mutations (Fig. 4 f), which can create rugged fitness landscapes. Interestingly, we identified reciprocal sign epistasis in the high-dimensional space, such as P94L and [G83V V188F A199S G212S] in the S57R genetic background (Fig. 4 f). We also found that, even for the same mutation, such as D33E, P94L, and D119N, ε can be either positive or negative when combined with different mutations, indicating the complex and idiosyncratic epistasis relationship between different mutations (Fig. 4 g). EvoAI enables sequence space reconstruction and prediction of new proteins Given the complex interaction of mutations in the high-dimensional space, we next aimed to use deep learning to extract the latent features of these anchors obtained from EvoScan to represent and reconstruct the design space of AmeR for high-fitness genotypes with high accuracy, enabling design of new proteins with multiple mutations not represented in the experimental outcomes. We name this hybrid experimental-computational method EvoAI. We combined a pre-trained GeoFitness model and the Protein Language Model (ESM-2), followed by a Multi-Layer Perceptron (MLP) to enhance the accuracy of predicting protein mutation effects (Fig. 5 a). The pre-trained GeoFitness model was trained on a large dataset of ~ 300,000 protein fitness values from various experimental cases and indicators to enable prediction of protein fitness of single mutations (Extended Data Fig. 7). We used the 82 anchor points for both training and validation with a 10-fold cross-validation approach to obtain the final model (Extended Data Fig. 8). Spearman correlation coefficients were 0.91 and 0.84 for the training set and the test set, respectively, demonstrating a high level of consistency in training effectiveness (Fig. 5 b). These results demonstrated that our deep learning model accurately predicted the multi-interaction of mutations and complex epistasis in higher dimensional space. We further validated the accuracy of the reconstructed space by designing, predicting, and testing new variants different from the 82 anchors. To reduce the computational load, we chose 13 mutations from the top 11 mutation sites with high prediction certainties for novel protein design (Extended Data Fig. 9b). We then computationally traversed all possible combinations of 6 total mutations and calculated the predicted fold repression by our model (1093 predictions in total). The 10 top-scoring protein sequences were cloned and experimentally tested for their fold repression. All 10 sequences showed significantly improved activities compared to WT with 10- to 38-fold repression abilities (Supplementary Table 3). Furthermore, although we chose only the top predictions and all of them have very close prediction scores, these variants still showed a high Spearman correlation coefficient between prediction scores and experimental results (Extended Data Fig. 9c, 9d). For comparison, we tested the predicted sequence space without using these anchors information but only using low-dimensional deep mutational scanning (DMS) information, and also generated 10 variants with 6 mutations each (Extended Data Fig. 9e). In striking contrast to the high-performing EvoAI-predicted variants, all 10 variants generated by DMS had worse activity relative to wild-type AmeR (Fig. 5 c, Supplementary Table 3). These results validated that, with these compressed anchors, our deep learning model can accurately reconstruct the design space for high-fitness genotypes in high-dimensional space, and design new protein sequences with improved functions. We identified 39 mutation sites in AmeR (Fig. 3 e) that could potentially generate high-fitness genotypes, with a theoretical design space of ~ 10 50 (20 39 ). Our EvoAI approach therefore effectively demonstrated that this vast design space of AmeR for high-fitness genotypes can be compressed by ~ 10 48 times to 82 anchor points. Discussion Navigating the complexity and scope of a protein fitness landscape is a long-standing challenge for protein design. We developed EvoScan, a novel system that combines EvolvR mutagenesis and phage selection to explore the protein sequence space in different dimensions. EvoScan can identify valuable anchors, which are variants with critical mutations that represent the sequence space. We showed that these anchor points can accurately reconstruct the space and design new proteins when coupled to deep learning methods (EvoAI), demonstrating the extreme compressibility of this space. Previous methods did not capture this insight likely because they only explored either the low-dimensional space by measuring single or double mutations, or a small region of the sequence space by saturating mutations. These methods thus might not capture the whole picture, especially the high-dimensional space (Fig. 5 d). Our approach has several important advantages over existing methods. First, it balances realistic fitness optimization and even sampling of sequence space, which can rapidly explore high dimensions and generate more diverse and functional variants, and provide richer information about sequence-function relationships. Second, by integrating empirical evolutionary scanning and deep learning models in EvoAI, we can leverage the strengths of both approaches. We could use the properties learned by deep learning to dynamically guide the scanning process. Future advances of explainable deep learning could uncover the underlying rules or patterns, and provide insights into how proteins adapt and overcome evolutionary constraints or trade-offs. Third, our method can evolve and investigate proteins that lack structural information, or that involve challenging interactions. We showed that EvoScan can capture anchors for proteins with diverse functions, such as protein-protein, protein-ligand, and protein-nucleic acid interactions. Our approach should be compatible with any biomolecular function that can be coupled to a transcriptional output (e.g., enzymes through small molecule sensors), and thus could be applied to study the sequence spaces of diverse biomolecules. Our approach could be further improved in the future. We could use Cas9 variants with more PAM options to increase the guide RNA tiling and mutation-targeted segment selection. We could also modify the editing system to introduce mutations at multiple sites at once, avoiding host switching and speeding up the exploration process. Furthermore, incorporation of the target mutagenesis approach of EvoScan into PACE could potentially lead to deeper sampling of sequence space segments. In addition, integration of EvoScan with genotype reconstruction methods, such as Evoracle, could enable more systematic and intelligent exploration of the sequence space 26 . Moreover, the modularity of our system makes it highly suitable for automation, such as with the recently reported PRANCE method 51 , and could be scaled up to provide more comprehensive fitness landscape profiling data for different protein targets, illustrating whether the extreme compressibility of the design space for high-fitness genotypes is universal or unusual, or if the whole protein fitness landscape is compressible. We also hope that our method will inspire new insights into the relationship between genotype and phenotype and the evolution of biological systems. The compressibility of the design space may suggest that nature somehow finds a way to search through the seemingly infinite space in the relatively short period of life time on earth by Darwinian evolution, possibly by “jumping” between these anchors instead of searching every possibility (Fig. 5 d). Genetic recombination in large sexual populations could possibly enable this “jumping” and boost evolution rates 53 , 54 . Our approach would enable the investigation of such path dependence of evolutionary outcomes of biological systems in high-throughput experiments 51 , 52 and provide valuable insights for evolution and protein design in biotechnology and biomedical applications. Declarations Acknowledgements This study was supported by Ministry of Science and Technology of China grant 2021YFA0911000 (S.Z.), National Natural Science Foundation of China grant 32171416 (S.Z.) and U22A20552 (S.Z.), Tsinghua University Dushi Plan Foundation (S.Z.), U.S. NIH R01 EB022376/EB031172 (D.R.L.), and R35 GM118062 (D.R.L.). We thank J. Zheng (Westlake University) for helpful discussions. We thank C. Zhang (Tsinghua University) for the kind gift of the EvolvR gene. We apologize to authors whose work cannot be cited owing to referencing restrictions. Contributions S.Z. conceptualized and supervised the project. S.Z., Z.M. and W.L. designed the experiments. Z.M., W.L., and H.Q. performed the evolution experiments in EvoScan. Z.M., W.L., Y.S., and Z.L. performed the flow cytometry assays and phage propagation assays of obtained variants. G.L. conducted the mammalian cell experiments. H.G., B.T., Y.X., and J.C. designed and developed the deep learning models. Z.M., and Y.S. wrote the first draft. B.W.T., D.R.L., C.A.V., and S.Z. wrote the final manuscript. All authors contributed to the drafting and revision of the manuscript. Competing interests S.Z. and Z.M. have filed a patent application based on this work. References Lovelock, S. L. et al. The road to fully programmable protein catalysis. Nature 606 , 49-58 (2022). Labanieh, L. & Mackall, C. L. CAR immune cells: design principles, resistance and the next generation. Nature 614 , 635-648 (2023). Dumontet, C., Reichert, J. M., Senter, P. D., Lambert, J. M. & Beck, A. Antibody–drug conjugates come of age in oncology. Nat. Rev. Drug Discov. 22 , 641-661 (2023). Macken, C. A. & Perelson, A. S. Protein evolution on rugged landscapes. Proc. Natl Acad. Sci. USA 86 , 6191-6195 (1989). Lutz, S. Beyond directed evolution—semi-rational protein engineering and design. Curr. Opin. Biotechnol. 21 , 734-743 (2010). Ding, X., Zou, Z. & Brooks III, C. L. Deciphering protein evolution and fitness landscapes with latent space models. Nat. Commun. 10 , 5644 (2019). Tian, P. & Best, R. B. Exploring the sequence fitness landscape of a bridge between protein folds. PLoS Comput. Biol. 16 , e1008285 (2020). Fernandez-de-Cossio-Diaz, J., Uguzzoni, G. & Pagnani, A. Unsupervised inference of protein fitness landscape from deep mutational scan. Mol. Biol. Evol. 38 , 318-328 (2021). D’Costa, S., Hinds, E. C., Freschlin, C. R., Song, H. & Romero, P. A. Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput. Biol. 19 , e1010956 (2023). Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11 , 801-807 (2014). Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell 160 , 882-892 (2015). Zheng, L., Baumann, U. & Reymond, J.-L. An efficient one-step site-directed and site-saturation mutagenesis protocol. Nucleic Acids Res. 32 , e115 (2004). McLaughlin Jr, R. N., Poelwijk, F. J., Raman, A., Gosal, W. S. & Ranganathan, R. The spatial architecture of protein function and adaptation. Nature 491 , 138-142 (2012). Cadwell, R. C. & Joyce, G. F. Randomization of genes by PCR mutagenesis. Genome Res. 2 , 28-33 (1992). Vanhercke, T., Ampe, C., Tirry, L. & Denolf, P. Reducing mutational bias in random protein libraries. Anal. Biochem. 339 , 9-14 (2005). Esvelt, K. M., Carlson, J. C. & Liu, D. R. A system for the continuous directed evolution of biomolecules. Nature 472 , 499-503 (2011). Miller, S. M., Wang, T. & Liu, D. R. Phage-assisted continuous and non-continuous evolution. Nat. Protoc. 15 , 4101-4127 (2020). Ravikumar, A., Arzumanyan, G. A., Obadi, M. K. A., Javanpour, A. A. & Liu, C. C. Scalable, Continuous Evolution of Genes at Mutation Rates above Genomic Error Thresholds. Cell 175 , 1946-1957.e1913 (2018). Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533 , 397-401 (2016). Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35 , 128-135 (2017). Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15 , 816-822 (2018). Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16 , 687-694 (2019). Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12 , 5743 (2021). Wu, Z., Johnston, K. E., Arnold, F. H. & Yang, K. K. Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 65 , 18-27 (2021). Somermeyer, L. G. et al. Heterogeneity of the GFP fitness landscape and data-driven protein design. Elife 11 , e75842 (2022). Shen, M. W., Zhao, K. T. & Liu, D. R. Reconstruction of evolving gene variants and fitness from short sequencing reads. Nat. Chem. Biol. 17 , 1188-1198 (2021). Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18 , 389-396 (2021). Papkou, A., Garcia-Pastor, L., Escudero, J. A. & Wagner, A. A rugged yet easily navigable fitness landscape. Science 382 , eadh3860 (2023). Halperin, S. O. et al. CRISPR-guided DNA polymerases enable diversification of all nucleotides in a tunable window. Nature 560 , 248-252 (2018). Baas, P. DNA replication of single-stranded Escherichia coli DNA phages. Biochim. Biophys. Acta, Gene Struct. Expression 825 , 111-139 (1985). Jinek, M. et al. A programmable dual-RNA–guided DNA endonuclease in adaptive bacterial immunity. Science 337 , 816-821 (2012). Ran, F. A. et al. Genome engineering using the CRISPR-Cas9 system. Nat. Protoc. 8 , 2281-2308 (2013). Dietsch, F. et al. Small p53 derived peptide suitable for robust nanobodies dimerization. J. Immunol. Methods 498 , 113144 (2021). Di Lallo, G., Castagnoli, L., Ghelardini, P. & Paolozzi, L. A two-hybrid system based on chimeric operator recognition for studying protein homo/heterodimerization in Escherichia coli. Microbiology 147 , 1651-1656 (2001). Gao, K. et al. Perspectives on SARS-CoV-2 main protease inhibitors. J. Med. Chem. 64 , 16922-16955 (2021). Li, J. et al. Structural basis of the main proteases of coronavirus bound to drug candidate PF-07321332. J. Virol. 96 , e02013-02021 (2022). Fu, L. et al. Both Boceprevir and GC376 efficaciously inhibit SARS-CoV-2 by targeting its main protease. Nat. Commun. 11 , 4417 (2020). Owen, D. R. et al. An oral SARS-CoV-2 M pro inhibitor clinical candidate for the treatment of COVID-19. Science 374 , 1586-1593 (2021). Iketani, S. et al. Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. Cell Host Microbe 30 ,1354-1362 (2022). Iketani, S. et al. Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. Nature 14 , 1716-1726 (2022). Dickinson, B. C., Packer, M. S., Badran, A. H. & Liu, D. R. A system for the continuous directed evolution of proteases rapidly reveals drug-resistance mutations. Nat. Commun. 5 , 5352 (2014). Packer, M. S., Rees, H. A. & Liu, D. R. Phage-assisted continuous evolution of proteases with altered substrate specificity. Nat. Commun. 8 , 956 (2017). Blum, T. R. et al. Phage-assisted evolution of botulinum neurotoxin proteases with reprogrammed specificity. Science 371 , 803-810 (2021). Iketani, S. et al. Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. Cell Host Microbe 30 , 1354-1362. e1356 (2022). Iketani, S. et al. Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. Nature 613 , 558-564 (2023). Nashed, N. T., Aniana, A., Ghirlando, R., Chiliveri, S. C. & Louis, J. M. Modulation of the monomer-dimer equilibrium and catalytic activity of SARS-CoV-2 main protease by a transition-state analog inhibitor. Commun. Biol. 5 , 160 (2022). Stanton, B. C. et al. Genomic mining of prokaryotic repressors for orthogonal logic gates. Nat. Chem. Biol. 10 , 99-105 (2014). Ramos, J. L. et al. The TetR family of transcriptional repressors. Microbiol. Mol. Biol. Rev. 69 , 326-356 (2005). Nielsen, A. A. et al. Genetic circuit design automation. Science 352 , aac7341 (2016). Brophy, J. A. N. & Voigt, C. A. Principles of genetic circuit design. Nat. Methods 11 , 508-520 (2014). DeBenedictis, E. A. et al. Systematic molecular evolution enables robust biomolecule discovery. Nat. Methods 19 , 55-64 (2021). Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proc. Natl Acad. Sci. USA 110 , 9007-9012 (2013). Weinreich, D. M. & Chao, L. Rapid evolutionary escape by large populations from local fitness peaks is likely in nature. Evolution 59 , 1175-1182 (2005). Weissman, D. B., Feldman, M. W. & Fisher, D. S. The Rate of Fitness-Valley Crossing in Sexual Populations. Genetics 186 , 1389-1410 (2010). Materials and Methods General methods. The following working concentrations of antibiotics were used: carbenicillin (Solarbio, 50 μg/ml), kanamycin (Solarbio, 50 μg/ml), spectinomycin (Macklin, 50 μg/ml), chloramphenicol (Macklin, 25 μg/ml). PHANTA 2x mix (Vazyme) was used for cloning PCR, and Flash 2x mix (Vazyme) was used for verification PCR and Sanger sequencing (Tsingke Bioscience). All cloning fragments were assembled by Golden Gate assembly (New England Biolabs) or ClonExpress assembly (Vazyme) methods. Plasmids were cloned in DH5α competent cells (HT Health). Synthetic genes were ordered from Tsingke Bioscience. Cloned plasmids were extracted by Tiangen DNA extraction kit. E. coli strain S2060 55 was used in all aspects of the EvoScan process, including system construction, evolution, and plaque assays. The DH5α strain was used for flow cytometry experiments. Detailed information on the plasmids and selection phage (SP) used in this work is given in Supplementary Table 6. Phage propagation assay. Competent S2060 cells were transformed with corresponding accessory plasmid (AP) in each experiment. Overnight cultures of single colonies inoculated in LB medium with proper antibiotics were diluted 50 or 100 times and grown at 37 ℃ in 220 rpm shaker (ZQZY-B8, cultured in shake tubes, 5 ml system) or 1000 rpm shaker (HUXI HW-400TG, cultured in 96-deep well plate, 500 μl system) to log phase (OD 600 ~ 0.4–0.6). These cells were then infected with SP at an initial titer of 5 × 10 6 plaque-forming units (p.f.u.) per ml. The mixture was further cultured overnight (16–20 h) at 37 ℃ in the shakers as described above, and was centrifuged at 4000 rpm for 10 min. Phages in the supernatant was filtered by 0.22 μm bacterial filter and stored at 4 ℃ for further use. Plaque assay. A single colony of chemically competent S2208 cells 55 (S2060 cells transformed with plasmid pJC175e) was cultured overnight in LB medium added with proper antibiotics. The saturated bacteria culture was diluted 50 or 100 times into LB medium with proper antibiotics and grown at 37 ℃ in 220 rpm shaker to log phase (OD ~0.4–0.8) before use. Phages were serially diluted 6 to 8 times with a dilution ratio of 10-fold in each step in LB medium. Then, 10 μl of each phage dilution was mixed with 45 μl S2208 cells, and then 180 μl of liquid (50–65℃) soft agar (LB medium and 0.5% agar) supplemented with 2% Bluo-gal (Inalco S.p.A.) was added and mixed by pipetting. The whole mixture was immediately added onto 500 μl of bottom agar (LB medium and 1.5% agar) previously prepared in 24-well plate. Then the plates were incubated in 37 ℃ for overnight growth (14–18 h). Calculation of fold propagation. For fold propagation measurement of the selection phage, initial phage titer and final phage titer were measured by plaque assays. We defined the ratio of final phage titer versus initial phage titer as the fold propagation of the phage. Basic process of evolutionary scanning (EvoScan). Target mutagenesis plasmid (TP) was first transformed to chemically competent S2060 cells, and then the prepared S2060-TP cells were used to prepare super chemically competent cell by Inoue method 56 . Chemically competent S2060-TP cells were transformed with corresponding APs. The resulting S2060-TP-AP bacteria were cultured overnight and diluted 50–100 times into 500 μl LB medium with antibiotics and inducers, and grown in 37 ℃ 1000 rpm shaker to OD ~0.5. The phage titer for the first infection was around 5×10 6 –5×10 8 p.f.u./ml, and for the following passages the phages were subjected to a 1:50 or 1:100 dilution. Vanillic acid (Sigma-Aldrich, ethanol dissolved) at a final concentration of 50 μM was added to induce the expression of nCas9-PolIM5 complex. The mixture was then cultured in 37 ℃ 1000 rpm shaker overnight. The next day the mixture was centrifuged at 4000 rpm for 10 min and the phage content of the collected supernatant was verified by PCR (Flash 2x mix) and Sanger Sequencing. The supernatant was then used for plaque assay as described above. Single plaques from plaque assay were picked and further verified by PCR (Flash 2x mix). The PCR product was sent for Sanger Sequencing. Searching steps in each route. For each step of EvoScan in a route, 10 μl supernatant with evolved phages was added into 1 ml log-phase S2208 bacteria culture (OD ~0.4–0.8), and propagated overnight in 96-deep well plate. The mixture was centrifuged at 4000 rpm for 10 min and filtered by 0.22 μm bacterial filter. The obtained phages were then diluted and infected another host cell containing a different AP with an infection titer of 5×10 6 p.f.u./ml. Basic process of phage-assisted non-continuous evolution (PANCE) . Accessory plasmid with the designed genetic circuit and the mutagenesis plasmid MP6 were co-transformed into super chemically competent S2060 cells. The S2060-MP6-AP bacteria were cultured overnight and diluted 50-100 times into 500 μl LB medium with antibiotics and inducers in 96-deep well plate, and grown in 37 ℃ 1000 rpm shaker to OD ~0.5. The initial phage titer was around 5×10 6 –5×10 8 p.f.u./ml, and the phages were subjected to a 1:10–1:100 dilution in the following passages in a 500 μl system. 1% (m/v) arabinose dissolved in ddH 2 O was added as the inducer of MP6. Phages were then collected to obtain mutations following the same procedures in EvoScan. Induced expression assay. Single colonies of strains to be tested were cultured in LB medium overnight. Saturated bacterial culture was diluted 100 times in LB medium with proper antibiotics and inducers, and cultured in 37 ℃ 1000 rpm shaker for 2 h (OD ~ 0.4). Then LB with proper antibiotics and inducers was prepared and 2 μl log phase bacteria culture was added together to a whole volume of 500 μl. The mixture was cultured for 5 hours in the 96-deep well plate. Flow cytometry assay. 10 μl of the culture was added into 190 μl PBS with 2 g/L kanamycin in the 96-well U-bottom plate to stop the cell growth. The plate was stored in 4 ℃ until used. The flow cytometer (Beckman Coulter Cytoflex S) was used to quantify the expression levels of fluorescent protein. The software FlowJo v10 was used to gate the events (at least 10000 events) and calculated the median of each sample. M pro drug resistance index. In the RTHS protease activity assay, the fluorescence of the experimental group carrying eYFP was measured with or without addition of M pro inhibitor GC376 or PF-07321332. The ratio of fluorescence FITC-A median with inhibitor versus fluorescence FITC-A median without inhibitor was defined as the resistance index (RI) to evaluate the drug resistance abilities of different M pro variants. Structure display and interaction prediction. Schrodinger 2017 was used for structural display. ZDOCK 57 was used for interaction structure prediction between EGFP and its nanobody. The interaction between M pro and inhibitors within 3 angstrom was shown in the figure. Fold repression calculation. The background fluorescence of cells, which is the median of the fluorescence of the bacteria carrying an empty plasmid with only the backbone, was measured and subtracted from all the experimental groups. The subtracted fluorescence values of the uninduced group (no repressor expression) were divided by the induced group (repressor expression) to obtain the fold repression. Relative expression level calculation. Using flow cytometry assay, we measured the FITC-A median of the strain carrying the empty plasmid and set this value as the background value. The FITC-A median of the strain carrying the standard plasmid expressing eYFP through the open reading frame J23101-B0064-YFP was measured the same way and set as the standard value. The FITC-A median of the strain containing a specific variant was measured the same way, and the relative expression level was defined as: (variant value – background value)/standard value. Circuit score calculation. Thestrain carrying the plasmid with a specific genetic circuit was prepared for flow cytometry assay. IPTG (1 mM) and vanillic acid (100 μM) were used as the input signals. YFP was used as the output reporter of the circuit and the FITC-A median of each state was measured. The lowest ON signal (lowest FITC-A median in “ON” states of the circuit) was divided by the highest OFF signal (highest FITC-A median in “OFF” states of the circuit) to obtain the circuit score. AmeR phylogenetic tree construction. Protein sequences of the 82 variants and the WT were collected as a fasta file and the file was input into MEGA11 for multiple sequence alignment (MSA) 58 . After MSA and phylogenetic analysis, neighbor-joining tree was selected as the method of tree construction. The output tree was decorated by iTOL 59 , and all the parameters were set using default values. Epistasis calculation. Epistasis between two different mutations, A and B, could be calculated as ε = fab + fAB – fAb – faB. f is the fitness of wild-type, double-mutant and single-mutant genotypes, respectively. ε > 0 means positive epistasis, while ε < 0 means negative epistasis. Mammalian Cell culture and transfections. HEK293T cells (CRL-3216, ATCC) were cultured in Dulbecco’s modified Eagle’s medium (DMEM, Gibco) supplemented with 10% (v/v) fetal bovine serum (FBS, Biological Industries) and 1% (v/v) penicillin/streptomycin solution (Beyotime) at 37 °C, 100% humidity and 5% CO 2 . In transfection experiments, 60,000–80,000 HEK293T cells in 0.2 ml of DMEM complete medium were seeded into each well of 48-well plastic plates (NEST) and grown for ~24 h. M5 HiPer Lipo2000 Transfection Reagent (Lipo2000, Mei5bio) was used in all transfection experiments following the manufacturer’s protocol. Briefly, a sample mixture was prepared by mixing 150 ng repressor plasmid or 150 ng control plasmid (repressor-deficiency) with 150 ng reporter plasmid in 0.7 μl Lipo2000. The mixture was incubated at room temperature for 20 min before adding to cells. Transfections were supplemented with 0.2 mL DMEM complete medium 24 h post-transfection. Cells were cultured for 2 days post-transfection before flow cytometry analysis. Mammalian cell flow cytometry assay. Cells were trypsinized 48 h after transfection and were then centrifuged at 250 × g for 10 min at room temperature. The supernatant was removed, and the cells were resuspended in 1 × PBS. Fluorescence values were measured with a Cytoflex flow cytometer (Beckman Coulter, Inc.). PB450-A and ECD-A channels were chosen for BFP and mCherry measurement, respectively. Data were processed using FlowJo (TreeStar), gated by the area of the forward scatter and the side scatter (FSC-A/SSC-A) and then cell populations were selected by gating out the background BFP signal of untransfected cells to obtain the median of fluorescence. The median of fluorescence was calculated for >20,000 transfected cells for each sample. To reduce expression noises between samples, the mCherry : BFP fluorescence ratio was used to report the repressor activity 60 . The mCherry : BFP fluorescence ratio was calculated by (mCherry - mCherry 0 )/(BFP - BFP 0 ), mCherry 0 and BFP 0 were the fluorescence values from untransfected HEK293T cells. The fold-repression was calculated by (mCherry : BFP) unrepressed /(mCherry : BFP) repressed . (mCherry : BFP) unrepressed and (mCherry : BFP) repressed were the fluorescence values of the states co-transfected with control plasmid or repressor plasmid. Feature generation. Our initial step entails querying the UniRef30_2021_03 and bfd multiple sequence alignment (MSA) databases. Subsequently, we employ AlphaFold2 to construct the structural representation of the wild-type protein. For this endeavor, we deploy the GeoFitness-Seq variant of the pre-training model. In the case of mutated proteins, structural configurations are generated using FoldX 5. The sequence features are extracted from the large-scale protein language model ESM-2, for the purpose of capturing global context information. Consequently, each node in the Geometric Encoder is initialized by the embedding of the corresponding residue derived from the ESM-2. Unlike conventional methodologies that rely upon inter-residue distances and contacts to establish edges, each edge in the Geometric Encoder is initialized by the relative geometric relationship between a pair of residues derived from the protein 3D structure 61 . Cross-validation. We employed a 10-fold cross-validation approach to find the hyperparameters of the model. The dataset, comprising 82 mutational data points, was divided into three parts: a training set (59 samples), a validation set (7 samples), and a test set (16 samples). Model evaluation was performed using the Spearman correlation coefficient (ρ) as the primary assessment metric. Model training details. The model employs the Soft Rank Loss as its loss function, with a learning rate of 10 -3 , Adam optimizer, and a decay rate for the learning rate. The training spans across 50 epochs. Subsequently, the learning rate of the upstream GeoFitness model is set to 10 -4 , while the learning rate of the downstream model is adjusted to 5×10 -4 for further fine-tuning. 55 Carlson, J. C., Badran, A. H., Guggiana-Nilo, D. A. & Liu, D. R. Negative selection and stringency modulation in phage-assisted continuous evolution. Nat. Chem. Biol. 10 , 216-222 (2014). 56 Green, M. R. & Sambrook, J. The Inoue Method for Preparation and Transformation of Competent Escherichia coli:" Ultracompetent" Cells. Cold Spring Harb Protoc. 2020 , 101196 (2020). 57 Chen, R., Li, L. & Weng, Z. ZDOCK: an initial‐stage protein‐docking algorithm. Proteins: Struct., Funct., Bioinf. 52 , 80-87 (2003). 58 Tamura, K., Stecher, G. & Kumar, S. MEGA11: molecular evolutionary genetics analysis version 11. Mol. Biol. Evol. 38 , 3022-3027 (2021). 59 Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49 , W293-W296 (2021). 60 Liang, J. C., Chang, A. L., Kennedy, A. B. & Smolke, C. D. A high-throughput, quantitative cell-based screen for efficient tailoring of RNA device activity. Nucleic Acids Res. 40 , e154 (2012). 61 Xu, Y., Liu, D. & Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. bioRxiv , 2023.2005. 2028.542668 (2023). Additional Declarations Yes there is potential Competing Interest. S.Z. and Z.M. have filed a patent application based on this work. Supplementary Files Supplementarytables20240201.docx Supplementary Data Set 1 ExtendedDataFigures.docx Cite Share Download PDF Status: Published Journal Publication published 11 Nov, 2024 Read the published version in Nature Methods → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-3930833","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":274382663,"identity":"680ec55d-7d32-4412-97a0-3bbd95780b33","order_by":0,"name":"Shuyi Zhang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABHUlEQVRIie2RMUvDQBTHX4hkimS9yX6Fl6UKYvNVrgSapUNB6BwpOOUDBPIVHC6TdIu8wSW26wkZ0qWTQpzMUMFL7JjEjoL34+69g3s/+MMD0Gj+IkxdDpmqpirYHoV1kmJNw9MVaBXbDZvnr8ooicasPBSjS+fhc1cviskj80uolgROEnYqWORjxu29u47f05WNe38dz9CINwSsyLoVNlcKI0PIl3QFSD5Kjub5Pakv3h0sbhQkT8h8d1e3SlCZXwMKyEbhNBXbyAhtpAnKOZrGgIJydnvFM/KFtNxEKRzzt8VTtAlsJvuC+elrfaAbsaXyQz08fA7Ssl5eXzhxTzCAs5/VHGOofR7X1DevMKu2OVnbvIFJjUaj+ad8A8W7bQYsTUraAAAAAElFTkSuQmCC","orcid":"","institution":"Tsinghua University","correspondingAuthor":true,"prefix":"","firstName":"Shuyi","middleName":"","lastName":"Zhang","suffix":""},{"id":274382664,"identity":"0ba14ddd-ad30-4405-9b9a-c00f2c2624e0","order_by":1,"name":"Ziyuan Ma","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Ziyuan","middleName":"","lastName":"Ma","suffix":""},{"id":274382665,"identity":"b73b8779-29e3-4b8b-a701-073842c13e15","order_by":2,"name":"Wenjie Li","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Wenjie","middleName":"","lastName":"Li","suffix":""},{"id":274382666,"identity":"cb33853b-87da-4fff-acd6-21293923895f","order_by":3,"name":"Yunhao Shen","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Yunhao","middleName":"","lastName":"Shen","suffix":""},{"id":274382667,"identity":"ac7ed1de-3557-4f73-b45c-c45c6b49c28b","order_by":4,"name":"Yunxin Xu","email":"","orcid":"https://orcid.org/0000-0002-9981-2097","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Yunxin","middleName":"","lastName":"Xu","suffix":""},{"id":274382668,"identity":"6c21bf1c-6265-4d50-8c24-12216f1d1a48","order_by":5,"name":"Gengjiang Liu","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Gengjiang","middleName":"","lastName":"Liu","suffix":""},{"id":274382669,"identity":"3d69ec85-7c3f-435b-8436-811e8b89a59c","order_by":6,"name":"Jiamin Chang","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Jiamin","middleName":"","lastName":"Chang","suffix":""},{"id":274382670,"identity":"738eb273-1371-4f4a-916e-94a47b56f12f","order_by":7,"name":"Zeju Li","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Zeju","middleName":"","lastName":"Li","suffix":""},{"id":274382671,"identity":"fcc05904-1e88-4e46-b9bf-0aee67789c5e","order_by":8,"name":"Hong Qin","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Hong","middleName":"","lastName":"Qin","suffix":""},{"id":274382672,"identity":"0dc0fc24-8e1e-4b9e-8457-1f04665adc59","order_by":9,"name":"Boxue Tian","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Boxue","middleName":"","lastName":"Tian","suffix":""},{"id":274382673,"identity":"bfed4ee4-70ad-4da7-8995-5009992c6368","order_by":10,"name":"Haipeng Gong","email":"","orcid":"https://orcid.org/0000-0002-5532-1640","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Haipeng","middleName":"","lastName":"Gong","suffix":""},{"id":274382674,"identity":"9399ecf5-0aba-40a3-ad50-b10de19d890e","order_by":11,"name":"David Liu","email":"","orcid":"https://orcid.org/0000-0002-9943-7557","institution":"Broad Institute","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"","lastName":"Liu","suffix":""},{"id":274382675,"identity":"4a74e391-fc6a-4c52-a983-96d401006c2d","order_by":12,"name":"B Thuronyi","email":"","orcid":"","institution":"Williams College","correspondingAuthor":false,"prefix":"","firstName":"B","middleName":"","lastName":"Thuronyi","suffix":""},{"id":274382676,"identity":"7c2fe27b-5981-4002-b525-8623ec93be80","order_by":13,"name":"Christopher Voigt","email":"","orcid":"https://orcid.org/0000-0003-0844-4776","institution":"Massachusetts Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Christopher","middleName":"","lastName":"Voigt","suffix":""}],"badges":[],"createdAt":"2024-02-05 12:03:50","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-3930833/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-3930833/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41592-024-02504-2","type":"published","date":"2024-11-11T05:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":51565300,"identity":"e7b698c5-68a5-41c8-93c3-135432e04e8d","added_by":"auto","created_at":"2024-02-23 19:00:33","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":325992,"visible":true,"origin":"","legend":"\u003cp\u003eEvoScan scheme, development, and validation on a protein-protein interaction evolution.\u003c/p\u003e\n\u003cp\u003e(a) Overview of the Evolution Scanning system. (b) Testing scheme for the EvoScan system on EGFP-nanobody interaction. (c) Predicted interaction between EGFP and nanobody. The structures of EGFP and nanobody were predicted by Alphafold2, and the interaction was modeled by ZDOCK (method). The position of the E103 site is labeled in red. (d) Propagation assays of combinatorial AP designs. Several ribosome binding sites (RBS) were tested, including: P2 and P3 for PhlF, and sd2, sd5 and B0064 for gIII. (e) Propagation assays of WT-Nano phage, the E103K mutant phage, and empty phage (ΔgIII) under different concentrations of IPTG (0, 200 μM and 1 mM) to control cI434-EGFP expression levels. (f) EvoScan of the EGFP nanobody. Three different gRNAs targeted different CDR regions of the E103K nanobody. An off-target gRNA with no target sequence in the nanobody was used as a control group. (g) Phage propagation and mutations of EvoScan and PANCE for EGFP nanobody. Initial titers of phage E103K were 3×10\u003csup\u003e9 \u003c/sup\u003ep.f.u. /ml for EvoScan and 5×10\u003csup\u003e8 \u003c/sup\u003ep.f.u. /ml for PANCE. Dilution factor was 100 for each passage. Sequences of sgRNA and genetic parts are provided in Supplementary Table 4 and Table 5. Data are mean ± SD of three experiments, except for phage titers.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/cf329abee35dd662b2a23609.png"},{"id":51565301,"identity":"17e49c93-2805-4b8d-984b-7860eddbf6b3","added_by":"auto","created_at":"2024-02-23 19:00:33","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":544898,"visible":true,"origin":"","legend":"\u003cp\u003eThorough segment scanning for protein-ligand interaction evolution.\u003c/p\u003e\n\u003cp\u003e(a) A schematic of the M\u003csup\u003epro\u003c/sup\u003e activity fluorescence reporter system. The M\u003csup\u003epro\u003c/sup\u003e substrate peptide CAAVLQSGFRKK was cloned into the linker between cI434 and cIp22 so that protease activity controlled the output (eYFP) from the p434 promoter. (b) Flow cytometry assays for M\u003csup\u003epro\u003c/sup\u003e and the C145A mutant under different inducer concentrations (0, 50 and 200 μM vanillic acid, and 200 μM IPTG). (c) Flow cytometry assays for M\u003csup\u003epro\u003c/sup\u003e activity in the presence of 50 μM inhibitors (without inhibitor (w/o), GC376 (GC) and PF-07321332 (PF)). The concentration of IPTG was 200 μM and the concentration of vanillic acid was 100 μM. (d) Genetic circuit design for M\u003csup\u003epro\u003c/sup\u003e evolution. 32 different APs carrying 32 different gRNAs tiling the M\u003csup\u003epro\u003c/sup\u003e gene sequence were designed. (e) Phage propagation assays for M\u003csup\u003epro\u003c/sup\u003e, the C145A mutant, and the empty phage at 1 mM IPTG. (f) Phage propagation assays in the presence of 0, 20, or 40 μM inhibitors. (g) Schematic diagram of the M\u003csup\u003epro\u003c/sup\u003e EvoScan process. (h) EvoScan of M\u003csup\u003epro\u003c/sup\u003e using two different inhibitors and 32 different gRNAs. The initial titer of M\u003csup\u003epro\u003c/sup\u003e phage was 5×10\u003csup\u003e6\u003c/sup\u003e p.f.u. /ml. (i) The resistance index (RI) against GC376 (50 μM) and PF-07321332 (50 μM) of different variants. (j, k) Crystal structure of WT M\u003csup\u003epro\u003c/sup\u003e interacting with GC376 (j, PDB ID: 7CB7) and PF-07321332 (k, PDB ID: 7VLO). Mutated residue sites obtained in EvoScan are highlighted in red. Interactions between key residues and the ligand are indicated with dash lines in the enlarged figures. Sequences of gRNAs and genetic parts are provided in Supplementary Table 4 and Table 5. Data are mean ± SD of three experiments, except for phage titers.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/221ce1a36402fc65ebd7d8ee.png"},{"id":51565302,"identity":"24a00d13-e237-4cf4-86fa-5e4536915bbb","added_by":"auto","created_at":"2024-02-23 19:00:33","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1788194,"visible":true,"origin":"","legend":"\u003cp\u003eApplying EvoScan on AmeR for protein-DNA interaction evolution.\u003c/p\u003e\n\u003cp\u003e(a) Schematic diagram of EvoScan of AmeR. A series of gRNAs was used to divide the AmeR gene into segments. An initial set of passages identified gRNAs that resulted in mutations, and several different evolution routes were designed with these APs. WT AmeR phages were passaged through these routes sequentially to scan and collect anchors. (b) Genetic circuit design for AmeR evolution. 13 different APs carrying 13 different gRNAs were designed. (c) Phage propagation assays of SP bearing the AmeR gene and the empty phage (ΔgIII). (d) Schematic diagram of one step in each route during evolution. (e) EvoScan of AmeR and properties of the collected variants. For each step in each route, the dominant mutations observed from supernatant were shown. Mutation number distribution and the top 10 mutation types of the 82 variants were shown. A comparison of the evolution results between EvoScan and PANCE was shown, including variant numbers, mutation diversities, and mutated sites of AmeR. (f) Distribution of mutations on AlphaFold2 predicted structure of AmeR. Red regions are mutation sites and the blue region is the typical Helix-Turn-Helix (HTH) Domain of TetR family proteins. (g) Mutation relation map among the 82 variants and WT AmeR. Each circle is a variant and its size is the mutation number. Variants are colored based on their log(fold repression). Variants with less than 3 amino acid difference were linked together. (h) Schematic and mutant information of four evolution paths from the mutation relation map. Sequences of gRNAs and genetic parts are provided in Supplementary Table 4 and Table 5. Data are mean ± SD of three experiments, except for phage titers.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/505deeb507e5ff885adb7f88.png"},{"id":51565304,"identity":"889ce1bc-b5df-4aae-8f11-4e098cb69fc3","added_by":"auto","created_at":"2024-02-23 19:00:34","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1274323,"visible":true,"origin":"","legend":"\u003cp\u003eGenetic relationships and features of the 82 anchors generated by EvoScan.\u003c/p\u003e\n\u003cp\u003e(a) Fold repression of WT AmeR and the S57R variant in different systems, \u003cem\u003eE. coli\u003c/em\u003e and HEK293T mammalian cells. (b) Leakages and circuit scores of different genetic circuits (IMPLY, NIMPLY, and NAND) using WT AmeR or the S57R variant. (c) Fold repression of AmeR variants with different mutation numbers. (d) Evolution paths from WT to the S57R I80V P94L variant. (e) Phylogenetic tree of the 82 variants. The tree is divided into sub-regions by branch distances. Variants with less than 2 amino acid difference are linked by curves. Curves across sub-regions are in bold black. (f) Magnitude epistasis for D33E and [P94L V188F] in the S57R genetic background (upper plane), and reciprocal sign epistasis for P94L and [G83V V188F A199S G212S] in the S57R genetic background (lower plane). (g) Epistasis values of D33E, S57R, P94L, R43S, I80V, and D119N with combinations of different mutations. Data are mean ± SD of three experiments, except for phage titers.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/38bb87ce958056337e48761d.png"},{"id":51565303,"identity":"76441a0d-bd3d-4b2b-a376-13e0890a6c77","added_by":"auto","created_at":"2024-02-23 19:00:34","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":200011,"visible":true,"origin":"","legend":"\u003cp\u003eAnchors and deep learning reconstruct the design space for high-fitness genotypes.\u003c/p\u003e\n\u003cp\u003e(a) Schematic of the deep learning model. WT AmeR and 82 anchors from EvoScan were the data set for model training. (b) Training and test results of the 82 variants. Training data are in blue and test data are in red. (c) Experimental fold repression of the designed variants using model trained by EvoScan anchors or deep mutational scanning (DMS) information. The dashed line is WT AmeR fold repression. (d) Comparison between DMS and EvoAI. DMS can only search the low-dimensional space, while EvoScan comprehensively segments and scans the high-dimensional space to enable accurate design space reconstruction by deep learning.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/2ae629a1c2839d09357c0292.png"},{"id":68805418,"identity":"12de34d1-5d7c-4092-ab86-609eb4b1835a","added_by":"auto","created_at":"2024-11-12 08:07:42","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":5014794,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/3b3cde38-7bf4-4cb1-825a-5db85d68347c.pdf"},{"id":51565299,"identity":"792b8d44-f22d-4f97-b916-47d7e0b64d29","added_by":"auto","created_at":"2024-02-23 19:00:33","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":36338,"visible":true,"origin":"","legend":"Supplementary Data Set 1","description":"","filename":"Supplementarytables20240201.docx","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/4e3ad0062639a1cdba5b3a9f.docx"},{"id":51565305,"identity":"645b6f1c-4e54-46db-8376-a0f55cd38f73","added_by":"auto","created_at":"2024-02-23 19:00:34","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":2776190,"visible":true,"origin":"","legend":"","description":"","filename":"ExtendedDataFigures.docx","url":"https://assets-eu.researchsquare.com/files/rs-3930833/v1/b1452e3db4fa701ce23967eb.docx"}],"financialInterests":"\u003cb\u003eYes\u003c/b\u003e there is potential Competing Interest.\nS.Z. and Z.M. have filed a patent application based on this work.","formattedTitle":"EvoAI enables extreme compression and reconstruction of the protein sequence space","fulltext":[{"header":"Main","content":"\u003cp\u003eProtein engineering and design can create proteins with optimized functions for various applications in biotechnology, medicine, and synthetic biology\u003csup\u003e\u003cspan additionalcitationids=\"CR2\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. The fundamental challenge of protein engineering is to understand and manipulate the protein fitness landscape, which is a high-dimensional and complex space that contains a vast number of possible sequences and functions\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eAlthough there have been considerable attempts over the past several decades to search this space for high-fitness sequences, we have only scratched the surface of understanding the rules and features of the space\u003csup\u003e\u003cspan additionalcitationids=\"CR7 CR8\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eExperimental methods using directed evolution techniques, such as deep mutational scanning\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e, site-saturated mutagenesis\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e, and random library construction\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e, can provide valuable information, but they are laborious and time-consuming to scale up and typically must trade off accuracy and precision with sequence space coverage. These experimental methods are also usually restricted to low-dimensional mutations that do not take into account the natural selection pressure that shapes the protein fitness landscape in high-dimensional space. Advanced directed evolution tools that support the necessary scale, such as phage-assisted continuous evolution (PACE)\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e or OrthoRep\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e, provide information primarily about trajectories that lead to high-fitness variants, which is insufficient to model the fitness landscape in its entirety. Computational methods, such as structure or sequence-based modeling of the protein fitness landscape\u003csup\u003e\u003cspan additionalcitationids=\"CR20 CR21 CR22 CR23 CR24\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e, can evaluate larger sequence spaces but are limited by the availability and quality of training data, especially for proteins with few homologs or no structure information. These computational methods also typically do not account for other biological factors that affect the protein function, such as \u003cem\u003ein vivo\u003c/em\u003e interactions or post-translational modifications.\u003c/p\u003e \u003cp\u003eAn ideal approach to understanding and navigating this space for design and engineering purposes would use comprehensive high-throughput experimental data to inform efficient computational models. It was shown that high-throughput short sequencing data from directed evolution experiments can enable machine learning methods to reconstruct the full-length genotype and identify high-fitness variants\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e. Furthermore, it has been demonstrated that deep learning models for protein design can benefit from even a limited number of functionally characterized variants\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. A recent work demonstrated that the protein fitness landscape is rugged with many local peaks but still easily navigable\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. We view these functional variants or local peaks as key \u0026ldquo;anchor\u0026rdquo; points that capture the features of high-fitness genotype space. We hypothesize that the design space for high-fitness genotypes can be effectively compressed by identifying a sufficient number of these \u0026ldquo;anchor\u0026rdquo; points to capture all the essential features, which can then instruct deep learning models to reconstruct and explore the whole space. However, no existing method can generate these anchors in a rapid and comprehensive way, especially for anchors from the high-dimensional space. Such a method would need to capture functional information about variants evenly distributed across protein sequence space in a very high throughput manner.\u003c/p\u003e \u003cp\u003eHere, we present EvoAI, a novel approach to empirically interrogate, then model, compress, and reconstruct, the sequence space. Our approach combines high-throughput experimental evolution and computational methods to capture and learn from the essential features of the space. We first developed an evolutionary scanning method that adapts phage-assisted non-continuous evolution (PANCE)\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e by incorporating a segmented mutagenesis system based on EvolvR\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. Compared to traditional methods, this method enabled rapid and thorough evolutionary scanning from low to high dimensions and captured valuable fitness anchors. We then developed a deep learning and large language model to reconstruct the sequence space from these anchors and design new proteins with more than 10-fold improved activity compared to wild-type. For a repressor protein, we demonstrated that this vast design space can be extremely compressed by a factor of 10\u003csup\u003e48\u003c/sup\u003e to 82 points.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eThe evolutionary scanning method\u003c/h2\u003e \u003cp\u003eThe M13 bacteriophage has a single-stranded DNA genome, but it generates a double-stranded form after infecting the host cell\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). We reasoned that this should allow the targeted CRISPR-guided DNA polymerase mutagenesis system (TP) to introduce mutations into the M13 phage genome for selection and evolution\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. Here, the expression of the nCas9-PolI complex was controlled by the vanillic acid induced VanR-pVanA expression system that has a large induction fold change and low background expression, and is suitable for expressing large and highly toxic proteins\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e,\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. The evolution target was inserted into the M13 genome in place of gIII (the major coat protein of M13) to generate the selection phage (SP). The accessory plasmid (AP) expresses guide RNAs (gRNAs) that target different regions of the gene of interest for mutagenesis. The AP also contains gIII under the control of a genetic circuit that links the function of the gene of interest to the expression of gIII. This allows the selection of phages with improved and high-fitness protein function during phage propagation, while phages with non-functional genes are eliminated after dilution (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). We named this system EvoScan (Evolutionary Scanning).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eEvoScan can explore specific regions of the fitness landscape to generate valuable anchors. These anchors are obtained by using different gRNAs to divide the target gene into defined segments, thus reducing the dimensionality of the fitness space. Moreover, the combination of different gRNAs through serial propagation on host cells bearing different APs enables the scanning and identification of anchors in higher dimensions, which can capture more details of the protein sequence space. To investigate and scan the protein sequence space, we validated and used this system to study three proteins with diverse functions: an EGFP-specific nanobody for protein-protein interaction; SARS-CoV-2 M\u003csup\u003epro\u003c/sup\u003e and its inhibitors for protein-ligand interaction; and AmeR and its DNA operator for protein-nucleic acid interaction.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eValidation of EvoScan and rapid identification of anchors in nanobody\u003c/h2\u003e \u003cp\u003eTo validate EvoScan and apply this system to proteins involved in protein-protein interaction, we chose antigen-antibody interaction, in this case, EGFP and its cognate nanobody\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. We first established a reverse two-hybrid system (RTHS) that coupled the nanobody-EGFP interaction to the expression of gIII. We fused EGFP to the cI434 repressor, and its nanobody to cIp22, which can interact with cI434 but not with itself\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. The gene encoding nanobody-cIp22 was inserted on phage to replace gIII. The gene encoding EGFP-cI434 was integrated on the AP and transformed into \u003cem\u003eE. coli\u003c/em\u003e. After phage infection, interaction between EGFP and nanobody will enable the interaction between cI434 and cIp22 to form a tetramer complex and inhibit the p434 promoter (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb, \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). In the AP, a transcriptional repressor PhlF was placed downstream of the p434 promoter, and gIII was placed under the control of the pPhlF promoter, such that interaction between EGFP and the nanobody will eventually induce the expression of gIII and allow phage propagation (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb). We tested several combinations of ribosome binding sites and chose P3 RBS for PhlF and B0064 for gIII (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed). This circuit propagated phage carrying EGFP nanobody while limiting the propagation of empty phage.\u003c/p\u003e \u003cp\u003eTo test whether EvoScan could quickly identify a fitness-increasing protein variant \u0026ldquo;anchor\u0026rdquo; site, we artificially disrupted the interaction between EGFP and nanobody by introducing the E103K mutation in the CDR3 of the nanobody, which is essential for binding to its target (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec, Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ee). We designed four different gRNAs targeting different segments of the nanobody gene, with gRNA3 designed to target the segment containing the E103K mutation site of the nanobody (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ef). After two passages in EvoScan, we observed that only the group with gRNA3 targeting the E103K segment showed increased phage titer, while the other three groups all decreased (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eg). Sequencing results of the phage supernatant confirmed that in the gRNA3 group, the E103K mutation had reverted back to glutamate. This validated that EvoScan can successfully and efficiently identify anchors that play important roles in protein function.\u003c/p\u003e \u003cp\u003eFor comparison, we also implemented a traditional phage-assisted non-continuous evolution system (PANCE) using the same E103K phage (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). The two systems differed only in the use of targeted (EvoScan, TP) or non-targeted (PANCE, MP6) mutagenesis. After 8 passages in PANCE, no consensus mutations were found in the nanobody gene. Interestingly, a N29D single mutation appeared on cIp22 (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb, Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eg), which disrupted the selection pressure on nanobody function due to the strong self-interaction between the two cI repressors (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec, \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed). These results further demonstrated that EvoScan can rapidly guide the evolution for precise searching of target proteins, even in the context of a more likely background mutation that could interfere with the desired evolution process.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e\u003cb\u003eThorough identification of anchors reveals novel M\u003c/b\u003e\u003csup\u003e\u003cb\u003epro\u003c/b\u003e\u003c/sup\u003e \u003cb\u003edrug resistant variants\u003c/b\u003e\u003c/h2\u003e \u003cp\u003eWe next applied EvoScan to investigate protein-ligand interaction. In this case, we chose M\u003csup\u003epro\u003c/sup\u003e, a crucial protease in the SARS-CoV-2 virus\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e,\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. Several M\u003csup\u003epro\u003c/sup\u003e inhibitors have been developed and used to treat COVID-19 patients, such as GC376\u003csup\u003e37\u003c/sup\u003e and PF-07321332\u003csup\u003e38\u003c/sup\u003e, which is a key component of Paxlovid. However, the rapid mutation of SARS-CoV-2 may reduce or even eliminate the efficacy of these drugs. Previous studies have identified mutational hotspots for drug resistance but have not comprehensively profiled the M\u003csup\u003epro\u003c/sup\u003e drug resistance fitness landscape\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e,\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. It is important to thoroughly study possible escape mechanisms of M\u003csup\u003epro\u003c/sup\u003e in order to inform future drug development efforts. Here, we used EvoScan to systematically identify and extract key anchors from different regions of M\u003csup\u003epro\u003c/sup\u003e that affect its interaction with small molecule inhibitors.\u003c/p\u003e \u003cp\u003eTo couple the protease activity of M\u003csup\u003epro\u003c/sup\u003e to the expression of downstream reporter genes, we fused the two cI repressors, cI434 and cIp22, with a linker that contains the specific sequence motif recognized by M\u003csup\u003epro\u003c/sup\u003e, such that only functional M\u003csup\u003epro\u003c/sup\u003e will cleave and deactivate the fused cI repressor (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea). We also used a previously reported inactive M\u003csup\u003epro\u003c/sup\u003e mutant (C145A) to validate this system\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e,\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. Our results demonstrated that this selection circuit can accurately and sensitively report on M\u003csup\u003epro\u003c/sup\u003e activity and the inhibition efficiency of small molecules (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb, \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec). In addition, we found that this genetic circuit can be used for proteases from other viruses such as HCV (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea, \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb), demonstrating its robustness and broad applications. Compared to previously reported selections used for protease evolution in PACE\u003csup\u003e\u003cspan additionalcitationids=\"CR42\" citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e, our circuit represents an alternative and improved strategy with better response properties (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec, \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe then used our genetic circuit to couple the cleavage activity of M\u003csup\u003epro\u003c/sup\u003e, encoded on SP, to the expression of gIII, which was controlled by the p434 promoter (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed). This circuit enables selection for M\u003csup\u003epro\u003c/sup\u003e variants that can escape inhibition by small molecules. Our results showed that wild-type M\u003csup\u003epro\u003c/sup\u003e supported robust phage propagation, while the C145A mutant behaved like empty phage (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee, Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee). We tested phage propagation at various concentrations of the inhibitors GC376 and PF-07321332, and selected 20 \u0026micro;M as the initial concentration for evolution (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ef, Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ef). We designed 32 different gRNAs to systematically cover the M\u003csup\u003epro\u003c/sup\u003e gene and performed EvoScan with two inhibitors (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eg, \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eh). Surprisingly, we found that escaping mutations can occur across the whole M\u003csup\u003epro\u003c/sup\u003e gene (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eh). Some of these mutations, such as F140L, E166V, and S144A, have also been reported in previous studies on drug resistance against PF-07321332\u003csup\u003e44,45\u003c/sup\u003e, proving the effectiveness and reliability of our system. Most other mutations were not observed in previous works, demonstrating that EvoScan can successfully identify novel key mutations. We also identified conserved mutation sites for both inhibitors, such as S62, L75, N119, S144, T169, A191, P241, and G302 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eh). Interestingly, we observed that the phage propagation trajectories of the 32 segments targeted for mutagenesis varied during the evolution process, and more than 10 segments showed no overall enrichment during serial passaging, suggesting that mutations within each of these segments taken individually cannot enable drug resistance, which may serve as regions for future drug development studies (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eh).\u003c/p\u003e \u003cp\u003eWe further verified the ability of these mutations to confer inhibition resistance (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ei, Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eg, \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eh). Nearly all of these mutations showed increased resistance against inhibitors compared to wild-type M\u003csup\u003epro\u003c/sup\u003e. In group I mutations, we found that A191V had a strong resistance effect against both inhibitors, while N119D had a moderate resistance effect, and other mutations had relatively weak resistance effects on their specific inhibitors. Strikingly, we found a set of group II mutations (such as E166K), of which the enzyme activities were even improved by inhibitors (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ei). Similar to a previously reported mechanism where GC376 increased the catalytic activity of M\u003csup\u003epro\u003c/sup\u003e mutants\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e, E166K has a different interaction with the inhibitors compared to WT, which may then improve the dimerization of M\u003csup\u003epro\u003c/sup\u003e and thus the enzyme activity (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ej, Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ek). The same phenomenon was observed with other mutations such as I136V, T169P, F140L, and S144A. However, how these mutation sites increase the enzyme activities when inhibitors were added is not clear, as they are located far from the active pocket.\u003c/p\u003e \u003cp\u003eAs a comparison, we also evolved M\u003csup\u003epro\u003c/sup\u003e using the PANCE system (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb). With only the mutagenesis method changed, M\u003csup\u003epro\u003c/sup\u003e SP failed to accumulate any consensus mutations after 36 passages. After 96 passages, 4 dominant variants with escaping abilities emerged in the four groups in total (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec-f). All these variants have the N119D or A191V mutation, which appeared after only 8 passages in EvoScan. These results further showed that EvoScan can effectively explore protein-ligand interaction and identify novel key anchor mutations related to small molecule interactions.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eSystematic searching for anchors in high-dimensional space\u003c/h2\u003e \u003cp\u003eHaving demonstrated that EvoScan can rapidly and thoroughly explore the sequence space and generate more diverse functional variants than traditional methods, we next applied this approach to protein-nucleic acid interaction and systematically searched the space from low to high dimensions. We selected AmeR, a transcriptional regulator from the TetR family which plays important roles in many biological processes and synthetic biology\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e,\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e. AmeR has few known sequence homologs, making it challenging to use traditional methods to explore its sequence-function relationship, especially in high-dimensional space (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea). We planned to first carry out a rapid scan of all gRNAs that cover the full sequence of AmeR, then select only those that generated enriched mutations for further use. Several different evolution routes could then be designed using the remaining APs. Serial passaging of phage across hosts containing different APs would identify anchors in high dimensions \u0026ndash; that is, combining multiple mutations in different segments \u0026ndash; that thoroughly and representatively sampled the AmeR sequence space (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea).\u003c/p\u003e \u003cp\u003eTo link AmeR interaction with its operator to gIII expression, we inserted a PhlF repressor after the pAmeR promoter, such that the repression ability of AmeR is positively correlated with gIII expression (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb). We tested several combinations of plasmid origins, ribosome binding sites (RBS) and repressor types\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e to optimize the circuit. The optimal combination resulted in 73-fold propagation of SP carrying AmeR (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec, Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo start the scanning process, we selected 13 gRNA sites that cover both the N-terminal and C-terminal domains of AmeR, which are involved in DNA binding and dimerization, respectively. We measured phage titers after each of the 4 passages and found that most groups enriched\u0026thinsp;\u0026ge;\u0026thinsp;50-fold. Of the 13 different groups, 8 generated dominant mutations in the phage supernatant. These mutations were observed within the targeting segment corresponding to each gRNA (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb, \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ee). These results provide one-dimensional information about the protein sequence space. We next designed 8 evolutionary routes to sample the high-dimensional space, in which SPs were passaged across all these 8 APs in different orders (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee). For each route, we sequenced the supernatant and 2 single plaques from each round (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee). After the full evolutionary scanning process, we obtained 82 anchor variants encompassing 52 different mutations at 39 residue sites (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee). Among all the variants, a large portion (~\u0026thinsp;83%) of variants had more than 2 mutations, demonstrating the successful exploration and even sampling of the high-dimensional space. We measured the fold repression of the 82 variants, and nearly all of them showed improved function compared to WT, demonstrating again the effectiveness of EvoScan in searching for high-fitness sequences (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea, Supplementary Table\u0026nbsp;1).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor comparison, we also applied PANCE to AmeR evolution (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea, \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb, \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec). After 16 passages, only R43S and S57R single mutants and the R43S S57R double mutant appeared (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee, Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed), all of which appeared during EvoScan within 8 passages. That only the variants from the low-dimensional space were observed in PANCE again illustrated how allowing competition between variants from all parts of sequence space can suppress and obscure many functionally informative mutation sites and high-fitness variants from the high-dimensional space, which were systematically captured by EvoScan.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eAnchors capture key features of the design space\u003c/h2\u003e \u003cp\u003eAlignment between mutations and predicted structure by AlphaFold2 suggested that these beneficial mutations accumulated not only on the helix-turn-helix domain near the N terminus that interacts with DNA, but also on regions related to dimerization of AmeR near the C terminus (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ef). To investigate the mutation relationship between variants, we drew a relation map linking variants that contained less than three different residues (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eg). The evolution paths leading to different mutants were connected with complexity, indicating the complex interactive nature of protein evolution in high-dimensional space.\u003c/p\u003e \u003cp\u003eWe were able to identify four evolution paths from the complex map, leading to different mutants that shared the same intermediates or reached the same destination (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eh). This suggested the existence of shared local peaks in the landscape, consistent with a recent study demonstrating the simultaneous accessibility of multiple peaks during evolution\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. These mutations usually contained one or more of D33E, R43S, S57R, P94L, and variants containing these mutations appeared to be fitter than WT AmeR (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eg, Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea), indicating that these mutation sites provide important information about the sequence-function relationships for high-fitness genotypes.\u003c/p\u003e \u003cp\u003eThe best-performing single mutant, S57R, outperformed the wild-type AmeR repression ability in both bacteria and mammalian cell systems (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea, Extended Data Fig.\u0026nbsp;6a, 6b). Repressors with better properties are crucial for robust gates and genetic circuits construction in synthetic biology (e.g., low leakage, high circuit score)\u003csup\u003e\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e\u003c/sup\u003e. We next incorporated it into several genetic circuit contexts such as IMPLY, NIMPLY, and NAND\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb, Extended Data Fig.\u0026nbsp;6c-h). The S57R variant significantly increased the circuit score of all these genetic circuits and reduced the circuit leakage at the same time (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb). These results show that the identified mutations affected the protein-DNA interaction directly and captured essential features of the protein itself, rather than increasing fitness only in the context of our evolution selection.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eAnchors capture complex epistasis interactions in the high-dimensional space\u003c/h2\u003e \u003cp\u003eWe also found that, in these anchors, mutational combinations had synergistically enhanced repression abilities in both \u003cem\u003eE. coli\u003c/em\u003e and HEK293T, demonstrating that exploring the higher dimensions is vital for identifying proteins with improved functions (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec, Extended Data Fig.\u0026nbsp;6a, 6b). However, we found that the order of introducing mutations significantly affected the evolvability, even if the start point and end point were the same (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed). For example, S57R P94L double-mutants had lower fitness than S57R, which suggested that it was more difficult for natural evolution to reach the final genotype (I80V P94L S57R) if S57R was introduced first (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed). We further built a phylogenetic tree to investigate the evolvability among these variants (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ee). The results revealed that, by designing different routes (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee), EvoScan likely bypassed these evolvability limitations and achieved long genetic distance searching to obtain these anchors by \u0026ldquo;jumping\u0026rdquo; between domains in different orders (APs) in the high-dimensional space. These results further highlighted the need for high-throughput targeting methods to effectively explore the sequence space.\u003c/p\u003e \u003cp\u003eThese non-additive interactions between two or more mutations are known as epistasis, which has profound impacts on the landscape in the high-dimensional space. We next systematically investigated the epistasis effect in these anchors and calculated the epistasis value (ε) using fold repression as the fitness value of different genotypes (Supplementary Table\u0026nbsp;2). We identified both negative (such as D33E and S57R, R43S and [D33E S57R A75T C93R]) and positive epistasis (such as [S57R P94L] and V188F, [P94L S57R] and [G83V V188F A199S G212S]) for different mutation combinations in both low dimensions and high dimensions (Supplementary Table\u0026nbsp;2). We also studied the magnitude and sign epistasis of different mutations (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ef), which can create rugged fitness landscapes. Interestingly, we identified reciprocal sign epistasis in the high-dimensional space, such as P94L and [G83V V188F A199S G212S] in the S57R genetic background (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ef). We also found that, even for the same mutation, such as D33E, P94L, and D119N, ε can be either positive or negative when combined with different mutations, indicating the complex and idiosyncratic epistasis relationship between different mutations (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eg).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eEvoAI enables sequence space reconstruction and prediction of new proteins\u003c/h3\u003e\n\u003cp\u003eGiven the complex interaction of mutations in the high-dimensional space, we next aimed to use deep learning to extract the latent features of these anchors obtained from EvoScan to represent and reconstruct the design space of AmeR for high-fitness genotypes with high accuracy, enabling design of new proteins with multiple mutations not represented in the experimental outcomes. We name this hybrid experimental-computational method EvoAI.\u003c/p\u003e \u003cp\u003eWe combined a pre-trained GeoFitness model and the Protein Language Model (ESM-2), followed by a Multi-Layer Perceptron (MLP) to enhance the accuracy of predicting protein mutation effects (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea). The pre-trained GeoFitness model was trained on a large dataset of ~\u0026thinsp;300,000 protein fitness values from various experimental cases and indicators to enable prediction of protein fitness of single mutations (Extended Data Fig.\u0026nbsp;7). We used the 82 anchor points for both training and validation with a 10-fold cross-validation approach to obtain the final model (Extended Data Fig.\u0026nbsp;8). Spearman correlation coefficients were 0.91 and 0.84 for the training set and the test set, respectively, demonstrating a high level of consistency in training effectiveness (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). These results demonstrated that our deep learning model accurately predicted the multi-interaction of mutations and complex epistasis in higher dimensional space.\u003c/p\u003e \u003cp\u003eWe further validated the accuracy of the reconstructed space by designing, predicting, and testing new variants different from the 82 anchors. To reduce the computational load, we chose 13 mutations from the top 11 mutation sites with high prediction certainties for novel protein design (Extended Data Fig.\u0026nbsp;9b). We then computationally traversed all possible combinations of 6 total mutations and calculated the predicted fold repression by our model (1093 predictions in total). The 10 top-scoring protein sequences were cloned and experimentally tested for their fold repression. All 10 sequences showed significantly improved activities compared to WT with 10- to 38-fold repression abilities (Supplementary Table\u0026nbsp;3). Furthermore, although we chose only the top predictions and all of them have very close prediction scores, these variants still showed a high Spearman correlation coefficient between prediction scores and experimental results (Extended Data Fig.\u0026nbsp;9c, 9d). For comparison, we tested the predicted sequence space without using these anchors information but only using low-dimensional deep mutational scanning (DMS) information, and also generated 10 variants with 6 mutations each (Extended Data Fig.\u0026nbsp;9e). In striking contrast to the high-performing EvoAI-predicted variants, all 10 variants generated by DMS had worse activity relative to wild-type AmeR (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec, Supplementary Table\u0026nbsp;3).\u003c/p\u003e \u003cp\u003eThese results validated that, with these compressed anchors, our deep learning model can accurately reconstruct the design space for high-fitness genotypes in high-dimensional space, and design new protein sequences with improved functions. We identified 39 mutation sites in AmeR (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee) that could potentially generate high-fitness genotypes, with a theoretical design space of ~\u0026thinsp;10\u003csup\u003e50\u003c/sup\u003e (20\u003csup\u003e39\u003c/sup\u003e). Our EvoAI approach therefore effectively demonstrated that this vast design space of AmeR for high-fitness genotypes can be compressed by ~\u0026thinsp;10\u003csup\u003e48\u003c/sup\u003e times to 82 anchor points.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eNavigating the complexity and scope of a protein fitness landscape is a long-standing challenge for protein design. We developed EvoScan, a novel system that combines EvolvR mutagenesis and phage selection to explore the protein sequence space in different dimensions. EvoScan can identify valuable anchors, which are variants with critical mutations that represent the sequence space. We showed that these anchor points can accurately reconstruct the space and design new proteins when coupled to deep learning methods (EvoAI), demonstrating the extreme compressibility of this space. Previous methods did not capture this insight likely because they only explored either the low-dimensional space by measuring single or double mutations, or a small region of the sequence space by saturating mutations. These methods thus might not capture the whole picture, especially the high-dimensional space (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed).\u003c/p\u003e \u003cp\u003eOur approach has several important advantages over existing methods. First, it balances realistic fitness optimization and even sampling of sequence space, which can rapidly explore high dimensions and generate more diverse and functional variants, and provide richer information about sequence-function relationships. Second, by integrating empirical evolutionary scanning and deep learning models in EvoAI, we can leverage the strengths of both approaches. We could use the properties learned by deep learning to dynamically guide the scanning process. Future advances of explainable deep learning could uncover the underlying rules or patterns, and provide insights into how proteins adapt and overcome evolutionary constraints or trade-offs. Third, our method can evolve and investigate proteins that lack structural information, or that involve challenging interactions. We showed that EvoScan can capture anchors for proteins with diverse functions, such as protein-protein, protein-ligand, and protein-nucleic acid interactions. Our approach should be compatible with any biomolecular function that can be coupled to a transcriptional output (e.g., enzymes through small molecule sensors), and thus could be applied to study the sequence spaces of diverse biomolecules.\u003c/p\u003e \u003cp\u003eOur approach could be further improved in the future. We could use Cas9 variants with more PAM options to increase the guide RNA tiling and mutation-targeted segment selection. We could also modify the editing system to introduce mutations at multiple sites at once, avoiding host switching and speeding up the exploration process. Furthermore, incorporation of the target mutagenesis approach of EvoScan into PACE could potentially lead to deeper sampling of sequence space segments. In addition, integration of EvoScan with genotype reconstruction methods, such as Evoracle, could enable more systematic and intelligent exploration of the sequence space\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e. Moreover, the modularity of our system makes it highly suitable for automation, such as with the recently reported PRANCE method\u003csup\u003e\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e, and could be scaled up to provide more comprehensive fitness landscape profiling data for different protein targets, illustrating whether the extreme compressibility of the design space for high-fitness genotypes is universal or unusual, or if the whole protein fitness landscape is compressible.\u003c/p\u003e \u003cp\u003eWe also hope that our method will inspire new insights into the relationship between genotype and phenotype and the evolution of biological systems. The compressibility of the design space may suggest that nature somehow finds a way to search through the seemingly infinite space in the relatively short period of life time on earth by Darwinian evolution, possibly by \u0026ldquo;jumping\u0026rdquo; between these anchors instead of searching every possibility (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed). Genetic recombination in large sexual populations could possibly enable this \u0026ldquo;jumping\u0026rdquo; and boost evolution rates\u003csup\u003e\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e,\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e. Our approach would enable the investigation of such path dependence of evolutionary outcomes of biological systems in high-throughput experiments\u003csup\u003e\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e,\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e\u003c/sup\u003e and provide valuable insights for evolution and protein design in biotechnology and biomedical applications.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eAcknowledgements\u003c/p\u003e\n\u003cp\u003eThis study was supported by Ministry of Science and Technology of China grant 2021YFA0911000 (S.Z.), National Natural Science Foundation of China grant 32171416 (S.Z.) and U22A20552 (S.Z.), Tsinghua University Dushi Plan Foundation (S.Z.), U.S. NIH R01 EB022376/EB031172 (D.R.L.), and R35 GM118062 (D.R.L.). We thank J. Zheng (Westlake University) for helpful discussions. We thank C. Zhang (Tsinghua University) for the kind gift of the EvolvR gene. We apologize to authors whose work cannot be cited owing to referencing restrictions.\u003c/p\u003e\n\u003cp\u003eContributions\u003c/p\u003e\n\u003cp\u003eS.Z. conceptualized and supervised the project. S.Z., Z.M. and W.L. designed the experiments. Z.M., W.L., and H.Q. performed the evolution experiments in EvoScan. Z.M., W.L., Y.S., and Z.L. performed the flow cytometry assays and phage propagation assays of obtained variants. G.L. conducted the mammalian cell experiments. H.G., B.T., Y.X., and J.C. designed and developed the deep learning models. Z.M., and Y.S. wrote the first draft. B.W.T., D.R.L., C.A.V., and S.Z. wrote the final manuscript. All authors contributed to the drafting and revision of the manuscript.\u003c/p\u003e\n\u003cp\u003eCompeting interests\u003c/p\u003e\n\u003cp\u003eS.Z. and Z.M. have filed a patent application based on this work.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eLovelock, S. L.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e The road to fully programmable protein catalysis. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e606\u003c/strong\u003e, 49-58 (2022).\u003c/li\u003e\n \u003cli\u003eLabanieh, L. \u0026amp; Mackall, C. L. CAR immune cells: design principles, resistance and the next generation. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e614\u003c/strong\u003e, 635-648 (2023).\u003c/li\u003e\n \u003cli\u003eDumontet, C., Reichert, J. M., Senter, P. D., Lambert, J. M. \u0026amp; Beck, A. Antibody\u0026ndash;drug conjugates come of age in oncology. \u003cem\u003eNat. Rev. Drug Discov.\u003c/em\u003e \u003cstrong\u003e22\u003c/strong\u003e, 641-661\u003cem\u003e\u0026nbsp;\u003c/em\u003e(2023).\u003c/li\u003e\n \u003cli\u003eMacken, C. A. \u0026amp; Perelson, A. S. Protein evolution on rugged landscapes. \u003cem\u003eProc. Natl Acad. Sci. USA\u003c/em\u003e \u003cstrong\u003e86\u003c/strong\u003e, 6191-6195 (1989).\u003c/li\u003e\n \u003cli\u003eLutz, S. Beyond directed evolution\u0026mdash;semi-rational protein engineering and design. \u003cem\u003eCurr. Opin. Biotechnol.\u003c/em\u003e \u003cstrong\u003e21\u003c/strong\u003e, 734-743 (2010).\u003c/li\u003e\n \u003cli\u003eDing, X., Zou, Z. \u0026amp; Brooks III, C. L. Deciphering protein evolution and fitness landscapes with latent space models. \u003cem\u003eNat. Commun.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e10\u003c/strong\u003e, 5644 (2019).\u003c/li\u003e\n \u003cli\u003eTian, P. \u0026amp; Best, R. B. Exploring the sequence fitness landscape of a bridge between protein folds. \u003cem\u003ePLoS Comput. Biol.\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, e1008285 (2020).\u003c/li\u003e\n \u003cli\u003eFernandez-de-Cossio-Diaz, J., Uguzzoni, G. \u0026amp; Pagnani, A. Unsupervised inference of protein fitness landscape from deep mutational scan. \u003cem\u003eMol. Biol. Evol.\u003c/em\u003e \u003cstrong\u003e38\u003c/strong\u003e, 318-328 (2021).\u003c/li\u003e\n \u003cli\u003eD\u0026rsquo;Costa, S., Hinds, E. C., Freschlin, C. R., Song, H. \u0026amp; Romero, P. A. Inferring protein fitness landscapes from laboratory evolution experiments. \u003cem\u003ePLoS Comput. Biol.\u003c/em\u003e \u003cstrong\u003e19\u003c/strong\u003e, e1010956 (2023).\u003c/li\u003e\n \u003cli\u003eFowler, D. M. \u0026amp; Fields, S. Deep mutational scanning: a new style of protein science. \u003cem\u003eNat. Methods\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 801-807 (2014).\u003c/li\u003e\n \u003cli\u003eStiffler, M. A., Hekstra, D. R. \u0026amp; Ranganathan, R. Evolvability as a function of purifying selection in TEM-1 \u0026beta;-lactamase. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e160\u003c/strong\u003e, 882-892 (2015).\u003c/li\u003e\n \u003cli\u003eZheng, L., Baumann, U. \u0026amp; Reymond, J.-L. An efficient one-step site-directed and site-saturation mutagenesis protocol. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cstrong\u003e32\u003c/strong\u003e, e115 (2004).\u003c/li\u003e\n \u003cli\u003eMcLaughlin Jr, R. N., Poelwijk, F. J., Raman, A., Gosal, W. S. \u0026amp; Ranganathan, R. The spatial architecture of protein function and adaptation. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e491\u003c/strong\u003e, 138-142 (2012).\u003c/li\u003e\n \u003cli\u003eCadwell, R. C. \u0026amp; Joyce, G. F. Randomization of genes by PCR mutagenesis. \u003cem\u003eGenome Res.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e2\u003c/strong\u003e, 28-33 (1992).\u003c/li\u003e\n \u003cli\u003eVanhercke, T., Ampe, C., Tirry, L. \u0026amp; Denolf, P. Reducing mutational bias in random protein libraries. \u003cem\u003eAnal. Biochem.\u003c/em\u003e \u003cstrong\u003e339\u003c/strong\u003e, 9-14 (2005).\u003c/li\u003e\n \u003cli\u003eEsvelt, K. M., Carlson, J. C. \u0026amp; Liu, D. R. A system for the continuous directed evolution of biomolecules. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e472\u003c/strong\u003e, 499-503 (2011).\u003c/li\u003e\n \u003cli\u003eMiller, S. M., Wang, T. \u0026amp; Liu, D. R. Phage-assisted continuous and non-continuous evolution. \u003cem\u003eNat. Protoc.\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 4101-4127 (2020).\u003c/li\u003e\n \u003cli\u003eRavikumar, A., Arzumanyan, G. A., Obadi, M. K. A., Javanpour, A. A. \u0026amp; Liu, C. C. Scalable, Continuous Evolution of Genes at Mutation Rates above Genomic Error Thresholds. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e175\u003c/strong\u003e, 1946-1957.e1913 (2018).\u003c/li\u003e\n \u003cli\u003eSarkisyan, K. S.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Local fitness landscape of the green fluorescent protein. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e533\u003c/strong\u003e, 397-401 (2016).\u003c/li\u003e\n \u003cli\u003eHopf, T. A.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Mutation effects predicted from sequence co-variation. \u003cem\u003eNat. Biotechnol.\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e35\u003c/strong\u003e, 128-135 (2017).\u003c/li\u003e\n \u003cli\u003eRiesselman, A. J., Ingraham, J. B. \u0026amp; Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. \u003cem\u003eNat. Methods\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 816-822 (2018).\u003c/li\u003e\n \u003cli\u003eYang, K. K., Wu, Z. \u0026amp; Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. \u003cem\u003eNat. Methods\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, 687-694 (2019).\u003c/li\u003e\n \u003cli\u003eLuo, Y.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e ECNet is an evolutionary context-integrated deep learning framework for protein engineering. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 5743 (2021).\u003c/li\u003e\n \u003cli\u003eWu, Z., Johnston, K. E., Arnold, F. H. \u0026amp; Yang, K. K. Protein sequence design with deep generative models. \u003cem\u003eCurr. Opin. Chem. Biol.\u003c/em\u003e \u003cstrong\u003e65\u003c/strong\u003e, 18-27 (2021).\u003c/li\u003e\n \u003cli\u003eSomermeyer, L. G.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Heterogeneity of the GFP fitness landscape and data-driven protein design. \u003cem\u003eElife\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, e75842 (2022).\u003c/li\u003e\n \u003cli\u003eShen, M. W., Zhao, K. T. \u0026amp; Liu, D. R. Reconstruction of evolving gene variants and fitness from short sequencing reads. \u003cem\u003eNat. Chem. Biol.\u003c/em\u003e \u003cstrong\u003e17\u003c/strong\u003e, 1188-1198 (2021).\u003c/li\u003e\n \u003cli\u003eBiswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. \u0026amp; Church, G. M. Low-N protein engineering with data-efficient deep learning. \u003cem\u003eNat. Methods\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, 389-396 (2021).\u003c/li\u003e\n \u003cli\u003ePapkou, A., Garcia-Pastor, L., Escudero, J. A. \u0026amp; Wagner, A. A rugged yet easily navigable fitness landscape. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e382\u003c/strong\u003e, eadh3860 (2023).\u003c/li\u003e\n \u003cli\u003eHalperin, S. O.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e CRISPR-guided DNA polymerases enable diversification of all nucleotides in a tunable window. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e560\u003c/strong\u003e, 248-252 (2018).\u003c/li\u003e\n \u003cli\u003eBaas, P. DNA replication of single-stranded Escherichia coli DNA phages. \u003cem\u003eBiochim. Biophys. Acta, Gene Struct. Expression\u003c/em\u003e \u003cstrong\u003e825\u003c/strong\u003e, 111-139 (1985).\u003c/li\u003e\n \u003cli\u003eJinek, M.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e A programmable dual-RNA\u0026ndash;guided DNA endonuclease in adaptive bacterial immunity. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e337\u003c/strong\u003e, 816-821 (2012).\u003c/li\u003e\n \u003cli\u003eRan, F. A.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Genome engineering using the CRISPR-Cas9 system. \u003cem\u003eNat. Protoc.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 2281-2308 (2013).\u003c/li\u003e\n \u003cli\u003eDietsch, F.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Small p53 derived peptide suitable for robust nanobodies dimerization. \u003cem\u003eJ. Immunol. Methods\u003c/em\u003e \u003cstrong\u003e498\u003c/strong\u003e, 113144 (2021).\u003c/li\u003e\n \u003cli\u003eDi Lallo, G., Castagnoli, L., Ghelardini, P. \u0026amp; Paolozzi, L. A two-hybrid system based on chimeric operator recognition for studying protein homo/heterodimerization in Escherichia coli. \u003cem\u003eMicrobiology\u003c/em\u003e \u003cstrong\u003e147\u003c/strong\u003e, 1651-1656 (2001).\u003c/li\u003e\n \u003cli\u003eGao, K.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Perspectives on SARS-CoV-2 main protease inhibitors. \u003cem\u003eJ. Med. Chem.\u003c/em\u003e \u003cstrong\u003e64\u003c/strong\u003e, 16922-16955 (2021).\u003c/li\u003e\n \u003cli\u003eLi, J.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Structural basis of the main proteases of coronavirus bound to drug candidate PF-07321332. \u003cem\u003eJ. Virol.\u003c/em\u003e \u003cstrong\u003e96\u003c/strong\u003e, e02013-02021 (2022).\u003c/li\u003e\n \u003cli\u003eFu, L.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Both Boceprevir and GC376 efficaciously inhibit SARS-CoV-2 by targeting its main protease. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 4417 (2020).\u003c/li\u003e\n \u003cli\u003eOwen, D. R.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e An oral SARS-CoV-2 M\u003csup\u003epro\u003c/sup\u003e inhibitor clinical candidate for the treatment of COVID-19. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e374\u003c/strong\u003e, 1586-1593 (2021).\u003c/li\u003e\n \u003cli\u003eIketani, S.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. \u003cem\u003eCell Host Microbe\u003c/em\u003e \u003cstrong\u003e30\u003c/strong\u003e,1354-1362 (2022).\u003c/li\u003e\n \u003cli\u003eIketani, S.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. \u003cem\u003eNature\u0026nbsp;\u003c/em\u003e\u003cstrong\u003e14\u003c/strong\u003e, 1716-1726 (2022).\u003c/li\u003e\n \u003cli\u003eDickinson, B. C., Packer, M. S., Badran, A. H. \u0026amp; Liu, D. R. A system for the continuous directed evolution of proteases rapidly reveals drug-resistance mutations. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 5352 (2014).\u003c/li\u003e\n \u003cli\u003ePacker, M. S., Rees, H. A. \u0026amp; Liu, D. R. Phage-assisted continuous evolution of proteases with altered substrate specificity. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 956 (2017).\u003c/li\u003e\n \u003cli\u003eBlum, T. R.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Phage-assisted evolution of botulinum neurotoxin proteases with reprogrammed specificity. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e371\u003c/strong\u003e, 803-810 (2021).\u003c/li\u003e\n \u003cli\u003eIketani, S.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. \u003cem\u003eCell Host Microbe\u003c/em\u003e \u003cstrong\u003e30\u003c/strong\u003e, 1354-1362. e1356 (2022).\u003c/li\u003e\n \u003cli\u003eIketani, S.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e613\u003c/strong\u003e, 558-564 (2023).\u003c/li\u003e\n \u003cli\u003eNashed, N. T., Aniana, A., Ghirlando, R., Chiliveri, S. C. \u0026amp; Louis, J. M. Modulation of the monomer-dimer equilibrium and catalytic activity of SARS-CoV-2 main protease by a transition-state analog inhibitor. \u003cem\u003eCommun. Biol.\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 160 (2022).\u003c/li\u003e\n \u003cli\u003eStanton, B. C.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Genomic mining of prokaryotic repressors for orthogonal logic gates. \u003cem\u003eNat. Chem. Biol.\u003c/em\u003e \u003cstrong\u003e10\u003c/strong\u003e, 99-105 (2014).\u003c/li\u003e\n \u003cli\u003eRamos, J. L. \u003cem\u003eet al.\u003c/em\u003e The TetR family of transcriptional repressors. \u003cem\u003eMicrobiol. Mol. Biol. Rev.\u003c/em\u003e \u003cstrong\u003e69\u003c/strong\u003e, 326-356 (2005).\u003c/li\u003e\n \u003cli\u003eNielsen, A. A.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Genetic circuit design automation. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e352\u003c/strong\u003e, aac7341 (2016).\u003c/li\u003e\n \u003cli\u003eBrophy, J. A. N. \u0026amp; Voigt, C. A. Principles of genetic circuit design. \u003cem\u003eNat. Methods\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 508-520 (2014).\u003c/li\u003e\n \u003cli\u003eDeBenedictis, E. A.\u003cem\u003e\u0026nbsp;et al.\u003c/em\u003e Systematic molecular evolution enables robust biomolecule discovery. \u003cem\u003eNat. Methods\u003c/em\u003e \u003cstrong\u003e19\u003c/strong\u003e, 55-64 (2021).\u003c/li\u003e\n \u003cli\u003eDickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. \u0026amp; Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. \u003cem\u003eProc. Natl Acad. Sci. USA\u003c/em\u003e \u003cstrong\u003e110\u003c/strong\u003e, 9007-9012 (2013).\u003c/li\u003e\n \u003cli\u003eWeinreich, D. M. \u0026amp; Chao, L. Rapid evolutionary escape by large populations from local fitness peaks is likely in nature. \u003cem\u003eEvolution\u003c/em\u003e \u003cstrong\u003e59\u003c/strong\u003e, 1175-1182 (2005).\u003c/li\u003e\n \u003cli\u003eWeissman, D. B., Feldman, M. W. \u0026amp; Fisher, D. S. The Rate of Fitness-Valley Crossing in Sexual Populations. \u003cem\u003eGenetics\u003c/em\u003e \u003cstrong\u003e186\u003c/strong\u003e, 1389-1410 (2010).\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003e\u003cstrong\u003eGeneral methods.\u003c/strong\u003e The following working concentrations of antibiotics were used: carbenicillin (Solarbio, 50 μg/ml), kanamycin (Solarbio, 50 μg/ml), spectinomycin (Macklin, 50 μg/ml), chloramphenicol (Macklin, 25 μg/ml). PHANTA 2x mix (Vazyme) was used for cloning PCR, and Flash 2x mix (Vazyme) was used for verification PCR and Sanger sequencing (Tsingke Bioscience). All cloning fragments were assembled by Golden Gate assembly (New England Biolabs) or ClonExpress assembly (Vazyme) methods. Plasmids were cloned in DH5α competent cells (HT Health). Synthetic genes were ordered from Tsingke Bioscience. Cloned plasmids were extracted by Tiangen DNA extraction kit. \u003cem\u003eE. coli\u003c/em\u003e strain S2060\u003csup\u003e55\u003c/sup\u003e was used in all aspects of the EvoScan process, including system construction, evolution, and plaque assays. The DH5α strain was used for flow cytometry experiments. Detailed information on the plasmids and selection phage (SP) used in this work is given in Supplementary Table 6.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePhage propagation assay.\u0026nbsp;\u003c/strong\u003eCompetent S2060 cells were transformed with corresponding accessory plasmid (AP) in each experiment. Overnight cultures of single colonies inoculated in LB medium with proper antibiotics were diluted 50 or 100 times and grown at 37 ℃ in 220 rpm shaker (ZQZY-B8, cultured in shake tubes, 5 ml system) or 1000 rpm shaker (HUXI HW-400TG, cultured in 96-deep well plate, 500 μl system) to log phase (OD\u003csub\u003e600\u003c/sub\u003e ~ 0.4–0.6). These cells were then infected with SP at an initial titer of 5 × 10\u003csup\u003e6\u003c/sup\u003e plaque-forming units (p.f.u.) per ml. The mixture was further cultured overnight (16–20 h) at 37 ℃ in the shakers as described above, and was centrifuged at 4000 rpm for 10 min. Phages in the supernatant was filtered by 0.22 μm bacterial filter and stored at 4 ℃ for further use.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePlaque assay.\u003c/strong\u003e A single colony of chemically competent S2208 cells\u003csup\u003e55\u003c/sup\u003e (S2060 cells transformed with plasmid pJC175e) was cultured overnight in LB medium added with proper antibiotics. The saturated bacteria culture was diluted 50 or 100 times into LB medium with proper antibiotics and grown at 37 ℃ in 220 rpm shaker to log phase (OD ~0.4–0.8) before use. Phages were serially diluted 6 to 8 times with a dilution ratio of 10-fold in each step in LB medium. Then, 10 μl of each phage dilution was mixed with 45 μl S2208 cells, and then 180 μl of liquid (50–65℃) soft agar (LB medium and 0.5% agar) supplemented with 2% Bluo-gal (Inalco S.p.A.) was added and mixed by pipetting. The whole mixture was immediately added onto 500 μl of bottom agar (LB medium and 1.5% agar) previously prepared in 24-well plate. Then the plates were incubated in 37 ℃ for overnight growth (14–18 h).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCalculation of fold propagation.\u0026nbsp;\u003c/strong\u003eFor fold propagation measurement of the selection phage, initial phage titer and final phage titer were measured by plaque assays. We defined the ratio of final phage titer versus initial phage titer as the fold propagation of the phage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBasic process of evolutionary scanning (EvoScan).\u003c/strong\u003e Target mutagenesis plasmid (TP) was first transformed to chemically competent S2060 cells, and then the prepared S2060-TP cells were used to prepare super chemically competent cell by Inoue method\u003csup\u003e56\u003c/sup\u003e. Chemically competent S2060-TP cells were transformed with corresponding APs. The resulting S2060-TP-AP bacteria were cultured overnight and diluted 50–100 times into 500 μl LB medium with antibiotics and inducers, and grown in 37 ℃ 1000 rpm shaker to OD ~0.5. The phage titer for the first infection was around 5×10\u003csup\u003e6\u003c/sup\u003e–5×10\u003csup\u003e8\u003c/sup\u003e p.f.u./ml, and for the following passages the phages were subjected to a 1:50 or 1:100 dilution. Vanillic acid (Sigma-Aldrich, ethanol dissolved) at a final concentration of 50 μM was added to induce the expression of nCas9-PolIM5 complex. The mixture was then cultured in 37 ℃ 1000 rpm shaker overnight. The next day the mixture was centrifuged at 4000 rpm for 10 min and the phage content of the collected supernatant was verified by PCR (Flash 2x mix) and Sanger Sequencing. The supernatant was then used for plaque assay as described above. Single plaques from plaque assay were picked and further verified by PCR (Flash 2x mix). The PCR product was sent for Sanger Sequencing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSearching steps in each route.\u003c/strong\u003e For each step of EvoScan in a route, 10 μl supernatant with evolved phages was added into 1 ml log-phase S2208 bacteria culture (OD ~0.4–0.8), and propagated overnight in 96-deep well plate. The mixture was centrifuged at 4000 rpm for 10 min and filtered by 0.22 μm bacterial filter. The obtained phages were then diluted and infected another host cell containing a different AP with an infection titer of 5×10\u003csup\u003e6\u003c/sup\u003e p.f.u./ml.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBasic process of phage-assisted non-continuous evolution (PANCE)\u003c/strong\u003e. Accessory plasmid with the designed genetic circuit and the mutagenesis plasmid MP6 were co-transformed into super chemically competent S2060 cells. The S2060-MP6-AP bacteria were cultured overnight and diluted 50-100 times into 500 μl LB medium with antibiotics and inducers in 96-deep well plate, and grown in 37 ℃ 1000 rpm shaker to OD ~0.5. The initial phage titer was around 5×10\u003csup\u003e6\u003c/sup\u003e–5×10\u003csup\u003e8\u003c/sup\u003e p.f.u./ml, and the phages were subjected to a 1:10–1:100 dilution in the following passages in a 500 μl system. 1% (m/v) arabinose dissolved in ddH\u003csub\u003e2\u003c/sub\u003eO was added as the inducer of MP6. Phages were then collected to obtain mutations following the same procedures in EvoScan.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInduced expression assay.\u0026nbsp;\u003c/strong\u003eSingle colonies of strains to be tested were cultured in LB medium overnight. Saturated bacterial culture was diluted 100 times in LB medium with proper antibiotics and inducers, and cultured in 37 ℃ 1000 rpm shaker for 2 h (OD ~ 0.4). Then LB with proper antibiotics and inducers was prepared and 2 μl log phase bacteria culture was added together to a whole volume of 500 μl. The mixture was cultured for 5 hours in the 96-deep well plate.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFlow cytometry assay.\u0026nbsp;\u003c/strong\u003e10 μl of the culture was added into 190 μl PBS with 2 g/L kanamycin in the 96-well U-bottom plate to stop the cell growth. The plate was stored in 4 ℃ until used. The flow cytometer (Beckman Coulter Cytoflex S) was used to quantify the expression levels of fluorescent protein. The software FlowJo v10 was used to gate the events (at least 10000 events) and calculated the median of each sample.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eM\u003csup\u003epro\u003c/sup\u003e drug resistance index.\u0026nbsp;\u003c/strong\u003eIn the RTHS protease activity assay, the fluorescence of the experimental group carrying eYFP was measured with or without addition of M\u003csup\u003epro\u003c/sup\u003e inhibitor GC376 or PF-07321332. The ratio of fluorescence FITC-A median with inhibitor versus fluorescence FITC-A median without inhibitor was defined as the resistance index (RI) to evaluate the drug resistance abilities of different M\u003csup\u003epro\u003c/sup\u003e variants.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStructure display and interaction prediction.\u0026nbsp;\u003c/strong\u003eSchrodinger 2017 was used for structural display. ZDOCK\u003csup\u003e57\u003c/sup\u003e was used for interaction structure prediction between EGFP and its nanobody. The interaction between M\u003csup\u003epro\u003c/sup\u003e and inhibitors within 3 angstrom was shown in the figure.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFold repression calculation.\u003c/strong\u003e The background fluorescence of cells, which is the median of the fluorescence of the bacteria carrying an empty plasmid with only the backbone, was measured and subtracted from all the experimental groups. The subtracted fluorescence values of the uninduced group (no repressor expression) were divided by the induced group (repressor expression) to obtain the fold repression.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRelative expression level calculation.\u003c/strong\u003e Using flow cytometry assay, we measured the FITC-A median of the strain carrying the empty plasmid and set this value as the background value. The FITC-A median of the strain carrying the standard plasmid expressing eYFP through the open reading frame J23101-B0064-YFP was measured the same way and set as the standard value. The FITC-A median of the strain containing a specific variant was measured the same way, and the relative expression level was defined as: (variant value – background value)/standard value.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCircuit score calculation.\u0026nbsp;\u003c/strong\u003eThestrain carrying the plasmid with a specific genetic circuit was prepared for flow cytometry assay. IPTG (1 mM) and vanillic acid (100 μM) were used as the input signals. YFP was used as the output reporter of the circuit and the FITC-A median of each state was measured. The lowest ON signal (lowest FITC-A median in “ON” states of the circuit) was divided by the highest OFF signal (highest FITC-A median in “OFF” states of the circuit) to obtain the circuit score.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAmeR phylogenetic tree construction.\u003c/strong\u003e Protein sequences of the 82 variants and the WT were collected as a fasta file and the file was input into MEGA11 for multiple sequence alignment (MSA)\u003csup\u003e58\u003c/sup\u003e. After MSA and phylogenetic analysis, neighbor-joining tree was selected as the method of tree construction. The output tree was decorated by iTOL\u003csup\u003e59\u003c/sup\u003e, and all the parameters were set using default values.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEpistasis calculation.\u003c/strong\u003e Epistasis between two different mutations, A and B, could be calculated as ε = fab + fAB – fAb – faB. f is the fitness of wild-type, double-mutant and single-mutant genotypes, respectively. ε \u0026gt; 0 means positive epistasis, while ε \u0026lt; 0 means negative epistasis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMammalian Cell culture and transfections.\u0026nbsp;\u003c/strong\u003eHEK293T cells (CRL-3216, ATCC) were cultured in Dulbecco’s modified Eagle’s medium (DMEM, Gibco) supplemented with 10% (v/v) fetal bovine serum (FBS, Biological Industries) and 1% (v/v) penicillin/streptomycin solution (Beyotime) at 37 °C, 100% humidity and 5% CO\u003csub\u003e2\u003c/sub\u003e. In transfection experiments, 60,000–80,000 HEK293T cells in 0.2 ml of DMEM complete medium were seeded into each well of 48-well plastic plates (NEST) and grown for ~24 h. M5 HiPer Lipo2000 Transfection Reagent (Lipo2000, Mei5bio) was used in all transfection experiments following the manufacturer’s protocol. Briefly, a sample mixture was prepared by mixing 150 ng repressor plasmid or 150 ng control plasmid (repressor-deficiency) with 150 ng reporter plasmid in 0.7 μl Lipo2000. The mixture was incubated at room temperature for 20 min before adding to cells. Transfections were supplemented with 0.2 mL DMEM complete medium 24 h post-transfection. Cells were cultured for 2 days post-transfection before flow cytometry analysis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMammalian cell flow cytometry assay.\u003c/strong\u003e Cells were trypsinized 48 h after transfection and were then centrifuged at 250 × g for 10 min at room temperature. The supernatant was removed, and the cells were resuspended in 1 × PBS. Fluorescence values were measured with a Cytoflex flow cytometer (Beckman Coulter, Inc.). PB450-A and ECD-A channels were chosen for BFP and mCherry measurement, respectively. Data were processed using FlowJo (TreeStar), gated by the area of the forward scatter and the side scatter (FSC-A/SSC-A) and then cell populations were selected by gating out the background BFP signal of untransfected cells to obtain the median of fluorescence. The median of fluorescence was calculated for \u0026gt;20,000 transfected cells for each sample. To reduce expression noises between samples, the mCherry : BFP fluorescence ratio was used to report the repressor activity\u003csup\u003e60\u003c/sup\u003e. The mCherry : BFP fluorescence ratio was calculated by (mCherry - mCherry\u003csub\u003e0\u003c/sub\u003e)/(BFP - BFP\u003csub\u003e0\u003c/sub\u003e), mCherry\u003csub\u003e0\u003c/sub\u003e and BFP\u003csub\u003e0\u0026nbsp;\u003c/sub\u003ewere the fluorescence values from untransfected HEK293T cells. The fold-repression was calculated by (mCherry : BFP)\u003csub\u003eunrepressed\u003c/sub\u003e/(mCherry : BFP)\u003csub\u003erepressed\u003c/sub\u003e. (mCherry : BFP)\u003csub\u003eunrepressed\u003c/sub\u003e and (mCherry : BFP)\u003csub\u003erepressed\u0026nbsp;\u003c/sub\u003ewere the fluorescence values of the states co-transfected with control plasmid or repressor plasmid.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFeature generation.\u0026nbsp;\u003c/strong\u003eOur initial step entails querying the UniRef30_2021_03 and bfd multiple sequence alignment (MSA) databases. Subsequently, we employ AlphaFold2 to construct the structural representation of the wild-type protein. For this endeavor, we deploy the GeoFitness-Seq variant of the pre-training model. In the case of mutated proteins, structural configurations are generated using FoldX 5. The sequence features are extracted from the large-scale protein language model ESM-2, for the purpose of capturing global context information. Consequently, each node in the Geometric Encoder is initialized by the embedding of the corresponding residue derived from the ESM-2. Unlike conventional methodologies that rely upon inter-residue distances and contacts to establish edges, each edge in the Geometric Encoder is initialized by the relative geometric relationship between a pair of residues derived from the protein 3D structure\u003csup\u003e61\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCross-validation.\u003c/strong\u003e We employed a 10-fold cross-validation approach to find the hyperparameters of the model. The dataset, comprising 82 mutational data points, was divided into three parts: a training set (59 samples), a validation set (7 samples), and a test set (16 samples). Model evaluation was performed using the Spearman correlation coefficient (ρ) as the primary assessment metric.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eModel training details.\u0026nbsp;\u003c/strong\u003eThe model employs the Soft Rank Loss as its loss function, with a learning rate of 10\u003csup\u003e-3\u003c/sup\u003e, Adam optimizer, and a decay rate for the learning rate. The training spans across 50 epochs. Subsequently, the learning rate of the upstream GeoFitness model is set to 10\u003csup\u003e-4\u003c/sup\u003e, while the learning rate of the downstream model is adjusted to 5×10\u003csup\u003e-4\u003c/sup\u003e for further fine-tuning.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e55\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Carlson, J. C., Badran, A. H., Guggiana-Nilo, D. A. \u0026amp; Liu, D. R. Negative selection and stringency modulation in phage-assisted continuous evolution. \u003cem\u003eNat. Chem. Biol.\u003c/em\u003e \u003cstrong\u003e10\u003c/strong\u003e, 216-222 (2014).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e56\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Green, M. R. \u0026amp; Sambrook, J. The Inoue Method for Preparation and Transformation of Competent Escherichia coli:\" Ultracompetent\" Cells. \u003cem\u003eCold Spring Harb Protoc.\u003c/em\u003e \u003cstrong\u003e2020\u003c/strong\u003e, 101196 (2020).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e57\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Chen, R., Li, L. \u0026amp; Weng, Z. ZDOCK: an initial‐stage protein‐docking algorithm. \u003cem\u003eProteins: Struct., Funct., Bioinf.\u003c/em\u003e \u003cstrong\u003e52\u003c/strong\u003e, 80-87 (2003).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e58\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Tamura, K., Stecher, G. \u0026amp; Kumar, S. MEGA11: molecular evolutionary genetics analysis version 11. \u003cem\u003eMol. Biol. Evol.\u003c/em\u003e \u003cstrong\u003e38\u003c/strong\u003e, 3022-3027 (2021).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e59\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Letunic, I. \u0026amp; Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cstrong\u003e49\u003c/strong\u003e, W293-W296 (2021).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e60\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Liang, J. C., Chang, A. L., Kennedy, A. B. \u0026amp; Smolke, C. D. A high-throughput, quantitative cell-based screen for efficient tailoring of RNA device activity. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cstrong\u003e40\u003c/strong\u003e, e154 (2012).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e61\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Xu, Y., Liu, D. \u0026amp; Gong, H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. \u003cem\u003ebioRxiv\u003c/em\u003e, 2023.2005. 2028.542668 (2023).\u0026nbsp;\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-3930833/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-3930833/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eDesigning proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here, we first establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 10\u003csup\u003e48\u003c/sup\u003e. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.\u003c/p\u003e","manuscriptTitle":"EvoAI enables extreme compression and reconstruction of the protein sequence space","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-02-23 19:00:28","doi":"10.21203/rs.3.rs-3930833/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-methods","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"nmeth","sideBox":"Learn more about [Nature Methods](http://www.nature.com/nmeth)","snPcode":"","submissionUrl":"","title":"Nature Methods","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Research","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"3b8d528f-0e55-40f5-a592-f97d7746bbd0","owner":[],"postedDate":"February 23rd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":28925117,"name":"Biological sciences/Systems biology/Synthetic biology"},{"id":28925118,"name":"Biological sciences/Biological techniques"},{"id":28925119,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":28925120,"name":"Biological sciences/Systems biology/Molecular engineering/Synthetic biology"}],"tags":[],"updatedAt":"2024-11-12T08:07:35+00:00","versionOfRecord":{"articleIdentity":"rs-3930833","link":"https://doi.org/10.1038/s41592-024-02504-2","journal":{"identity":"nature-methods","isVorOnly":false,"title":"Nature Methods"},"publishedOn":"2024-11-11 05:00:00","publishedOnDateReadable":"November 11th, 2024"},"versionCreatedAt":"2024-02-23 19:00:28","video":"","vorDoi":"10.1038/s41592-024-02504-2","vorDoiUrl":"https://doi.org/10.1038/s41592-024-02504-2","workflowStages":[]},"version":"v1","identity":"rs-3930833","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-3930833","identity":"rs-3930833","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00