Self-similar Sequences Yield Higher Protein Expression in a Squid Ring Teeth Protein Library

doi:10.21203/rs.3.rs-7724325/v1

Self-similar Sequences Yield Higher Protein Expression in a Squid Ring Teeth Protein Library

2025 · doi:10.21203/rs.3.rs-7724325/v1

preprint OA: closed

Full text JSON View at publisher

Full text 141,028 characters · extracted from preprint-html · click to expand

Self-similar Sequences Yield Higher Protein Expression in a Squid Ring Teeth Protein Library | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Self-similar Sequences Yield Higher Protein Expression in a Squid Ring Teeth Protein Library Melik Demirel, Khushank Singhal, Benjamin Allen, Thomas Baer This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7724325/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Protein materials give biological systems remarkable abilities, including tunable control over structural, optical, electrical, self-healing, and thermal properties. High-throughput screening of structural proteins is essential for understanding and enhancing proteins by exploring full sequence spaces, as shown in directed evolution. A significant challenge is the poor protein yield of recombinant structural proteins, often due to toxicity issues (e.g., aggregation, cell stress, formation of inclusion bodies), which limits protein yield. There is a need to address these issues and enhance the expression yield of structural proteins. Based on naturally observed squid ring teeth proteins, we introduce a structural protein library that allows us to explore a broad sequence space using a new high-throughput platform. We selected 33 amino acid fragments from six different native squid species and constructed a protein library containing every possible fusion of four such fragments (approximately 1.2 million variants, 33 4 ). We analyzed subsets of this library using a multi-step screening method that combines fluorescent-assisted cell sorting (FACS) and fluorescent microcapillary-array-based screening to establish correlations between structural protein sequences in single cells and in clonal populations. Our workflow considers both protein expression and cell growth, supporting systematic genetic-design studies focused on protein expression yield. We observed that protein sequences with higher self-similarity tend to have greater expression levels. This suggests that self-similarity is a crucial design parameter for the heterologous expression of material-forming proteins with repetitive sequences, a factor that was not previously addressed. The ability to screen large libraries of structural proteins for expression and cell growth enables high-yield protein production, crucial for synthetic biology and biomanufacturing. Biological sciences/Chemical biology/Protein design Biological sciences/Biochemistry/Protein folding/Protein aggregation SRT protein combinatorial library sequence self-similarity microcapillary arrays FACS recombinant proteins high-throughput screening Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 INTRODUCTION Structural proteins, such as silk, 1 collagen, 2 elastin, 3 keratin, 4 and squid ring teeth (SRT) 5 , have been utilized by materials engineers due to their exceptional mechanical properties, chemical functionalities, and biodegradability. Sequence-structure-property relationships are fundamental to the study of protein-based materials, particularly those derived from structural proteins. 6 Amino acid sequence influences protein folding, which in turn affects how the protein responds to environmental factors. 7 Many structural features, such as hydrophobicity caused by exposed amino acids, crystalline or amorphous regions, protein entanglements, and charged functional groups, contribute to the unique properties of protein materials. The resulting properties of these proteins and their composites with inorganic materials 8 are used in various fields, from materials science to biomedical engineering. 9 SRT has gained attention recently due to its unique composite structure and remarkable toughness, which enable quick and agile movements. 10 Squid use SRT proteins to build tough, resilient ring teeth that support effective defense and hunting strategies. 11 As a materials design platform, SRT proteins are multifunctional, exhibiting unique properties such as thermal conductivity switching, 12 rapid self-healing, 13 tunable hydrophobicity, 14 enhanced optical properties, 15 and triboelectric properties. 16 This variability presents a valuable opportunity to optimize key factors affecting biomanufacturability, enabling the identification of sequence designs that enhance protein yield, solubility, and stability for large-scale production. While SRT proteins and, in general, structural proteins 17 – 19 have been recombinantly engineered and produced at a laboratory scale, many challenges still exist in their biomanufacturing. 5 For example, the diversity of SRT protein sequences and their impact on expression and solubility are challenging to predict. 11 Efficient methods are needed to screen and select structural protein variants with high expression and solubility, as traditional protein engineering techniques often have low throughput and rely on trial-and-error approaches. Arakawa et al. emphasized the importance of global sampling by cataloging silk gene sequences from 1098 spider species and analyzing how amino acid motifs influence properties such as mechanical strength, thermal stability, and hydration in over 446 dragline silks. 20 The aforementioned study took several years to complete. Efficient protocols for protein library synthesis, like combinatorial methods, could supplement such studies by providing faster sequence-space coverage than previously thought. Rational design of structural proteins, inspired by natural protein sequences, enables further optimization, such as increasing the molecular weight of engineered proteins through intein-mediated polymerization. 21 , 22 However, these methods have been limited to a few protein sequences found in nature or their modifications, leaving room for significant improvements in their properties through modern synthetic biology techniques, especially by exploring large protein libraries. Several studies have concentrated on high-throughput screening of proteins. 23 – 25 Analyzing the entire sequence space for protein libraries is essential for understanding and improving proteins, as shown by directed evolution studies. 26 – 29 Mutagenesis is a powerful technique in protein engineering that enables the modification of proteins to create libraries of variants. 30 Site-directed mutagenesis enables precise changes at specific nucleotide positions to modify amino acids, a technique commonly used in rational protein design based on structural insights. Random mutagenesis, achieved through error-prone PCR or chemical mutagenesis, introduces a diverse range of mutations for screening. DNA shuffling is a directed evolution method that creates genetic diversity by recombining related DNA sequences. 31 , 32 This mimics natural recombination, creating a diverse library of variants that can be screened for improved traits such as higher enzymatic activity, increased stability, or altered substrate specificity. A key challenge limiting the large-scale use of recombinant structural proteins is their poor protein yield. Due to toxicity issues in cellular expression, the yield of structural proteins is not high enough to be cost-effective compared to polymers, which high-throughput screening approaches could address. In this study, we present the synthesis and high-throughput screening of a structural protein library inspired by SRT proteins. To efficiently manage the complexity of design and randomization, we use a tandem-repeat library construction method based on combinatorial DNA assembly, enabling us to generate and test multiple sequence variants systematically. 33 Our goal is to demonstrate the efficient synthesis of large libraries of structural proteins, especially those with repetitive DNA sequences. Along with our high-throughput screening platform, 34 this can lead to the discovery of sequences that enhance protein yield. The library contains over one million assembled variants from various protein fragments found in squid species. Although there are several library synthesis methods, such as random mutagenesis, 35 directed evolution, 36 and DNA shuffling, 37 we chose combinatorial library synthesis to preserve native protein sequences. 24 RESULTS SRT combinatorial library About 350 squid species are found worldwide, and we studied six of these strains (Figs. 1 a, b), each showing a unique SRT protein composition. These proteins have a segmented, copolymer-like structure, with fragments repeated in tandem. SRT proteins are classified as proteins with intrinsically disordered regions, where each repeat contributes to both crystalline (b-sheets) and amorphous domains, as verified by solid-state NMR studies 38 (Fig. 1 c). In this study, we created a library of SRT proteins with four fragments that have different amorphous regions while maintaining a constant crystalline sequence. Figure 1 d shows the design strategy and amino acid composition. A set of 33 amorphous sequences found in nature, with similar chain lengths (16–21 amino acids), was assembled in a combinatorial manner, resulting in 1,185,921 (33 4 ) variants. SRT proteins generally tend to form b-sheets, 38 and to enhance this native property, the crystal-forming sequence was kept constant. The predicted tertiary structures of several variants within the SRT protein library are displayed in Figure S1 . Our library offers a comprehensive set of protein features for holistic screening and sequence-to-expression analysis. The amorphous fragments consist of different amino acids from all classifications, including polar, electrically charged side chains, hydrophobic, and special cases (Figure S2). While the overall positive hydrophobicity score of all fragments makes them hydrophobic, there is considerable variation in their degree of hydrophobicity (Figure S3). Additionally, the differences in amino acid composition cause the fragments to exhibit a range of electric charges, measured by the isoelectric point (Figure S3). It is important to note that these characteristics are amplified when four fragments randomly assemble to form a protein chain. We hypothesize that this diversity in protein building blocks affects their recombinant expression levels because of the different folding and conformational tendencies of these chains inside host cells. Single-cell and ensemble screening strategy In our previous study on high-throughput protein expression optimization, 34 we examined libraries of plasmid components, namely promoters and 5’ untranslated regions, which are known to influence single-cell protein levels. Additionally, the characteristics of these features (e.g., transcription rate, translation initiation rate, mRNA decay rate) and their subsequent effects on single-cell protein levels can now be predicted using computational models. 39 In high-throughput screening, it is crucial to first separate single-cell protein expression from ensemble-level expression. Since ensemble expression is affected by other factors, including growth dynamics and metabolic burden, it is more effective to establish genotype-to-phenotype correlations at the single-cell level first. Later, cellular growth characteristics can be analyzed in the context of expression in a clonal population to connect protein yield with DNA or protein sequence. Since single-cell expression cannot yet be predicted a priori , we used fluorescence-activated cell sorting (FACS) to divide the SRT protein library into three logarithmically spaced bins based on mCherry fluorescence, which acts as a proxy for protein expression levels (as detailed in later sections). Cells from each bin were then sequenced to link genotype with protein expression phenotypes at the single-cell level. Our workflow is shown in Fig. 2 a. We then independently screened each bin's sub-libraries using our high-throughput microcapillary array platform, shown in Fig. 2 b. This system enables spatially isolated clonal growth within individual capillaries (20 µm diameter) and allows for the induction of the expression system through IPTG diffusion via an agarose gel (see Fig. 2 c). It captures both fluorescence and brightfield signals to measure protein levels and cell density, respectively. The fluorescence intensity summed across each capillary reflects protein quantity, while the transmitted brightfield intensity is used to calculate absorbance, indicating cell density relative to a background reference. Unlike flow cytometry or plate readers, this platform supports high-throughput, time-series experiments (over 10 5 clones per hour), facilitating the generation of production and growth rate curves over extended periods. In our workflow, the microcapillary array platform and FACS complement each other: first, by distinguishing one of the two variables critical to protein yield (protein per cell or cell frequency), and second, by capturing the other variable at high resolution within the enriched search space. This approach enables a systematic and comprehensive study of how factors such as protein composition, structure, expression system design, cell viability, and environmental conditions influence protein yield. Expression system and cloning The plasmid design for the library is shown in Fig. 3 a. The expression system includes a T5 promoter with two lac operators, a ribosome binding site with a consistent translation initiation rate (1664 au), a protein coding sequence, a HiBiT tag, and an mCherry CDS that overlaps by four nucleotides with the stop codon of the protein CDS. SRT proteins are naturally non-fluorescent, making their direct optical screening impossible. Therefore, we use mCherry as a biomarker with expression dependent on expression of the structural-protein in a translationally coupled bicistronic construct (Fig. 3 b). 40 It has been demonstrated that the expression levels of translationally coupled fluorescent proteins increase proportionally with the rate of translation initiation. 41 A ribosome first transcribes the structural-protein CDS starting from the initial upstream open reading frame (ORF). Instead of fully disassembling at the ‘TGA’ stop codon, the ribosome re-initiates translation at the downstream ORF’s ‘ATG’ start codon of mCherry. This mechanism produces a positive correlation between the translation of both ORFs due to the short intergenic region (4 nt) and the absence of a strong transcriptional terminator. The T5 promoter system is recognized by native E. coli RNAP and does not require the expression of exogenous polymerases. When combined with an expression control system like the lac operator, which is inducible by IPTG, T5 provides better transcriptional regulation than T7 (as shown in our previous study 34 ). This helps prevent basal (‘leaky’) expression, which is especially important when expressing toxic proteins. In our vector, we included two lac operator sites near the promoter to enhance the regulation of basal expression. The effectiveness of T5's prevention of basal expression, using the IPTG-inducible system, was confirmed with microcapillary arrays, as shown in Fig. 3 c. No fluorescence signal was detected in the non-induced capillaries, despite using a high exposure time (4s) for imaging. In contrast, the induced array showed significant mCherry fluorescence at about 40 ms exposure. To build the protein library, we used a streamlined cloning method with a library of pre-assembled full-length protein coding sequences (synthesized by Genscript, Fig. 3 d) delivered in a BsaI-flanked entry plasmid, allowing efficient Golden Gate assembly into our screening vector. This approach significantly increased accuracy by reducing the number of assembly parts by 2.5 times. The destination plasmid pBA500C (which contains the base vector and protein insertion site) and protein-coding inserts were combined in a single reaction, simplifying construction without reducing efficiency (Fig. 3 e). To ensure even clone representation and reduce biases from different growth rates, we pooled isogenic colonies directly from agar plates instead of growing individual colonies in liquid culture. This method ensured that the final library maintained an unbiased composition reflecting the initial assembly. FACS of SRT library Single-cell protein expression levels for the negative control and the SRT protein library are shown in Figs. 4 a and b. The clear separation between the two populations indicates strong mCherry expression in the library, which is crucial for accurately identifying SRT protein-expressing clones. In addition to generally high mCherry expression, the SRT protein library covers roughly a two-order-of-magnitude range in fluorescence intensity, indicating considerable diversity in expression levels. This variation suggests that the encoded protein significantly influences expression modulation. To explore this diversity further, the library was sorted into three bins, as shown in Fig. 4 b, with each bin representing a specific expression range and consisting of a similar, minimal fraction (about 15%) of the total population. Figure 4 c illustrates our cost-effective strategy for sequencing the constructs in each bin, which includes DNA template amplification through rolling circle amplification (RCA) and sequencing via Next-Generation Sequencing (NGS, Illumina). We chose RCA over polymerase chain reaction (PCR) to limit the insertions and deletions that form during PCR amplification of repetitive DNA templates. NGS is ideal for short inserts like the SRT protein library, and its high accuracy (> 99.9%) allows for easy and rapid alignment with the reference. 42 Sequencing cells from each bin revealed significant diversity within each group, with minimal sequence overlap between bins (Fig. 4 d), further supporting the connection between sequence composition and expression behavior. The mCherry-positive density scatter plot shows that most clones express medium protein levels, with fewer clones showing low or high expression. The cell population in the sorting region resembles a unimodal symmetric distribution (Figure S4), suggesting an unbiased composition of the SRT protein library—resulting from randomized fragment assembly. This balanced distribution is ideal for establishing a strong genotype-to-phenotype relationship. Genotype-to-phenotype correlations The SRT protein library includes sequence variations specifically within the amorphous regions, resulting from differences in amino acid composition and fragment length. To explore potential relationships between genotype and phenotype, we analyzed correlations between various protein characteristics and single-cell expression levels. Specifically, we compared hydrophobicity (Wimley–White scale), isoelectric point, molecular weight difference, and nitrogen content of sequences in the low, medium, and high bins (Figure S2 and S3). No statistically significant correlations were found for any of these properties. Importantly, features of the expression-regulating plasmid, such as the promoter and ribosome binding site (translation initiation rate of 1664 a.u. calculated using RBS calculator, denovodna.com), were kept constant across all variants in the library, ensuring that differences in expression were solely due to the encoded protein sequences. Since single-cell expression seems unaffected by the intrinsic physicochemical properties of the proteins encoded, we examined sequence features that could influence folding efficiency—specifically, intra-sequence self-similarity. We conducted pairwise alignments of constituting fragments in each protein sequence for this purpose (see Fig. 5 a for workflow). The distribution of fragment-pair identities was unimodal and symmetric, indicating that the combinatorial library is unbiased (Fig. 5 b). FACS analysis revealed clear enrichment patterns: compared to the medium- and high-expression bin, low-expression bins were enriched in fragment pairs sharing 10–30% identity and depleted of those with 40–60% identity (Fig. 5 b). A stronger correlation is observed when analyzing the coincidence of multiple high-identity within individual sequences. The high-expression bin is enriched in sequences containing three or more fragment pairs with > 50% sequence identity, whereas the low-expression bin is enriched in sequences with one or fewer such fragment pairs (Fig. 5 c). The medium-expression bin exhibited intermediate behavior, containing similar proportions of sequences with one, two, or three fragments pairs with > 50% sequence identity. These observations suggest higher inter-fragment sequence similarity is correlated with, and may promote, increased protein expression in our system. High-throughput screening using microcapillary arrays We expanded the screening of the SRT protein library by utilizing our high-throughput microarray platform, which produces protein and growth data that traditional methods cannot achieve. We analyzed the bins over a 35-hour period, measuring mCherry fluorescence levels (Figs. 6 a, b, c) and cell density (Figs. 6 d, e, f) in more than 15,000 capillaries in a series of time points. Not all capillaries exhibit the same mCherry fluorescence, highlighting significant heterogeneity in protein expression. The fluorescence intensity distribution is broad, a common trait of stochastic gene expression. This variability is particularly evident in structural proteins, which are generally more challenging to express compared to soluble fluorescent proteins (as also observed previously with cement and reflectin versus mRFP1 34 ). Interestingly, the distribution is asymmetric — it appears skewed, with a small number of capillaries exhibiting high fluorescence intensities and a gradual decline toward lower intensities. This "pinning" at the lower end likely reflects a combination of expression inefficiencies and biological factors such as cell viability. Indeed, in each expression bin, a subset of clones produced minimal levels of protein, which could be due to compromised cell health or other cellular burdens. Moreover, the rate at which intensity values increase in each bin provides additional insights (Figure S5). The rise in intensity from t = 15 hr to 25 hr is greatest in the high bins, followed by medium and low. Additionally, the low bin showed minimal additional expression during the final 10-hour incubation period. The medium and high-expression bins show a broader, more gradual increase in fluorescence values. This indicates a broader range of expression levels and potentially more resilient or variable cellular states within those populations. In addition to fluorescence imaging to measure protein titers, we utilized brightfield imaging for estimating cellular growth characteristics. Capillaries containing a greater number of cells at a given time point attenuate white light more compared to those containing fewer cells. For each brightfield image, we plotted the same number of the highest-absorbance capillaries as we plotted for the corresponding fluorescence image. Consistent with this, the number of capillaries exceeding the low bin's fluorescence threshold in both the medium and high bins was significantly higher (as shown in Fig. 6 a and 6 d). This trend highlights the underlying diversity in expression potential across the library and supports the existence of a dynamic, non-uniform distribution of expression phenotypes. Representative fluorescence and brightfield images are shown in Figs. 7 a and 7 b, respectively. The resulting protein production and cell growth profiles at the final time point are summarized in Figs. 7 c and 7 d, respectively. A clear upward trend in protein titer is observed across the expression bins—from low to medium to high—with each bin exhibiting a higher titer than the previous one. Notably, the medium-expression bin displayed approximately 54% higher protein titers compared to the low-expression bin. In contrast, the increase from the medium to high bin was modest, with the high bin showing only about a 5% improvement over the medium bin (Fig. 7 e). The platform’s main insight is that growth rates are similar across all bins. Although it was expected that either due to metabolic burden from accumulating more protein per cell or toxicity arising from protein aggregation, bins might depict variable cellular growth, however, all three bins showed nearly identical absorbance (or cell density) distributions at t = 35 hr (Fig. 7 f). This distribution also matches that at t = 15 hr for each bin, indicating that the cell population in the capillaries stabilized early on—probably during the pre-induction phase at 37°C—and then remained stable during the later incubation at 25°C after induction. This stability is likely due to the T5 expression system's ability to suppress basal expression and the lower temperatures used after induction, which help cells grow largely independent of the expressed protein. 43 These results directly correlate with the single-cell protein levels; however, the same degree of variation observed in the single-cell protein levels was not seen at the ensemble level of protein expression. Each clone is expected to show a range of single-cell protein levels, and selecting cells with the average protein level is not guaranteed in FACS, especially for the high bin, making a 10-fold variation in protein titers unlikely. Additionally, unviable cells are present in each capillary, as confirmed by cytometry of individual cultures from several clones in the library (Figure S6) and the dense left lobe in Fig. 4 b. Therefore, it is likely that only a portion of the clonal population expresses mCherry and SRT proteins (as previously noted with cement and reflectin 34 ), and this fraction varies across the bins (Figure S7). Our results suggest that clones in the medium bin express a similar total amount of SRT protein as those in the high bin, possibly because their higher rate of expression-positive cells compensates for their lower single-cell SRT protein levels (illustrated in Figure S4). The mCherry-positive cell populations appear to be centered around the medium bin; the sharp decrease in cell frequency toward the high bin also supports this (Figure S4). DISCUSSION SRT proteins are recognized for their multifunctional properties, making the optimization of their protein yield essential for practical applications. While previous studies have explored how protein sequence influences expression yield in different systems, our research broadens this scope to include SRT proteins by an extensive combinatorial library. This library was designed to cover a wide sequence space, including variations in features such as self-similarity, hydrophobicity, charge, and nitrogen content. Significantly, sequences responsible for crystalline region formation were kept constant to ensure structural consistency across variants. Our high-throughput screening utilized a microcapillary array platform that was capable of capturing the entire distribution of production levels within each bin, stratified by single-cell expression intensity. This enabled a detailed mapping of genotype to phenotype across thousands of clones. The platform, based on T5 promoter-driven expression, effectively minimized basal expression, resulting in uniform growth profiles across bins. A strong correlation was found between single-cell protein expression and final protein titer levels. However, within the high-expression bin, a wide range of titer values was observed, emphasizing the influence of stochastic expression and cell viability constraints even among high-expressing clones. Notably, the platform features a precision extraction mechanism that allows for iterative enrichment of the high bin. This makes it possible to progressively isolate top-performing clones through multiple rounds of screening and refinement, providing a powerful tool for engineering variants with higher expression yield. While most physicochemical properties of the protein sequences—such as hydrophobicity, charge, or nitrogen content—were not found to strongly correlate with protein titer levels, sequence self-similarity emerged as a key factor influencing expression. SRT proteins consist of tandem repeats that form β-sheet structures. This assembly behavior is reflected in their inherent disorder profiles. For example, the engineered variant TR-n4, made up of four tandem repeat units, is known to form β-sheets and displays a consistent, periodic disorder profile with four distinct peaks (Figure S8). These peaks indicate flexible regions in the sequence that undergo structural rearrangement during folding or assembly. To explore the connection between self-similarity and β-sheet assembly, we analyzed the disorder profiles of the most self-similar protein sequences in our library (cumulative self-similarity score: 560–600) and compared them to the least self-similar sequences (score: 266–273). The results, shown in Figure S9, reveal that highly self-similar clones display periodic disorder patterns with four evenly spaced peaks, closely resembling the TR-n4 profile. In contrast, the disorder profiles of low-similarity sequences are irregular and lack clear periodicity (Figure S10). In our SRT protein library, greater sequence similarity between assembly-prone protein segments was correlated with higher expression levels. This finding is initially surprising: repetitive, material-forming protein sequences are challenging to express in bacteria, in part due to protein-aggregate toxicity. 44 However, a growing body of literature on heterotypic aggregates suggests that the co-assembly of variant sequences can modulate aggregate structure and aggregation kinetics; lower sequence similarity between co-assembling variants often corresponds to less efficient aggregation. 45 Critically, the SRT library sequences in this study are genetic fusions of heterotypic assembly-prone sequences; we expect that library members with lower self-similarity will aggregate less efficiently and will be more likely to populate unfolded, soluble, and oligomeric (pre-aggregated) states compared to those with higher self-similarity. Such pre-aggregated states are understood to mediate much of the host-cell toxicity caused by assembly-prone proteins: these states recruit and overload the proteostatic machinery, whereas large-scale aggregates like fibrils and inclusion bodies sequester assembly-prone proteins and protect the cell. 46 Hence, our library sequences with greater self-similarity may reach higher expression levels because they assemble into insoluble aggregates more efficiently and spend less time populating toxic, pre-aggregated states. This hypothesis suggests self-similarity as an important design parameter for the heterologous expression of material-forming proteins with repetitive sequences. In contrast to issues like coding-sequence instability, mRNA stability, and amino acid and tRNA depletion, all of which can be managed with increased coding-sequence diversity, 47 protein toxicity concerns may favor sequences with sufficient amino-acid self-similarity to promote rapid sequestration. In summary, we have demonstrated an orthogonal screening method that combines FACS and fluorescent microcapillary arrays to establish genotype-to-single-cell-to-ensemble correlations. We observed that brighter clones tend to contain sequences with more self-similar fragments, whereas dimmer clones in the library exhibit less self-similarity. The method employs two high-throughput techniques in a complementary way, covering screening from single cells to the ensemble level, while enabling enrichment of desired clones. Our workflow separates single-cell expression and growth traits for a systematic study of genetic design on protein yield. It is especially suited for genetic systems that do not respond well to computational predictions, such as those involving structural proteins with complex structures. DATA AVAILABILITY The authors declare that data supporting the findings of this study are available within the paper and its supplementary information files. Declarations COMPETING INTERESTS Benjamin Allen and Melik Demirel are the co-founders of Tandem Repeat Technologies, Inc., and they hold equity in the company. All authors declare that they have issued and pending patents. ACKNOWLEDGEMENT OF SPONSORSHIP STATEMENT This effort was sponsored in whole or in part by the Central Intelligence Agency (CIA), through CIA Federal Labs. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. DISCLAIMER The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Central Intelligence Agency. AUTHOR CONTRIBUTIONS M.C.D. conceived the project. T.M.B. and K.S. developed the microcapillary array platform. B.D.A. designed plasmids and protein libraries and conducted self-similarity analysis. K.S. designed and performed all experiments, and analyzed the data. K.S. wrote the manuscript together with M.C.D. All authors participated in manuscript revisions, discussions, and data interpretation. ACKNOWLEDGMENTS We thank Dr. Howard Salis and Harry Adamson for their scientific discussions. Additionally, we appreciate Dr. William Humphries for guiding us in setting up the optical system. METHODS Plasmid library design The amino acid sequences of the protein are based on native Squid Ring Teeth proteins. 48 We designed a segmented copolymer protein with four fragments. Each fragment has a crystalline, beta-sheet-forming region and a flexible, GGXY amorphous region. The amino acid sequence of the crystalline region across all fragments remained constant at ‘AAASVSTVHHP’. We selected 33 amorphous amino acid fragments from six different squid species, based on our previous analysis. The selection was made according to the similarity in length to the original recombinant n4 sequence, as described in our earlier publication. 48 Vector cloning and library synthesis Plasmids pBA407 and pBA500B were used to clone the pBA500C vector. Plasmid pBA500B, containing the protein insertion site, GS linker, HiBiT tag and mCherry CDS, was synthesized by Genscript. pBA407 was digested at BsaI and NcoI restriction sites and pBA500B was digested at KpnI and BamHI restriction sites using the standard protocols (New England Biolabs, NEB). The digested products were then assembled to form pBA500C through standard HiFi DNA assembly (NEB) and transformed into NEB 10-beta at 37°C. Isogenic colonies were propagated in Luria Broth (LB) at 37°C and 250 RPM, and the cultures were miniprepped to obtain the purified pBA500C plasmid. pBA500C sequence was confirmed through long-read whole-plasmid sequencing (Oxford Nanopore, Quintara Biosciences). pBA500C plasmid contains the SRT protein CDS insertion site compatible with Golden Gate assembly. Plasmids containing the protein CDS (flanked with BsaI restriction sites) of all variants were ordered as a pool in a commercial vector pUC57 (Genscript). Golden Gate assembly of pBA500C and protein-insert plasmid pool were carried out using standard procedures (NEB, BsaI-HF v2). The assembled product was transformed into NEB 10-beta chemically competent cells. Multiple agar plates were prepared using different loading volumes of the recovery (2.5 µl, 25 µl, 100 µl, and 250 µl). Three plates with the greatest density of isogenic colonies were selected for library pooling, and all colonies (estimated about 20,000–25,000) were collected using LB. A portion of the recovered colony suspension was miniprepped. The miniprepped plasmids were then transformed into E. Coli BL21-DE3 (NEB) chemically competent cells. All colonies from four agar plates with 100 µl and 250 µl loading volume of the recovery were collected using LB and pooled. This pool was stored in glycerol cryostocks and was used for further screening and tests. All nucleotide sequences used in this study are provided in Supplementary Information. FACS The library cryostock was inoculated in carbenicillin-supplemented LB and incubated at 37°C and 250 RPM for 5 hours. The culture was then induced with IPTG at a final concentration of 400 µM and incubated overnight at 25°C. The culture was diluted 500× in phosphate-buffered saline (PBS, 1×) and used for FACS (ThermoScientific, Bigfoot Cell Sorter). The parameters for the gates (low, medium, and high) and the respective populations are described in the main text and figures. The sorted cells (10,000 in each bin) were collected in carbenicillin-supplemented LB and incubated overnight. Finally, glycerol stocks of each bin were prepared and used for further analyses. Instrumentation and sample preparation The microcapillary-array screening system was described in detail earlier. 34 Briefly, an inverted fluorescence microscope (Olympus IX73) was equipped with a monochrome camera (DP23M, Olympus), a motorized XY stage (Marzhauser Wetzlar Tango 3), a motorized focus drive (Marzhauser Wetzlar), and a 10x objective lens (Olympus). The platform was operated using the cellSens (Olympus) software suite. Microcapillary arrays with 20 µm capillary diameter (INCOM, Inc.) were used throughout all experiments, each containing approximately 8×10^5 capillaries. The arrays were sterilized prior to use and treated with a plasma wand to increase hydrophilicity. The cryostock of each bin was incubated in carbenicillin-supplemented Luria broth (LB) at 37°C until reaching an optical density of about 0.1. Cells in LB were loaded onto an array, with cell concentration adjusted so that each capillary typically contained a single cell. After loading, the array was overlaid with a 2% (w/v) agarose gel layer (1–2 mm thick) and incubated at 37°C in a sealed petri dish lined with moist wipes for 5 hours. The cells were then induced with IPTG by adding the stock solution to achieve a 400 µM concentration in the capillaries. Following induction, the arrays were incubated at 25°C and screened at 5, 15, 25, and 35 hours. After screening, the arrays were sterilized in 70% ethanol for several hours. They were then thoroughly cleaned with a stream of DI water, with intermediate ultrasonication for 2 minutes. Amplification and sequencing To obtain protein CDS inserts for sequencing, the BL21-DE3 sub-libraries were propagated in LB (37°C, 250 RPM) and stored in aliquots. Rolling circle amplification (RCA) was performed with these aliquots using standard protocols (phi29-XT RCA kit, NEB) for 24 hours. Initial verification of amplification used random primers; after confirmation, sequence-specific primers (5’-CGTGAAACATC*C*T*G-3’, 5’-GTCCTGAAGAG*A*G*G-3’) minimized background. The RCA products were diluted twofold and digested with MscI and NcoI. The digested products were run on DNA purification gels (1% w/v agarose in TAE buffer) for about an hour. Desired bands were excised and purified using a DNA gel purification kit (NEB). Plasmid concentrations were quantified with Qubit and sent for sequencing via next-generation sequencing (NGS, Illumina; Quintara Biosciences). A custom MATLAB script was used to analyze the merged raw reads. Microcapillary array platform screening Brightfield and fluorescence images were captured at t = 5 hr, 15 hr, 25 hr, and 35 hr at multiple locations across each bin array. These images were taken with consistent exposure times (450 ms for brightfield and 100 ms for fluorescence). Fluorescence images were processed using a custom MATLAB script. The contrast was first normalized, and the images were binarized to highlight white capillaries indicating fluorescence. Capillary regions were identified and used as masks to measure the total pixel intensity within each capillary in the original images. No fluorescence was detected at t = 5 hr. The threshold level used for the low bin images at t = 15 hr was applied to all other images, including medium and high bins. Similarly, brightfield images were binarized using an optimal threshold, capillaries identified, and their intensities recorded. The intensities from each image were sorted in ascending order, and the lowest set of capillary intensities—matching the number of capillaries identified in the corresponding fluorescence images—were used to generate growth curves. Remaining capillaries were classified as background. Absorbance levels were also calculated. References Leal-Egaña, A. & Scheibel, T. Silk‐based materials for biomedical applications. Biotech and App Biochem 55, 155–167 (2010). https://doi.org/10.1042/BA20090229 Meyer, M. Processing of collagen based biomaterials and the resulting materials properties. BioMed Eng OnLine 18, 24 (2019). https://doi.org/10.1186/s12938-019-0647-0 Bracalello, A. et al. Design and Production of a Chimeric Resilin-, Elastin-, and Collagen-Like Engineered Polypeptide. Biomacromolecules 12, 2957–2965 (2011). https://doi.org/10.1021/bm2005388 Feroz, S., Muhammad, N., Ratnayake, J. & Dias, G. Keratin - Based materials for biomedical applications. Bioactive Materials 5, 496–509 (2020). https://doi.org/10.1016/j.bioactmat.2020.04.007 Pena-Francesch, A. & Demirel, M. C. Squid-inspired tandem repeat proteins: Functional fibers and films. Frontiers in chemistry 7, 69 (2019). Carter, N. A. & Grove, T. Z. Functional protein materials: Beyond elastomeric and structural proteins. Polymer Chemistry 10, 2952–2959 (2019). Leisola, M. & Turunen, O. Protein engineering: opportunities and challenges. Appl Microbiol Biotechnol 75, 1225–1232 (2007). https://doi.org/10.1007/s00253-007-0964-2 Vural, M. & Demirel, M. C. Biocomposites of 2D layered materials. Nanoscale Horizons 10, 664–680 (2025). Capezza, A. J. & Mezzenga, R. Proteins for Applied and Functional Materials. Biomacromolecules 25, 4615–4618 (2024). https://doi.org/10.1021/acs.biomac.4c00884 Demirel, M. C., Cetinkaya, M., Pena-Francesch, A. & Jung, H. Recent advances in nanoscale bioinspired materials. Macromolecular bioscience 15, 300–311 (2015). Pena-Francesch, A. et al. Research update: programmable tandem repeat proteins inspired by squid ring teeth. APL Materials 6 (2018). Tomko, J. A. et al. Tunable thermal transport and reversible thermal conductivity switching in topologically networked bio-inspired materials. Nature Nanotech 13, 959–964 (2018). https://doi.org/10.1038/s41565-018-0227-7 Pena-Francesch, A., Jung, H., Demirel, M. C. & Sitti, M. Biosynthetic self-healing materials for soft machines. Nat. Mater. 19, 1230–1235 (2020). Singhal, K., Mazeed, T. & Demirel, M. C. Cephalopod inspired self-healing protein foams for oil-water separation. iScience 26, 108300 (2023). https://doi.org/10.1016/j.isci.2023.108300 Yılmaz, H. et al. Structural Protein-Based Whispering Gallery Mode Resonators. ACS Photonics 4, 2179–2186 (2017). https://doi.org/10.1021/acsphotonics.7b00310 Singhal, K., Boy, R., Abdullah, A. M., Mazeed, T. & Demirel, M. C. Engineering advanced cellulosics for enhanced triboelectric performance using biomanufactured proteins. npj Mater. Sustain. 2, 29 (2024). https://doi.org/10.1038/s44296-024-00035-7 Cappello, J. et al. Genetic Engineering of Structural Protein Polymers. Biotechnology Progress 6, 198–202 (1990). https://doi.org/10.1021/bp00003a006 Poddar, H., Breitling, R. & Takano, E. Towards engineering and production of artificial spider silk using tools of synthetic biology. Eng. biol. 4, 1–6 (2020). https://doi.org/10.1049/enb.2019.0017 Shire, E., Coimbra, A. A. B., Barba Ostria, C., Rios-Solis, L. & López Barreiro, D. Molecular design of protein-based materials – state of the art, opportunities and challenges at the interface between materials engineering and synthetic biology. Mol. Syst. Des. Eng. 9, 1187–1209 (2024). https://doi.org/10.1039/D4ME00122B Arakawa, K. et al. 1000 spider silkomes: Linking sequences to silk physical properties. Sci. Adv. 8, eabo6043 (2022). https://doi.org/10.1126/sciadv.abo6043 Bowen, C. H. et al. Recombinant Spidroins Fully Replicate Primary Mechanical Properties of Natural Spider Silk. Biomacromolecules 19, 3853–3860 (2018). https://doi.org/10.1021/acs.biomac.8b00980 Lin, S., Chen, G., Liu, X. & Meng, Q. Chimeric spider silk proteins mediated by intein result in artificial hybrid silks. Biopolymers 105, 385–392 (2016). https://doi.org/10.1002/bip.22828 Hosse, R. J., Rothe, A. & Power, B. E. A new generation of protein display scaffolds for molecular recognition. Protein Science 15, 14–27 (2006). https://doi.org/10.1110/ps.051817606 Moffet, D. A. & Hecht, M. H. De Novo Proteins from Combinatorial Libraries. Chem. Rev. 101, 3191–3204 (2001). https://doi.org/10.1021/cr000051e Zhao, H. & Arnold, F. H. Combinatorial protein design: strategies for screening protein libraries. Current Opinion in Structural Biology 7, 480–485 (1997). https://doi.org/10.1016/S0959-440X(97)80110-8 Jäckel, C., Kast, P. & Hilvert, D. Protein Design by Directed Evolution. Annu. Rev. Biophys. 37, 153–173 (2008). https://doi.org/10.1146/annurev.biophys.37.032807.125832 Kan, A. & Joshi, N. S. Towards the directed evolution of protein materials. MRS Communications 9, 441–455 (2019). https://doi.org/10.1557/mrc.2019.28 Wang, Y. et al. Directed Evolution: Methodologies and Applications. Chem. Rev. 121, 12384–12444 (2021). https://doi.org/10.1021/acs.chemrev.1c00260 Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat Methods 16, 687–694 (2019). https://doi.org/10.1038/s41592-019-0496-6 Zimmermann, A., Prieto-Vivas, J. E., Voordeckers, K., Bi, C. & Verstrepen, K. J. Mutagenesis techniques for evolutionary engineering of microbes – exploiting CRISPR-Cas, oligonucleotides, recombinases, and polymerases. Trends in Microbiology 32, 884–901 (2024). https://doi.org/10.1016/j.tim.2024.02.006 Stemmer, W. P. C. Rapid evolution of a protein in vitro by DNA shuffling. Nature 370, 389–391 (1994). https://doi.org/10.1038/370389a0 Zhao, H. Optimization of DNA shuffling for high fidelity recombination. Nucleic Acids Research 25, 1307–1308 (1997). https://doi.org/10.1093/nar/25.6.1307 Engler, C. & Marillonnet, S. in Synthetic Biology Vol. 1073 (eds Karen M. Polizzi & Cleo Kontoravdi) 141–156 (Humana Press, 2013). Singhal, K., Adamson, H. E., Baer, T. M., Salis, H. M. & Demirel, M. C. Microcapillary Array-Based High Throughput Screening for Protein Biomanufacturability. ACS Synth. Biol. 14, 2328–2340 (2025). Reetz, M. T. & Carballeira, J. D. Iterative saturation mutagenesis (ISM) for rapid directed evolution of functional enzymes. Nat Protoc 2, 891–903 (2007). https://doi.org/10.1038/nprot.2007.72 Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol 10, 866–876 (2009). https://doi.org/10.1038/nrm2805 Biot-Pelletier, D. & Martin, V. J. J. Evolutionary engineering by genome shuffling. Appl Microbiol Biotechnol 98, 3877–3887 (2014). https://doi.org/10.1007/s00253-014-5616-8 Dubini, R. C., Jung, H., Skidmore, C. H., Demirel, M. C. & Rovó, P. Hydration-induced structural transitions in biomimetic tandem repeat proteins. The Journal of Physical Chemistry B 125, 2134–2145 (2021). LaFleur, T. L., Hossain, A. & Salis, H. M. Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria. Nature communications 13, 5159 (2022). Levin-Karp, A. et al. Quantifying translational coupling in E. coli synthetic operons using RBS modulation and fluorescent reporters. ACS Synth. Biol. 2, 327–336 (2013). Tian, T. & Salis, H. M. A predictive biophysical model of translational coupling to coordinate and control protein expression in bacterial operons. Nucleic Acids Research 43, 7137–7151 (2015). https://doi.org/10.1093/nar/gkv635 Metzker, M. L. Sequencing technologies—the next generation. Nat Rev Genet 11, 31–46 (2010). Makrides, S. C. Strategies for achieving high-level expression of genes in Escherichia coli. Microbiological reviews 60, 512–538 (1996). Sabate, R., De Groot, N. S. & Ventura, S. Protein folding and aggregation in bacteria. Cellular and Molecular Life Sciences 67, 2695–2715 (2010). Konstantoulea, K., Louros, N., Rousseau, F. & Schymkowitz, J. Heterotypic interactions in amyloid function and disease. The FEBS Journal 289, 2025–2046 (2022). Bednarska, N. G., Schymkowitz, J., Rousseau, F. & Van Eldere, J. Protein aggregation in bacteria: the thin boundary between functionality and toxicity. Microbiology 159, 1795–1806 (2013). Jeon, J., Subramani, S. V., Lee, K. Z., Jiang, B. & Zhang, F. Microbial synthesis of high-molecular-weight, highly repetitive protein polymers. International journal of molecular sciences 24, 6416 (2023). Jung, H. et al. Molecular tandem repeat strategy for elucidating mechanical properties of high-strength proteins. Proc. Natl. Acad. Sci. U.S.A. 113, 6478–6483 (2016). Additional Declarations Yes there is potential Competing Interest. Benjamin Allen and Melik Demirel are the co-founders of Tandem Repeat Technologies, Inc., and they hold equity in the company. All authors declare that they have issued and pending patents. Supplementary Files srtlibraryrevisionsuppinfov2BDA.docx Supplementary Information Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7724325","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":526506605,"identity":"66bd7017-b0fa-47e0-bf3f-9d49888e40d4","order_by":0,"name":"Melik Demirel","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAqklEQVRIiWNgGAWjYFAC5oaDDRUQpgSRWhiBWs6QqoWxsY0ULbrtjY0HZ867Y29wgPngbR5itJidOdhwcOO2Z4kbDrAlWxOn5UZiw8GH2w4nGBzgMZMmTsv9h0Atcw4DHcb/jUgtN4AhtrHhMOOGAzxsRGo5A3TYjGPPEmceZjO2nEOUluOHD3/sqbljz3e8+eGNN8RogYIDwFRAgnKollEwCkbBKBgFuAAA5E49H4DribgAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0003-0466-7649","institution":"Pennsylvania State University","correspondingAuthor":true,"prefix":"","firstName":"Melik","middleName":"","lastName":"Demirel","suffix":""},{"id":526506606,"identity":"4233ef2f-bd8e-4154-89b1-6fc6f8e1801b","order_by":1,"name":"Khushank Singhal","email":"","orcid":"","institution":"Pennsylvania State University","correspondingAuthor":false,"prefix":"","firstName":"Khushank","middleName":"","lastName":"Singhal","suffix":""},{"id":526506607,"identity":"fc9a3399-a448-4715-9afd-fc919a16c54a","order_by":2,"name":"Benjamin Allen","email":"","orcid":"","institution":"Pennsylvania State University","correspondingAuthor":false,"prefix":"","firstName":"Benjamin","middleName":"","lastName":"Allen","suffix":""},{"id":526506608,"identity":"47baf13c-12aa-4c3f-a3cd-2238ccb13b3a","order_by":3,"name":"Thomas Baer","email":"","orcid":"","institution":"Pennsylvania State University","correspondingAuthor":false,"prefix":"","firstName":"Thomas","middleName":"","lastName":"Baer","suffix":""}],"badges":[],"createdAt":"2025-09-26 19:00:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7724325/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7724325/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":94141564,"identity":"2d2cb480-1fee-4ed9-a533-7e50cedac5bc","added_by":"auto","created_at":"2025-10-22 20:04:41","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3189269,"visible":true,"origin":"","legend":"","description":"","filename":"srtlibraryrevisionv2BDA.docx","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/cdedea4ba3a53c82662a15c8.docx"},{"id":94141547,"identity":"f9d08ffd-7f41-41c7-9663-44e287c92e29","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5614,"visible":true,"origin":"","legend":"","description":"","filename":"COMMSCHEM250909.json","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/1b56adb1d72f7fa94b3af20a.json"},{"id":94141956,"identity":"92d8d4f8-556d-4c5c-91d9-2344e2300650","added_by":"auto","created_at":"2025-10-22 20:12:40","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10542201,"visible":true,"origin":"","legend":"","description":"","filename":"srtlibraryrevisionsuppinfov2BDA.docx","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/4a3ddf20fef354287d39d8cb.docx"},{"id":94142183,"identity":"ada380c3-e2a2-47a3-ba85-c14094e40f18","added_by":"auto","created_at":"2025-10-22 20:20:40","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":126050,"visible":true,"origin":"","legend":"","description":"","filename":"COMMSCHEM2509090enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/a04b17bd6e3193be45d7f8cb.xml"},{"id":94141555,"identity":"e8fee326-06c0-44eb-a86b-187d465dd183","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":78246,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/1c2ee8c3f21f8c8aa1790daf.png"},{"id":94141551,"identity":"13c196b9-c3c4-4dbb-aca1-1b84f7d27df5","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":82675,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/d1263c953996f681d1a7e1fa.png"},{"id":94141954,"identity":"52b7e08d-a916-4176-8ff3-25f3ee17c8e1","added_by":"auto","created_at":"2025-10-22 20:12:40","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":148825,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/9ee33f2188803003099dd3ba.png"},{"id":94141952,"identity":"f058cc62-cf1c-4291-9e40-d65c5a7e0f08","added_by":"auto","created_at":"2025-10-22 20:12:40","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":126559,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/8b8db4cdd9c2de3a39eba610.png"},{"id":94142182,"identity":"4938afe6-644e-4792-acbb-7c8677d21353","added_by":"auto","created_at":"2025-10-22 20:20:40","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":94871,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/6b2da47114b0a5840b825027.png"},{"id":94141562,"identity":"dcc284f1-a743-43b3-8867-5d940415c61e","added_by":"auto","created_at":"2025-10-22 20:04:41","extension":"png","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":89083,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/a3915dd9d8d73576379da2ef.png"},{"id":94141957,"identity":"d330960f-a7fb-44dc-8887-b65e22c2ee27","added_by":"auto","created_at":"2025-10-22 20:12:41","extension":"png","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":218093,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/173fe2b50ffd1bf16e0497c3.png"},{"id":94141563,"identity":"0143a15d-b0c6-434f-a614-84e2b26c61ee","added_by":"auto","created_at":"2025-10-22 20:04:41","extension":"xml","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":122720,"visible":true,"origin":"","legend":"","description":"","filename":"COMMSCHEM2509090structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/dab9a7f06d1ef47d052da9a1.xml"},{"id":94142184,"identity":"f3023aac-d1bd-4476-9fd8-1e86bce303d9","added_by":"auto","created_at":"2025-10-22 20:20:41","extension":"html","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":135086,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/799757058cfcac1eed6a9351.html"},{"id":94141546,"identity":"19d4b00e-2ccf-4bf4-b805-784d19cf3f00","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":492500,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSRT protein library design. \u003c/strong\u003ea. Sequences of SRT proteins in four squid species from around the world were studied and served as inspiration for the library design. b. A representative squid ring teeth is shown (scale bar = 5 mm). c. SRT protein repeat fragments consist of crystal-forming and amorphous regions, which assemble into a network of β-sheets. A protein chain is formed by ‘n’ tandem repeats. d. The design strategy of the n = 4 SRT protein library is shown. The crystalline sequence was kept constant, while a set of 33 amorphous sequences was selected from naturally occurring sequences that assemble in a combinatorial fashion.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/afb5c7b5b8fbfbd127f6824b.png"},{"id":94141951,"identity":"0eef99e3-6898-435b-8598-2199d55d4d13","added_by":"auto","created_at":"2025-10-22 20:12:40","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":334844,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eScreening methodology for single cells and ensembles. \u003c/strong\u003ea. Single-cell protein levels are measured using cytometry, and the library is sorted into three sub-libraries—low, medium, and high—with protein levels differing by a single order of magnitude. These sub-libraries are then screened using the micro-capillary array platform to analyze growth characteristics through miniature cultures. b. A representative microcapillary array is shown (dimensions: 20 mm × 20 mm). c. For induction, stock IPTG aqueous solution is applied to the agarose gel overlaid on the array. Two optical signals—fluorescence (d) and brightfield (e)—are collected from each capillary in the array. These signals are used to estimate protein levels (f) and cellular growth (g) rates.\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/3589c4134171cdc92bf68894.jpeg"},{"id":94141549,"identity":"3b1ed650-bd12-4e1b-a332-f37ebc853c78","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":619026,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSRT protein library plasmid design and cloning. \u003c/strong\u003ea. The SRT protein library plasmid design features key components such as the T5 promoter, ribosome binding site (shown in blue), SRT protein CDS, mCherry CDS, and T7 terminator. b. The mechanism of ribosome re-initiation is illustrated wherein ribosome partially disassembles at SRT protein CDS stop codon and binds again to the mRNA at the nearby mCherry start codon. The mCherry ORF overlaps with the SRT protein ORF by four nucleotides. c. Images display an array of cells expressing the library plasmids. The brightfield image shows miniature cultures. No fluorescence was detected in arrays that were not induced with IPTG. d. The design of the SRT protein DNA insert, sourced commercially in the pUC57 vector, is shown. The insert is flanked by BsaI restriction sites, enabling compatibility with Golden Gate assembly. e. The methodology for library cloning using the pBA500C vector and pUC57-SRT protein insert pool is illustrated.\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/d232075acc308e778831b372.jpeg"},{"id":94141553,"identity":"967cb636-05bd-4c7c-ad19-239f87558c9e","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":499929,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFlow cytometry distribution and FACS bins. \u003c/strong\u003eSingle-cell mCherry expression levels of the negative control \u0026nbsp;\u0026nbsp;(pBA500C-containing cells) (a) and SRT protein library clones (b) are shown. \u0026nbsp;\u0026nbsp;The FACS bin gates are illustrated with their respective share of the \u0026nbsp;\u0026nbsp;population. c. Protein insert amplification and sequencing strategy is \u0026nbsp;\u0026nbsp;depicted. The protocol uses RCA to amplify plasmid DNA followed by digestion \u0026nbsp;\u0026nbsp;and gel separation of SRT-coding regions for sequencing. d. Venn diagram \u0026nbsp;\u0026nbsp;illustrating frequency distribution of SRT-library sequences in each FACS \u0026nbsp;\u0026nbsp;bin.\u003c/p\u003e","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/b893d9191989680cbcbffaa4.jpeg"},{"id":94141557,"identity":"82a74fab-a95b-458e-95af-588f86f09d2f","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":459036,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAnalysis of sequence self-similarity in SRT library. \u003c/strong\u003ea. The strategy for checking sequence self-similarity among fragments in protein chains is illustrated. For each protein chain, pairwise alignment of six pairs of fragments is performed, and percent identity is recorded. b. Distribution of percent identity in fragment pairs within each bin is plotted. c. Distribution of sequence population versus the number of pairs with over 50% identity is plotted.\u003c/p\u003e","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/d7a85f56d685cfe2e0cd8505.jpeg"},{"id":94141550,"identity":"5bacc43c-d115-43c9-88dd-8067175fe1bd","added_by":"auto","created_at":"2025-10-22 20:04:40","extension":"jpeg","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":468272,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eProtein production and cellular growth rates.\u003c/strong\u003e Time series plots of protein production (\u003cstrong\u003ea, b, c\u003c/strong\u003e) and cell growth (\u003cstrong\u003ed, e, f\u003c/strong\u003e) for the low (\u003cstrong\u003ea, d\u003c/strong\u003e), medium (\u003cstrong\u003eb, e\u003c/strong\u003e) and high (\u003cstrong\u003ec, f\u003c/strong\u003e) bins at different time points after induction.\u003c/p\u003e","description":"","filename":"floatimage6.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/b8b0ebc48910c7f966d56c4b.jpeg"},{"id":94141559,"identity":"9148b23b-4e68-44ce-8850-b08c28759a70","added_by":"auto","created_at":"2025-10-22 20:04:41","extension":"jpeg","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":992838,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparative analysis of protein production and cell growth.\u003c/strong\u003e Fluorescence (a) and brightfield (b) images showing capillaries for the low (blue), medium (green), and high (red) bins at the t = 35-hour time point. Protein production (c) and cell growth (d) curves at t = 35 hours for the bins are compared. e. Cumulative fluorescence intensities of the bins at t = 35 hours are shown. f. Distribution of capillary frequency versus absorbance at t = 35 hours is depicted.\u003c/p\u003e","description":"","filename":"floatimage7.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/52cea513ea14561e0f02eef0.jpeg"},{"id":105035155,"identity":"48b7f4af-41b6-419d-ba2c-0a8b0223d66d","added_by":"auto","created_at":"2026-03-20 07:25:35","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":4643901,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/9c4d25bf-7a94-4219-a016-0960e2047ec0.pdf"},{"id":94141958,"identity":"88dc68ea-c3e2-46f9-833d-2cf4c19fedde","added_by":"auto","created_at":"2025-10-22 20:12:41","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":10542201,"visible":true,"origin":"","legend":"Supplementary Information","description":"","filename":"srtlibraryrevisionsuppinfov2BDA.docx","url":"https://assets-eu.researchsquare.com/files/rs-7724325/v1/72ed69e2cdaa3743e269255d.docx"}],"financialInterests":"\u003cb\u003eYes\u003c/b\u003e there is potential Competing Interest.\nBenjamin Allen and Melik Demirel are the co-founders of Tandem Repeat Technologies, Inc., and they hold equity in the company. All authors declare that they have issued and pending patents.","formattedTitle":"Self-similar Sequences Yield Higher Protein Expression in a Squid Ring Teeth Protein Library","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eStructural proteins, such as silk,\u003csup\u003e1\u003c/sup\u003e collagen,\u003csup\u003e2\u003c/sup\u003e elastin,\u003csup\u003e3\u003c/sup\u003e keratin,\u003csup\u003e4\u003c/sup\u003e and squid ring teeth (SRT)\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e, have been utilized by materials engineers due to their exceptional mechanical properties, chemical functionalities, and biodegradability. Sequence-structure-property relationships are fundamental to the study of protein-based materials, particularly those derived from structural proteins.\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e Amino acid sequence influences protein folding, which in turn affects how the protein responds to environmental factors.\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e Many structural features, such as hydrophobicity caused by exposed amino acids, crystalline or amorphous regions, protein entanglements, and charged functional groups, contribute to the unique properties of protein materials. The resulting properties of these proteins and their composites with inorganic materials\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e are used in various fields, from materials science to biomedical engineering.\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003cp\u003eSRT has gained attention recently due to its unique composite structure and remarkable toughness, which enable quick and agile movements.\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e Squid use SRT proteins to build tough, resilient ring teeth that support effective defense and hunting strategies.\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e As a materials design platform, SRT proteins are multifunctional, exhibiting unique properties such as thermal conductivity switching,\u003csup\u003e12\u003c/sup\u003e rapid self-healing,\u003csup\u003e13\u003c/sup\u003e tunable hydrophobicity,\u003csup\u003e14\u003c/sup\u003e enhanced optical properties,\u003csup\u003e15\u003c/sup\u003e and triboelectric properties.\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e This variability presents a valuable opportunity to optimize key factors affecting biomanufacturability, enabling the identification of sequence designs that enhance protein yield, solubility, and stability for large-scale production.\u003c/p\u003e\u003cp\u003eWhile SRT proteins and, in general, structural proteins\u003csup\u003e\u003cspan additionalcitationids=\"CR18\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e have been recombinantly engineered and produced at a laboratory scale, many challenges still exist in their biomanufacturing.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e For example, the diversity of SRT protein sequences and their impact on expression and solubility are challenging to predict.\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e Efficient methods are needed to screen and select structural protein variants with high expression and solubility, as traditional protein engineering techniques often have low throughput and rely on trial-and-error approaches. Arakawa et al. emphasized the importance of global sampling by cataloging silk gene sequences from 1098 spider species and analyzing how amino acid motifs influence properties such as mechanical strength, thermal stability, and hydration in over 446 dragline silks.\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e The aforementioned study took several years to complete. Efficient protocols for protein library synthesis, like combinatorial methods, could supplement such studies by providing faster sequence-space coverage than previously thought. Rational design of structural proteins, inspired by natural protein sequences, enables further optimization, such as increasing the molecular weight of engineered proteins through intein-mediated polymerization.\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e However, these methods have been limited to a few protein sequences found in nature or their modifications, leaving room for significant improvements in their properties through modern synthetic biology techniques, especially by exploring large protein libraries.\u003c/p\u003e\u003cp\u003eSeveral studies have concentrated on high-throughput screening of proteins.\u003csup\u003e\u003cspan additionalcitationids=\"CR24\" citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e Analyzing the entire sequence space for protein libraries is essential for understanding and improving proteins, as shown by directed evolution studies.\u003csup\u003e\u003cspan additionalcitationids=\"CR27 CR28\" citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e Mutagenesis is a powerful technique in protein engineering that enables the modification of proteins to create libraries of variants.\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e Site-directed mutagenesis enables precise changes at specific nucleotide positions to modify amino acids, a technique commonly used in rational protein design based on structural insights. Random mutagenesis, achieved through error-prone PCR or chemical mutagenesis, introduces a diverse range of mutations for screening. DNA shuffling is a directed evolution method that creates genetic diversity by recombining related DNA sequences.\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e,\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e This mimics natural recombination, creating a diverse library of variants that can be screened for improved traits such as higher enzymatic activity, increased stability, or altered substrate specificity. A key challenge limiting the large-scale use of recombinant structural proteins is their poor protein yield. Due to toxicity issues in cellular expression, the yield of structural proteins is not high enough to be cost-effective compared to polymers, which high-throughput screening approaches could address.\u003c/p\u003e\u003cp\u003eIn this study, we present the synthesis and high-throughput screening of a structural protein library inspired by SRT proteins. To efficiently manage the complexity of design and randomization, we use a tandem-repeat library construction method based on combinatorial DNA assembly, enabling us to generate and test multiple sequence variants systematically.\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e Our goal is to demonstrate the efficient synthesis of large libraries of structural proteins, especially those with repetitive DNA sequences. Along with our high-throughput screening platform,\u003csup\u003e34\u003c/sup\u003e this can lead to the discovery of sequences that enhance protein yield. The library contains over one million assembled variants from various protein fragments found in squid species. Although there are several library synthesis methods, such as random mutagenesis,\u003csup\u003e35\u003c/sup\u003e directed evolution,\u003csup\u003e36\u003c/sup\u003e and DNA shuffling,\u003csup\u003e37\u003c/sup\u003e we chose combinatorial library synthesis to preserve native protein sequences.\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003e\u003cstrong\u003eSRT combinatorial library\u003c/strong\u003e\u003cp\u003eAbout 350 squid species are found worldwide, and we studied six of these strains (Figs.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea, b), each showing a unique SRT protein composition. These proteins have a segmented, copolymer-like structure, with fragments repeated in tandem. SRT proteins are classified as proteins with intrinsically disordered regions, where each repeat contributes to both crystalline (b-sheets) and amorphous domains, as verified by solid-state NMR studies \u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). In this study, we created a library of SRT proteins with four fragments that have different amorphous regions while maintaining a constant crystalline sequence. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed shows the design strategy and amino acid composition. A set of 33 amorphous sequences found in nature, with similar chain lengths (16\u0026ndash;21 amino acids), was assembled in a combinatorial manner, resulting in 1,185,921 (33\u003csup\u003e4\u003c/sup\u003e) variants. SRT proteins generally tend to form b-sheets,\u003csup\u003e38\u003c/sup\u003e and to enhance this native property, the crystal-forming sequence was kept constant. The predicted tertiary structures of several variants within the SRT protein library are displayed in Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eOur library offers a comprehensive set of protein features for holistic screening and sequence-to-expression analysis. The amorphous fragments consist of different amino acids from all classifications, including polar, electrically charged side chains, hydrophobic, and special cases (Figure S2). While the overall positive hydrophobicity score of all fragments makes them hydrophobic, there is considerable variation in their degree of hydrophobicity (Figure S3). Additionally, the differences in amino acid composition cause the fragments to exhibit a range of electric charges, measured by the isoelectric point (Figure S3). It is important to note that these characteristics are amplified when four fragments randomly assemble to form a protein chain. We hypothesize that this diversity in protein building blocks affects their recombinant expression levels because of the different folding and conformational tendencies of these chains inside host cells.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eSingle-cell and ensemble screening strategy\u003c/strong\u003e\u003cp\u003eIn our previous study on high-throughput protein expression optimization,\u003csup\u003e34\u003c/sup\u003e we examined libraries of plasmid components, namely promoters and 5\u0026rsquo; untranslated regions, which are known to influence single-cell protein levels. Additionally, the characteristics of these features (e.g., transcription rate, translation initiation rate, mRNA decay rate) and their subsequent effects on single-cell protein levels can now be predicted using computational models.\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e In high-throughput screening, it is crucial to first separate single-cell protein expression from ensemble-level expression. Since ensemble expression is affected by other factors, including growth dynamics and metabolic burden, it is more effective to establish genotype-to-phenotype correlations at the single-cell level first. Later, cellular growth characteristics can be analyzed in the context of expression in a clonal population to connect protein yield with DNA or protein sequence. Since single-cell expression cannot yet be predicted \u003cem\u003ea priori\u003c/em\u003e, we used fluorescence-activated cell sorting (FACS) to divide the SRT protein library into three logarithmically spaced bins based on mCherry fluorescence, which acts as a proxy for protein expression levels (as detailed in later sections). Cells from each bin were then sequenced to link genotype with protein expression phenotypes at the single-cell level. Our workflow is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eWe then independently screened each bin's sub-libraries using our high-throughput microcapillary array platform, shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb. This system enables spatially isolated clonal growth within individual capillaries (20 \u0026micro;m diameter) and allows for the induction of the expression system through IPTG diffusion via an agarose gel (see Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec). It captures both fluorescence and brightfield signals to measure protein levels and cell density, respectively. The fluorescence intensity summed across each capillary reflects protein quantity, while the transmitted brightfield intensity is used to calculate absorbance, indicating cell density relative to a background reference. Unlike flow cytometry or plate readers, this platform supports high-throughput, time-series experiments (over 10\u003csup\u003e5\u003c/sup\u003e clones per hour), facilitating the generation of production and growth rate curves over extended periods. In our workflow, the microcapillary array platform and FACS complement each other: first, by distinguishing one of the two variables critical to protein yield (protein per cell or cell frequency), and second, by capturing the other variable at high resolution within the enriched search space. This approach enables a systematic and comprehensive study of how factors such as protein composition, structure, expression system design, cell viability, and environmental conditions influence protein yield.\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eExpression system and cloning\u003c/strong\u003e\u003cp\u003eThe plasmid design for the library is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea. The expression system includes a T5 promoter with two lac operators, a ribosome binding site with a consistent translation initiation rate (1664 au), a protein coding sequence, a HiBiT tag, and an mCherry CDS that overlaps by four nucleotides with the stop codon of the protein CDS. SRT proteins are naturally non-fluorescent, making their direct optical screening impossible. Therefore, we use mCherry as a biomarker with expression dependent on expression of the structural-protein in a translationally coupled bicistronic construct (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb).\u003csup\u003e\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e It has been demonstrated that the expression levels of translationally coupled fluorescent proteins increase proportionally with the rate of translation initiation. \u003csup\u003e41\u003c/sup\u003e A ribosome first transcribes the structural-protein CDS starting from the initial upstream open reading frame (ORF). Instead of fully disassembling at the \u0026lsquo;TGA\u0026rsquo; stop codon, the ribosome re-initiates translation at the downstream ORF\u0026rsquo;s \u0026lsquo;ATG\u0026rsquo; start codon of mCherry. This mechanism produces a positive correlation between the translation of both ORFs due to the short intergenic region (4 nt) and the absence of a strong transcriptional terminator.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe T5 promoter system is recognized by native E. coli RNAP and does not require the expression of exogenous polymerases. When combined with an expression control system like the lac operator, which is inducible by IPTG, T5 provides better transcriptional regulation than T7 (as shown in our previous study\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e). This helps prevent basal (\u0026lsquo;leaky\u0026rsquo;) expression, which is especially important when expressing toxic proteins. In our vector, we included two lac operator sites near the promoter to enhance the regulation of basal expression. The effectiveness of T5's prevention of basal expression, using the IPTG-inducible system, was confirmed with microcapillary arrays, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec. No fluorescence signal was detected in the non-induced capillaries, despite using a high exposure time (4s) for imaging. In contrast, the induced array showed significant mCherry fluorescence at about 40 ms exposure.\u003c/p\u003e\u003cp\u003eTo build the protein library, we used a streamlined cloning method with a library of pre-assembled full-length protein coding sequences (synthesized by Genscript, Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed) delivered in a BsaI-flanked entry plasmid, allowing efficient Golden Gate assembly into our screening vector. This approach significantly increased accuracy by reducing the number of assembly parts by 2.5 times. The destination plasmid pBA500C (which contains the base vector and protein insertion site) and protein-coding inserts were combined in a single reaction, simplifying construction without reducing efficiency (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee). To ensure even clone representation and reduce biases from different growth rates, we pooled isogenic colonies directly from agar plates instead of growing individual colonies in liquid culture. This method ensured that the final library maintained an unbiased composition reflecting the initial assembly.\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eFACS of SRT library\u003c/strong\u003e\u003cp\u003eSingle-cell protein expression levels for the negative control and the SRT protein library are shown in Figs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea and b. The clear separation between the two populations indicates strong mCherry expression in the library, which is crucial for accurately identifying SRT protein-expressing clones. In addition to generally high mCherry expression, the SRT protein library covers roughly a two-order-of-magnitude range in fluorescence intensity, indicating considerable diversity in expression levels. This variation suggests that the encoded protein significantly influences expression modulation. To explore this diversity further, the library was sorted into three bins, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb, with each bin representing a specific expression range and consisting of a similar, minimal fraction (about 15%) of the total population.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec illustrates our cost-effective strategy for sequencing the constructs in each bin, which includes DNA template amplification through rolling circle amplification (RCA) and sequencing via Next-Generation Sequencing (NGS, Illumina). We chose RCA over polymerase chain reaction (PCR) to limit the insertions and deletions that form during PCR amplification of repetitive DNA templates. NGS is ideal for short inserts like the SRT protein library, and its high accuracy (\u0026gt;\u0026thinsp;99.9%) allows for easy and rapid alignment with the reference.\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e Sequencing cells from each bin revealed significant diversity within each group, with minimal sequence overlap between bins (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed), further supporting the connection between sequence composition and expression behavior. The mCherry-positive density scatter plot shows that most clones express medium protein levels, with fewer clones showing low or high expression. The cell population in the sorting region resembles a unimodal symmetric distribution (Figure S4), suggesting an unbiased composition of the SRT protein library\u0026mdash;resulting from randomized fragment assembly. This balanced distribution is ideal for establishing a strong genotype-to-phenotype relationship.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eGenotype-to-phenotype correlations\u003c/strong\u003e\u003cp\u003eThe SRT protein library includes sequence variations specifically within the amorphous regions, resulting from differences in amino acid composition and fragment length. To explore potential relationships between genotype and phenotype, we analyzed correlations between various protein characteristics and single-cell expression levels. Specifically, we compared hydrophobicity (Wimley\u0026ndash;White scale), isoelectric point, molecular weight difference, and nitrogen content of sequences in the low, medium, and high bins (Figure S2 and S3). No statistically significant correlations were found for any of these properties. Importantly, features of the expression-regulating plasmid, such as the promoter and ribosome binding site (translation initiation rate of 1664 a.u. calculated using RBS calculator, denovodna.com), were kept constant across all variants in the library, ensuring that differences in expression were solely due to the encoded protein sequences.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eSince single-cell expression seems unaffected by the intrinsic physicochemical properties of the proteins encoded, we examined sequence features that could influence folding efficiency\u0026mdash;specifically, intra-sequence self-similarity. We conducted pairwise alignments of constituting fragments in each protein sequence for this purpose (see Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea for workflow). The distribution of fragment-pair identities was unimodal and symmetric, indicating that the combinatorial library is unbiased (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). FACS analysis revealed clear enrichment patterns: compared to the medium- and high-expression bin, low-expression bins were enriched in fragment pairs sharing 10\u0026ndash;30% identity and depleted of those with 40\u0026ndash;60% identity (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). A stronger correlation is observed when analyzing the coincidence of multiple high-identity within individual sequences. The high-expression bin is enriched in sequences containing three or more fragment pairs with \u0026gt;\u0026thinsp;50% sequence identity, whereas the low-expression bin is enriched in sequences with one or fewer such fragment pairs (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec). The medium-expression bin exhibited intermediate behavior, containing similar proportions of sequences with one, two, or three fragments pairs with \u0026gt;\u0026thinsp;50% sequence identity. These observations suggest higher inter-fragment sequence similarity is correlated with, and may promote, increased protein expression in our system.\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eHigh-throughput screening using microcapillary arrays\u003c/strong\u003e\u003cp\u003eWe expanded the screening of the SRT protein library by utilizing our high-throughput microarray platform, which produces protein and growth data that traditional methods cannot achieve. We analyzed the bins over a 35-hour period, measuring mCherry fluorescence levels (Figs.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea, b, c) and cell density (Figs.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ed, e, f) in more than 15,000 capillaries in a series of time points. Not all capillaries exhibit the same mCherry fluorescence, highlighting significant heterogeneity in protein expression. The fluorescence intensity distribution is broad, a common trait of stochastic gene expression. This variability is particularly evident in structural proteins, which are generally more challenging to express compared to soluble fluorescent proteins (as also observed previously with cement and reflectin versus mRFP1\u003csup\u003e34\u003c/sup\u003e).\u003c/p\u003e\u003c/p\u003e\u003cp\u003eInterestingly, the distribution is asymmetric \u0026mdash; it appears skewed, with a small number of capillaries exhibiting high fluorescence intensities and a gradual decline toward lower intensities. This \"pinning\" at the lower end likely reflects a combination of expression inefficiencies and biological factors such as cell viability. Indeed, in each expression bin, a subset of clones produced minimal levels of protein, which could be due to compromised cell health or other cellular burdens. Moreover, the rate at which intensity values increase in each bin provides additional insights (Figure S5). The rise in intensity from t\u0026thinsp;=\u0026thinsp;15 hr to 25 hr is greatest in the high bins, followed by medium and low. Additionally, the low bin showed minimal additional expression during the final 10-hour incubation period. The medium and high-expression bins show a broader, more gradual increase in fluorescence values. This indicates a broader range of expression levels and potentially more resilient or variable cellular states within those populations.\u003c/p\u003e\u003cp\u003eIn addition to fluorescence imaging to measure protein titers, we utilized brightfield imaging for estimating cellular growth characteristics. Capillaries containing a greater number of cells at a given time point attenuate white light more compared to those containing fewer cells. For each brightfield image, we plotted the same number of the highest-absorbance capillaries as we plotted for the corresponding fluorescence image. Consistent with this, the number of capillaries exceeding the low bin's fluorescence threshold in both the medium and high bins was significantly higher (as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea and \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ed). This trend highlights the underlying diversity in expression potential across the library and supports the existence of a dynamic, non-uniform distribution of expression phenotypes.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eRepresentative fluorescence and brightfield images are shown in Figs.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ea and \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eb, respectively. The resulting protein production and cell growth profiles at the final time point are summarized in Figs.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ec and \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ed, respectively. A clear upward trend in protein titer is observed across the expression bins\u0026mdash;from low to medium to high\u0026mdash;with each bin exhibiting a higher titer than the previous one. Notably, the medium-expression bin displayed approximately 54% higher protein titers compared to the low-expression bin. In contrast, the increase from the medium to high bin was modest, with the high bin showing only about a 5% improvement over the medium bin (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ee).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe platform\u0026rsquo;s main insight is that growth rates are similar across all bins. Although it was expected that either due to metabolic burden from accumulating more protein per cell or toxicity arising from protein aggregation, bins might depict variable cellular growth, however, all three bins showed nearly identical absorbance (or cell density) distributions at t\u0026thinsp;=\u0026thinsp;35 hr (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ef). This distribution also matches that at t\u0026thinsp;=\u0026thinsp;15 hr for each bin, indicating that the cell population in the capillaries stabilized early on\u0026mdash;probably during the pre-induction phase at 37\u0026deg;C\u0026mdash;and then remained stable during the later incubation at 25\u0026deg;C after induction. This stability is likely due to the T5 expression system's ability to suppress basal expression and the lower temperatures used after induction, which help cells grow largely independent of the expressed protein.\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003cp\u003eThese results directly correlate with the single-cell protein levels; however, the same degree of variation observed in the single-cell protein levels was not seen at the ensemble level of protein expression. Each clone is expected to show a range of single-cell protein levels, and selecting cells with the average protein level is not guaranteed in FACS, especially for the high bin, making a 10-fold variation in protein titers unlikely. Additionally, unviable cells are present in each capillary, as confirmed by cytometry of individual cultures from several clones in the library (Figure S6) and the dense left lobe in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb. Therefore, it is likely that only a portion of the clonal population expresses mCherry and SRT proteins (as previously noted with cement and reflectin\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e), and this fraction varies across the bins (Figure S7). Our results suggest that clones in the medium bin express a similar total amount of SRT protein as those in the high bin, possibly because their higher rate of expression-positive cells compensates for their lower single-cell SRT protein levels (illustrated in Figure S4). The mCherry-positive cell populations appear to be centered around the medium bin; the sharp decrease in cell frequency toward the high bin also supports this (Figure S4).\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eSRT proteins are recognized for their multifunctional properties, making the optimization of their protein yield essential for practical applications. While previous studies have explored how protein sequence influences expression yield in different systems, our research broadens this scope to include SRT proteins by an extensive combinatorial library. This library was designed to cover a wide sequence space, including variations in features such as self-similarity, hydrophobicity, charge, and nitrogen content. Significantly, sequences responsible for crystalline region formation were kept constant to ensure structural consistency across variants. Our high-throughput screening utilized a microcapillary array platform that was capable of capturing the entire distribution of production levels within each bin, stratified by single-cell expression intensity. This enabled a detailed mapping of genotype to phenotype across thousands of clones. The platform, based on T5 promoter-driven expression, effectively minimized basal expression, resulting in uniform growth profiles across bins. A strong correlation was found between single-cell protein expression and final protein titer levels. However, within the high-expression bin, a wide range of titer values was observed, emphasizing the influence of stochastic expression and cell viability constraints even among high-expressing clones. Notably, the platform features a precision extraction mechanism that allows for iterative enrichment of the high bin. This makes it possible to progressively isolate top-performing clones through multiple rounds of screening and refinement, providing a powerful tool for engineering variants with higher expression yield.\u003c/p\u003e\u003cp\u003eWhile most physicochemical properties of the protein sequences\u0026mdash;such as hydrophobicity, charge, or nitrogen content\u0026mdash;were not found to strongly correlate with protein titer levels, sequence self-similarity emerged as a key factor influencing expression. SRT proteins consist of tandem repeats that form β-sheet structures. This assembly behavior is reflected in their inherent disorder profiles. For example, the engineered variant TR-n4, made up of four tandem repeat units, is known to form β-sheets and displays a consistent, periodic disorder profile with four distinct peaks (Figure S8). These peaks indicate flexible regions in the sequence that undergo structural rearrangement during folding or assembly. To explore the connection between self-similarity and β-sheet assembly, we analyzed the disorder profiles of the most self-similar protein sequences in our library (cumulative self-similarity score: 560\u0026ndash;600) and compared them to the least self-similar sequences (score: 266\u0026ndash;273). The results, shown in Figure S9, reveal that highly self-similar clones display periodic disorder patterns with four evenly spaced peaks, closely resembling the TR-n4 profile. In contrast, the disorder profiles of low-similarity sequences are irregular and lack clear periodicity (Figure S10).\u003c/p\u003e\u003cp\u003eIn our SRT protein library, greater sequence similarity between assembly-prone protein segments was correlated with higher expression levels. This finding is initially surprising: repetitive, material-forming protein sequences are challenging to express in bacteria, in part due to protein-aggregate toxicity.\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e However, a growing body of literature on heterotypic aggregates suggests that the co-assembly of variant sequences can modulate aggregate structure and aggregation kinetics; lower sequence similarity between co-assembling variants often corresponds to less efficient aggregation.\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e Critically, the SRT library sequences in this study are genetic fusions of heterotypic assembly-prone sequences; we expect that library members with lower self-similarity will aggregate less efficiently and will be more likely to populate unfolded, soluble, and oligomeric (pre-aggregated) states compared to those with higher self-similarity. Such pre-aggregated states are understood to mediate much of the host-cell toxicity caused by assembly-prone proteins: these states recruit and overload the proteostatic machinery, whereas large-scale aggregates like fibrils and inclusion bodies sequester assembly-prone proteins and protect the cell.\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e Hence, our library sequences with greater self-similarity may reach higher expression levels because they assemble into insoluble aggregates more efficiently and spend less time populating toxic, pre-aggregated states. This hypothesis suggests self-similarity as an important design parameter for the heterologous expression of material-forming proteins with repetitive sequences. In contrast to issues like coding-sequence instability, mRNA stability, and amino acid and tRNA depletion, all of which can be managed with increased coding-sequence diversity,\u003csup\u003e47\u003c/sup\u003e protein toxicity concerns may favor sequences with sufficient amino-acid self-similarity to promote rapid sequestration.\u003c/p\u003e\u003cp\u003eIn summary, we have demonstrated an orthogonal screening method that combines FACS and fluorescent microcapillary arrays to establish genotype-to-single-cell-to-ensemble correlations. We observed that brighter clones tend to contain sequences with more self-similar fragments, whereas dimmer clones in the library exhibit less self-similarity. The method employs two high-throughput techniques in a complementary way, covering screening from single cells to the ensemble level, while enabling enrichment of desired clones. Our workflow separates single-cell expression and growth traits for a systematic study of genetic design on protein yield. It is especially suited for genetic systems that do not respond well to computational predictions, such as those involving structural proteins with complex structures.\u003c/p\u003e\n\u003ch3\u003eDATA AVAILABILITY\u003c/h3\u003e\n\u003cp\u003eThe authors declare that data supporting the findings of this study are available within the paper and its supplementary information files.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003ch2\u003eCOMPETING INTERESTS\u003c/h2\u003e\u003cp\u003eBenjamin Allen and Melik Demirel are the co-founders of Tandem Repeat Technologies, Inc., and they hold equity in the company. All authors declare that they have issued and pending patents.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003ch2\u003eACKNOWLEDGEMENT OF SPONSORSHIP STATEMENT\u003c/h2\u003e\u003cp\u003eThis effort was sponsored in whole or in part by the Central Intelligence Agency (CIA), through CIA Federal Labs. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eDISCLAIMER\u003c/strong\u003e\u003cp\u003eThe views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Central Intelligence Agency.\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eAUTHOR CONTRIBUTIONS\u003c/h2\u003e\u003cp\u003eM.C.D. conceived the project. T.M.B. and K.S. developed the microcapillary array platform. B.D.A. designed plasmids and protein libraries and conducted self-similarity analysis. K.S. designed and performed all experiments, and analyzed the data. K.S. wrote the manuscript together with M.C.D. All authors participated in manuscript revisions, discussions, and data interpretation.\u003c/p\u003e\u003ch2\u003eACKNOWLEDGMENTS\u003c/h2\u003e\u003cp\u003eWe thank Dr. Howard Salis and Harry Adamson for their scientific discussions. Additionally, we appreciate Dr. William Humphries for guiding us in setting up the optical system.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003e\u003cstrong\u003ePlasmid library design\u003c/strong\u003e\u003cp\u003eThe amino acid sequences of the protein are based on native Squid Ring Teeth proteins.\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e We designed a segmented copolymer protein with four fragments. Each fragment has a crystalline, beta-sheet-forming region and a flexible, GGXY amorphous region. The amino acid sequence of the crystalline region across all fragments remained constant at \u0026lsquo;AAASVSTVHHP\u0026rsquo;. We selected 33 amorphous amino acid fragments from six different squid species, based on our previous analysis. The selection was made according to the similarity in length to the original recombinant n4 sequence, as described in our earlier publication.\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eVector cloning and library synthesis\u003c/strong\u003e\u003cp\u003ePlasmids pBA407 and pBA500B were used to clone the pBA500C vector. Plasmid pBA500B, containing the protein insertion site, GS linker, HiBiT tag and mCherry CDS, was synthesized by Genscript. pBA407 was digested at BsaI and NcoI restriction sites and pBA500B was digested at KpnI and BamHI restriction sites using the standard protocols (New England Biolabs, NEB). The digested products were then assembled to form pBA500C through standard HiFi DNA assembly (NEB) and transformed into NEB 10-beta at 37\u0026deg;C. Isogenic colonies were propagated in Luria Broth (LB) at 37\u0026deg;C and 250 RPM, and the cultures were miniprepped to obtain the purified pBA500C plasmid. pBA500C sequence was confirmed through long-read whole-plasmid sequencing (Oxford Nanopore, Quintara Biosciences). pBA500C plasmid contains the SRT protein CDS insertion site compatible with Golden Gate assembly. Plasmids containing the protein CDS (flanked with BsaI restriction sites) of all variants were ordered as a pool in a commercial vector pUC57 (Genscript). Golden Gate assembly of pBA500C and protein-insert plasmid pool were carried out using standard procedures (NEB, BsaI-HF v2). The assembled product was transformed into NEB 10-beta chemically competent cells. Multiple agar plates were prepared using different loading volumes of the recovery (2.5 \u0026micro;l, 25 \u0026micro;l, 100 \u0026micro;l, and 250 \u0026micro;l). Three plates with the greatest density of isogenic colonies were selected for library pooling, and all colonies (estimated about 20,000\u0026ndash;25,000) were collected using LB. A portion of the recovered colony suspension was miniprepped. The miniprepped plasmids were then transformed into \u003cem\u003eE. Coli\u003c/em\u003e BL21-DE3 (NEB) chemically competent cells. All colonies from four agar plates with 100 \u0026micro;l and 250 \u0026micro;l loading volume of the recovery were collected using LB and pooled. This pool was stored in glycerol cryostocks and was used for further screening and tests. All nucleotide sequences used in this study are provided in Supplementary Information.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eFACS\u003c/strong\u003e\u003cp\u003eThe library cryostock was inoculated in carbenicillin-supplemented LB and incubated at 37\u0026deg;C and 250 RPM for 5 hours. The culture was then induced with IPTG at a final concentration of 400 \u0026micro;M and incubated overnight at 25\u0026deg;C. The culture was diluted 500\u0026times; in phosphate-buffered saline (PBS, 1\u0026times;) and used for FACS (ThermoScientific, Bigfoot Cell Sorter). The parameters for the gates (low, medium, and high) and the respective populations are described in the main text and figures. The sorted cells (10,000 in each bin) were collected in carbenicillin-supplemented LB and incubated overnight. Finally, glycerol stocks of each bin were prepared and used for further analyses.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eInstrumentation and sample preparation\u003c/strong\u003e\u003cp\u003eThe microcapillary-array screening system was described in detail earlier.\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e Briefly, an inverted fluorescence microscope (Olympus IX73) was equipped with a monochrome camera (DP23M, Olympus), a motorized XY stage (Marzhauser Wetzlar Tango 3), a motorized focus drive (Marzhauser Wetzlar), and a 10x objective lens (Olympus). The platform was operated using the cellSens (Olympus) software suite. Microcapillary arrays with 20 \u0026micro;m capillary diameter (INCOM, Inc.) were used throughout all experiments, each containing approximately 8\u0026times;10^5 capillaries. The arrays were sterilized prior to use and treated with a plasma wand to increase hydrophilicity. The cryostock of each bin was incubated in carbenicillin-supplemented Luria broth (LB) at 37\u0026deg;C until reaching an optical density of about 0.1. Cells in LB were loaded onto an array, with cell concentration adjusted so that each capillary typically contained a single cell. After loading, the array was overlaid with a 2% (w/v) agarose gel layer (1\u0026ndash;2 mm thick) and incubated at 37\u0026deg;C in a sealed petri dish lined with moist wipes for 5 hours. The cells were then induced with IPTG by adding the stock solution to achieve a 400 \u0026micro;M concentration in the capillaries. Following induction, the arrays were incubated at 25\u0026deg;C and screened at 5, 15, 25, and 35 hours. After screening, the arrays were sterilized in 70% ethanol for several hours. They were then thoroughly cleaned with a stream of DI water, with intermediate ultrasonication for 2 minutes.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eAmplification and sequencing\u003c/strong\u003e\u003cp\u003eTo obtain protein CDS inserts for sequencing, the BL21-DE3 sub-libraries were propagated in LB (37\u0026deg;C, 250 RPM) and stored in aliquots. Rolling circle amplification (RCA) was performed with these aliquots using standard protocols (phi29-XT RCA kit, NEB) for 24 hours. Initial verification of amplification used random primers; after confirmation, sequence-specific primers (5\u0026rsquo;-CGTGAAACATC*C*T*G-3\u0026rsquo;, 5\u0026rsquo;-GTCCTGAAGAG*A*G*G-3\u0026rsquo;) minimized background. The RCA products were diluted twofold and digested with MscI and NcoI. The digested products were run on DNA purification gels (1% w/v agarose in TAE buffer) for about an hour. Desired bands were excised and purified using a DNA gel purification kit (NEB). Plasmid concentrations were quantified with Qubit and sent for sequencing via next-generation sequencing (NGS, Illumina; Quintara Biosciences). A custom MATLAB script was used to analyze the merged raw reads.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eMicrocapillary array platform screening\u003c/strong\u003e\u003cp\u003eBrightfield and fluorescence images were captured at t\u0026thinsp;=\u0026thinsp;5 hr, 15 hr, 25 hr, and 35 hr at multiple locations across each bin array. These images were taken with consistent exposure times (450 ms for brightfield and 100 ms for fluorescence). Fluorescence images were processed using a custom MATLAB script. The contrast was first normalized, and the images were binarized to highlight white capillaries indicating fluorescence. Capillary regions were identified and used as masks to measure the total pixel intensity within each capillary in the original images. No fluorescence was detected at t\u0026thinsp;=\u0026thinsp;5 hr. The threshold level used for the low bin images at t\u0026thinsp;=\u0026thinsp;15 hr was applied to all other images, including medium and high bins. Similarly, brightfield images were binarized using an optimal threshold, capillaries identified, and their intensities recorded. The intensities from each image were sorted in ascending order, and the lowest set of capillary intensities\u0026mdash;matching the number of capillaries identified in the corresponding fluorescence images\u0026mdash;were used to generate growth curves. Remaining capillaries were classified as background. Absorbance levels were also calculated.\u003c/p\u003e\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eLeal-Ega\u0026ntilde;a, A. \u0026amp; Scheibel, T. Silk‐based materials for biomedical applications. \u003cem\u003eBiotech and App Biochem\u003c/em\u003e 55, 155\u0026ndash;167 (2010). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1042/BA20090229\u003c/span\u003e\u003cspan address=\"10.1042/BA20090229\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMeyer, M. Processing of collagen based biomaterials and the resulting materials properties. \u003cem\u003eBioMed Eng OnLine\u003c/em\u003e 18, 24 (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s12938-019-0647-0\u003c/span\u003e\u003cspan address=\"10.1186/s12938-019-0647-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBracalello, A. \u003cem\u003eet al.\u003c/em\u003e Design and Production of a Chimeric Resilin-, Elastin-, and Collagen-Like Engineered Polypeptide. \u003cem\u003eBiomacromolecules\u003c/em\u003e 12, 2957\u0026ndash;2965 (2011). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/bm2005388\u003c/span\u003e\u003cspan address=\"10.1021/bm2005388\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFeroz, S., Muhammad, N., Ratnayake, J. \u0026amp; Dias, G. Keratin - Based materials for biomedical applications. \u003cem\u003eBioactive Materials\u003c/em\u003e 5, 496\u0026ndash;509 (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.bioactmat.2020.04.007\u003c/span\u003e\u003cspan address=\"10.1016/j.bioactmat.2020.04.007\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePena-Francesch, A. \u0026amp; Demirel, M. C. Squid-inspired tandem repeat proteins: Functional fibers and films. \u003cem\u003eFrontiers in chemistry\u003c/em\u003e 7, 69 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCarter, N. A. \u0026amp; Grove, T. Z. Functional protein materials: Beyond elastomeric and structural proteins. \u003cem\u003ePolymer Chemistry\u003c/em\u003e 10, 2952\u0026ndash;2959 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLeisola, M. \u0026amp; Turunen, O. Protein engineering: opportunities and challenges. \u003cem\u003eAppl Microbiol Biotechnol\u003c/em\u003e 75, 1225\u0026ndash;1232 (2007). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00253-007-0964-2\u003c/span\u003e\u003cspan address=\"10.1007/s00253-007-0964-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eVural, M. \u0026amp; Demirel, M. C. Biocomposites of 2D layered materials. \u003cem\u003eNanoscale Horizons\u003c/em\u003e 10, 664\u0026ndash;680 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCapezza, A. J. \u0026amp; Mezzenga, R. Proteins for Applied and Functional Materials. \u003cem\u003eBiomacromolecules\u003c/em\u003e 25, 4615\u0026ndash;4618 (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/acs.biomac.4c00884\u003c/span\u003e\u003cspan address=\"10.1021/acs.biomac.4c00884\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDemirel, M. C., Cetinkaya, M., Pena-Francesch, A. \u0026amp; Jung, H. Recent advances in nanoscale bioinspired materials. \u003cem\u003eMacromolecular bioscience\u003c/em\u003e 15, 300\u0026ndash;311 (2015).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePena-Francesch, A. \u003cem\u003eet al.\u003c/em\u003e Research update: programmable tandem repeat proteins inspired by squid ring teeth. \u003cem\u003eAPL Materials\u003c/em\u003e 6 (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTomko, J. A. \u003cem\u003eet al.\u003c/em\u003e Tunable thermal transport and reversible thermal conductivity switching in topologically networked bio-inspired materials. \u003cem\u003eNature Nanotech\u003c/em\u003e 13, 959\u0026ndash;964 (2018). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41565-018-0227-7\u003c/span\u003e\u003cspan address=\"10.1038/s41565-018-0227-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePena-Francesch, A., Jung, H., Demirel, M. C. \u0026amp; Sitti, M. Biosynthetic self-healing materials for soft machines. \u003cem\u003eNat. Mater.\u003c/em\u003e 19, 1230\u0026ndash;1235 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal, K., Mazeed, T. \u0026amp; Demirel, M. C. Cephalopod inspired self-healing protein foams for oil-water separation. \u003cem\u003eiScience\u003c/em\u003e 26, 108300 (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.isci.2023.108300\u003c/span\u003e\u003cspan address=\"10.1016/j.isci.2023.108300\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYılmaz, H. \u003cem\u003eet al.\u003c/em\u003e Structural Protein-Based Whispering Gallery Mode Resonators. \u003cem\u003eACS Photonics\u003c/em\u003e 4, 2179\u0026ndash;2186 (2017). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/acsphotonics.7b00310\u003c/span\u003e\u003cspan address=\"10.1021/acsphotonics.7b00310\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal, K., Boy, R., Abdullah, A. M., Mazeed, T. \u0026amp; Demirel, M. C. Engineering advanced cellulosics for enhanced triboelectric performance using biomanufactured proteins. \u003cem\u003enpj Mater. Sustain.\u003c/em\u003e 2, 29 (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s44296-024-00035-7\u003c/span\u003e\u003cspan address=\"10.1038/s44296-024-00035-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCappello, J. \u003cem\u003eet al.\u003c/em\u003e Genetic Engineering of Structural Protein Polymers. \u003cem\u003eBiotechnology Progress\u003c/em\u003e 6, 198\u0026ndash;202 (1990). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/bp00003a006\u003c/span\u003e\u003cspan address=\"10.1021/bp00003a006\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePoddar, H., Breitling, R. \u0026amp; Takano, E. Towards engineering and production of artificial spider silk using tools of synthetic biology. \u003cem\u003eEng. biol.\u003c/em\u003e 4, 1\u0026ndash;6 (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1049/enb.2019.0017\u003c/span\u003e\u003cspan address=\"10.1049/enb.2019.0017\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShire, E., Coimbra, A. A. B., Barba Ostria, C., Rios-Solis, L. \u0026amp; L\u0026oacute;pez Barreiro, D. Molecular design of protein-based materials \u0026ndash; state of the art, opportunities and challenges at the interface between materials engineering and synthetic biology. \u003cem\u003eMol. Syst. Des. Eng.\u003c/em\u003e 9, 1187\u0026ndash;1209 (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1039/D4ME00122B\u003c/span\u003e\u003cspan address=\"10.1039/D4ME00122B\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eArakawa, K. \u003cem\u003eet al.\u003c/em\u003e 1000 spider silkomes: Linking sequences to silk physical properties. \u003cem\u003eSci. Adv.\u003c/em\u003e 8, eabo6043 (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1126/sciadv.abo6043\u003c/span\u003e\u003cspan address=\"10.1126/sciadv.abo6043\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBowen, C. H. \u003cem\u003eet al.\u003c/em\u003e Recombinant Spidroins Fully Replicate Primary Mechanical Properties of Natural Spider Silk. \u003cem\u003eBiomacromolecules\u003c/em\u003e 19, 3853\u0026ndash;3860 (2018). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/acs.biomac.8b00980\u003c/span\u003e\u003cspan address=\"10.1021/acs.biomac.8b00980\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLin, S., Chen, G., Liu, X. \u0026amp; Meng, Q. Chimeric spider silk proteins mediated by intein result in artificial hybrid silks. \u003cem\u003eBiopolymers\u003c/em\u003e 105, 385\u0026ndash;392 (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/bip.22828\u003c/span\u003e\u003cspan address=\"10.1002/bip.22828\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHosse, R. J., Rothe, A. \u0026amp; Power, B. E. A new generation of protein display scaffolds for molecular recognition. \u003cem\u003eProtein Science\u003c/em\u003e 15, 14\u0026ndash;27 (2006). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1110/ps.051817606\u003c/span\u003e\u003cspan address=\"10.1110/ps.051817606\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMoffet, D. A. \u0026amp; Hecht, M. H. De Novo Proteins from Combinatorial Libraries. \u003cem\u003eChem. Rev.\u003c/em\u003e 101, 3191\u0026ndash;3204 (2001). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/cr000051e\u003c/span\u003e\u003cspan address=\"10.1021/cr000051e\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhao, H. \u0026amp; Arnold, F. H. Combinatorial protein design: strategies for screening protein libraries. \u003cem\u003eCurrent Opinion in Structural Biology\u003c/em\u003e 7, 480\u0026ndash;485 (1997). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/S0959-440X(97)80110-8\u003c/span\u003e\u003cspan address=\"10.1016/S0959-440X(97)80110-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJ\u0026auml;ckel, C., Kast, P. \u0026amp; Hilvert, D. Protein Design by Directed Evolution. \u003cem\u003eAnnu. Rev. Biophys.\u003c/em\u003e 37, 153\u0026ndash;173 (2008). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1146/annurev.biophys.37.032807.125832\u003c/span\u003e\u003cspan address=\"10.1146/annurev.biophys.37.032807.125832\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKan, A. \u0026amp; Joshi, N. S. Towards the directed evolution of protein materials. \u003cem\u003eMRS Communications\u003c/em\u003e 9, 441\u0026ndash;455 (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1557/mrc.2019.28\u003c/span\u003e\u003cspan address=\"10.1557/mrc.2019.28\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, Y. \u003cem\u003eet al.\u003c/em\u003e Directed Evolution: Methodologies and Applications. \u003cem\u003eChem. Rev.\u003c/em\u003e 121, 12384\u0026ndash;12444 (2021). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/acs.chemrev.1c00260\u003c/span\u003e\u003cspan address=\"10.1021/acs.chemrev.1c00260\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYang, K. K., Wu, Z. \u0026amp; Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. \u003cem\u003eNat Methods\u003c/em\u003e 16, 687\u0026ndash;694 (2019). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41592-019-0496-6\u003c/span\u003e\u003cspan address=\"10.1038/s41592-019-0496-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZimmermann, A., Prieto-Vivas, J. E., Voordeckers, K., Bi, C. \u0026amp; Verstrepen, K. J. Mutagenesis techniques for evolutionary engineering of microbes \u0026ndash; exploiting CRISPR-Cas, oligonucleotides, recombinases, and polymerases. \u003cem\u003eTrends in Microbiology\u003c/em\u003e 32, 884\u0026ndash;901 (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.tim.2024.02.006\u003c/span\u003e\u003cspan address=\"10.1016/j.tim.2024.02.006\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eStemmer, W. P. C. Rapid evolution of a protein in vitro by DNA shuffling. \u003cem\u003eNature\u003c/em\u003e 370, 389\u0026ndash;391 (1994). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/370389a0\u003c/span\u003e\u003cspan address=\"10.1038/370389a0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhao, H. Optimization of DNA shuffling for high fidelity recombination. \u003cem\u003eNucleic Acids Research\u003c/em\u003e 25, 1307\u0026ndash;1308 (1997). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/nar/25.6.1307\u003c/span\u003e\u003cspan address=\"10.1093/nar/25.6.1307\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEngler, C. \u0026amp; Marillonnet, S. in \u003cem\u003eSynthetic Biology\u003c/em\u003e Vol. 1073 (eds Karen M. Polizzi \u0026amp; Cleo Kontoravdi) 141\u0026ndash;156 (Humana Press, 2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal, K., Adamson, H. E., Baer, T. M., Salis, H. M. \u0026amp; Demirel, M. C. Microcapillary Array-Based High Throughput Screening for Protein Biomanufacturability. \u003cem\u003eACS Synth. Biol.\u003c/em\u003e 14, 2328\u0026ndash;2340 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eReetz, M. T. \u0026amp; Carballeira, J. D. Iterative saturation mutagenesis (ISM) for rapid directed evolution of functional enzymes. \u003cem\u003eNat Protoc\u003c/em\u003e 2, 891\u0026ndash;903 (2007). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/nprot.2007.72\u003c/span\u003e\u003cspan address=\"10.1038/nprot.2007.72\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRomero, P. A. \u0026amp; Arnold, F. H. Exploring protein fitness landscapes by directed evolution. \u003cem\u003eNat Rev Mol Cell Biol\u003c/em\u003e 10, 866\u0026ndash;876 (2009). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/nrm2805\u003c/span\u003e\u003cspan address=\"10.1038/nrm2805\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBiot-Pelletier, D. \u0026amp; Martin, V. J. J. Evolutionary engineering by genome shuffling. \u003cem\u003eAppl Microbiol Biotechnol\u003c/em\u003e 98, 3877\u0026ndash;3887 (2014). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00253-014-5616-8\u003c/span\u003e\u003cspan address=\"10.1007/s00253-014-5616-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDubini, R. C., Jung, H., Skidmore, C. H., Demirel, M. C. \u0026amp; Rov\u0026oacute;, P. Hydration-induced structural transitions in biomimetic tandem repeat proteins. \u003cem\u003eThe Journal of Physical Chemistry B\u003c/em\u003e 125, 2134\u0026ndash;2145 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLaFleur, T. L., Hossain, A. \u0026amp; Salis, H. M. Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria. \u003cem\u003eNature communications\u003c/em\u003e 13, 5159 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLevin-Karp, A. \u003cem\u003eet al.\u003c/em\u003e Quantifying translational coupling in E. coli synthetic operons using RBS modulation and fluorescent reporters. \u003cem\u003eACS Synth. Biol.\u003c/em\u003e 2, 327\u0026ndash;336 (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTian, T. \u0026amp; Salis, H. M. A predictive biophysical model of translational coupling to coordinate and control protein expression in bacterial operons. \u003cem\u003eNucleic Acids Research\u003c/em\u003e 43, 7137\u0026ndash;7151 (2015). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/nar/gkv635\u003c/span\u003e\u003cspan address=\"10.1093/nar/gkv635\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMetzker, M. L. Sequencing technologies\u0026mdash;the next generation. \u003cem\u003eNat Rev Genet\u003c/em\u003e 11, 31\u0026ndash;46 (2010).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMakrides, S. C. Strategies for achieving high-level expression of genes in Escherichia coli. \u003cem\u003eMicrobiological reviews\u003c/em\u003e 60, 512\u0026ndash;538 (1996).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSabate, R., De Groot, N. S. \u0026amp; Ventura, S. Protein folding and aggregation in bacteria. \u003cem\u003eCellular and Molecular Life Sciences\u003c/em\u003e 67, 2695\u0026ndash;2715 (2010).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKonstantoulea, K., Louros, N., Rousseau, F. \u0026amp; Schymkowitz, J. Heterotypic interactions in amyloid function and disease. \u003cem\u003eThe FEBS Journal\u003c/em\u003e 289, 2025\u0026ndash;2046 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBednarska, N. G., Schymkowitz, J., Rousseau, F. \u0026amp; Van Eldere, J. Protein aggregation in bacteria: the thin boundary between functionality and toxicity. \u003cem\u003eMicrobiology\u003c/em\u003e 159, 1795\u0026ndash;1806 (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJeon, J., Subramani, S. V., Lee, K. Z., Jiang, B. \u0026amp; Zhang, F. Microbial synthesis of high-molecular-weight, highly repetitive protein polymers. \u003cem\u003eInternational journal of molecular sciences\u003c/em\u003e 24, 6416 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJung, H. \u003cem\u003eet al.\u003c/em\u003e Molecular tandem repeat strategy for elucidating mechanical properties of high-strength proteins. \u003cem\u003eProc. Natl. Acad. Sci. U.S.A.\u003c/em\u003e 113, 6478\u0026ndash;6483 (2016).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"SRT protein, combinatorial library, sequence self-similarity, microcapillary arrays, FACS, recombinant proteins, high-throughput screening","lastPublishedDoi":"10.21203/rs.3.rs-7724325/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7724325/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eProtein materials give biological systems remarkable abilities, including tunable control over structural, optical, electrical, self-healing, and thermal properties. High-throughput screening of structural proteins is essential for understanding and enhancing proteins by exploring full sequence spaces, as shown in directed evolution. A significant challenge is the poor protein yield of recombinant structural proteins, often due to toxicity issues (e.g., aggregation, cell stress, formation of inclusion bodies), which limits protein yield. There is a need to address these issues and enhance the expression yield of structural proteins. Based on naturally observed squid ring teeth proteins, we introduce a structural protein library that allows us to explore a broad sequence space using a new high-throughput platform. We selected 33 amino acid fragments from six different native squid species and constructed a protein library containing every possible fusion of four such fragments (approximately 1.2\u0026nbsp;million variants, 33\u003csup\u003e4\u003c/sup\u003e). We analyzed subsets of this library using a multi-step screening method that combines fluorescent-assisted cell sorting (FACS) and fluorescent microcapillary-array-based screening to establish correlations between structural protein sequences in single cells and in clonal populations. Our workflow considers both protein expression and cell growth, supporting systematic genetic-design studies focused on protein expression yield. We observed that protein sequences with higher self-similarity tend to have greater expression levels. This suggests that self-similarity is a crucial design parameter for the heterologous expression of material-forming proteins with repetitive sequences, a factor that was not previously addressed. The ability to screen large libraries of structural proteins for expression and cell growth enables high-yield protein production, crucial for synthetic biology and biomanufacturing.\u003c/p\u003e","manuscriptTitle":"Self-similar Sequences Yield Higher Protein Expression in a Squid Ring Teeth Protein Library","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-22 20:04:35","doi":"10.21203/rs.3.rs-7724325/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"68e98948-c9c7-44b7-b879-7d80994ac183","owner":[],"postedDate":"October 22nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":55961394,"name":"Biological sciences/Chemical biology/Protein design"},{"id":55961395,"name":"Biological sciences/Biochemistry/Protein folding/Protein aggregation"}],"tags":[],"updatedAt":"2026-03-19T12:45:48+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-22 20:04:35","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7724325","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7724325","identity":"rs-7724325","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00