{"paper_id":"2093bfdf-b81d-48a7-bf9e-b17064dce47b","body_text":"Journal Title Here, 2022, pp. 1–10\ndoi: DOI HERE\nAdvance Access Publication Date: Day Month Year\nPaper\nPAPER\nFrom Atoms to Fragments: A Coarse Representation\nfor Efficient and Functional Protein Design\nLeonardo V. Castorina ,1, Christopher W. Wood 2 and Kartic Subr 1∗\n1School of Informatics, The University of Edinburgh, 10 Crichton Street, Newington, Edinburgh, EH8 9AB, UK and 2School of Biological\nSciences, The University of Edinburgh, Roger Land Building, Edinburgh , EH9 3FF, UK\n∗Corresponding author. email-id.com\nFOR PUBLISHER ONLY Received on Date Month Year; revised on Date Month Year; accepted on Date Month Year\nAbstract\nDeep learning has made remarkable progress in protein design, yet current protein representations remain largely black-\nbox and scale poorly with protein length, leading to high computational costs. We propose a fragment-based protein\nrepresentation that balances interpretability and efficiency. Using a curated set of 40 evolutionarily conserved fragments,\nwe represent proteins as fragment sets or fragment graphs, significantly reducing dimensionality while preserving\nfunctional information. Here, we show that fragment-based representations capture significantly more information at\nmuch lower dimensions compared to traditional methods. On a dataset of 215 functionally diverse proteins, our approach\noutperforms traditional sequence- and structure-based methods in clustering by protein function at ≤ 30% sequence\nidentity. Additionally, fragment-based search achieves comparable accuracy while using 90% fewer tokens. It also runs\n∼68.7× faster than RMSD-based methods and ∼1.64× faster than sequence-based methods, even when including fragment\npre-processing overhead. Finally, we show that fragments can guide RFDiffusion backbone generation, with recovery rates\nhigher than 40%. We propose fragment-based representations as a scalable and interpretable alternative for the next\ngeneration of protein design tools, spanning backbone and sequence design to functional searches in protein structure\ndatabases.\nKey words: Protein Representation, Functional Protein Design, Functional Protein Search, Fragments\nIntroduction\nDesigning functional proteins could transform medicine,\nbiotechnology, and sustainability. From enzymes that catalyze\nreactions, to vaccines against target diseases, proteins serve as\nprecise molecular tools to our most pressing problems. However,\ndesigning proteins remains a computationally intractable\nproblem due to the combinatorial complexity of the search\nspace. With 20 possible amino acids at each position, the\nsearch space grows exponentially with protein length, making\nexhaustive explorations impossible.\nTo navigate this search space, Artificial Intelligence\n(AI) methods have enabled de novo design of protein\nbinders [14], neutralizing antibodies against diseases [28],\nand enzymes [33, 23]. These models rely on different\nprotein representations. Large Language Models (LLMs) (e.g.,\nProtGPT [11], ESM [26]), treat proteins as sequences, while\ndiffusion models (e.g., RFDiffusion[31], EvoDiff[1]) represent\nprotein structures as vector frames encoding the atomic\ncoordinates. Other approaches represent structures as voxel\ngrids (e.g., TIMED [8], DenseCPD[24]) or graphs (e.g.,\nProteinMPNN[10]).\nWhile structure- and sequence-based representations have\nenabled breakthroughs, they impose significant computational\nburdens that scale non-linearly with protein size. This makes\nlarge-scale protein design prohibitively expensive and leads to\nincreasingly complex models, highlighting the need for more\nefficient and interpretable representations.\nTo address these challenges, we propose fragment-based\nrepresentations – an approach that represents proteins as\ncombinations of evolutionarily conserved structural fragments\ninstead of full sequences or atomic structures (see Figure 1).\nThis idea is rooted in protein evolution, where structures and\nfunctions evolved from recombination, repetition, and accretion\nof small, functional peptides [2]. We show that this intermediate\nabstraction level significantly reduces dimensionality while\npreserving protein functional signatures, enabling faster and\nmore interpretable protein search, analysis, and design.\nProteins inherently lend themselves to abstraction at\nmultiple scales. Secondary structures such as α-helices and\nβ-sheets provide simplified views of local folding patterns in\nribbon diagrams [25]. At a higher level, tertiary structural\nmotifs, such as β hairpins or helix-turn-helix domains are\nstrongly associated with molecular functions and are widely\nused for design and analysis [21]. Our fragment-based\nrepresentation follows this principle, focusing on functional\nbuilding blocks rather than full atomic resolution.\n© The Author 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail:\njournals.permissions@oup.com\n1\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\n2 Castorina et al.\nPrevious studies have identified and leveraged recurring\nstructural motifs for protein analysis and design. Alva et\nal. [3] identified a conserved set of ancient fragments associated\nwith functions such as DNA and metal ion binding. Frappier\net al. [13] introduced recurring Tertiary Structural Motifs\n(TERMs) to design protein binders. Kolodny et al. [20]\ncompiled several sequence-based “THEMES” for functional\nprotein analysis. Fragment-based representations have also\nbeen used beyond proteins, such as fragSMILES [22] for\nchemical representations and component-based approaches in\ncomputer vision [6].\nBuilding on these ideas, we introduce fragment-based\nrepresentations as an alternative to traditional sequence- and\nstructure-based representations. Our method explicitly encodes\nthe backbone geometry rather than amino acid sequence,\ndecomposing proteins into functional fragments. Each fragment\nrepresents a recurring structural motif associated with specific\nfunctions. Using a library of 40 conserved fragments, we\nshow that fragment-based representations effectively capture\nfunctional information at a much lower dimensionality than\ntraditional methods.\nAdditionally, we provide a fully vectorized Python package\n(MIT License) for fragment detection and representation,\nadaptable to any fragment library.\nWe demonstrate three key applications of our fragment-\nbased approach: (1) functional clustering to evaluate how well\nfragments capture protein function; (2) database searching\nto demonstrate effectiveness in retrieving functional proteins\nand computational efficiency; and (3) protein design using\nfragments as blueprints to guide RFdiffusion to generate\nbackbones with functional signatures.\nBecause fragments encode functional units, fragment-\nbased representations offer a computationally efficient and\ninterpretable approach to protein representation, search, and\ndesign. By balancing efficiency and biological relevance,\nfragment-based representations provide a scalable foundation\nfor the next generation of protein design tools.\nMethods\nWe propose to represent proteins abstractly as being composed\nof evolutionarily conserved fragments. We choose 40 fragments\nidentified by Alva et al. [3], representing ancient structural\nmotifs associated with proteins that bind with DNA, RNA,\nmetal ions, GTP, and ATP.\nWe first explain how to construct fragment-based representations\nprotein structures and then evaluate them via three\napplications:\n• functional clustering to assess how well fragments capture\nprotein function;\n• database searching to demonstrate effectiveness in retrieval\nand computational efficiency; and\n• protein design to show how fragments can condition the\nbackbone generation process.\nIn each case, we compare fragment-based representations\nagainst traditional sequence- and structure-based approaches.\nFragments as a Coarse Representation of Proteins\nGiven a protein structure from a Protein Data Bank (PDB)\nfile, our representation decomposes the structure using\nevolutionarily conserved fragment motifs. We then propose\ntwo representations for the protein structure, without sequence\ninformation, as a F ragment Graph or as a F ragment Set.\nIn the former, nodes in the graph represent fragments (with\nidentification) and edges denote either peptide bonds between\nfragments or spatial proximity. Fragment Sets, on the other\nhand, only contain lists of unique fragments present in a\nstructure, regardless of their arrangement (See Figure 1).\nOur fragment library is based on the 40 fragments from Alva\net al. [3]. To create a curated reference set, we extracted all\ninstances of these fragments from their reported PDB structures\nusing the AMPAL framework [32]. We then filtered these\ninstances to ensure sequence consistency and correct residue\nlengths, resulting in a verified reference set of 219 instances\nacross the 40 fragment types (see Supplementary Table 1).\nBuilding Fragment Representations\nRepresenting proteins as fragments involves three main steps:\n(1) detecting the fragments in the given structure, (2)\nclassifying unmatched regions, and (3) converting the classified\nstructure into a graph or a set representation (See Figure 1).\nFragment detection identifies segments of the target protein\nthat match fragments in our library below a distance\nthreshold. We implemented a sliding window algorithm (see\nSupplementary Algorithm 1) that computes distances between\nsegments of the target protein and each fragment in the library.\nFor this distance calculation, we evaluated several distance\nmetrics both individually and in combination:\n• Sequence-based metrics (sequence identity, BLOSUM\ndistance) measure distance in amino acid sequence [17].\n• Angle-based metrics (RMS, RamRMSD, LogPr) are\nsequence-independent and measure distance in backbone\ntorsion angles (ϕ and ψ) [19].\nThis produces a F ragment Distance Matrix D\nquantifying the distance between each library fragment Fk and\nthe segments of the target structure T. To classify regions\nas fragments, we normalize the Fragment Distance Matrix D\nto the [0, 1] range. Regions with distances below the optimal\nthreshold of 3.65% (determined through ROC analysis to\nmaximize accuracy) are classified as matching fragments. If\nfragment matches overlap, we prioritize matches with lower\ndistances. We allow up to two amino acids overlap between\nneighboring fragments.\nAfter fragment detection, regions not matching any known\nfragments are classified according to their length: regions of\n9–24 amino acids are classified as unknown fragments, while\nregions shorter than 9 amino acids are classified as unknown\nconnectors.\nFinally, we represent classified regions using two types of\nfragment-based representations:\n• F ragment Setsrecord only presence or absence of fragment\nclasses without considering connectivity information.\n• F ragment Graphs represent structures as graphs, where\nnodes correspond to fragments (known or unknown) and\nconnectors, and edges indicate peptide bonds or spatial\nproximity (<10 ˚A). Edge features are one-hot encoded to\ndistinguish between connection types.\nFragment Sets are suitable for applications where speed\nis required, such as database searches. Fragment Graphs are\nbetter suited when the structural context is important, for\nexample in protein design or functional clustering.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\nFrom Atoms to Fragments 3\n{     ,     }\n14 \nFragment Graph \nFragment Set \n14 \n14 \n14 \n14 \n14 \n14 0 \n0 \n0 \n0 \n0 \n14 \n14 \n14 \nFig. 1. Fragment-based protein representation. Conversion from protein structure to fragment representations using a ZIF268 Zinc Finger (PDB: 1AAY\n- DNA in Yellow and Zinc ions in Purple). The detection algorithm identifies regions matching known fragments, with Fragment 14 (blue), corresponding\nto DNA- and metal- binding functions essential for zinc fingers. Unclassified regions are labeled as “unknown” (white). The identified fragments are\nrepresented either as a Fragment Set, which counts unique fragment types or as a Fragment Graph, which preserves connectivity through peptide bonds\n(dark edges) and spatial proximity (dotted edges).\nNext, we used the PDBench, a fold-balanced protein\ndataset, to analyze whether structural and chemical properties\nare preserved in fragment versus non-fragment regions [7].\nDatasets for Validation\nWe used two datasets to validate our representation: PDBench\nto test the preservation of structural and chemical properties\nand Protein Function Dataset (PFD) to assess whether\nfragments capture functional properties in proteins.\nTo validate whether fragments preserve structural and\nchemical properties, we used PDBench [7], a fold-balanced\ndataset of protein structures. We tested whether properties\nsuch as hydrogen bonding and solvent accessibility are\npreserved in similar proportions between fragment and non-\nfragment regions. We hypothesized that important structural\nand chemical properties would be enriched in fragment regions\nrelative to non-fragment regions. Specifically, we quantified the\ncorrelation between the percentage of each property in fragment\nregions and the fraction of the protein covered by fragments.\nA deviation from perfect correlation would indicate over- or\nunder-represented properties.\nTo validate whether fragments capture functional relationships,\nwe created PFD, a structurally diverse dataset of functional\nproteins. The dataset includes 215 protein monomers spanning\n12 functional categories, covering binding of DNA, RNA,\nmetal ions, GTP, ATP, and combinations thereof. We ensured\nstructural diversity by filtering using Gene Ontology (GO)\ncodes [4] and enforcing a sequence identity cutoff of ≤ 30%\nthrough the PDB Advanced Search interface [5]. This structural\ndiversity is essential to assess whether functional relationships\nare captured, independent of high homology. Where possible,\nwe selected 10 representative structures per functional category\n(detailed in Supplementary Table 1). We use this dataset to\nevaluate fragment-based functional clustering, then, to assess\ndatabase search performance, and finally, for fragment-guided\nbackbone generation.\nFragments for Functional Clustering\nWe tested the quality of our fragment-based representations\nby evaluating how well they capture functional relationships\nbetween proteins. Since fragments represent evolutionarily\nconserved functional motifs, we hypothesized that fragment-\nbased representations should accurately capture functional\nsimilarities between proteins. Specifically, we evaluated\nwhether proteins with similar functions cluster better when\nrepresented using fragments compared to traditional structure-\nand sequence-based representations.\nTo test this, we first computed pairwise distances for all\nprotein pairs in the PFD using fragment-based, sequence-\nbased, and structure-based metrics. Then, we projected each\nprotein into a distance-preserving latent space of increasing\ndimensionality using Principal Coordinate Analysis (PCoA)\nand t-SNE. This allowed us to evaluate how effectively each\nrepresentation captures functional relationships at various\ndimensions.\nWe selected these metrics to calculate the distances for\nclustering:\n1. RMSD (Root Mean Square Deviation) : Measures\ntraditional structural similarity using BioPython’s CE-\nAligner [29, 9].\n2. BLOSUM62: Measures traditional sequence similarity\nusing pairwise alignment scores based on amino acid\nsubstitution frequencies from BLOSUM[17].\n3. BagOfNodes: Measures similarity based purely on the\npresence or absence of fragments (Fragment Sets), ignoring\ntheir spatial arrangement. This representation tests\ntopology-independent functional information.\n4. GraphEditDistance: Measures functional similarity\nby accounting for both fragment identity and spatial\narrangement (Fragment Graphs), providing a more\ncomprehensive fragment-based metric [12].\nNext, we clustered the resulting embeddings using Gaussian\nMixture Models (GMM) and K-Means. We set the number of\nclusters to 12 clusters, corresponding to the known functional\ncategories within the Protein Function Dataset (PFD). To\nmeasure robustness and comprehensively evaluate clustering\nquality, we use Adjusted Rand Index (ARI), Normalized Mutual\nInformation (NMI), Silhouette and Trustworthiness scores, F1-\nscore, and the correlation between embedding distances and the\noriginal pairwise distances.\nFragments for Functional-based Searches\nWhile the previous experiment evaluated the quality of the\nrepresentation, here we assessed its feasibility for protein\ndatabase searches. We tested how quickly fragment-based\ndistances retrieved results and whether the retrieved proteins\nhad similar functions as the query protein.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\n4 Castorina et al.\nFirst, we benchmarked the initialization time, search speed,\nand memory requirements. Then, we tested the most relevant\nresults matched the function of the protein query.\nFor benchmarking, we measured initialization and query\ntimes for our fragment-based methods (GraphEditDistance\nand BagOfNodes) against traditional approaches (RMSD and\nBLOSUM). We used 1, 10, and 100 queries on a database of 100\nproteins to measure the scalability of the search, using 35 cores\n1. We also measured the memory requirement as the average\nnumber of data points required for each representation.\nTo assess the quality of the retrieved results, we queried each\nfunctional protein in the PFD against all other proteins and\nsorted by the lowest distance. We then evaluated whether the\nretrieved proteins shared functional similarity with the query\nusing two complementary metrics: Normalized Discounted\nCumulative Gain (NDCG) and Area Under the Receiver\nOperating Characteristic (AUROC), which measure whether\nfunctionally similar proteins rank higher in the results (see\ndetails in Supplementary Section 13).\nFrom Fragments to Functional Proteins\nFinally, we explored whether fragments could be used as\nblueprints to guide the generation of functional proteins by\nproviding structural constraints to a generative model. We\nhypothesized that if fragments capture functional information,\nthen proteins generated using fragment-derived templates\nshould be structurally similar to known proteins with the same\nfunction.\nTo test this hypothesis, we used RFDiffusion[31], a state-\nof-the-art protein backbone generation model. For all the 215\nPFD proteins, we generated partial backbone templates by\nmasking non-fragment regions. After filling the missing regions,\nwe used them as queries to assess whether the most similar\nresults matched the function of the queries.\nFor each protein in the dataset, we first applied our fragment\ndetection algorithm to identify functionally important regions.\nThen, we created partial backbone templates with only these\nregions, and non-fragment regions removed. RFDiffusion then\nfilled in the missing regions, generating five candidate backbone\nstructures per template.\nTo evaluate the functional recovery of the designs, we used\nFoldSeek to align each generated structure to known proteins\nin the PDB using sequence-independent shape matching [30].\nWe calculated the percentage recovery rate as the fraction of\ngenerated designs whose top 10 structural matches shared the\nexact Gene Ontology (GO) code(s) of the original protein. As\na control, we performed the same evaluation with the original\nprotein backbones to create a baseline for comparison.\nResults\nWe evaluate our fragment-based representation across four key\nareas: fragment detection accuracy, physicochemical properties\nof fragments, effectiveness in capturing functional patterns, and\napplications in protein search and design.\nAccurate Fragment Detection Using Combined\nMetrics\nWe validated the fragment detection algorithm using distance\nmetrics based on sequence (BLOSUM and Sequence Identity)\n1 Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz\nand angle-based distance metrics (LogPr, RamRMSD, and\nRMSD). We also tried RMSD (via both PyMol and BioPython),\nhowever it was much slower and we observed several silent\nfailures during alignments. Supplementary Figure 1 shows the\nperformance (F1 score) of individual metrics and ensembles.\nWhile individual metrics achieved modest F1 scores around\n0.40, combining two complementary metrics significantly\nimproved performance to approximately 0.85. The LogPr\nand RamRMSD combination consistently demonstrated the\nhighest accuracy. Adding a third metric provided no significant\nimprovement.\nUsing Receiver Operating Characteristic (ROC) analysis, we\nidentified an optimal probability threshold of 3.65% or fragment\nclassification, achieving an Area Under ROC (AUROC) of 87%\n(see Supplementary Figure 5).\nFragment Regions Show Distinct Structural and\nChemical Properties\nWe evaluated physico-chemical properties of fragment and non-\nfragment regions using the PDBench benchmark. As shown\nin Supplementary Figure 6, the percentage of the protein\ncovered by fragments was roughly 40% with consistent standard\ndeviations. There were some outliers like Alpha Solenoid or\nAlpha-Beta Horseshoe at around 20%. The special folds had\nthe highest standard deviation, larger than the coverage value\nitself. The fragment coverage was consistent across resolutions\n(Supplementary Figure 7).\nFragment regions showed a higher proportion of intra-\nfragment hydrogen bonds, especially in mainly β folds where we\nobserved a ∼15% increase compared to non fragment regions.\nConversely they showed lower inter-fragment hydrogen bonds,\nparticularly in the mainly α folds with a ∼ 47% reduction.\nSurface accessibility was slightly reduced in fragment regions,\nshowing a ∼5% decrease across most folds except special folds.\nDespite these structural differences, fragment and non-fragment\nregions maintained similar distributions of charge, polarity, and\nsecondary structure elements (see Supplementary Section 10).\nFragment-Based Embeddings Efficiently Capture\nFunctional Similarities\nWe evaluated how well fragment-based representations and\ntraditional sequence- and shape-based methods capture\nfunctional similarities in reduced-dimensional embeddings.\nWe compute a distance matrix for the dataset of 215\nfunctional proteins using RMSD (shape), BLOSUM (sequence),\nBagOfNodes (fragment sets), and GraphEditDistance (fragment\ngraphs).\nWe projected the data into lower dimensions using Principal\nCoordinate Analysis (PCoA) and calculated the cumulative\nexplained variance across dimensions (Figure 2). Fragment-\nbased representations significantly outperform traditional\nmetrics, with BagOfNodes and GraphEditDistance preserving\nover 95% and 80% of cumulative variance within 20 dimensions,\nrespectively. In contrast, traditional methods preserved\nsignificantly less information, with BLOSUM capturing less\nthan 60% and RMSD less than 40% of the variance.\nInterstingly, fragment-based distances showed strong\ncorrelation with sequence-based distances despite not directly\nusing sequence information. GraphEditDistance achieved a\nSpearman correlation of 0.91 with BLOSUM distances, while\nBagOfNodes showed a moderate correlation of 0.57. RMSD,\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\nFrom Atoms to Fragments 5\nDistance Based On ARI NMI Silhouette Trustworthiness Distance Corr. F1 Score\nRMSD Shape 0.0357 0.3863 -0.0334 0.9598 0.8647 0.1957\nBLOSUM Sequence 0.0027 0.2933 0.0248 0.9923 0.9966 0.1640\nGraphEditDistance (ours) Fragment Graph 0.0458 0.3832 0.0766 0.9915 0.9829 0.1985\nBagOfNodes (ours) Fragment Set 0.0050 0.3455 0.8227 0.9991 0.9998 0.1660\nT able 1. Clustering performance comparison of different distance metrics using Gaussian Mixture Models (GMM) on Principal Coordinate\nAnalysis (PCoA) embeddings of the functional protein dataset (215 proteins across 12 functional categories).\nhowever, showed minimal or slightly negative correlation with\nother metrics.\n0 25 50 75 100 125 150\nNumber of Dimensions\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0Cumulative Variance Explained\nBagOfNodes\nGraphEditDistance\nBLOSUM\nRMSD\nFig. 2. Cumulative variance explained by different distance metrics after\nPrincipal Coordinate Analysis (PCoA) projection of a functional protein\ndataset containing 215 proteins across 12 functional categories.\nWe evaluated the clustering performance of Gaussian\nMixture Models (GMMs) and K-Means using PCoA, t-SNE,\nand UMAP embeddings. Table 1 summarizes the results\nfor GMMs on PCoA embeddings. Overall, fragment-based\nrepresentations demonstrated better clustering performance\nacross most metrics.\nNotably, BagOfNodes achieved the highest Silhouette score\n(0.8227), indicating well-separated clusters, along with the\nbest Silhouette and Trustworthiness scores, and Distance\nCorrelation. GraphEditDistance performed best for for ARI\n(0.0458), indicating the highest agreement with the true\nfunctional clusters after adjusting for chance, and also the\nhighest F1 score (0.1985). RMSD ranked highest in NMI score\n(0.3863), reflecting better mutual information between cluster\nassignments and true functions, and was second best for ARI\nand F1 Score. BLOSUM ranked second in Silhouette scores and\nTrustworthiness scores (see Supplementary Table 3).\nFragment-Based Search Combines Speed and\nAccuracy\nWe tested the fragment representation for functional proteins\nsearches. We assessed both the quality of search retrieval\nand the computational efficiency of fragment distance methods\n(GraphEditDistance and BagOfNodes) against traditional\nsequence (BLOSUM) and shape (RMSD) distance methods.\nUsing the dataset of 215 functionally annotated proteins,\nwe select individual proteins for each function as queries.\nThen, we calculate the pairwise distance to rank all other\nproteins. We assessed the quality of the retrieval using\nNormalized Discounted Cumulative Gain (NDCG) and Area\nUnder the Receiver Operating Characteristic Curve (AUROC),\nto quantify how well the ranking of retrieved proteins matches\nthe expected order based on shared functions.\nAs shown in Supplementary Figures 3 and 4, the AUROC\nand NDCG scores across all methods are generally within\n1 standard deviations of one another. In terms of retrieval\naccuracy, fragment-based methods matched approaches for\nmost functional categories, with particularly strong AUROC\nperformance in identifying DNA+ATP+GTP-binding proteins\n(values >0.8 against 0.75 0.56 for RMSD and BLOSUM).\nRMSD showed an advantage in the NDCG for specific functions,\nespecially in DNA+GTP, RNA+GTP, and RNA+GTP+Metal\nbinding searches.\nThen, we benchmarked the computational efficiency of each\nmethod, measuring query times for 1, 10, and 100 queries\nagainst a database of 100 proteins using 35 cores (Table 2).\nFragment-based representations substantially reduce data\ndimensionality compared to traditional methods. Relative to\nbackbone atom representation (RMSD), our fragment approach\nachieves dimensionality reduction of 99.1% for fragment graphs\nand 99.7% for fragment sets. Even compared to sequence\nrepresentations, we observe significant compression: 94.4%\nreduction for fragment graphs and 98.3% for fragment nodes.\nFragment-based representations use significantly less number\nof datapoints compared to traditional methods. Relative\nto backbone atom-based representations (RMSD), fragment\ngraphs and fragment sets reduce dimensionality by approximately\n99.1% and 99.7%, respectively. Compared to sequence-\nbased representations, the reductions are 94.4% and 98.3%,\nrespectively.\nOverall, sequence search with BLOSUM distance is the\nfastest method considering initialization time and search\ntime. BagOfNodes is the fastest search method overall,\ncompleting 100 queries in under 0.07 s, while other methods\nrequired substantially longer – RMSD took about 1717 s,\nGraphEditDistance about 573 s, and BLOSUM took 36.57 s.\nAlthough fragment-based methods have a higher initial cost\nwhich involves converting protein structures to fragment graphs\n(around 6 s compared to 25 s), this cost is quickly offset by the\nfaster search times.\nFunctional Design Recovery with Fragment-Based\nDiffusion\nWe evaluated the ability of fragment-based templates to guide\nthe generation of functional proteins using RFDiffusion. For\neach of the 215 proteins, we generated a template backbone\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\n6 Castorina et al.\nMethod Init. Time (s) 1 Query Time (s) 10 Queries (s) 100 Queries (s) Memory Req.\nRMSD 5.6796 22.2474 155.3840 1717.0282 1744.36 ± 1306.90\nBLOSUM 4.5215 0.4238 3.4366 36.5742 290.73 ± 217.82\nGraphEditDistance (ours) 24.9708 5.6529 57.1725 573.0536 16.39 ± 13.57\nBagOfNodes (ours) 24.9931 0.0012 0.0075 0.0686 4.87 ± 2.48\nT able 2. Performance comparison of query methods across different protein representations. Initialization and query times (1, 10, and 100\nqueries) are measured on a database of 100 proteins using 35 cores. Memory requirement is reported as the average number of data points\nrequired to represent a protein: backbone atoms (RMSD), residues (BLOSUM), nodes (Fragment Graphs), or elements (Fragment Sets).\nof fragments. We used RFDiffusion to fill in the gaps between\nthe fragments and generate 5 different structures. Then, we\nuse FoldSeek to search for the closest 10 backbones using\nsequence independent TMAlign. For each design and for the\noriginal backbone, we define recovery rate as the fraction of\nbackbones annotated with the function of the original backbone\n(see Supplementary Figure 20). We also calculate the relative\nrecovery rate as the recovery rate of the design over the recovery\nrate of the original backbone (see Figure 3 and boxplot in\nSupplementary Figure 21)\nIn general, there is a range of recovery rates, across\ndifferent functional categories. Metal-binding proteins achieved\nperfect recovery rates, while ATP- and GTP-binding\nproteins also showed consistently high recovery rates. Multi-\nfunctional proteins demonstrated more variable outcomes, with\nDNA+ATP+GTP-binding showing the widest range, varying\nfrom 0% to 300% relative recovery rate and also the lowest\nrecovery rate for the control.\nSingle-function designs generally demonstrated higher\nrecovery rates compared to their multi-functional counterparts.\nAmong dual-function proteins, metal-binding combinations\nproved most successful, with DNA+Metal, GTP+Metal,\nand RNA+Metal showing particularly high recovery rates.\nInterestingly, some triple-function combinations achieved\nsurprisingly high recovery rates, particularly for DNA+ATP+GTP,\nDNA+RNA+Metal, and RNA+GTP+Metal binding.\nFragments and Functional Similarity\nWe identified two DNA-binding proteins with high sequence\nand shape distances but low fragment distance (Figure 4).\nThese proteins are the UvrABC system protein C, involved in\nDNA repair (PDB: 2NRR), and a viral DNA-dependent RNA\npolymerase (PDB: 6RIE). Despite their overall differences, they\nshare fragments 17 (metal-binding), 23 (nucleotide-binding),\nand 35 (structural).\nFragment 17 is a small helix involved in metal binding, while\nfragment 23 is a helix-loop-sheet associated with nucleotide\nbinding. Fragment 35, a sheet-loop-sheet motif, contributes\nto structural integrity. Notably, none of these fragments are\nexplicitly classified as DNA-binding, yet their presence captures\nsimilarities in overall fold architecture. This is reflected in\ntheir low fragment distance scores compared to sequence (90%\ndivergence) and shape (RMSD: 6.71 ˚A) distances.\nAdditionally, Graph Edit Distance (GED: 4%) accounts for\nthe fragment neighborhood, considering factors such as the\nnumber of adjacent unknown fragments and peptide bonds.\nDiscussion\nIn this study, we demonstrate that fragment-based representations\neffectively coarsen protein structures while preserving essential\nfunctional information. Using just 40 evolutionarily conserved\nMETAL\nATP\nGTP\nDNA\nRNA\nATP+GTP\nATP+METAL\nDNA+ATP\nDNA+GTP\nDNA+METAL\nDNA+RNA\nGTP+METAL\nRNA+ATP\nRNA+GTP\nRNA+METAL\nATP+GTP+METAL\nDNA+ATP+GTP\nDNA+ATP+METAL\nDNA+RNA+ATP\nDNA+RNA+METAL\nRNA+ATP+GTP\nRNA+ATP+METAL\nRNA+GTP+METAL\nCategory\n0\n20\n40\n60\n80\n100Relative Recovery Rate (%)\nFig. 3. Relative recovery rates for fragment-constrained backbone\ngeneration. The fragments from each protein in the Protein Function\nDataset (PFD) were used to generate a template for RFDiffusion. The\nrecovery rate is defined as the fraction of generated backbones whose top\n10 structural matches in FoldSeek share the exact Gene Ontology (GO)\nfunction of the original protein. The relative recovery rate compares this\nto the recovery rate of the original protein backbone.\nfragments, our approach captures important structural\nproperties while reducing the dimensionality by up to\n99% compared to traditional methods. Our fast and\nvectorised fragment-detection algorithm allows fast conversion\nto fragments and achieves an F1 score of 0.85. Furthermore,\nwe successfully use fragments to guide backbone towards\npreserving protein functional signatures with recovery rates\nbetween 40-100%. These results make fragment-based representations\na promising alternative to traditional sequence- and structure-\nbased approaches for protein analysis, search, and design.\nFrom Libraries to Detection: Building Fast and\nRobust Fragment Algorithms\nWe evaluated several metrics for fragment detection, focusing\non shape and sequence. While individual metrics performed\nsimilarly, combining LogPr and RamRMSD nearly doubled the\nF1 score from 0.40 to 0.85.[19]. This highlights that torsion-\nangle alone can outperform sequence-based methods for robust\nfragment detection\nWhile both metrics measure differences in backbone\ntorsion angles ϕ and ψ, they process information differently.\nRamRMSD uses root-mean-square deviation, where squared\ndifferences are averaged, giving more weight to larger\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\nFrom Atoms to Fragments 7\n2NRR\nReferences\n6RIE\n17\n25\n23\n17\n25\n23\n37\nSequence-based:\nSeq. Dist.: 90%\nShape-based:\nRMSD: 6.71 Å\nFragment-based:\nGED: 4%\nBON: 22%\nFig. 4. Comparison of two structurally distinct DNA-binding proteins using sequence-, shape-, and fragment- based methods. The UvrABC system\nprotein C, involved in DNA repair (PDB: 2NRR), is shown on the left, while a viral DNA-dependent RNA polymerase (PDB: 6RIE) is on the right.\nDespite significant differences in sequence and overall structure, both proteins share common functional fragments (17, 23, and 25, highlighted in\ncolor). Traditional sequence- and structure-based distances indicate high divergence between these proteins. In contrast, fragment-based metrics show\na relatively low distance, suggesting a potentially shared functional role.\ndeviations. In contrast, LogPr applies a logarithmic\ntransformation to normalized angle differences, emphasizing\nsmall deviations and converting them to a probability-like scale.\nThis complementarity allows our algorithm to be sensitive to\nboth, large and small deviations in torsion angles.\nThe fragment detection algorithm is written in Python\nand uses AMPAL [32] for parsing protein structures\nand NumPy’s [16] vectorised convolutional operations. We\ndeliberately avoided structural alignment methods based on\nincremental combinatorial extension (CE), which, despite\npotentially improving detection accuracy, proved computationally\nexpensive and occasionally unstable during testing. [29, 9].\nA key strength of our software is its flexibility. Users can\neasily swap our library with custom fragments by providing\nfolders of PDB structures with the fragments of interest. This\nextends the software applications beyond the binding functions\npresented here, functions presented here, including enzyme\ndesign, antibody engineering, and de novo structural design.\nThe software is written in Python and it is highly modular,\nmeaning that users can expand it to integrate their own\ndistance algorithms. Additionally, we use vectorized operations\nthrough NumPy, delivering fast performance while retaining\nthe intuitive syntax that Python offers.\nFragment-based Representations Capture Functional\nInformation\nUsing the fold-balanced PDBench dataset, we found that\nfragment regions capture distinct structural and chemical\nproperties. These regions contained higher proportions of intra-\nfragment hydrogen bond, particularly in mainly β structures\n(+15%). This is consistent with β folds forming hydrogen\nbonds between adjacent strands[27]. On the other hand, inter-\nfragment hydrogen bonds were significantly lower in fragment\nregions, with a 47% reduction in mainly α structures. This\nobservation is consistent with the characteristic hydrogen\nbonding pattern of α-helices, where hydrogen bonds stabilise\nthe helical structure internally ( i, i + 4 pattern), reducing\nthe potential for hydrogen bonds with adjacent fragments [27].\nThese results suggests that fragments may capture “self-\ncontained” structural units. This is also supported by the\nreduced surface accessibility in fragment regions, such as the\ncore of the protein, which is more likely to have folded regions,\ncompared to surface exposed areas like loops [27].\nAdditionally, fragment-based representations outperform\nor match traditional methods in capturing functional\nsimilarities in embedding spaces. Both Fragment Graph\n(GraphEditDistance) and Set (BagOfNodes) metrics consistently\nachieved strong clustering scores, with BagOfNodes reaching\na Silhouette score of 0.82 and GraphEditDistance showing\nthe best overall performance for ARI (0.046) and F1 score\n(0.20). Fragment-based methods also preserved substantially\nmore information at lower dimensions, achieving 95% and\n80% cumulative variance compared to 60% and 40% for\ntraditional sequence- and shape-based methods, respectively at\n20 dimensions.\nNotably, fragment-based representations capture these\nfunctional patterns without relying on the amino acid\nsequences. Instead they rely solely on backbone geometry. This\neffectiveness arises because fragments capture functional motifs\nregardless of their sequential arrangement, which is typical\nof other alignment-free analysis tools [34]. Sequence-alignment\ntools assume colinearity, meaning that they expect homologous\nresidues to occur in the same order in both sequences [34].\nStructural-alignment tools, such as combinatorial extension\n(CE), mitigate this by breaking the structure into smaller\nregions and reassemble them to complete the alignment [29].\nHowever, these tools may struggle when there is little structural\nhomology between proteins with the same function [15]. In\ncontrast, fragment-based representations maintain performance\nby focusing on the presence of specific functional units.\nFor example, Fragment Sets simply track the presence or\nabsence of functional fragments, without their spatial precise\narrangement.\nPractical Implications of Fragments for Searching\nProtein Databases\nProtein database searches are essential tools for finding\nstructurally or functionally similar proteins. Traditional\nsequence- and shape-based methods can miss important\nrelationships when functional motifs are arranged differently,\nfor example when divided by other structural elements.\nFragment-based representations overcome this limitation while\ndelivering equal or better performance than traditional\nmethods. Fragments require an initial processing cost to\nconvert structures to graphs or sets. However, this one-time\ncomputation can be done for the entire dataset in advance\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\n8 Castorina et al.\nand it is quickly offset by the search speedups. In our\nbenchmarks, Fragment Sets searches using BagOfNodes execute\nat fractions of a second (0.07s for 100 queries), over 500× faster\nthan sequence-based BLOSUM searches (36.57s). Similarly,\nFragment Graphs searches with GraphEditDistance are ∼3×\nfaster that structure-based searches with RMSD (573s vs. 1717s\nfor 100 queries), but are slower than sequence searches.\nThe major advantage, however, is memory efficiency.\nFragment representations reduce the memory requirements by\n90-99%, compared to traditional representations. For large-\nscale applications involving millions of proteins, this reduction\ncould enable searches on hardware that would otherwise be\ninsufficient for atom- or residue-level comparisons. These\nefficiency gains, coupled with comparable functional retrieval\naccuracy (as measured by AUROC and NDCG scores), make\nfragments an attractive alternative for the next-generation of\nprotein search tools.\nFragment Constraints as Design Guides\nThe current generative tools for protein design are difficult\nto prompt for functional generation. For example, Ingraham\net al. [18] highlight that there is currently no protein design\nsystem that can: (1) sample conditionally under diverse design\nconstraints without retraining for new target functions, (2) with\na sub-quadratic scaling computational efficiency, and (3) which\nintegrates both sequence and structure modeling. For instance,\nRFDiffusion, a state-of-the-art diffusion model, lacks explicit\nmechanisms to enforce specific functional constraints in the\ngenerated structures.\nFragment-based constraints address this limitation by\nusing evolutionarily conserved functional units to guide the\ngeneration process. Instead of retraining models with additional\nfunctional labels, our approach leverages evolutionarily\nconserved “building blocks” to steer generative models toward\nfunctionally relevant backbones.\nWe successfully generated functional-looking protein structures\nusing fragments as RFDiffusion templates. Using our fragment-\ndetection algorithm, we detected fragments in existing proteins\nand created partial backbones containing only these regions.\nWe then used RFdiffusion to fill the connecting segments\nand used FoldSeek to retrieve the closest proteins available.\nOn a dataset of various functional categories, we successfully\ngenerated structures that maintained the functional signatures\nof the original proteins. Recovery rates varied by functional\ncategory, with metal-binding and ATP-binding proteins\nachieving nearly perfect recovery (∼100%). Our approach\nwas particularly effective for certain multi-functional proteins,\nwith DNA+ATP+GTP and DNA+RNA+Metal combinations\nshowing surprisingly high recovery rates despite their\ncomplexity.\nThese results suggest that diffusion models have implicitly\nlearned about evolutionarily conserved fragments and are able\nto use them for design. Explicitly incorporating fragment\nrepresentations in these models could help reduce the\ncomputational complexity while also providing more direct\nfunctional control to generate specific functional proteins.\nLimitations and Future Work\nOur current implementation uses a curated library of 40\nfragments spanning functions of DNA, RNA, GTP, ATP,\nand Metal binding. Further studies could explore data-driven\napproaches to discover novel fragments with unsupervised\nlearning, potentially expanding the representation capacity\nbeyond the functions described here.\nA major advantage of our approach is its inherent\ninterpretability. Unlike traditional black-box methods, fragment-\nbased representations provide clear functional insights as they\nare associated with specific structural motifs and known\nbiological roles. This interpretability could improve generative\nmodels by making their outputs more functionally interpretable\nand allow more control during the design process. Additionally,\nfor protein sequence design, classifying fragments sequences\ninstead of individual amino acid in the backbone could be faster\nand take into account the sequence bias defined by the fragment\nfunction.\nWhile the BagOfNodes approach is very fast, it is less\neffective when multiple instances of a fragment contribute to\ndistinct functional roles. For example, Zinc finger proteins\nusually contain three instances of fragment 14, each binding a\npositively-charged Zinc ion, and all binding negatively-charged\nDNA (See Figure 1). More instances of fragment 14 may\nindicate binding to multiple DNA strands or different regions\nof the same strand. In these cases, Fragment Graphs with\nGraphEditDistance provide a more nuanced representation by\ncapturing the connectivity and fragment context, despite their\nhigher computational cost.\nOur fragment detection algorithm achieves a good F1\nscore of 0.85, but is potentially sensitive to subtle variations\nin torsion angles which could lead to misclassification. For\nexample, a large change in torsion angles of the middle amino\nacid of a fragment would change the backbone angles for one\namino acid only, so it might still be classified similarly. Future\nwork could integrate probabilistic models to quantify detection\nconfidence and providing adjustable sensitivity, allowing users\nto choose the settings based on their design scenario.\nBeyond protein design, fragment-based representations\nimprove biosecurity applications by identifying potentially\nhazardous structural motifs for that might escape detection in\nsequence- or structure- based screening systems. By recognizing\nfunctional fragments, regardless of their arrangement in the\nproteins, our approach could provide an additional layer of\nsafety for protein synthesis services.\nConclusion\nWe introduced a fragment-based protein representation that\nencodes structures using a curated library of 40 evolutionarily\nconserved functional fragments. This approach reduces\ndimensionality by up to 99% while preserving functional and\nstructural information.\nOur evaluations demonstrate that fragment-based representations\ncapture functional relationships more effectively than traditional\nmethods in clustering, enable significantly faster database\nsearches with comparable accuracy, and successfully guide\nRFDiffusion to generate backbones with functional signatures.\nUnlike black-box representations, our method provides\ninterpretability by linking fragments to biological functions.\nFragment-based representations offer a scalable and\nbiologically relevant framework for protein design. By\nbalancing efficiency and interpretability, this approach lays the\nfoundation for the next generation of protein design tools.\nCompeting interests\nNo competing interest is declared.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\nFrom Atoms to Fragments 9\nAuthor contributions statement\nAll authors were involved in the conception of the project\nand the writing of the manuscript. CWW and KS supervised\nthe work. LVC developed the code, ran the experiments, and\nproduced the figures.\nAcknowledgments\nLVC thanks Lorenzo Pisani for his valuable guidance in\ndeveloping the protein fragment library software.\nReferences\n1. Sarah Alamdari, Nitya Thakkar, Rianne Van Den Berg, Neil\nTenenholtz, Robert Strome, Alan M. Moses, Alex X. Lu,\nNicol` o Fusi, Ava P. Amini, and Kevin K. Yang. Protein\ngeneration with evolutionary diffusion: Sequence is all you\nneed, September 2023.\n2. Vikram Alva and Andrei N Lupas. From ancestral peptides\nto designed proteins. Current Opinion in Structural\nBiology, 48:103–109, February 2018.\n3. Vikram Alva, Johannes S¨ oding, and Andrei N Lupas.\nA vocabulary of ancient peptides at the origin of folded\nproteins. eLife, 4:e09410, December 2015.\n4. Michael Ashburner, Catherine A. Ball, Judith A. Blake,\nDavid Botstein, Heather Butler, J. Michael Cherry, Allan P.\nDavis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig,\nMidori A. Harris, David P. Hill, Laurie Issel-Tarver,\nAndrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E.\nRichardson, Martin Ringwald, Gerald M. Rubin, and Gavin\nSherlock. Gene Ontology: Tool for the unification of biology.\nNature Genetics, 25(1):25–29, May 2000.\n5. H. M. Berman. The Protein Data Bank. Nucleic Acids\nResearch, 28(1):235–242, January 2000.\n6. Alice Bizeul, Thomas Sutter, Alain Ryser, Bernhard\nSch¨ olkopf, Julius von K¨ ugelgen, and Julia E. Vogt. From\nPixels to Components: Eigenvector Masking for Visual\nRepresentation Learning, 2025.\n7. Leonardo V Castorina, Rokas Petrenas, Kartic Subr, and\nChristopher W Wood. PDBench: Evaluating computational\nmethods for protein-sequence design. Bioinformatics,\n39(1):btad027, January 2023.\n8. Leonardo V Castorina, Suleyman Mert ¨Unal, Kartic Subr,\nand Christopher W Wood. TIMED-Design: Flexible and\naccessible protein sequence design with convolutional neural\nnetworks. Protein Engineering, Design and Selection ,\n37:gzae002, January 2024.\n9. Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A.\nChapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg,\nThomas Hamelryck, Frank Kauff, Bartek Wilczynski, and\nMichiel J. L. de Hoon. Biopython: Freely available\nPython tools for computational molecular biology and\nbioinformatics. Bioinformatics, 25(11):1422–1423, June\n2009.\n10. J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J.\nRagotte, L. F. Milles, B. I. M. Wicky, A. Courbet, R. J.\nde Haas, N. Bethel, P. J. Y. Leung, T. F. Huddy, S. Pellock,\nD. Tischer, F. Chan, B. Koepnick, H. Nguyen, A. Kang,\nB. Sankaran, A. K. Bera, N. P. King, and D. Baker.\nRobust deep learning based protein sequence design using\nProteinMPNN, June 2022.\n11. Noelia Ferruz, Steffen Schmidt, and Birte H¨ ocker.\nProtGPT2 is a deep unsupervised language model for\nprotein design. Nature Communications, 13(1):4348, July\n2022.\n12. Andreas Fischer, Kaspar Riesen, and Horst Bunke.\nImproved quadratic time approximation of graph edit\ndistance by combining Hausdorff matching and greedy\nassignment. Pattern Recognition Letters , 87:55–62,\nFebruary 2017.\n13. Vincent Frappier, Justin M. Jenson, Jianfu Zhou, Gevorg\nGrigoryan, and Amy E. Keating. Tertiary Structural Motif\nSequence Statistics Enable Facile Prediction and Design\nof Peptides that Bind Anti-apoptotic Bfl-1 and Mcl-1.\nStructure, 27(4):606–617.e5, April 2019.\n14. P. Gainza, F. Sverrisson, F. Monti, E. Rodol` a, D. Boscaini,\nM. M. Bronstein, and B. E. Correia. Deciphering\ninteraction fingerprints from protein molecular surfaces\nusing geometric deep learning. Nature Methods, 17(2):184–\n192, February 2020.\n15. Tymor Hamamsy, James T. Morton, Robert Blackwell,\nDaniel Berenberg, Nicholas Carriero, Vladimir Gligorijevic,\nCharlie E. M. Strauss, Julia Koehler Leman, Kyunghyun\nCho, and Richard Bonneau. Protein remote homology\ndetection and structural alignment using deep learning.\nNature Biotechnology, 42(6):975–985, June 2024.\n16. Charles R. Harris, K. Jarrod Millman, St´ efan J.\nVan Der Walt, Ralf Gommers, Pauli Virtanen, David\nCournapeau, Eric Wieser, Julian Taylor, Sebastian Berg,\nNathaniel J. Smith, Robert Kern, Matti Picus, Stephan\nHoyer, Marten H. Van Kerkwijk, Matthew Brett, Allan\nHaldane, Jaime Fern´ andez Del R´ ıo, Mark Wiebe, Pearu\nPeterson, Pierre G´ erard-Marchant, Kevin Sheppard, Tyler\nReddy, Warren Weckesser, Hameer Abbasi, Christoph\nGohlke, and Travis E. Oliphant. Array programming with\nNumPy. Nature, 585(7825):357–362, September 2020.\n17. S Henikoff and J G Henikoff. Amino acid substitution\nmatrices from protein blocks. Proceedings of the National\nAcademy of Sciences, 89(22):10915–10919, November 1992.\n18. John B. Ingraham, Max Baranov, Zak Costello, Karl W.\nBarber, Wujie Wang, Ahmed Ismail, Vincent Frappier,\nDana M. Lord, Christopher Ng-Thow-Hing, Erik R.\nVan Vlack, Shan Tie, Vincent Xue, Sarah C. Cowles,\nAlan Leung, Jo˜ ao V. Rodrigues, Claudio L. Morales-Perez,\nAlex M. Ayoub, Robin Green, Katherine Puentes, Frank\nOplinger, Nishant V. Panwar, Fritz Obermeyer, Adam R.\nRoot, Andrew L. Beam, Frank J. Poelwijk, and Gevorg\nGrigoryan. Illuminating protein space with a programmable\ngenerative model. Nature, 623(7989):1070–1078, November\n2023.\n19. Sunghoon Jung, Se-Eun Bae, Insung Ahn, and Hyeon S.\nSon. Protein Backbone Torsion Angle-Based Structure\nComparison and Secondary Structure Database Web Server.\nGenomics & Informatics, 11(3):155, 2013.\n20. Rachel Kolodny, Sergey Nepomnyachiy, Dan S Tawfik, and\nNir Ben-Tal. Bridging Themes: Short Protein Segments\nFound in Different Architectures. Molecular Biology and\nEvolution, 38(6):2191–2208, May 2021.\n21. Craig O Mackenzie and Gevorg Grigoryan. Protein\nstructural motifs in prediction and design.Current Opinion\nin Structural Biology, 44:161–167, June 2017.\n22. Fabrizio Mastrolorito, Fulvio Ciriaco, Maria Vittoria Togo,\nNicola Gambacorta, Daniela Trisciuzzi, Cosimo Damiano\nAltomare, Nicola Amoroso, Francesca Grisoni, and Orazio\nNicolotti. fragSMILES as a chemical string notation\nfor advanced fragment and chirality representation.\nCommunications Chemistry, 8(1):26, January 2025.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint \n\n10 Castorina et al.\n23. Geraldene Munsamy, Ramiro Illanes-Vicioso, Silvia\nFuncillo, Ioanna T. Nakou, Sebastian Lindner, Gavin Ayres,\nLesley S. Sheehan, Steven Moss, Ulrich Eckhard, Philipp\nLorenz, and Noelia Ferruz. Conditional language models\nenable the efficient design of proficient enzymes, May 2024.\n24. Yifei Qi and John Z. H. Zhang. DenseCPD: Improving the\nAccuracy of Neural-Network-Based Computational Protein\nSequence Design with DenseNet. Journal of Chemical\nInformation and Modeling , 60(3):1245–1252, March 2020.\n25. Jane S. Richardson. Early ribbon drawings of proteins.\nNature Structural Biology, 7(8):624–625, August 2000.\n26. Alexander Rives, Joshua Meier, Tom Sercu, Siddharth\nGoyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott,\nC. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological\nstructure and function emerge from scaling unsupervised\nlearning to 250 million protein sequences. Proceedings of\nthe National Academy of Sciences of the United States of\nAmerica, 2019.\n27. Georg E. Schulz and R. Heiner Schirmer. Principles of\nProtein Structure. Springer Advanced Texts in Chemistry.\nSpringer New York, New York, NY, 1979.\n28. Fabian Sesterhenn, Che Yang, Jaume Bonet, Johannes T.\nCramer, Xiaolin Wen, Yimeng Wang, Chi-I Chiang,\nLuciano A. Abriata, Iga Kucharska, Giacomo Castoro,\nSabrina S. Vollers, Marie Galloux, Elie Dheilly, St´ ephane\nRosset, Patricia Corth´ esy, Sandrine Georgeon, M´ elanie\nVillard, Charles-Adrien Richard, Delphyne Descamps,\nTeresa Delgado, Elisa Oricchio, Marie-Anne Rameix-Welti,\nVicente M´ as, Sean Ervin, Jean-Fran¸ cois El´ eou¨ et, Sabine\nRiffault, John T. Bates, Jean-Philippe Julien, Yuxing Li,\nTheodore Jardetzky, Thomas Krey, and Bruno E. Correia.\nDe novo protein design enables the precise induction of\nRSV-neutralizing antibodies. Science, 368(6492):eaay5051,\nMay 2020.\n29. Ilya N Shindyalov and Philip E Bourne. Protein structure\nalignment by incremental combinatorial extension (CE) of\nthe optimal path. Protein engineering , 11(9):739–747,\n1998.\n30. Michel Van Kempen, Stephanie S. Kim, Charlotte\nTumescheit, Milot Mirdita, Jeongjae Lee, Cameron L. M.\nGilchrist, Johannes S¨ oding, and Martin Steinegger. Fast\nand accurate protein structure search with Foldseek.Nature\nBiotechnology, 42(2):243–246, February 2024.\n31. Joseph L. Watson, David Juergens, Nathaniel R. Bennett,\nBrian L. Trippe, Jason Yim, Helen E. Eisenach, Woody\nAhern, Andrew J. Borst, Robert J. Ragotte, Lukas F.\nMilles, Basile I. M. Wicky, Nikita Hanikel, Samuel J.\nPellock, Alexis Courbet, William Sheffler, Jue Wang,\nPreetham Venkatesh, Isaac Sappington, Susana V´ azquez\nTorres, Anna Lauko, Valentin De Bortoli, Emile Mathieu,\nSergey Ovchinnikov, Regina Barzilay, Tommi S. Jaakkola,\nFrank DiMaio, Minkyung Baek, and David Baker. De novo\ndesign of protein structure and function with RFdiffusion.\nNature, 620(7976):1089–1100, August 2023.\n32. Christopher W Wood, Jack W Heal, Andrew R Thomson,\nGail J Bartlett, Amaurys ´A Ibarra, R Leo Brady, Richard B\nSessions, and Derek N Woolfson. ISAMBARD: An\nopen-source computational environment for biomolecular\nanalysis, modelling and design. Bioinformatics,\n33(19):3043–3050, October 2017.\n33. Andy Hsien-Wei Yeh, Christoffer Norn, Yakov Kipnis, Doug\nTischer, Samuel J. Pellock, Declan Evans, Pengchen Ma,\nGyu Rie Lee, Jason Z. Zhang, Ivan Anishchenko, Brian\nCoventry, Longxing Cao, Justas Dauparas, Samer Halabiya,\nMichelle DeWitt, Lauren Carter, K. N. Houk, and David\nBaker. De novo design of luciferases using deep learning.\nNature, 614(7949):774–780, February 2023.\n34. Andrzej Zielezinski, Susana Vinga, Jonas Almeida,\nand Wojciech M. Karlowski. Alignment-free sequence\ncomparison: Benefits, applications, and tools. Genome\nBiology, 18(1):186, December 2017.\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted March 20, 2025. ; https://doi.org/10.1101/2025.03.19.644162doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}