Benchmarking protein sequence and structure search methods for remote homology detection

doi:10.21203/rs.3.rs-8796067/v1

Benchmarking protein sequence and structure search methods for remote homology detection

2026 · doi:10.21203/rs.3.rs-8796067/v1

preprint OA: closed

Full text JSON View at publisher

⚙ AI-generated deep summary by claude@2026-06, 2026-06-24 · read from full text ⓘ

This preprint presents a unified benchmark to evaluate 13 protein sequence and structure search methods for remote homology detection, using consistent datasets, metrics, and search protocols across five biologically relevant scenarios (fold-level similarity, functional consistency, multi-domain architecture, intrinsic disorder, and predicted-structure confidence). The authors find pronounced, context-dependent differences: structure alignment methods perform best for fold/geometric similarity, while representation-based approaches show advantages for functional similarity under low sequence identity and robustness to predicted structure variability, but all methods have limited effectiveness for intrinsically disordered proteins. They also analyze how predicted structure confidence (pLDDT) affects performance and benchmark computational efficiency under typical use conditions, with an explicit caveat that the methods are evaluated within the benchmark’s chosen scenarios and protocols rather than across all possible biological settings. The paper does not explicitly discuss endometriosis or adenomyosis; it was included in the corpus via a keyword match in the upstream search index.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Full text 198,175 characters · extracted from preprint-html · click to expand

Benchmarking protein sequence and structure search methods for remote homology detection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Benchmarking protein sequence and structure search methods for remote homology detection Yuan Liu, Yingquan Zhou, Yan Huang, Hongyi Xin, Xiaoyong Pan, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8796067/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 9 You are reading this latest preprint version Abstract Background Protein sequence and structure similarity-based search is an important task, which underpins protein annotation, evolutionary analysis, large-scale functional inference, and the exploration of the protein “dark space”. The rapid growth of sequence and predicted structure databases has spurred diverse search methods, yet their evaluation remains limited to fold-level similarity and inconsistent benchmarking protocols. Results We present a unified benchmark for protein sequence and structure search. Using this framework, we evaluate 13 representative methods spanning sequence alignment, structure alignment, and representation-based approaches across multiple biologically relevant scenarios. Our results show pronounced and context-dependent differences among methods. Structure alignment methods excel at detecting fold-level and geometric similarity, while representation-based searching approaches show advantages in capturing functional similarity under low sequence identity and robustness to predicted structures. Notably, all evaluated methods show limited effectiveness on intrinsically disordered proteins. Conclusions This benchmark establishes a standardized framework for evaluating protein similarity search methods, providing a practical resource for method selection and a foundation for the development of next-generation approaches capable of addressing diverse homology search challenges. Protein similarity search Benchmark Structure alignment Representation-based searching Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Introduction Quantifying similarity between proteins underlies a wide range of problems in computational biology[ 1 , 2 ]. Protein sequence and structure similarity search is a core operation in biological research, supporting protein annotation[ 3 ], evolutionary analysis[ 4 ], and large-scale functional inference[ 5 , 6 ]. Over the past decade, this task has undergone rapid expansion driven primarily by data growth. Protein sequence databases have continued to increase exponentially[ 7 – 9 ], and recent advances in structure prediction[ 10 – 14 ] have made high-quality protein structures available at the proteome scale[ 15 – 18 ]. Predicted structures now complement experimental data for a vast number of proteins, substantially enlarging both the size and the diversity of searchable protein databases[ 19 ]. The rapid growth of protein sequence and structure databases has placed increasing demands on both the efficiency and accuracy of similar protein search, motivating the development of a broad range of computational approaches. Currently, dozens of methods are available, spanning both classical alignment and more recent deep learning techniques, with their properties describable along several complementary perspectives. First, protein search methods differ in their search modality, operating either on protein sequences or on three-dimensional structures. Sequence-based homology search tools, such as BLAST[ 20 ] and related approaches[ 21 ], remain widely used due to their scalability and effectiveness in identifying members of protein families[ 22 ]. Structure-based methods, including classical alignment approaches such as TMalign[ 23 ] and DALI[ 24 ], leverage spatial information to detect structural similarity and remote homology that may not be apparent from sequences alone[ 25 ]. Second, the methods vary in their matching granularity[ 26 ]. Some approaches emphasize global comparison[ 27 ], aiming to assess overall similarity between entire proteins, while others support local or fragment level matching, enabling the identification of shared substructures or partial similarities[ 28 ]. This distinction is particularly relevant for proteins with complex or multi-domain architectures, where biologically meaningful similarity may be confined to specific regions[ 29 ]. Third, protein search methods can be distinguished by the space in which the similarity is computed. Classical approaches typically perform direct comparison in the original observed sequence or structure space, relying on alignments to quantify the similarity[ 30 , 31 ]. In contrast, more recent representation-based methods encode proteins into latent embeddings and perform similarity search in a learned feature space, enabling fast retrieval over large databases and improved robustness to structural variability[ 32 – 37 ]. Beyond these aspects, many modern protein search methods adopt hybrid designs that combine multiple paradigms rather than fitting cleanly into a single category. For example, Foldseek[ 38 ] integrates representation learning with the alignment by converting protein structures into discrete 3Di sequences and applying fast sequence alignment. Other approaches, such as PLMSearch[ 39 ] and TMvec[ 40 ], operate on sequence inputs but are trained with structure-based similarity supervision, thereby implicitly encoding structural information into sequence-derived representations. Representation-based methods further differ in the type of information they encode, ranging from structure representations (e.g., GraSR[ 41 ]), to sequence representations (e.g., DHR[ 42 ]), and to joint sequence-structure representations as used in FoldExplorer[ 43 ]. Across these characteristics, different combinations of design choices have produced a diverse methodological landscape, with different classes of methods tailored to the trade-offs between sensitivity, robustness, and computational efficiency in searching against the large protein databases. Despite the methodological diversity, the systematic evaluation of protein search methods has lagged behind their development[ 44 ]. Existing benchmarks are often limited in the scope, focusing on curated datasets dominated by single-domain, well-structured proteins and evaluating the methods under heterogeneous experimental settings[ 45 , 46 ]. As a result, the reported performance is difficult to compare across different settings, and it remains unclear how different classes of methods perform relative to another when assessed under the consistent conditions. The lack of unified benchmarking complicates both the method development and practical tool selection. Moreover, most benchmarks emphasize the homology or fold-level similarity, while other biologically relevant aspects of protein search remain underexplored[ 47 , 48 ]. In particular, large-scale assessments of the functional consistency in search results are scarce, despite the fact that functional relatedness does not always align with the sequence or structural similarity. Similarly, proteins with complex architectures, such as multi-domain proteins, or those enriched in intrinsically disordered regions (IDRs)[ 49 ] are rarely examined, even though such proteins are abundant in real proteomes[ 50 ]. In addition, predicted structures are inherently heterogeneous in the quality, yet the impact of structural confidence on different search strategies has not been systematically analyzed[ 51 ]. To address these gaps, we present a unified benchmark for protein sequence and structure search approaches. We systematically compare representative sequence-based, structure-based, and representation-based methods on the CATH[ 52 ] dataset using a unified framework with consistent benchmark datasets, evaluation metrics, and search protocols. To assess the biological relevance, we further evaluate the performance on a dataset emphasizing the functional consistency of retrieved results. We explicitly examine the behavior of different methods on multi-domain proteins and contrast it with corresponding single-domain proteins. In addition, we evaluate protein search on intrinsically disordered proteins (IDPs), highlighting a challenging scenario in which current methods consistently underperform. We also analyze the impact of predicted structure confidence, measured by pLDDT[ 53 ], on the search performance. Finally, we benchmark the computational efficiency of individual methods under typical usage conditions. This benchmark offers a comprehensive and realistic evaluation of protein search methods against rapidly expanding, prediction-driven protein databases. By facilitating fair comparisons across paradigms and highlighting method-specific strengths and limitations in diverse biological contexts, it provides practical guidance for tool selection and lays a foundation for the development of protein search approaches. To the best of our knowledge, this work presents the first unified benchmark that systematically evaluates protein search methods across a diverse set of biologically challenging scenarios, including intrinsically disordered proteins, multi-domain architectures, and varying levels of predicted structure confidence. By explicitly examining how search performance depends on disorder content and pLDDT-derived structure reliability, our benchmark extends beyond traditional structure- or sequence-centric evaluations and reflects the practical conditions encountered in modern, prediction-driven protein databases. Results Overview of benchmark design and evaluated methods To systematically assess protein sequence and structure search methods under realistic and comparable conditions, we designed a unified benchmark encompassing five complementary biological scenarios. These scenarios capture key challenges encountered in contemporary protein search, including fold-level structural similarity, functional consistency, architectural complexity, intrinsic disorder, and the quality of predicted structural data. We evaluated representative methods spanning different search paradigms, including sequence-based, structure-based alignment, and representation-based approaches (Table 1). All methods were assessed using consistent datasets, evaluation metrics, and experimental protocols within each scenario, enabling fair comparison across paradigms. Some of the scenarios are challenging, such as the searching for similar multi-domain proteins, and for similar intrinsic disorder proteins which do not have stable structures. An overview of the benchmark design and evaluated scenarios is shown in Fig. 1. Table 1 Protein search tools used in the benchmark a Category Tool Source Repository Structure alignment GTalign[ 54 ] https://github.com/minmarg/gtalign_alpha TMalign[ 23 ] https://www.aideepmed.com/TM-align/ Dali[ 24 ] http://ekhidna2.biocenter.helsinki.fi/dali Foldseek[ 38 ] https://github.com/steineggerlab/foldseek Representation-based GraSR[ 41 ] https://github.com/chunqiux/GraSR TMvec[ 40 ] https://github.com/tymor22/tm-vec PLMSearch[ 39 ] https://dmiip.sjtu.edu.cn/PLMSearch DHR[ 42 ] https://github.com/ml4bio/Dense-Homolog-Retrieval FoldExplorer[ 43 ] https://github.com/YuanLiu-SJTU/FoldExplorer Sequence alignment BLAST[ 20 ] https://blast.ncbi.nlm.nih.gov/Blast.cgi Diamond[ 55 ] https://github.com/bbuchfink/diamond jackhmmer[ 56 ] http://hmmer.org/download.html MMseqs[ 57 ] https://github.com/soedinglab/MMseqs2 a Detailed information about the tools’ version and parameter values used in this study is provided in the “Methods” section. Fold classification performance on the CATH dataset Fold classification performance was evaluated on the CATH-S20 dataset, which organizes protein structures into a four-level hierarchy with the increasing specificity: Class, Architecture, Topology, and Homologous superfamily. Performance was quantified using the sensitivity up to the first false positive (detailed information is provided in the “Methods” section), a stringent metric reflecting the ability of a method to retrieve true structural neighbors before any incorrect match is introduced. Results are shown in Fig. 2, and the corresponding area under the curve (AUC) values are reported in Supplementary Table S1 . Across the four CATH levels, structure-based alignment methods achieved the highest sensitivity, demonstrating strong capability in capturing structural similarity even at fine-grained classification levels. Representation-based methods generally ranked second, outperforming sequence-based approaches while remaining less sensitive than explicit structural alignment. Specifically, we quantify the sensitivity of each method using the area under the sensitivity up to the first false positive curve. Under this metric, among the five representation-based approaches, GraSR achieved the highest sensitivity at the Class and Architecture levels, whereas FoldExplorer performed best at the Topology and Homologous Superfamily levels. The best-performing representation-based methods achieved sensitivities of 51.8%, 71.7%, 79.2%, and 88.1% of the performance of structure-based alignment methods across the four hierarchical CATH levels, respectively. Sensitivity consistently increased at deeper hierarchical levels and gradually approached the performance of structure-based alignment methods. Foldseek, which combines representation learning with alignment-based matching, also demonstrated consistently high sensitivity across all hierarchical levels. In contrast, the best sequence-based method, BLAST, achieved relative sensitivities of only 1.8%, 7.8%, 18.4%, and 27.5% compared with the best-performing structure alignment methods, underscoring the limited capability of sequence alignment methods in detecting remote homology relationships. Although these trends were consistent at the level of method categories, differences among individual methods were observed. For example, the structure-based representation method GraSR exhibited slightly lower sensitivity than large language model-based sequence representations such as PLMSearch at the Topology and Homologous superfamily levels, while outperforming other representation-based methods at the Class and Architecture levels. This pattern reflects a general distinction between structure-based and sequence-based representations: structural representations tend to enhance the discrimination at coarse structural levels, while sequence-based representations capture the homology-driven similarity at finer hierarchical resolutions. Our overall results show that structure-based alignment methods remain the most effective approach for fold classification in high-quality structure database such as the CATH. While representation-based approaches offer a competitive and practical alternative in terms of the accuracy and efficiency, especially in the case of large-scale structure database where the high time cost for alignment-based comparison. Assessing the functional consistency in protein search Structural or sequence similarity does not necessarily imply functional relatedness[ 58 ]. To assess the biological relevance of protein search results, we evaluated the functional consistency using a SwissProt dataset with sequence identity filtered to less than 20%, thereby reducing redundancy and emphasizing remote functional relationships (Fig. 3). Functional consistency was quantified by measuring the overlap of Gene Ontology[ 59 ] Molecular Function (GO-MF) annotations between query proteins and their retrieved top 10 hits. In contrast to the fold-level results observed on CATH-S20, methods based on global structural alignment, such as TMalign and GTalign, did not achieve high functional consistency in this evaluation. Representation-based methods, such as FoldExplorer, which also leverages global structural representations but further leverage large protein language models, consistently demonstrated improved functional coherence among top-ranked results, achieving a 4.7% increase over TMalign. This improvement was observed regardless of whether ESM[ 12 , 60 ] or ProtTrans[ 61 ] embeddings were used. Whereas using sequence-based alignment alone did not produce such gains, highlighting the key role of protein language models in capturing functional signals[ 62 ]. On the other hand, methods based on local structural alignment, including Foldseek, achieved an even greater improvement of 7.0% over TMalign, suggesting that capturing local structural motifs is particularly effective for identifying functionally related proteins under conditions of low sequence identity. The comparison between sequence-based and structure-based approaches under this evaluation highlights a fundamental difference in how similarity relates to functional consistency. Because all protein pairs in this dataset share less than 20% sequence identity, sequence alignment operates in the classical “twilight zone”[ 63 ], where low sequence similarity no longer reliably reflects functional relatedness. In contrast, protein structures are more conserved than sequences during evolution, particularly around functional regions such as active sites and interaction interfaces. As a result, structure-based search methods, especially those capturing local structural similarity, are better suited for identifying functionally related proteins under remote homology conditions, directly addressing the challenge of functional consistency in protein search. Functional consistency represents an independent and biologically meaningful dimension of protein relatedness that is not fully captured by sequence or structural similarity alone, and therefore requires explicit evaluation in protein search benchmarks. Nevertheless, our evaluation indicates that under remote homology conditions, functional consistency is more closely associated with structural similarity than with sequence similarity. Search performance on multi-domain proteins with diverse architectures Multi-domain proteins are widespread in proteomes, accounting for over 50% of known proteins and more than 70% in eukaryotic organisms, and typically exhibit complex architectures composed of multiple structural units[ 64 ]. In such cases, biologically meaningful similarity is frequently confined to individual domains or conserved substructures, while global folds, domain arrangements, and inter-domain orientations may differ substantially. These characteristics make multi-domain protein search fundamentally more challenging than single-domain fold recognition and limit the effectiveness of purely global similarity measures. To systematically assess how different search approaches handle this architectural complexity, we compared search performance on full-length multi-domain proteins with that obtained after decomposing the same proteins into their constituent single-domain units based on annotations from The Encyclopedia of Domains (TED)[ 65 ]. Search performance was evaluated using the local Distance Difference Test (lDDT)[ 66 ], a reference-free metric that quantifies local structural similarity without requiring explicit structural superposition[ 38 ]. The lDDT metric used in this study follows a slightly modified definition relative to the original formulation (details are provided in the “Methods” section). The calculation of lDDT relies on residue-level correspondences defined by an alignment. For the methods without an inherent alignment output, alignments were obtained using structure-based alignment tools prior to lDDT computation. Specifically, residue-level alignments were generated using four established structure alignment methods: Dali, TMalign, GTalign, and Foldseek. Across all evaluated query-hit pairs, Dali consistently produced the highest lDDT scores, followed by TMalign and GTalign with comparable performance, while Foldseek yielded lower lDDT scores, reflecting its design focuses on fast structural retrieval rather than high-precision residue-level alignment (Supplementary Figure S1 ). To ensure a method-agnostic evaluation of structural consistency, independent of the alignment strategy used, we therefore adopted the maximum lDDT score obtained across the four alignment methods as the final lDDT score for each query-hit pair. For full-length multi-domain proteins, structure-based alignment methods generally achieved higher lDDT scores than representation-based approaches, with local structural alignment methods, such as Dali and Foldseek, performing the best. Global structural alignment methods, while slightly weaker than local alignment, still consistently outperformed representation-based methods, indicating that representation-based approaches capture local structural features less precisely when evaluating multi-domain architectures. Interestingly, as shown in Fig. 4, after decomposing multi-domain proteins into their constituent single-domain units, the performance of representation-based methods improved substantially. Among the five methods we tested, the three sequence-based representation approaches exhibited modest gains of 3%~10%. The largest improvements were observed for structure-informed representation methods, GraSR and FoldExplorer, which achieved increases of 32.1% and 16.1%, respectively. Notably, the hybrid sequence-structure method FoldExplorer surpassed global alignment methods like TMalign and achieved lDDT scores comparable to those of local structural alignment methods. These results suggest that domain decomposition can reveal strengths of representation-based methods that are otherwise masked in full-length multi-domain comparisons. The performance shifts highlight that the effectiveness of different search paradigms depends strongly on the evaluation context. Local alignment-based methods demonstrate strong robustness to architectural complexity, maintaining high functional consistency even for multi-domain proteins, whereas representation-based methods benefit substantially from domain-level resolution, particularly when incorporating structural information. Search performance on intrinsically disordered proteins Intrinsically disordered proteins (IDPs) and regions (IDRs) play critical roles in cellular regulation, signaling, and molecular recognition, and constitute a substantial fraction of the proteomes. Computational analyses suggest that intrinsically disordered proteins and regions are widespread across proteomes. Approximately one-third of eukaryotic proteins are predicted to contain long intrinsically disordered regions, and in the human proteome, an estimated 37–50% of all amino acid residues are inferred to be intrinsically disordered[ 10 , 67 ]. Unlike well-folded proteins, IDPs lack a stable tertiary structure under physiological conditions and often exert their functions through transient, context-dependent interactions. As a result, identifying functionally related IDPs remains an important and challenging task for protein search methods. To systematically evaluate the protein search performance in this scenario, we constructed a benchmark set based on experimentally validated IDPs from DisProt[ 68 ]. A total of 534 IDP sequences were used as queries, with the pairwise sequence identity below 20%. These queries were searched against a SwissProt human dataset consisting of 6955 proteins, also filtered to 20% sequence identity, ensuring the reduced redundancy within and between the query and target sets. Functional relevance was assessed using GO semantic similarity computed for the top 10 query-target pairs. Across all the evaluated methods, the functional consistency on the IDP dataset was uniformly low, with a median GO semantic similarity below 0.4, only marginally exceeding random expectations. This behavior was observed consistently across alignment-based, representation-based, and hybrid search paradigms, with substantially diminished performance differences compared to evaluations on structured proteins. Notably, no method demonstrated a clear advantage in distinguishing functionally related IDPs from unrelated targets. This low functional consistency cannot be attributed to dataset quality or the absence of meaningful biological relationships. Across all proteins containing disordered regions—ranging from entirely disordered proteins to proteins with only partial disorder—existing methods consistently underestimate functional similarity, with predicted values roughly 50% lower than the ground-truth functional similarity (around 0.8). This indicates that relevant functional signals are present but are not effectively captured by current protein search approaches. The lack of stable tertiary structure limits the applicability of structure-based alignment (Supplementary Figure S2), while the high sequence variability and weak evolutionary constraints of IDPs reduce the effectiveness of sequence-based and representation-based similarity measurements. Moreover, IDP function is often mediated by short linear motifs and context-specific interactions[ 69 ], which are difficult to capture using global similarity metrics. These results indicate that accurate remote homology search for IDPs remains a significant challenge for current protein search approaches. Our benchmark results demonstrate that IDPs represent a particularly challenging scenario for protein similarity search. The uniformly low performance across paradigms highlights a shared limitation of current methods and underscores the need for evaluation frameworks and search strategies specifically tailored to the unique biological properties of disordered proteins. Impact of predicted structure quality on the search performance Recent advances in protein structure prediction, exemplified by AlphaFold and related methods, have enabled the generation of large-scale, proteome-wide structural databases. As a result, a typical and increasingly important application of protein structure search is to query against the databases composed predominantly of predicted rather than experimentally determined structures. In this context, understanding how prediction uncertainty influences the search performance becomes critical. Predicted protein structures exhibit substantial variation in the quality, with local confidence commonly quantified by pLDDT scores. To examine how uncertainty of predicted models affects protein search, we stratified predicted structures into confidence intervals based on pLDDT and systematically evaluated the search performance across these groups (Fig. 6). Decreasing structural confidence was consistently associated with degraded search performance across all evaluated structure-based methods. Relative to accurately predicted structures (pLDDT ≥ 90), structure alignment-based approaches, including GTalign, TMalign, and Dali, showed substantial reductions in the mean GO semantic similarity of their top-10 hits when applied to low-confidence predicted structures (pLDDT < 70), with decreases of 36.4%, 36.1%, and 22.9%, respectively with the p-value < 0.001. Methods that fully or partially incorporate structural representations, namely GraSR, Foldseek, and FoldExplorer, experienced more moderate declines of 22.7%, 6.9%, and 8.7%, respectively. In contrast, sequence-derived representation methods, including TMvec, PLMSearch, and DHR, exhibited only minor decreases of 6.6%, 7.0%, and 5.2%. Serving as a control, traditional sequence alignment methods displayed no measurable degradation across pLDDT intervals, indicating that the observed performance losses are specifically attributable to structural prediction uncertainty rather than dataset effects. As illustrated in Fig. 6, methods leveraging large language model-based sequence representations, such as TMvec, PLMSearch, DHR, and FoldExplorer, consistently outperformed both structure alignment-based and structure representation-based approaches in the moderate-confidence (70 ≤ pLDDT < 90) and low-confidence (pLDDT < 70) regimes. These observations reveal that the sensitivity to prediction uncertainty differed markedly between search paradigms. Structure alignment-based methods showed pronounced performance drops in the presence of low-confidence regions, reflecting their reliance on accurate local geometry. In contrast, representation-based approaches were more tolerant to moderate levels of structural uncertainty, and although their performance also deteriorated when large fractions of the structure exhibited low confidence, they generally remained superior to traditional sequence alignment methods. These results suggest that predicted structures do not provide a uniformly reliable substrate for protein search, and their utility depends critically on both structural confidence and the underlying search paradigm. While moderate prediction uncertainty can be tolerated by representation-based methods, extensive low-confidence regions fundamentally compromise the informativeness of structural features for retrieval. Consequently, the effectiveness of structure-based search over large predicted structure databases is inherently constrained by the model reliability, underscoring the importance of integrating confidence awareness into search design and result interpretation. Computational efficiency Computational efficiency is a critical practical consideration for large-scale protein search, particularly for applications involving comprehensive databases such as the AlphaFold Database (AFDB, ~ 241 million structures). Using the same dataset as the functional search benchmark (7,735 targets), we compared per-query runtime across different protein search paradigms under commonly used execution modes (Fig. 7). For traditional sequence and structure alignment methods, we measured runtimes on a single CPU core (CPU-1) and on 16 CPU cores (CPU-16) of an AMD EPYC 9654 processor. Representation-based methods, which are deep learning-based, support GPU acceleration and allow pre-building the target database. Therefore, we evaluated them in two modes: using a GPU alone and using a GPU with a pre-built target database (GPU + DB). The GPU used was an NVIDIA GeForce RTX 4090 D. Sequence alignment methods are consistently the fastest, enabling high-throughput searches even with modest computational resources. In fact, sequence-based methods can be 5–6 orders of magnitude faster than traditional structure alignment methods, which are substantially more expensive due to the intrinsic cost of explicit structural superposition and remain computationally demanding even with parallelization. Representation-based approaches fall between sequence- and structure-based methods, achieving runtimes roughly 3 orders of magnitude lower than structure alignment while retaining sensitivity beyond the sequence level. These runtime differences are explained by fundamentally different scalability characteristics. Representation-based methods primarily incur cost during embedding computation which however can be precomputed, leading to an overall complexity that scales approximately linearly with the number of queries and targets in the onsite search practice. Importantly, target embeddings can also be reused, allowing the one-time indexing cost to be amortized across repeated searches and making such methods well suited for large or growing databases. In contrast, alignment-based methods are dominated by pairwise comparisons, resulting in the computational complexity that scales with the product of query and target set sizes, which fundamentally limits the scalability despite hardware acceleration. Overall, our benchmark results suggest that representation-based methods provide a favorable balance between computational efficiency and expressive power, making them particularly well suited for large-scale protein search. Summary Overall rankings of each protein search method across all datasets and metrics (Fig. 8) highlights clear differences in their strengths across application scenarios. Global structural alignment methods (GTalign and TMalign) consistently perform well in CATH classification, making them particularly suitable for high-accuracy fold recognition and homologous superfamily assignment when reliable structures are available. In contrast, local structural alignment methods (Foldseek and Dali) show superior performance in functional consistency and local structural similarity (lDDT metric), indicating their advantage in capturing functionally relevant substructures and conserved local motifs, especially in cases where global folds diverge or are only partially conserved. Embedding-based representation methods (DHR and FoldExplorer) primarily excel in their robustness across diverse and noisy structural contexts, maintaining stable performance in functional consistency even when global pLDDT is low or when proteins contain partially disordered regions. By operating in a continuous representation space rather than relying on strict residue-level alignment, these methods are less sensitive to local structural inaccuracies and fold divergence. In particular, approaches that integrate structural cues with learned representations achieve a more reliable balance across evaluation categories, making them especially suitable for exploratory searches, functional inference, and large-scale protein space analysis, where robustness and generalization are more critical than exact structural alignment. Sequence alignment methods (BLAST, Diamond, jackhmmer, MMseqs) are effective for close homolog detection but show clear limitations in structural and functional benchmarks, particularly for remote homology and accurate predicted structures. Beyond accuracy, computational efficiency and scalability also play a decisive role in practical large-scale applications. Embedding-based approaches offer substantial advantages in speed and scalability, enabling rapid similarity search over millions of proteins. These properties make representation-based methods particularly attractive for high-throughput annotation and proteome-wide exploration. These results suggest that no single method is optimal for all tasks. The choice of method should therefore be guided by the specific application scenario, data quality, and scale of analysis. Discussions Applicability boundaries of protein search paradigms Rather than revealing a best-performing tool, this benchmark highlights the existence of clear applicability boundaries among different protein search paradigms. Sequence-based, structure-based alignment, and representation-based approaches rely on fundamentally different biological assumptions, and therefore no single paradigm provides a universally optimal solution. Sequence-based search relies on the conservation of primary structure and is inherently well suited for identifying close homologs and annotating proteins within well-characterized families. Structure-based alignment, by contrast, operates in a higher-level geometric space and is capable of detecting remote relationships that are invisible at the sequence level, particularly at the fold or topology scale. Representation-based methods occupy an intermediate and more flexible regime, compressing sequence or structural information into learned embeddings that can generalize across large databases but may blur fine-grained biological distinctions. Crucially, these paradigms diverge most strongly when evaluated across heterogeneous biological contexts. Differences in the sequence identity, domain composition, structural modularity, and disorder content systematically shift the relative strengths of each approach. As a result, performance rankings are not stable across tasks but instead depend on the biological scenario under different consideration. This instability is not a limitation of individual methods, but a reflection of the intrinsic complexity of protein relatedness. These observations emphasize that protein search performance is inherently context dependent. Evaluations spanning multiple biological scenarios can help to provide practical guidance for users in selecting appropriate tools for specific tasks, while also offering insights for future method development by clarifying how different design choices perform under varying conditions. Metric space perspectives on protein search Across the benchmarks evaluated in this study, we observed that the relative performance of alignment-based and representation-based methods depends strongly on protein complexity and the definition of the similarity. In particular, representation-based approaches improved substantially after decomposing multi-domain proteins into single-domain units, reaching the performance comparable to structural alignment methods. Similar trends were observed in functional search, where embedding-based methods exhibited stronger functional coherence among retrieved hits. These observations highlight that different protein representations induce distinct metric spaces, each encoding specific aspects of protein relatedness. Sequence-based, structure-based, and hybrid embedding-based methods emphasize complementary similarity signals, which helps explain their varying behavior across benchmarks rather than a single dominant paradigm. Alignment-based methods rely primarily on independent pairwise comparisons and do not naturally generate a unified or continuous representation of protein space. Consequently, they are less suited for analyzing the global organization, coverage, or unexplored regions of large protein databases. In contrast, embedding-based methods construct explicit, high-dimensional metric spaces that capture protein relationships globally. To illustrate this, we performed dimensionality reduction on all proteins in the CATH-S20 dataset (Fig. 9). In the resulting embedding space, proteins from the same class form distinct clusters, reflecting meaningful structural relationships, and the distances between these clusters can be quantitatively measured to assess similarity and divergence between different structural clusters. This continuous organization enables systematic characterization of protein space, providing insights into sparsely populated or poorly annotated regions where meaningful relationships may not be evident from pairwise alignments alone, and facilitates the identification of underexplored areas that may merit further investigation. Future directions Previous benchmarks for protein sequence and structure search have provided valuable insights into method performance, typically focusing on well-curated, folded domains and evaluating structural or alignment accuracy under relatively idealized conditions. Such studies have played a critical role in establishing reference standards and driving methodological progress in the field. our benchmark extends these efforts by explicitly considering several biologically and practically relevant scenarios that are increasingly prevalent in modern protein databases but have been less systematically examined. In particular, we evaluate search performance on intrinsically disordered proteins, contrast multi-domain proteins with their single-domain counterparts, and stratify analyses by predicted structure confidence. These design choices enable a more realistic assessment of how different classes of methods behave when confronted with incomplete folding, domain complexity, or uncertainty in predicted structures. By systematically benchmarking protein similarity search methods across structural and functional-aware scenarios, our results highlight both the strengths and the current limitations of existing approaches under realistic database conditions. In particular, analyses of multi-domain architectures and predicted structures with varying confidence reveal how structural complexity and prediction uncertainty modulate search performance, whereas the explicit evaluation of intrinsically disordered proteins identifies regimes that remain challenging even for state-of-the-art methods. These findings underscore the need for benchmarks that go beyond canonical folded proteins and static structures, and motivate future efforts toward more robust, context-aware, and confidence-sensitive protein search methodologies. Despite the unified design of the benchmark, several limitations should be noted. While we rely on curated resources such as CATH and DisProt, the construction of several evaluation scenarios was necessary because the field currently lacks widely accepted public benchmarks for protein sequence and structure search, leading us to collect some datasets from SwissProt and thereby introducing potential data reuse biases. In particular, several deep learning-based methods evaluated here are trained on large-scale resources that may partially overlap, directly or indirectly, with SwissProt sequences or annotations. Examples include TMvec, whose training relies on protein structures derived from SWISS-MODEL[ 70 ]; PLMSearch, which leverages Pfam-based homology information during the retrieval (with the PfamClan module disabled in our evaluation); and DHR, whose training dataset is constructed using UniRef90[ 71 ] as a reference. As a result, it is difficult to fully exclude prior exposure of benchmark proteins during model training. This challenge is increasingly unavoidable in the era of large pretrained protein models, where training data provenance is often broad and heterogeneous. While we applied sequence identity filtering and redundancy reduction to mitigate the trivial memorization, subtle forms of information leakage cannot be completely ruled out. Beyond these data-related considerations, additional limitations reflect the scope of the evaluation. Functional consistency is assessed using GO-based annotations, which are incomplete and unevenly distributed across proteins. Structural analyses rely on static protein conformations and therefore do not capture conformational dynamics or interaction-dependent folding. Moreover, our evaluation focuses on pairwise search performance and does not explicitly address downstream workflows such as clustering or iterative annotation. Future benchmarks would benefit from stricter control of training-evaluation separation, for example through time-split datasets, explicit tracking of training data sources, or the use of newly deposited or de novo sequences and structures. Extending evaluation to confidence-aware search, domain-aware representations, and disordered or context-dependent proteins will also be essential for capturing the full complexity of protein relatedness in large-scale databases. Methods Benchmark datasets CATH-S20 dataset The CATH-S20 (v4.4.0, 15043 domains) dataset was used to evaluate protein search performance under a hierarchical fold classification setting. Protein domains were obtained from the CATH database and filtered to a maximum pairwise sequence identity of 20% to reduce the redundancy. Each protein domain is annotated with a four-level hierarchical classification, including Class, Architecture, Topology (fold), and Homologous superfamily. We evaluated protein search performance on CATH-S20 using an all-versus-all search protocol, in which each domain was used in turn as a query against all remaining domains in the dataset. Functional search dataset The functional search dataset was constructed based on the AlphaFold Protein Structure Database (v6)[ 72 ], which provides a pre-assembled set of SwissProt proteins with predicted structures, comprising a total of 550122 entries. To ensure reliable structural information, we retained only proteins with an average predicted pLDDT score of at least 90, resulting in 307578 high-confidence predicted structures. These proteins were then clustered using CD-HIT[ 73 ] at a 20% sequence identity threshold to reduce the redundancy, yielding 11721 representative proteins. Next, we filtered the dataset to retain only proteins with curated Gene Ontology (GO) annotations, considering only the Molecular Function category. This resulted in 8767 proteins. We further excluded proteins annotated with GO terms that had been removed in the current GO release, producing a final dataset of 8735 proteins. From this final set, 1000 proteins were randomly selected as the query set, and the remaining proteins were used as the target set for functional search evaluation. Multi-domain dataset The multi-domain dataset was constructed to evaluate the impact of protein architectural complexity on the search performance. Starting from the same non-redundant, high-confidence protein set described above, comprising 11721 structures, we identified proteins annotated as multi-domain based on TED domain annotations. This resulted in a total of 5573 multi-domain proteins. To construct the search benchmark, 100 multi-domain proteins were randomly selected as the query set, and the remaining 4573 proteins were used as the target set. In addition, to enable a direct comparison between full-length multi-domain proteins and their constituent domains, we further derived a domain-resolved dataset by decomposing the multi-domain proteins into their single-domain units according to TED annotations. This design allows paired evaluations on identical underlying proteins, facilitating an assessment of how domain composition influences search performance across different methods. Only TED-defined continuous structural segments were retained as single-domain units, resulting in 204 queries and 11041 targets. Disorder protein search dataset The disorder protein search dataset was constructed to evaluate protein search performance on intrinsically disordered proteins (IDPs). Query proteins were obtained from DisProt (release 2024.12), which provides experimentally validated annotations of intrinsically disordered regions. Starting from the DisProt set with 3113 proteins, we retained proteins that have AlphaFold structures and curated Gene Ontology (GO) annotations. This resulted in an initial set of 1012 IDP query candidates, which was further processed to remove the redundancy. The target set was constructed from the AlphaFold SwissProt human proteome, comprising 23586 proteins. Human proteins were selected as targets because the majority of DisProt entries are derived from human proteins, enabling a biologically consistent evaluation setting. To ensure low sequence redundancy between query and target proteins, the two sets were pooled and clustered using CD-HIT at a 20% sequence identity threshold. After clustering, the final dataset consisted of 534 query proteins and 6955 target proteins. Predicted structure quality dataset This dataset was constructed to assess the robustness of protein search methods to variations in structural prediction accuracy. All proteins were obtained from the AFDB SwissProt set. Based on the average predicted pLDDT score, proteins were grouped into three subsets representing different levels of structure quality: high confidence (pLDDT ≥ 90), medium confidence (70 ≤ pLDDT < 90), and low confidence (pLDDT < 70). To control potential confounding effects introduced by protein length, we restricted the analysis to proteins with sequence lengths up to 300 residues. This filtering step was applied uniformly across all three subsets, as proteins with lower pLDDT scores tend to be substantially longer and could otherwise bias the evaluation. Each subset was independently clustered using CD-HIT at a 20% sequence identity threshold to reduce the redundancy. For each structure quality group, 100 proteins were randomly selected as the query set, while the remaining proteins were used as the target set. This resulted in target sets of 4053 proteins for the high-confidence group, 4913 proteins for the medium-confidence group, and 1907 proteins for the low-confidence group. Evaluation tasks and metrics Sensitivity up to the 1st False Positive For fold classification tasks on the CATH-S20 dataset, search performance was evaluated using sensitivity up to the first false positive. For each query, retrieved targets were ranked by similarity score, and true positives were defined according to the corresponding CATH classification level (Class, Architecture, Topology, or Homologous superfamily). Sensitivity up to the first false positive is defined as the fraction of true positives retrieved before the first incorrect hit appears in the ranked list: $$\:\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}{\text{y}}_{\text{F}\text{P}1}=\frac{{N}_{\text{T}\text{P}\:\text{b}\text{e}\text{f}\text{o}\text{r}\text{e}\:\text{F}\text{P}}}{{N}_{\text{t}\text{o}\text{t}\text{a}\text{l}\:\text{T}\text{P}}}$$ 1 where $\:{N}_{\text{T}\text{P}\:\text{b}\text{e}\text{f}\text{o}\text{r}\text{e}\:\text{F}\text{P}}$ denotes the number of true positive hits retrieved before the first false positive, and $\:{N}_{\text{t}\text{o}\text{t}\text{a}\text{l}\:\text{T}\text{P}}$ is the total number of true positives for the given query at the specified CATH level. GO semantic similarity For functional search and disorder protein search benchmarks, functional consistency was evaluated using Gene Ontology (GO) semantic similarity based on Molecular Function annotations. Functional similarity between a query protein and its retrieved targets was computed using the Wang semantic similarity measure[ 74 ], as implemented in the python “goatools” package[ 75 ]. Given two GO terms $\:{\text{g}}_{1}$ and $\:{\text{g}}_{2}$ , the Wang method quantifies their semantic similarity by considering the graph structure of the GO directed acyclic graph and the contribution of shared ancestor terms. For proteins annotated with multiple GO terms, we aggregated term-level similarities using the best-match average (BMA) strategy. $$\:\text{S}\text{i}{\text{m}}_{\text{G}\text{O}}\left(q,t\right)=\frac{1}{\left|{G}_{q}\right|}{\sum\:}_{g\in\:{G}_{q}}\underset{h{\in\:G}_{t}}{\text{max}}\text{S}\text{i}{\text{m}}_{\text{W}\text{a}\text{n}\text{g}}\left(g,h\right)$$ 2 This formulation computes similarity in a query-to-target direction, reflecting the extent to which the functional annotations of the query are recovered among the retrieved targets. Local Distance Difference Test (lDDT) Local structural similarity was assessed using the Local Distance Difference Test (lDDT), a reference-free metric that quantifies the agreement of local inter-residue distances without requiring global structural superposition. lDDT measures the fraction of residue-residue distance pairs that are preserved within predefined distance tolerances between two aligned structures. Given an alignment between a query structure and a target structure, lDDT is computed as: $$\:\text{l}\text{D}\text{D}\text{T}=\frac{1}{N}{\sum\:}_{i=1}^{N}\frac{1}{\left|\mathcal{N}\left(i\right)\right|}{\sum\:}_{j\in\:\mathcal{N}\left(i\right)}\mathbf{I}\left(\left|{d}_{ij}^{\left(q\right)}\right|-\left|{d}_{ij}^{\left(t\right)}\right|<\delta\:\right)$$ 3 where $\:{d}_{ij}^{\left(q\right)}$ and $\:{d}_{ij}^{\left(t\right)}$ denote inter-residue distances in the query and target structures, respectively, $\:\mathcal{N}\left(i\right)$ is the set of neighboring residues of residue $\:i$ within a fixed spatial cutoff (15 Å), $\:\delta\:$ represents distance tolerance thresholds (0.5, 1, 2, and 4 Å), and $\:\mathbf{I}\left(\cdot\:\right)$ is the indicator function. The final residue-wise lDDT score is obtained by averaging over the four distance thresholds. Following previous work on reference-free multi-domain evaluation, we adopted a modified definition of lDDT in which the denominator $\:\left|\mathcal{N}\left(i\right)\right|$ is defined as the total number of neighboring residues within 15 Å of residue $\:\text{i}$ in the query structure, rather than only those neighbors that are aligned to the target. This modification penalizes non-compact or fragmented alignments, where few neighboring residues are aligned despite limited local structural consistency, and thus provides a more stringent assessment of local structural agreement[ 38 ]. For each query-target pair, lDDT was computed based on structural alignments obtained from multiple alignment tools, and the maximum lDDT score was used as the final local similarity measure. Search performance was evaluated by averaging lDDT scores over the top 10 retrieved hits. Evaluated protein search tools We evaluated a diverse set of protein search tools spanning sequence-based, structure-based, and representation-based approaches. All methods were applied using publicly available implementations with recommended or default parameters unless otherwise specified. Structure alignment tools Dali Dali is a classical structure alignment tool that performs local structural alignment based on distance matrix comparisons. We installed the standalone DaliLite.v5 (available at http://ekhidna2.biocenter.helsinki.fi/dali/README.v5.html ). Input PDB files were converted to DAT format using Dali’s “ import.pl ”. Protein alignments were computed using Dali’s structural alignment algorithm, and results were sorted by Dali z-score in descending order. TMalign We used TMalign (Version 20220412) for structural alignment. Alignments between query and target protein structures were computed with default parameters. All-vs-all dense comparisons were performed between queries and targets. For benchmarks focusing on global structural similarity, such as CATH-S20 and single-domain datasets, the average TM-score[ 76 ] over all query-target pairs was used. For evaluations emphasizing local structural similarity, such as functional search, multi-domain, or GO semantic similarity datasets, the maximum TM-score across all alignments was used as the final similarity measurement. GTalign. GTalign is an accelerated implementation of TMalign that preserves the underlying alignment strategy while substantially reducing the computational cost. We used the GPU Version 0.18.00 ( https://github.com/minmarg/gtalign_alpha ). All-versus-all dense retrieval was performed between query and target structures. Alignments were computed for all query-target pairs. TM-scores produced by GTalign were used in the same manner as TMalign. Foldseek Foldseek is a fast structure-based protein search tool that enables large-scale structural comparisons by discretizing protein structures into a sequence of structural alphabets (3Di) and performing alignment using a sequence search framework. We used Foldseek (Version dd579d9e6682519937e5c27d1ccb9eb4c9aeb87f) with default parameters. The software is available at https://github.com/steineggerlab/foldseek . Searches were performed using the “ easy-search ” command. As Foldseek typically identifies enough target hits, we employed the default E-value threshold of 1.0, and results were ranked by bit score. Representation-based tools GraSR GraSR is a graph-based structural representation method that encodes protein structures into embeddings using graph neural network (GNN)[ 77 ] to enable efficient similarity search without explicit structural alignment. We used the publicly available GraSR implementation with pretrained model weights (available at https://github.com/chunqiux/GraSR ). TMvec TMvec is a representation-based protein structural similarity search method. It builds protein embeddings using protein large language model representations from ProtTrans, which are further refined by a four-layer Transformer[ 78 ] network trained to approximate TM-score. It provides pretrained weights trained on four different types of datasets, and we used the “ tmvec_swiss_model ” weights. TMvec is available at https://github.com/tymor22/tm-vec . PLMSearch PLMSearch is a protein search framework based on protein language model embeddings and includes two optional modules, SS-predictor and PfamClan. The PfamClan module relies on PfamScan[ 79 ], a widely used third-party tool for Pfam domain annotation, and is independent of the core embedding model. As Pfam-based annotations can in some cases reflect curated functional knowledge in Swiss-Prot, including GO annotations, we restricted the use of PfamClan to the CATH evaluation only. For all other benchmark settings, we used the SS-predictor module to ensure a consistent and annotation-independent comparison across methods. PLMSearch is available at https://dmiip.sjtu.edu.cn/PLMSearch . DHR DHR (Dense-Homolog-Retrieval) is a representation-based protein sequence search method based on a bi-encoder architecture with contrastive learning, initialized from the pretrained ESM-1b model. The model is trained on homologous sequence pairs derived from jackhmmer-generated MSAs built from large-scale sequence databases, with training queries primarily drawn from UniRef90 and UniClust30[ 80 ], which include substantial coverage of SwissProt and may introduce partial overlap with our evaluation datasets. In our benchmark, we used the released v1 pretrained weights ( https://github.com/ml4bio/Dense-Homolog-Retrieval/tree/v1 ) following the authors’ recommendation. FoldExplorer FoldExplorer is a protein search method based on joint sequence-structure representations. Structural information is encoded using a graph neural network, while sequence representations are obtained from an ESM-2 model with fine-tuning. Query and target proteins are mapped to fixed-length embeddings, and retrieval is performed by ranking targets according to cosine similarity. FoldExplorer is available at https://github.com/YuanLiu-SJTU/FoldExplorer . Sequence alignment tools BLAST BLAST is a widely used sequence-based alignment tool that performs local sequence similarity searches using heuristic alignment strategies. We used protein-protein BLAST 2.16.0 + with default parameters unless otherwise specified. Protein sequences were searched against the pre-built target database, and hits were ranked by bit score. Diamond Diamond is a fast sequence alignment tool designed for large-scale protein searches, providing a substantial speed-up over BLAST while maintaining comparable sensitivity. We used Diamond (v2.1.10.164) in “ blastp ” mode with default parameters. Search results were ranked by bit score. Jackhmmer Jackhmmer is an iterative sequence search tool based on profile hidden Markov models (HMMs), implemented in the HMMER suite[ 81 ]. We used Jackhmmer (HMMER Version 3.4) with default parameters. Iterative searches were performed against the target sequence database for five iterations with “ -N 5 ”. Final hits were ranked according to the native scores reported by the software. MMseqs MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast, parallelized protein sequence searches and clustering of huge protein sequence data sets. We used MMseqs2 (Release 16.747c6) in search mode with default parameters. Query sequences were searched against the target database, and results were ranked by bit score. For all sequence alignment tools, we set the E-value threshold to 1e16 to ensure a sufficient number of hits. This is necessary because the sequence similarity in our datasets is generally low, and default thresholds often result in fewer than 10 hits per query. Evaluation protocol To ensure a fair and consistent comparison across protein search methods with heterogeneous output formats and scoring schemes, we adopted a unified top-k retrieval protocol with k = 10. For each query protein, we first configured each method to return as many target hits as possible by using permissive or unbounded output settings (e.g., disabling hit number limits or E-value thresholds where applicable). Retrieved targets were ranked according to the native similarity score of each method, and the top-10 ranked targets were selected for evaluation. In cases where a method returned fewer than 10 valid hits for a given query (e.g., due to limited sensitivity or strict internal filtering), the remaining slots were filled by randomly sampling proteins from the target set that were not retrieved by the method. These randomly sampled targets were appended to the ranked list to ensure that exactly 10 targets were associated with each query for all methods. Random sampling was performed using a fixed random seed to ensure the reproducibility. Declarations Author Contribution Yuan Liu and Hong-Bin Shen designed the benchmark framework and conducted the experiments. Yuan Liu implemented the evaluation pipeline, wrote the code, performed data analysis, and drafted the manuscript. Yingquan Zhou assisted with dataset preparation and validation experiments. Yan Huang and Hongyi Xin contributed to method selection, experimental design, and result interpretation. Xiaoyong Pan provided critical insights into benchmark design and revised the manuscript. Hong-Bin Shen conceived and supervised the study, provided overall guidance, and revised the manuscript. All authors reviewed and approved the final manuscript. Acknowledgements This work was supported by the National Key Research and Development Program of China (No. 2025YFA1805600), National Natural Science Foundation of China (No. 62573293, 62473257), and the Science and Technology Commission of Shanghai Municipality (No. 24ZR1435300, 24510714300). Data Availability All data sets used in this paper and the scripts to reproduce the analyses are freely available at https://github.com/YuanLiu-SJTU/protein-search-benchmark. The CATH structural domains were obtained from https://www.cathdb.info/, predicted structures from AlphaFold are available at https://alphafold.com/, and domain annotations were sourced from https://ted.cathdb.info/. References Wilson CA, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol. 2000;297:233–49. Joshi T, Xu D. Quantitative assessment of relationship between sequence similarity and function similarity. BMC Genomics. 2007;8:222. Xie H, Wasserman A, Levine Z, Novik A, Grebinskiy V, Shoshan A, Mintz L. Large-scale protein annotation through gene ontology. Genome Res. 2002;12:785–94. Roberts E, Eargle J, Wright D, Luthey-Schulten Z. MultiSeq: unifying sequence and structure data for evolutionary analysis. BMC Bioinformatics. 2006;7:382. Lisewski AM, Lichtarge O. Rapid detection of similarity in protein structure and function through contact metric distances. Nucleic Acids Res. 2006;34:e152–152. Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A. Protein function annotation by homology-based inference. Genome Biol. 2009;10:207. Blum M, Andreeva A, Florentino LC, Chuguransky SR, Grego T, Hobbs E, Pinto BL, Orr A, Paysan-Lafosse T, Ponamareva I. InterPro: the protein sequence classification resource in 2025. Nucleic Acids Res. 2025;53:D444–56. Paysan-Lafosse T, Andreeva A, Blum M, Chuguransky SR, Grego T, Pinto BL, Salazar GA, Bileschi ML, Llinares-López F, Meng-Papaxanthos L. The Pfam protein families database: embracing AI/ML. Nucleic Acids Res. 2025;53:D523–34. UniProt. the universal protein knowledgebase in 2025. Nucleic Acids Res. 2025;53:D609–17. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A. Highly accurate protein structure prediction with AlphaFold. nature 2021, 596:583–589. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. Zheng W, Wuyun Q, Li Y, Liu Q, Zhou X, Peng C, Zhu Y, Freddolino L, Zhang Y. Deep-learning-based single-domain and multidomain protein structure prediction with DI-TASSER. Nat Biotechnol 2025:1–13. Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022;19:679–82. Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, Bridgland A, Cowie A, Meyer C, Laydon A. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596:590–6. Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–44. Yeo J, Han Y, Bordin N, Lau AM, Kandathil SM, Kim H, Karin EL, Mirdita M, Jones DT, Orengo C. Metagenomic-scale analysis of the predicted protein structure universe. bioRxiv 2025:2025.2004. 2023.650224. Kim RS, Levy Karin E, Mirdita M, Chikhi R, Steinegger M. BFVD—a large repository of predicted viral protein structures. Nucleic Acids Res. 2025;53:D340–7. Porta-Pardo E, Ruiz-Serra V, Valentini S, Valencia A. The structural coverage of the human proteome before and after AlphaFold. PLoS Comput Biol. 2022;18:e1009818. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. Kilinc M, Jia K, Jernigan RL. Improved global protein homolog detection with major gains in function identification. Proceedings of the National Academy of Sciences 2023, 120:e2211823120. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–9. Holm L. Using Dali for protein structure comparison. Structural bioinformatics: methods and protocols. Springer; 2020. pp. 29–42. Holm L, Laiho A, Törönen P, Salgado M. DALI shines a light on remote homologs: One hundred discoveries. Protein Sci. 2023;32:e4519. Koehl P. Protein structure similarities. Curr Opin Struct Biol. 2001;11:348–53. Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D. Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins Struct Funct Bioinform. 2008;72:1259–73. Choi I-G, Kwon J, Kim S-H. Local feature frequency profile: a method to measure structural similarity in proteins. Proceedings of the National Academy of Sciences 2004, 101:3797–3802. Bhaskara RM, Srinivasan N. Stability of domain structures in multi-domain proteins. Sci Rep. 2011;1:40. Raghava G, Searle SM, Audley PC, Barber JD, Barton GJ. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics. 2003;4:47. Zhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods. 2022;19:1109–15. Budowski-Tal I, Nov Y, Kolodny R. FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately. Proceedings of the National Academy of Sciences 2010, 107:3481–3486. Liu Y, Ye Q, Wang L, Peng J. Learning structural motif representations for efficient protein structure search. Bioinformatics. 2018;34:i773–80. Durairaj J, Akdel M, de Ridder D, van Dijk AD. Geometricus represents protein structures as shape-mers derived from moment invariants. Bioinformatics. 2020;36:i718–25. Llinares-López F, Berthet Q, Blondel M, Teboul O, Vert J-P. Deep embedding and alignment of protein sequences. Nat Methods. 2023;20:104–11. Kaminski K, Ludwiczak J, Pawlicki K, Alva V, Dunin-Horkawicz S. pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics. 2023;39:btad579. Kandathil SM, Lau AM, Buchan DW, Jones DT. Foldclass and Merizo-search: scalable structural similarity search for single-and multi-domain proteins using geometric learning. Bioinformatics. 2025;41:btaf277. Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, Söding J, Steinegger M. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2024;42:243–6. Liu W, Wang Z, You R, Xie C, Wei H, Xiong Y, Yang J, Zhu S. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nat Commun. 2024;15:2775. Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CE, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol. 2024;42:975–85. Xia C, Feng S-H, Xia Y, Pan X, Shen H-B. Fast protein structure comparison through effective representation learning with contrastive graph neural networks. PLoS Comput Biol. 2022;18:e1009986. Hong L, Hu Z, Sun S, Tang X, Wang J, Tan Q, Zheng L, Wang S, Xu S, King I. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat Biotechnol. 2025;43:983–95. Liu Y, Zhang Y, Zhou Z, Shen H-B. FoldExplorer: fast and accurate protein structure search with sequence-enhanced graph embedding. J Mol Biol 2025:169412. Söding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol. 2011;21:404–11. Sauder JM, Arthur JW, Dunbrack RL Jr. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins Struct Funct Bioinform. 2000;40:6–22. Csaba G, Birzele F, Zimmer R. Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis. BMC Struct Biol. 2009;9:23. Wang Y, Wu H, Cai Y. A benchmark study of sequence alignment methods for protein clustering. BMC Bioinformatics. 2018;19:529. Sykes J, Holland BR, Charleston MA. Benchmarking methods of protein structure alignment. J Mol Evol. 2020;88:575–97. Oldfield CJ, Dunker AK. Intrinsically disordered proteins and intrinsically disordered protein regions. Annu Rev Biochem. 2014;83:553–84. Dobson L, Tusnády GE, Tompa P. Regularly updated benchmark sets for statistically correct evaluations of AlphaFold applications. Brief Bioinform. 2025;26:bbaf104. Aderinwale T, Bharadwaj V, Christoffer C, Terashi G, Zhang Z, Jahandideh R, Kagaya Y, Kihara D. Real-time structure search and structure classification for AlphaFold protein models. Commun biology. 2022;5:316. Waman VP, Bordin N, Lau A, Kandathil S, Wells J, Miller D, Velankar S, Jones DT, Sillitoe I, Orengo C. CATH v4. 4: major expansion of CATH by experimental and predicted structural data. Nucleic Acids Res. 2025;53:D348–55. Vander Meersche Y, Diharce J, Gelly J-C, Galochkina T. Flexibility or uncertainty? A critical assessment of AlphaFold 2 pLDDT. Structure. 2025;33:2157–63. e2152. Margelevičius M. GTalign: Spatial index-driven protein structure alignment, superposition, and search. Nat Commun. 2024;15:7305. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics. 2010;11:431. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–8. Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379:1358–63. Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, Harris NL. The gene ontology knowledgebase in 2023. Genetics. 2023;224:iyad031. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 2021, 118:e2016239118. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2021;44:7112–27. Zhang Z, Wayment-Steele HK, Brixi G, Wang H, Kern D, Ovchinnikov S. Protein language models learn evolutionary statistics of interacting sequence motifs. Proceedings of the National Academy of Sciences 2024, 121:e2406285121. Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. Han J-H, Batey S, Nickson AA, Teichmann SA, Clarke J. The folding and evolution of multidomain proteins. Nat Rev Mol Cell Biol. 2007;8:319–30. Lau AM, Bordin N, Kandathil SM, Sillitoe I, Waman VP, Wells J, Orengo CA, Jones DT. Exploring structural diversity across the protein universe with The Encyclopedia of Domains. Science. 2024;386:eadq4946. Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013;29:2722–8. Trivedi R, Nagarajaram HA. Intrinsically disordered proteins: an overview. Int J Mol Sci. 2022;23:14050. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN. DisProt: the database of disordered proteins. Nucleic Acids Res. 2007;35:D786–93. Kulkarni P, Leite VB, Roy S, Bhattacharyya S, Mohanty A, Achuthan S, Singh D, Appadurai R, Rangarajan G, Weninger K. Intrinsically disordered proteins: Ensembles at the limits of Anfinsen's dogma. Biophys Reviews 2022, 3. Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018;46:W296–303. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–8. Bertoni D, Tsenkov M, Magana P, Nair S, Pidruchna I, Querino Lima Afonso M, Midlik A, Paramval U, Lawal D, Tanweer A. AlphaFold Protein Structure Database 2025: a redesigned interface and updated structural coverage. Nucleic Acids Res. 2026;54:D358–62. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2. Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26:976–8. Klopfenstein DV, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, Mungall CJ, Yunes JM, Botvinnik O, Weigel M. GOATOOLS: A Python library for Gene Ontology analyses. Sci Rep. 2018;8:10872. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins Struct Funct Bioinform. 2004;57:702–10. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G. The graph neural network model. IEEE Trans Neural Networks. 2008;20:61–80. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017, 30. Mistry J, Bateman A, Finn RD. Predicting active site residue annotations in the Pfam database. BMC Bioinformatics. 2007;8:298. Mirdita M, Von Den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45:D170–6. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29–37. Additional Declarations No competing interests reported. Supplementary Files supportinginformation.docx Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 25 Mar, 2026 Reviews received at journal 24 Mar, 2026 Reviews received at journal 04 Mar, 2026 Reviewers agreed at journal 25 Feb, 2026 Reviewers agreed at journal 20 Feb, 2026 Reviewers invited by journal 19 Feb, 2026 Editor assigned by journal 12 Feb, 2026 Submission checks completed at journal 05 Feb, 2026 First submitted to journal 05 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8796067","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":595013581,"identity":"4002c4f7-452f-4d16-bdc8-02537f5681e4","order_by":0,"name":"Yuan Liu","email":"","orcid":"","institution":"Shanghai Jiao Tong University","correspondingAuthor":false,"prefix":"","firstName":"Yuan","middleName":"","lastName":"Liu","suffix":""},{"id":595013582,"identity":"337c7aef-1373-4e13-bfc2-ad4cb51981dc","order_by":1,"name":"Yingquan Zhou","email":"","orcid":"","institution":"Shanghai Jiao Tong University","correspondingAuthor":false,"prefix":"","firstName":"Yingquan","middleName":"","lastName":"Zhou","suffix":""},{"id":595013583,"identity":"ba2e8be0-e951-41b4-8fb7-7fbe1fc7048f","order_by":2,"name":"Yan Huang","email":"","orcid":"","institution":"Shanghai Institute of Technical Physics","correspondingAuthor":false,"prefix":"","firstName":"Yan","middleName":"","lastName":"Huang","suffix":""},{"id":595013584,"identity":"8d5934c6-4367-456e-aafb-a2d247baff69","order_by":3,"name":"Hongyi Xin","email":"","orcid":"","institution":"Shanghai Jiao Tong Universit","correspondingAuthor":false,"prefix":"","firstName":"Hongyi","middleName":"","lastName":"Xin","suffix":""},{"id":595013585,"identity":"5af67189-5b99-4eab-96e2-e2f66716b973","order_by":4,"name":"Xiaoyong Pan","email":"","orcid":"","institution":"Shanghai Jiao Tong University","correspondingAuthor":false,"prefix":"","firstName":"Xiaoyong","middleName":"","lastName":"Pan","suffix":""},{"id":595013586,"identity":"9659828e-a0a7-45fe-af36-0a3ea7a0cf77","order_by":5,"name":"Hong-Bin Shen","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAz0lEQVRIiWNgGAWjYBACAwkGNiBlw8DYAOKyEa8ljYGxjUQth6GqidFiLt3+7MHHHeftmef3GDB8KDvMwD+7Ab8WyzlnzA1nnrmd2NjGY8A449xhBok7Bwg47EYOmzRv2+0ERqAWZt62w0CnJhDSkv5M+m/bOXuwlr/EaUkwk2ZsO8AIchgzI1Fa7pwxk+xtSwb6Ja3gYM+5dB6JG4S03G5/JvGzzc7esPnwxgc/yqzl+GcQ0AIHhg0MDAeANA+R6oFAnnilo2AUjIJRMNIAAAOvQg3eRRCMAAAAAElFTkSuQmCC","orcid":"","institution":"Shanghai Jiao Tong University","correspondingAuthor":true,"prefix":"","firstName":"Hong-Bin","middleName":"","lastName":"Shen","suffix":""}],"badges":[],"createdAt":"2026-02-05 11:08:33","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8796067/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8796067/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103326537,"identity":"7548cd06-ac65-4641-a874-ac18bc284009","added_by":"auto","created_at":"2026-02-24 12:57:03","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":636734,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the unified benchmark for protein search. \u003c/strong\u003eThe benchmark evaluates representative protein search methods across five complementary biological scenarios designed to reflect challenges in protein similarity search. These include (A) fold-level structural similarity assessed on the CATH dataset, (B) functional consistency evaluated using Gene Ontology semantic similarity on redundancy-reduced SwissProt proteins, (C) local structural similarity in multi-domain proteins, (D) search performance on intrinsically disordered proteins, and (E) robustness to the quality of predicted structures based on AlphaFold models with varying pLDDT scores. Methods from different search paradigms, including sequence-based, structure-based alignment, and representation-based approaches, are evaluated under consistent datasets, metrics, and experimental protocols.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/35abda1f961dad571b4db3d9.png"},{"id":103326507,"identity":"0f9cfd14-bb93-4f70-899a-75993acd66b2","added_by":"auto","created_at":"2026-02-24 12:56:56","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":681488,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFold classification performance on the CATH-S20 dataset. \u003c/strong\u003eCumulative distributions of the\u003cstrong\u003e \u003c/strong\u003esensitivity up to the first false positive curves are reported across the four CATH hierarchy levels: Class, Architecture, Topology, and Homologous superfamily. Methods are grouped by search paradigm and colored accordingly: structure-based alignment methods (blue), representation-based methods (orange), and sequence-based alignment methods (green). Foldseek combines representation-based and alignment-based strategies and is shown separately.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/510c0fe5ab62b7a0f005bbb2.png"},{"id":103326609,"identity":"d747fe0a-a6b6-4219-8a19-136bc0079dfc","added_by":"auto","created_at":"2026-02-24 12:57:24","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":189859,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFunctional consistency measured by GO semantic similarity. \u003c/strong\u003eGO semantic similarity is computed between query proteins and their top 10 retrieved hits on a SwissProt dataset filtered to 20% sequence identity.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/ecfb704225cc3140f73a248a.png"},{"id":103326546,"identity":"dda8096a-ba52-482e-baf7-8f3779e34f7b","added_by":"auto","created_at":"2026-02-24 12:57:11","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":236621,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTop-10 lDDT score comparison on multi-domain and single-domain proteins. \u003c/strong\u003eSearch performance is evaluated using the average lDDT over the top 10 retrieved hits. Single-domain proteins are obtained by decomposing the same multi-domain proteins based on TED annotations, enabling a paired comparison on identical underlying data. Hollow markers (MultiDomain) denote full-length multi-domain proteins, while filled markers (Multi2Single) denote the corresponding domain-resolved single-domain proteins.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/1d9b1e4bc6fa2e13ca12a17c.png"},{"id":103326539,"identity":"d165cd7d-8bf8-4606-8b07-b05e3190111b","added_by":"auto","created_at":"2026-02-24 12:57:03","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":241069,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFunctional relevance of protein search results for intrinsically disordered proteins. \u003c/strong\u003eGO semantic similarity distributions of the top-10 retrieved targets are shown for intrinsically disordered protein (IDP) queries grouped by disorder content. Experimentally validated IDPs from DisProt were binned into five intervals according to the fraction (ranging from 0.0 to 1.0) of disordered residues. For each bin, GO semantic similarity was computed between each query and its retrieved targets. The number of query proteins in each disorder bin is indicated as “n” in parentheses (e.g., “0.0-0.2 (n = 202)” indicates 202 proteins with 0-20% disordered sequence content). The “random” baseline corresponds to randomly sampled targets from the search database, while “best” denotes an upper-bound reference obtained by selecting targets that maximize GO semantic similarity for each query.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/21dc44a927f744ef81c1ca59.png"},{"id":103326564,"identity":"66ce8bc6-7f2d-43de-9a10-72d5bc2afe3b","added_by":"auto","created_at":"2026-02-24 12:57:14","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":256914,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSearch performance across predicted structure quality. \u003c/strong\u003eSearch performance was evaluated on AlphaFold predicted protein structures stratified by local confidence, as measured by pLDDT scores (pLDDT ≥ 90, 70 ≤ pLDDT \u0026lt; 90, and pLDDT \u0026lt; 70). For each method, performance distributions are shown as boxplots of GO semantic similarity scores averaged over the top 10 retrieved hits. Methods are grouped by search paradigm and color-coded accordingly.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/6b7bdc02baf79de000ef4468.png"},{"id":103326567,"identity":"61f9895a-e9b7-4fad-9aa5-61747755f2a1","added_by":"auto","created_at":"2026-02-24 12:57:16","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":573043,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eRuntime comparison of protein search methods. \u003c/strong\u003ePer-query runtime is evaluated for searches against a target database comprising 7735 proteins under different execution modes, including single-core CPU, 16-core CPU, GPU acceleration, and GPU execution with pre-built target databases (GPU+DB). Execution modes are selected to reflect the typical and recommended usage patterns of each method.\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/e403386e58df7799cf764050.png"},{"id":103326543,"identity":"613de4e6-0eed-4169-9fb9-1a44d11fdfa4","added_by":"auto","created_at":"2026-02-24 12:57:04","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":218918,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eBenchmark comparison across all datasets. \u003c/strong\u003eEach row represents a protein search method, and columns correspond to different evaluation metrics grouped by category, including CATH classification, functional consistency, local structural similarity (lDDT), disorder, and structure confidence (pLDDT). Colored dots indicate performance on individual metrics, with dot size proportional to the metric rank. Group labels and horizontal bars above the x-axis denote the scope of each evaluation category.\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/a71b4ae4609d60c2e333c9a5.png"},{"id":103326548,"identity":"ce70c38e-ea59-4825-84b7-7b4fd8303236","added_by":"auto","created_at":"2026-02-24 12:57:11","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":2554159,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eVisualization of the protein domain embedding space using t-SNE. \u003c/strong\u003eEach subplot shows a 2D t-SNE projection of CATH-S20 protein domain embeddings generated by a representation-based method (GraSR, TMvec, PLMSearch, DHR, or FoldExplorer). Domains are colored according to their CATH class. This visualization illustrates the global structure of the protein embedding space captured by each method, highlighting how domains with similar secondary structure classes cluster together.\u003c/p\u003e","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/d5f3e9a6e9eb555586e99057.png"},{"id":103326673,"identity":"6806e952-cb85-4628-8821-62bed9710f61","added_by":"auto","created_at":"2026-02-24 12:57:40","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":6057874,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/539c91d1-dda0-4f2d-b013-084513de0e70.pdf"},{"id":103326538,"identity":"d5a886d0-b2a9-4ae9-a07c-632361d0aaae","added_by":"auto","created_at":"2026-02-24 12:57:03","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":907993,"visible":true,"origin":"","legend":"","description":"","filename":"supportinginformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-8796067/v1/d34b461021c58ddeab4d6b27.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Benchmarking protein sequence and structure search methods for remote homology detection","fulltext":[{"header":"Introduction","content":"\u003cp\u003eQuantifying similarity between proteins underlies a wide range of problems in computational biology[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Protein sequence and structure similarity search is a core operation in biological research, supporting protein annotation[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], evolutionary analysis[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], and large-scale functional inference[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Over the past decade, this task has undergone rapid expansion driven primarily by data growth. Protein sequence databases have continued to increase exponentially[\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], and recent advances in structure prediction[\u003cspan additionalcitationids=\"CR11 CR12 CR13\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] have made high-quality protein structures available at the proteome scale[\u003cspan additionalcitationids=\"CR16 CR17\" citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Predicted structures now complement experimental data for a vast number of proteins, substantially enlarging both the size and the diversity of searchable protein databases[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe rapid growth of protein sequence and structure databases has placed increasing demands on both the efficiency and accuracy of similar protein search, motivating the development of a broad range of computational approaches. Currently, dozens of methods are available, spanning both classical alignment and more recent deep learning techniques, with their properties describable along several complementary perspectives.\u003c/p\u003e \u003cp\u003eFirst, protein search methods differ in their search modality, operating either on protein sequences or on three-dimensional structures. Sequence-based homology search tools, such as BLAST[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] and related approaches[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], remain widely used due to their scalability and effectiveness in identifying members of protein families[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Structure-based methods, including classical alignment approaches such as TMalign[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e] and DALI[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], leverage spatial information to detect structural similarity and remote homology that may not be apparent from sequences alone[\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eSecond, the methods vary in their matching granularity[\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Some approaches emphasize global comparison[\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], aiming to assess overall similarity between entire proteins, while others support local or fragment level matching, enabling the identification of shared substructures or partial similarities[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. This distinction is particularly relevant for proteins with complex or multi-domain architectures, where biologically meaningful similarity may be confined to specific regions[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThird, protein search methods can be distinguished by the space in which the similarity is computed. Classical approaches typically perform direct comparison in the original observed sequence or structure space, relying on alignments to quantify the similarity[\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. In contrast, more recent representation-based methods encode proteins into latent embeddings and perform similarity search in a learned feature space, enabling fast retrieval over large databases and improved robustness to structural variability[\u003cspan additionalcitationids=\"CR33 CR34 CR35 CR36\" citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eBeyond these aspects, many modern protein search methods adopt hybrid designs that combine multiple paradigms rather than fitting cleanly into a single category. For example, Foldseek[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e] integrates representation learning with the alignment by converting protein structures into discrete 3Di sequences and applying fast sequence alignment. Other approaches, such as PLMSearch[\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e] and TMvec[\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e], operate on sequence inputs but are trained with structure-based similarity supervision, thereby implicitly encoding structural information into sequence-derived representations. Representation-based methods further differ in the type of information they encode, ranging from structure representations (e.g., GraSR[\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]), to sequence representations (e.g., DHR[\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]), and to joint sequence-structure representations as used in FoldExplorer[\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAcross these characteristics, different combinations of design choices have produced a diverse methodological landscape, with different classes of methods tailored to the trade-offs between sensitivity, robustness, and computational efficiency in searching against the large protein databases.\u003c/p\u003e \u003cp\u003eDespite the methodological diversity, the systematic evaluation of protein search methods has lagged behind their development[\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. Existing benchmarks are often limited in the scope, focusing on curated datasets dominated by single-domain, well-structured proteins and evaluating the methods under heterogeneous experimental settings[\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e, \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e]. As a result, the reported performance is difficult to compare across different settings, and it remains unclear how different classes of methods perform relative to another when assessed under the consistent conditions. The lack of unified benchmarking complicates both the method development and practical tool selection.\u003c/p\u003e \u003cp\u003eMoreover, most benchmarks emphasize the homology or fold-level similarity, while other biologically relevant aspects of protein search remain underexplored[\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e]. In particular, large-scale assessments of the functional consistency in search results are scarce, despite the fact that functional relatedness does not always align with the sequence or structural similarity. Similarly, proteins with complex architectures, such as multi-domain proteins, or those enriched in intrinsically disordered regions (IDRs)[\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e] are rarely examined, even though such proteins are abundant in real proteomes[\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e]. In addition, predicted structures are inherently heterogeneous in the quality, yet the impact of structural confidence on different search strategies has not been systematically analyzed[\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eTo address these gaps, we present a unified benchmark for protein sequence and structure search approaches. We systematically compare representative sequence-based, structure-based, and representation-based methods on the CATH[\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e] dataset using a unified framework with consistent benchmark datasets, evaluation metrics, and search protocols. To assess the biological relevance, we further evaluate the performance on a dataset emphasizing the functional consistency of retrieved results. We explicitly examine the behavior of different methods on multi-domain proteins and contrast it with corresponding single-domain proteins. In addition, we evaluate protein search on intrinsically disordered proteins (IDPs), highlighting a challenging scenario in which current methods consistently underperform. We also analyze the impact of predicted structure confidence, measured by pLDDT[\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e], on the search performance. Finally, we benchmark the computational efficiency of individual methods under typical usage conditions.\u003c/p\u003e \u003cp\u003eThis benchmark offers a comprehensive and realistic evaluation of protein search methods against rapidly expanding, prediction-driven protein databases. By facilitating fair comparisons across paradigms and highlighting method-specific strengths and limitations in diverse biological contexts, it provides practical guidance for tool selection and lays a foundation for the development of protein search approaches.\u003c/p\u003e \u003cp\u003eTo the best of our knowledge, this work presents the first unified benchmark that systematically evaluates protein search methods across a diverse set of biologically challenging scenarios, including intrinsically disordered proteins, multi-domain architectures, and varying levels of predicted structure confidence. By explicitly examining how search performance depends on disorder content and pLDDT-derived structure reliability, our benchmark extends beyond traditional structure- or sequence-centric evaluations and reflects the practical conditions encountered in modern, prediction-driven protein databases.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eOverview of benchmark design and evaluated methods\u003c/h2\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo systematically assess protein sequence and structure search methods under realistic and comparable conditions, we designed a unified benchmark encompassing five complementary biological scenarios. These scenarios capture key challenges encountered in contemporary protein search, including fold-level structural similarity, functional consistency, architectural complexity, intrinsic disorder, and the quality of predicted structural data.\u003c/p\u003e \u003cp\u003eWe evaluated representative methods spanning different search paradigms, including sequence-based, structure-based alignment, and representation-based approaches (Table\u0026nbsp;1). All methods were assessed using consistent datasets, evaluation metrics, and experimental protocols within each scenario, enabling fair comparison across paradigms. Some of the scenarios are challenging, such as the searching for similar multi-domain proteins, and for similar intrinsic disorder proteins which do not have stable structures. An overview of the benchmark design and evaluated scenarios is shown in Fig.\u0026nbsp;1.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eProtein search tools used in the benchmark\u003c/b\u003e \u003csup\u003ea\u003c/sup\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCategory\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTool\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSource Repository\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eStructure alignment\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGTalign[\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/minmarg/gtalign_alpha\u003c/span\u003e\u003cspan address=\"https://github.com/minmarg/gtalign_alpha\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTMalign[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.aideepmed.com/TM-align/\u003c/span\u003e\u003cspan address=\"https://www.aideepmed.com/TM-align/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDali[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://ekhidna2.biocenter.helsinki.fi/dali\u003c/span\u003e\u003cspan address=\"http://ekhidna2.biocenter.helsinki.fi/dali\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFoldseek[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/steineggerlab/foldseek\u003c/span\u003e\u003cspan address=\"https://github.com/steineggerlab/foldseek\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"4\" rowspan=\"5\"\u003e \u003cp\u003eRepresentation-based\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGraSR[\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/chunqiux/GraSR\u003c/span\u003e\u003cspan address=\"https://github.com/chunqiux/GraSR\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTMvec[\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/tymor22/tm-vec\u003c/span\u003e\u003cspan address=\"https://github.com/tymor22/tm-vec\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePLMSearch[\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://dmiip.sjtu.edu.cn/PLMSearch\u003c/span\u003e\u003cspan address=\"https://dmiip.sjtu.edu.cn/PLMSearch\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDHR[\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/ml4bio/Dense-Homolog-Retrieval\u003c/span\u003e\u003cspan address=\"https://github.com/ml4bio/Dense-Homolog-Retrieval\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFoldExplorer[\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/YuanLiu-SJTU/FoldExplorer\u003c/span\u003e\u003cspan address=\"https://github.com/YuanLiu-SJTU/FoldExplorer\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eSequence alignment\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBLAST[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://blast.ncbi.nlm.nih.gov/Blast.cgi\u003c/span\u003e\u003cspan address=\"https://blast.ncbi.nlm.nih.gov/Blast.cgi\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDiamond[\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/bbuchfink/diamond\u003c/span\u003e\u003cspan address=\"https://github.com/bbuchfink/diamond\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ejackhmmer[\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://hmmer.org/download.html\u003c/span\u003e\u003cspan address=\"http://hmmer.org/download.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMMseqs[\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/soedinglab/MMseqs2\u003c/span\u003e\u003cspan address=\"https://github.com/soedinglab/MMseqs2\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003csup\u003ea\u003c/sup\u003e Detailed information about the tools\u0026rsquo; version and parameter values used in this study is provided in the \u0026ldquo;Methods\u0026rdquo; section.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eFold classification performance on the CATH dataset\u003c/h3\u003e\n\u003cp\u003eFold classification performance was evaluated on the CATH-S20 dataset, which organizes protein structures into a four-level hierarchy with the increasing specificity: Class, Architecture, Topology, and Homologous superfamily. Performance was quantified using the sensitivity up to the first false positive (detailed information is provided in the \u0026ldquo;Methods\u0026rdquo; section), a stringent metric reflecting the ability of a method to retrieve true structural neighbors before any incorrect match is introduced. Results are shown in Fig.\u0026nbsp;2, and the corresponding area under the curve (AUC) values are reported in Supplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAcross the four CATH levels, structure-based alignment methods achieved the highest sensitivity, demonstrating strong capability in capturing structural similarity even at fine-grained classification levels. Representation-based methods generally ranked second, outperforming sequence-based approaches while remaining less sensitive than explicit structural alignment.\u003c/p\u003e \u003cp\u003eSpecifically, we quantify the sensitivity of each method using the area under the sensitivity up to the first false positive curve. Under this metric, among the five representation-based approaches, GraSR achieved the highest sensitivity at the Class and Architecture levels, whereas FoldExplorer performed best at the Topology and Homologous Superfamily levels. The best-performing representation-based methods achieved sensitivities of 51.8%, 71.7%, 79.2%, and 88.1% of the performance of structure-based alignment methods across the four hierarchical CATH levels, respectively. Sensitivity consistently increased at deeper hierarchical levels and gradually approached the performance of structure-based alignment methods. Foldseek, which combines representation learning with alignment-based matching, also demonstrated consistently high sensitivity across all hierarchical levels. In contrast, the best sequence-based method, BLAST, achieved relative sensitivities of only 1.8%, 7.8%, 18.4%, and 27.5% compared with the best-performing structure alignment methods, underscoring the limited capability of sequence alignment methods in detecting remote homology relationships.\u003c/p\u003e \u003cp\u003eAlthough these trends were consistent at the level of method categories, differences among individual methods were observed. For example, the structure-based representation method GraSR exhibited slightly lower sensitivity than large language model-based sequence representations such as PLMSearch at the Topology and Homologous superfamily levels, while outperforming other representation-based methods at the Class and Architecture levels. This pattern reflects a general distinction between structure-based and sequence-based representations: structural representations tend to enhance the discrimination at coarse structural levels, while sequence-based representations capture the homology-driven similarity at finer hierarchical resolutions.\u003c/p\u003e \u003cp\u003eOur overall results show that structure-based alignment methods remain the most effective approach for fold classification in high-quality structure database such as the CATH. While representation-based approaches offer a competitive and practical alternative in terms of the accuracy and efficiency, especially in the case of large-scale structure database where the high time cost for alignment-based comparison.\u003c/p\u003e\n\u003ch3\u003eAssessing the functional consistency in protein search\u003c/h3\u003e\n\u003cp\u003eStructural or sequence similarity does not necessarily imply functional relatedness[\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e]. To assess the biological relevance of protein search results, we evaluated the functional consistency using a SwissProt dataset with sequence identity filtered to less than 20%, thereby reducing redundancy and emphasizing remote functional relationships (Fig.\u0026nbsp;3). Functional consistency was quantified by measuring the overlap of Gene Ontology[\u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e] Molecular Function (GO-MF) annotations between query proteins and their retrieved top 10 hits.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn contrast to the fold-level results observed on CATH-S20, methods based on global structural alignment, such as TMalign and GTalign, did not achieve high functional consistency in this evaluation. Representation-based methods, such as FoldExplorer, which also leverages global structural representations but further leverage large protein language models, consistently demonstrated improved functional coherence among top-ranked results, achieving a 4.7% increase over TMalign. This improvement was observed regardless of whether ESM[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e] or ProtTrans[\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e] embeddings were used. Whereas using sequence-based alignment alone did not produce such gains, highlighting the key role of protein language models in capturing functional signals[\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOn the other hand, methods based on local structural alignment, including Foldseek, achieved an even greater improvement of 7.0% over TMalign, suggesting that capturing local structural motifs is particularly effective for identifying functionally related proteins under conditions of low sequence identity.\u003c/p\u003e \u003cp\u003eThe comparison between sequence-based and structure-based approaches under this evaluation highlights a fundamental difference in how similarity relates to functional consistency. Because all protein pairs in this dataset share less than 20% sequence identity, sequence alignment operates in the classical \u0026ldquo;twilight zone\u0026rdquo;[\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e], where low sequence similarity no longer reliably reflects functional relatedness. In contrast, protein structures are more conserved than sequences during evolution, particularly around functional regions such as active sites and interaction interfaces. As a result, structure-based search methods, especially those capturing local structural similarity, are better suited for identifying functionally related proteins under remote homology conditions, directly addressing the challenge of functional consistency in protein search.\u003c/p\u003e \u003cp\u003eFunctional consistency represents an independent and biologically meaningful dimension of protein relatedness that is not fully captured by sequence or structural similarity alone, and therefore requires explicit evaluation in protein search benchmarks. Nevertheless, our evaluation indicates that under remote homology conditions, functional consistency is more closely associated with structural similarity than with sequence similarity.\u003c/p\u003e\n\u003ch3\u003eSearch performance on multi-domain proteins with diverse architectures\u003c/h3\u003e\n\u003cp\u003eMulti-domain proteins are widespread in proteomes, accounting for over 50% of known proteins and more than 70% in eukaryotic organisms, and typically exhibit complex architectures composed of multiple structural units[\u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e64\u003c/span\u003e]. In such cases, biologically meaningful similarity is frequently confined to individual domains or conserved substructures, while global folds, domain arrangements, and inter-domain orientations may differ substantially. These characteristics make multi-domain protein search fundamentally more challenging than single-domain fold recognition and limit the effectiveness of purely global similarity measures.\u003c/p\u003e \u003cp\u003eTo systematically assess how different search approaches handle this architectural complexity, we compared search performance on full-length multi-domain proteins with that obtained after decomposing the same proteins into their constituent single-domain units based on annotations from The Encyclopedia of Domains (TED)[\u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e]. Search performance was evaluated using the local Distance Difference Test (lDDT)[\u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e66\u003c/span\u003e], a reference-free metric that quantifies local structural similarity without requiring explicit structural superposition[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. The lDDT metric used in this study follows a slightly modified definition relative to the original formulation (details are provided in the \u0026ldquo;Methods\u0026rdquo; section). The calculation of lDDT relies on residue-level correspondences defined by an alignment. For the methods without an inherent alignment output, alignments were obtained using structure-based alignment tools prior to lDDT computation.\u003c/p\u003e \u003cp\u003eSpecifically, residue-level alignments were generated using four established structure alignment methods: Dali, TMalign, GTalign, and Foldseek. Across all evaluated query-hit pairs, Dali consistently produced the highest lDDT scores, followed by TMalign and GTalign with comparable performance, while Foldseek yielded lower lDDT scores, reflecting its design focuses on fast structural retrieval rather than high-precision residue-level alignment (Supplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). To ensure a method-agnostic evaluation of structural consistency, independent of the alignment strategy used, we therefore adopted the maximum lDDT score obtained across the four alignment methods as the final lDDT score for each query-hit pair.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor full-length multi-domain proteins, structure-based alignment methods generally achieved higher lDDT scores than representation-based approaches, with local structural alignment methods, such as Dali and Foldseek, performing the best. Global structural alignment methods, while slightly weaker than local alignment, still consistently outperformed representation-based methods, indicating that representation-based approaches capture local structural features less precisely when evaluating multi-domain architectures.\u003c/p\u003e \u003cp\u003eInterestingly, as shown in Fig.\u0026nbsp;4, after decomposing multi-domain proteins into their constituent single-domain units, the performance of representation-based methods improved substantially. Among the five methods we tested, the three sequence-based representation approaches exhibited modest gains of 3%~10%. The largest improvements were observed for structure-informed representation methods, GraSR and FoldExplorer, which achieved increases of 32.1% and 16.1%, respectively. Notably, the hybrid sequence-structure method FoldExplorer surpassed global alignment methods like TMalign and achieved lDDT scores comparable to those of local structural alignment methods. These results suggest that domain decomposition can reveal strengths of representation-based methods that are otherwise masked in full-length multi-domain comparisons.\u003c/p\u003e \u003cp\u003eThe performance shifts highlight that the effectiveness of different search paradigms depends strongly on the evaluation context. Local alignment-based methods demonstrate strong robustness to architectural complexity, maintaining high functional consistency even for multi-domain proteins, whereas representation-based methods benefit substantially from domain-level resolution, particularly when incorporating structural information.\u003c/p\u003e\n\u003ch3\u003eSearch performance on intrinsically disordered proteins\u003c/h3\u003e\n\u003cp\u003eIntrinsically disordered proteins (IDPs) and regions (IDRs) play critical roles in cellular regulation, signaling, and molecular recognition, and constitute a substantial fraction of the proteomes. Computational analyses suggest that intrinsically disordered proteins and regions are widespread across proteomes. Approximately one-third of eukaryotic proteins are predicted to contain long intrinsically disordered regions, and in the human proteome, an estimated 37\u0026ndash;50% of all amino acid residues are inferred to be intrinsically disordered[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e67\u003c/span\u003e]. Unlike well-folded proteins, IDPs lack a stable tertiary structure under physiological conditions and often exert their functions through transient, context-dependent interactions. As a result, identifying functionally related IDPs remains an important and challenging task for protein search methods.\u003c/p\u003e \u003cp\u003eTo systematically evaluate the protein search performance in this scenario, we constructed a benchmark set based on experimentally validated IDPs from DisProt[\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e]. A total of 534 IDP sequences were used as queries, with the pairwise sequence identity below 20%. These queries were searched against a SwissProt human dataset consisting of 6955 proteins, also filtered to 20% sequence identity, ensuring the reduced redundancy within and between the query and target sets. Functional relevance was assessed using GO semantic similarity computed for the top 10 query-target pairs.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAcross all the evaluated methods, the functional consistency on the IDP dataset was uniformly low, with a median GO semantic similarity below 0.4, only marginally exceeding random expectations. This behavior was observed consistently across alignment-based, representation-based, and hybrid search paradigms, with substantially diminished performance differences compared to evaluations on structured proteins. Notably, no method demonstrated a clear advantage in distinguishing functionally related IDPs from unrelated targets.\u003c/p\u003e \u003cp\u003eThis low functional consistency cannot be attributed to dataset quality or the absence of meaningful biological relationships. Across all proteins containing disordered regions\u0026mdash;ranging from entirely disordered proteins to proteins with only partial disorder\u0026mdash;existing methods consistently underestimate functional similarity, with predicted values roughly 50% lower than the ground-truth functional similarity (around 0.8). This indicates that relevant functional signals are present but are not effectively captured by current protein search approaches.\u003c/p\u003e \u003cp\u003eThe lack of stable tertiary structure limits the applicability of structure-based alignment (Supplementary Figure S2), while the high sequence variability and weak evolutionary constraints of IDPs reduce the effectiveness of sequence-based and representation-based similarity measurements. Moreover, IDP function is often mediated by short linear motifs and context-specific interactions[\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e], which are difficult to capture using global similarity metrics. These results indicate that accurate remote homology search for IDPs remains a significant challenge for current protein search approaches.\u003c/p\u003e \u003cp\u003eOur benchmark results demonstrate that IDPs represent a particularly challenging scenario for protein similarity search. The uniformly low performance across paradigms highlights a shared limitation of current methods and underscores the need for evaluation frameworks and search strategies specifically tailored to the unique biological properties of disordered proteins.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eImpact of predicted structure quality on the search performance\u003c/h2\u003e \u003cp\u003eRecent advances in protein structure prediction, exemplified by AlphaFold and related methods, have enabled the generation of large-scale, proteome-wide structural databases. As a result, a typical and increasingly important application of protein structure search is to query against the databases composed predominantly of predicted rather than experimentally determined structures. In this context, understanding how prediction uncertainty influences the search performance becomes critical.\u003c/p\u003e \u003cp\u003ePredicted protein structures exhibit substantial variation in the quality, with local confidence commonly quantified by pLDDT scores. To examine how uncertainty of predicted models affects protein search, we stratified predicted structures into confidence intervals based on pLDDT and systematically evaluated the search performance across these groups (Fig.\u0026nbsp;6).\u003c/p\u003e \u003cp\u003eDecreasing structural confidence was consistently associated with degraded search performance across all evaluated structure-based methods. Relative to accurately predicted structures (pLDDT\u0026thinsp;\u0026ge;\u0026thinsp;90), structure alignment-based approaches, including GTalign, TMalign, and Dali, showed substantial reductions in the mean GO semantic similarity of their top-10 hits when applied to low-confidence predicted structures (pLDDT\u0026thinsp;\u0026lt;\u0026thinsp;70), with decreases of 36.4%, 36.1%, and 22.9%, respectively with the \u003cem\u003ep-value\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001. Methods that fully or partially incorporate structural representations, namely GraSR, Foldseek, and FoldExplorer, experienced more moderate declines of 22.7%, 6.9%, and 8.7%, respectively. In contrast, sequence-derived representation methods, including TMvec, PLMSearch, and DHR, exhibited only minor decreases of 6.6%, 7.0%, and 5.2%. Serving as a control, traditional sequence alignment methods displayed no measurable degradation across pLDDT intervals, indicating that the observed performance losses are specifically attributable to structural prediction uncertainty rather than dataset effects. As illustrated in Fig.\u0026nbsp;6, methods leveraging large language model-based sequence representations, such as TMvec, PLMSearch, DHR, and FoldExplorer, consistently outperformed both structure alignment-based and structure representation-based approaches in the moderate-confidence (70\u0026thinsp;\u0026le;\u0026thinsp;pLDDT\u0026thinsp;\u0026lt;\u0026thinsp;90) and low-confidence (pLDDT\u0026thinsp;\u0026lt;\u0026thinsp;70) regimes.\u003c/p\u003e \u003cp\u003eThese observations reveal that the sensitivity to prediction uncertainty differed markedly between search paradigms. Structure alignment-based methods showed pronounced performance drops in the presence of low-confidence regions, reflecting their reliance on accurate local geometry. In contrast, representation-based approaches were more tolerant to moderate levels of structural uncertainty, and although their performance also deteriorated when large fractions of the structure exhibited low confidence, they generally remained superior to traditional sequence alignment methods.\u003c/p\u003e \u003cp\u003eThese results suggest that predicted structures do not provide a uniformly reliable substrate for protein search, and their utility depends critically on both structural confidence and the underlying search paradigm. While moderate prediction uncertainty can be tolerated by representation-based methods, extensive low-confidence regions fundamentally compromise the informativeness of structural features for retrieval. Consequently, the effectiveness of structure-based search over large predicted structure databases is inherently constrained by the model reliability, underscoring the importance of integrating confidence awareness into search design and result interpretation.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eComputational efficiency\u003c/h3\u003e\n\u003cp\u003eComputational efficiency is a critical practical consideration for large-scale protein search, particularly for applications involving comprehensive databases such as the AlphaFold Database (AFDB, ~\u0026thinsp;241\u0026nbsp;million structures). Using the same dataset as the functional search benchmark (7,735 targets), we compared per-query runtime across different protein search paradigms under commonly used execution modes (Fig.\u0026nbsp;7). For traditional sequence and structure alignment methods, we measured runtimes on a single CPU core (CPU-1) and on 16 CPU cores (CPU-16) of an AMD EPYC 9654 processor. Representation-based methods, which are deep learning-based, support GPU acceleration and allow pre-building the target database. Therefore, we evaluated them in two modes: using a GPU alone and using a GPU with a pre-built target database (GPU\u0026thinsp;+\u0026thinsp;DB). The GPU used was an NVIDIA GeForce RTX 4090 D.\u003c/p\u003e \u003cp\u003eSequence alignment methods are consistently the fastest, enabling high-throughput searches even with modest computational resources. In fact, sequence-based methods can be 5\u0026ndash;6 orders of magnitude faster than traditional structure alignment methods, which are substantially more expensive due to the intrinsic cost of explicit structural superposition and remain computationally demanding even with parallelization. Representation-based approaches fall between sequence- and structure-based methods, achieving runtimes roughly 3 orders of magnitude lower than structure alignment while retaining sensitivity beyond the sequence level.\u003c/p\u003e \u003cp\u003eThese runtime differences are explained by fundamentally different scalability characteristics. Representation-based methods primarily incur cost during embedding computation which however can be precomputed, leading to an overall complexity that scales approximately linearly with the number of queries and targets in the onsite search practice. Importantly, target embeddings can also be reused, allowing the one-time indexing cost to be amortized across repeated searches and making such methods well suited for large or growing databases. In contrast, alignment-based methods are dominated by pairwise comparisons, resulting in the computational complexity that scales with the product of query and target set sizes, which fundamentally limits the scalability despite hardware acceleration.\u003c/p\u003e \u003cp\u003eOverall, our benchmark results suggest that representation-based methods provide a favorable balance between computational efficiency and expressive power, making them particularly well suited for large-scale protein search.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Summary","content":"\u003cp\u003eOverall rankings of each protein search method across all datasets and metrics (Fig.\u0026nbsp;8) highlights clear differences in their strengths across application scenarios. Global structural alignment methods (GTalign and TMalign) consistently perform well in CATH classification, making them particularly suitable for high-accuracy fold recognition and homologous superfamily assignment when reliable structures are available. In contrast, local structural alignment methods (Foldseek and Dali) show superior performance in functional consistency and local structural similarity (lDDT metric), indicating their advantage in capturing functionally relevant substructures and conserved local motifs, especially in cases where global folds diverge or are only partially conserved.\u003c/p\u003e \u003cp\u003eEmbedding-based representation methods (DHR and FoldExplorer) primarily excel in their robustness across diverse and noisy structural contexts, maintaining stable performance in functional consistency even when global pLDDT is low or when proteins contain partially disordered regions. By operating in a continuous representation space rather than relying on strict residue-level alignment, these methods are less sensitive to local structural inaccuracies and fold divergence. In particular, approaches that integrate structural cues with learned representations achieve a more reliable balance across evaluation categories, making them especially suitable for exploratory searches, functional inference, and large-scale protein space analysis, where robustness and generalization are more critical than exact structural alignment.\u003c/p\u003e \u003cp\u003eSequence alignment methods (BLAST, Diamond, jackhmmer, MMseqs) are effective for close homolog detection but show clear limitations in structural and functional benchmarks, particularly for remote homology and accurate predicted structures.\u003c/p\u003e \u003cp\u003eBeyond accuracy, computational efficiency and scalability also play a decisive role in practical large-scale applications. Embedding-based approaches offer substantial advantages in speed and scalability, enabling rapid similarity search over millions of proteins. These properties make representation-based methods particularly attractive for high-throughput annotation and proteome-wide exploration.\u003c/p\u003e \u003cp\u003eThese results suggest that no single method is optimal for all tasks. The choice of method should therefore be guided by the specific application scenario, data quality, and scale of analysis.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Discussions","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eApplicability boundaries of protein search paradigms\u003c/h2\u003e \u003cp\u003eRather than revealing a best-performing tool, this benchmark highlights the existence of clear applicability boundaries among different protein search paradigms. Sequence-based, structure-based alignment, and representation-based approaches rely on fundamentally different biological assumptions, and therefore no single paradigm provides a universally optimal solution.\u003c/p\u003e \u003cp\u003eSequence-based search relies on the conservation of primary structure and is inherently well suited for identifying close homologs and annotating proteins within well-characterized families. Structure-based alignment, by contrast, operates in a higher-level geometric space and is capable of detecting remote relationships that are invisible at the sequence level, particularly at the fold or topology scale. Representation-based methods occupy an intermediate and more flexible regime, compressing sequence or structural information into learned embeddings that can generalize across large databases but may blur fine-grained biological distinctions.\u003c/p\u003e \u003cp\u003eCrucially, these paradigms diverge most strongly when evaluated across heterogeneous biological contexts. Differences in the sequence identity, domain composition, structural modularity, and disorder content systematically shift the relative strengths of each approach. As a result, performance rankings are not stable across tasks but instead depend on the biological scenario under different consideration. This instability is not a limitation of individual methods, but a reflection of the intrinsic complexity of protein relatedness.\u003c/p\u003e \u003cp\u003eThese observations emphasize that protein search performance is inherently context dependent. Evaluations spanning multiple biological scenarios can help to provide practical guidance for users in selecting appropriate tools for specific tasks, while also offering insights for future method development by clarifying how different design choices perform under varying conditions.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eMetric space perspectives on protein search\u003c/h2\u003e \u003cp\u003eAcross the benchmarks evaluated in this study, we observed that the relative performance of alignment-based and representation-based methods depends strongly on protein complexity and the definition of the similarity. In particular, representation-based approaches improved substantially after decomposing multi-domain proteins into single-domain units, reaching the performance comparable to structural alignment methods. Similar trends were observed in functional search, where embedding-based methods exhibited stronger functional coherence among retrieved hits.\u003c/p\u003e \u003cp\u003eThese observations highlight that different protein representations induce distinct metric spaces, each encoding specific aspects of protein relatedness. Sequence-based, structure-based, and hybrid embedding-based methods emphasize complementary similarity signals, which helps explain their varying behavior across benchmarks rather than a single dominant paradigm.\u003c/p\u003e \u003cp\u003eAlignment-based methods rely primarily on independent pairwise comparisons and do not naturally generate a unified or continuous representation of protein space. Consequently, they are less suited for analyzing the global organization, coverage, or unexplored regions of large protein databases.\u003c/p\u003e \u003cp\u003eIn contrast, embedding-based methods construct explicit, high-dimensional metric spaces that capture protein relationships globally. To illustrate this, we performed dimensionality reduction on all proteins in the CATH-S20 dataset (Fig.\u0026nbsp;9). In the resulting embedding space, proteins from the same class form distinct clusters, reflecting meaningful structural relationships, and the distances between these clusters can be quantitatively measured to assess similarity and divergence between different structural clusters. This continuous organization enables systematic characterization of protein space, providing insights into sparsely populated or poorly annotated regions where meaningful relationships may not be evident from pairwise alignments alone, and facilitates the identification of underexplored areas that may merit further investigation.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eFuture directions\u003c/h2\u003e \u003cp\u003ePrevious benchmarks for protein sequence and structure search have provided valuable insights into method performance, typically focusing on well-curated, folded domains and evaluating structural or alignment accuracy under relatively idealized conditions. Such studies have played a critical role in establishing reference standards and driving methodological progress in the field.\u003c/p\u003e \u003cp\u003eour benchmark extends these efforts by explicitly considering several biologically and practically relevant scenarios that are increasingly prevalent in modern protein databases but have been less systematically examined. In particular, we evaluate search performance on intrinsically disordered proteins, contrast multi-domain proteins with their single-domain counterparts, and stratify analyses by predicted structure confidence. These design choices enable a more realistic assessment of how different classes of methods behave when confronted with incomplete folding, domain complexity, or uncertainty in predicted structures.\u003c/p\u003e \u003cp\u003eBy systematically benchmarking protein similarity search methods across structural and functional-aware scenarios, our results highlight both the strengths and the current limitations of existing approaches under realistic database conditions. In particular, analyses of multi-domain architectures and predicted structures with varying confidence reveal how structural complexity and prediction uncertainty modulate search performance, whereas the explicit evaluation of intrinsically disordered proteins identifies regimes that remain challenging even for state-of-the-art methods. These findings underscore the need for benchmarks that go beyond canonical folded proteins and static structures, and motivate future efforts toward more robust, context-aware, and confidence-sensitive protein search methodologies.\u003c/p\u003e \u003cp\u003eDespite the unified design of the benchmark, several limitations should be noted. While we rely on curated resources such as CATH and DisProt, the construction of several evaluation scenarios was necessary because the field currently lacks widely accepted public benchmarks for protein sequence and structure search, leading us to collect some datasets from SwissProt and thereby introducing potential data reuse biases.\u003c/p\u003e \u003cp\u003eIn particular, several deep learning-based methods evaluated here are trained on large-scale resources that may partially overlap, directly or indirectly, with SwissProt sequences or annotations. Examples include TMvec, whose training relies on protein structures derived from SWISS-MODEL[\u003cspan class=\"CitationRef\"\u003e70\u003c/span\u003e]; PLMSearch, which leverages Pfam-based homology information during the retrieval (with the PfamClan module disabled in our evaluation); and DHR, whose training dataset is constructed using UniRef90[\u003cspan class=\"CitationRef\"\u003e71\u003c/span\u003e] as a reference. As a result, it is difficult to fully exclude prior exposure of benchmark proteins during model training. This challenge is increasingly unavoidable in the era of large pretrained protein models, where training data provenance is often broad and heterogeneous. While we applied sequence identity filtering and redundancy reduction to mitigate the trivial memorization, subtle forms of information leakage cannot be completely ruled out.\u003c/p\u003e \u003cp\u003eBeyond these data-related considerations, additional limitations reflect the scope of the evaluation. Functional consistency is assessed using GO-based annotations, which are incomplete and unevenly distributed across proteins. Structural analyses rely on static protein conformations and therefore do not capture conformational dynamics or interaction-dependent folding. Moreover, our evaluation focuses on pairwise search performance and does not explicitly address downstream workflows such as clustering or iterative annotation.\u003c/p\u003e \u003cp\u003eFuture benchmarks would benefit from stricter control of training-evaluation separation, for example through time-split datasets, explicit tracking of training data sources, or the use of newly deposited or de novo sequences and structures. Extending evaluation to confidence-aware search, domain-aware representations, and disordered or context-dependent proteins will also be essential for capturing the full complexity of protein relatedness in large-scale databases.\u003c/p\u003e \u003c/div\u003e "},{"header":"Methods","content":"\u003ch2\u003eBenchmark datasets\u003c/h2\u003e\u003ch2\u003eCATH-S20 dataset\u003c/h2\u003e\u003cp\u003eThe CATH-S20 (v4.4.0, 15043 domains) dataset was used to evaluate protein search performance under a hierarchical fold classification setting. Protein domains were obtained from the CATH database and filtered to a maximum pairwise sequence identity of 20% to reduce the redundancy. Each protein domain is annotated with a four-level hierarchical classification, including Class, Architecture, Topology (fold), and Homologous superfamily. We evaluated protein search performance on CATH-S20 using an all-versus-all search protocol, in which each domain was used in turn as a query against all remaining domains in the dataset.\u003c/p\u003e\u003ch2\u003eFunctional search dataset\u003c/h2\u003e\u003cp\u003eThe functional search dataset was constructed based on the AlphaFold Protein Structure Database (v6)[\u003cspan class=\"CitationRef\"\u003e72\u003c/span\u003e], which provides a pre-assembled set of SwissProt proteins with predicted structures, comprising a total of 550122 entries. To ensure reliable structural information, we retained only proteins with an average predicted pLDDT score of at least 90, resulting in 307578 high-confidence predicted structures. These proteins were then clustered using CD-HIT[\u003cspan class=\"CitationRef\"\u003e73\u003c/span\u003e] at a 20% sequence identity threshold to reduce the redundancy, yielding 11721 representative proteins. Next, we filtered the dataset to retain only proteins with curated Gene Ontology (GO) annotations, considering only the Molecular Function category. This resulted in 8767 proteins. We further excluded proteins annotated with GO terms that had been removed in the current GO release, producing a final dataset of 8735 proteins. From this final set, 1000 proteins were randomly selected as the query set, and the remaining proteins were used as the target set for functional search evaluation.\u003c/p\u003e\u003ch2\u003eMulti-domain dataset\u003c/h2\u003e\u003cp\u003eThe multi-domain dataset was constructed to evaluate the impact of protein architectural complexity on the search performance. Starting from the same non-redundant, high-confidence protein set described above, comprising 11721 structures, we identified proteins annotated as multi-domain based on TED domain annotations. This resulted in a total of 5573 multi-domain proteins. To construct the search benchmark, 100 multi-domain proteins were randomly selected as the query set, and the remaining 4573 proteins were used as the target set. In addition, to enable a direct comparison between full-length multi-domain proteins and their constituent domains, we further derived a domain-resolved dataset by decomposing the multi-domain proteins into their single-domain units according to TED annotations. This design allows paired evaluations on identical underlying proteins, facilitating an assessment of how domain composition influences search performance across different methods. Only TED-defined continuous structural segments were retained as single-domain units, resulting in 204 queries and 11041 targets.\u003c/p\u003e\u003ch2\u003eDisorder protein search dataset\u003c/h2\u003e\u003cp\u003eThe disorder protein search dataset was constructed to evaluate protein search performance on intrinsically disordered proteins (IDPs). Query proteins were obtained from DisProt (release 2024.12), which provides experimentally validated annotations of intrinsically disordered regions. Starting from the DisProt set with 3113 proteins, we retained proteins that have AlphaFold structures and curated Gene Ontology (GO) annotations. This resulted in an initial set of 1012 IDP query candidates, which was further processed to remove the redundancy. The target set was constructed from the AlphaFold SwissProt human proteome, comprising 23586 proteins. Human proteins were selected as targets because the majority of DisProt entries are derived from human proteins, enabling a biologically consistent evaluation setting. To ensure low sequence redundancy between query and target proteins, the two sets were pooled and clustered using CD-HIT at a 20% sequence identity threshold. After clustering, the final dataset consisted of 534 query proteins and 6955 target proteins.\u003c/p\u003e\u003ch2\u003ePredicted structure quality dataset\u003c/h2\u003e\u003cp\u003eThis dataset was constructed to assess the robustness of protein search methods to variations in structural prediction accuracy. All proteins were obtained from the AFDB SwissProt set. Based on the average predicted pLDDT score, proteins were grouped into three subsets representing different levels of structure quality: high confidence (pLDDT ≥ 90), medium confidence (70 ≤ pLDDT \u0026lt; 90), and low confidence (pLDDT \u0026lt; 70). To control potential confounding effects introduced by protein length, we restricted the analysis to proteins with sequence lengths up to 300 residues. This filtering step was applied uniformly across all three subsets, as proteins with lower pLDDT scores tend to be substantially longer and could otherwise bias the evaluation. Each subset was independently clustered using CD-HIT at a 20% sequence identity threshold to reduce the redundancy. For each structure quality group, 100 proteins were randomly selected as the query set, while the remaining proteins were used as the target set. This resulted in target sets of 4053 proteins for the high-confidence group, 4913 proteins for the medium-confidence group, and 1907 proteins for the low-confidence group.\u003c/p\u003e\u003cb\u003eEvaluation tasks and metrics\u003c/b\u003e\u003ch2\u003eSensitivity up to the 1st False Positive\u003c/h2\u003e\u003cp\u003eFor fold classification tasks on the CATH-S20 dataset, search performance was evaluated using sensitivity up to the first false positive. For each query, retrieved targets were ranked by similarity score, and true positives were defined according to the corresponding CATH classification level (Class, Architecture, Topology, or Homologous superfamily).\u003c/p\u003e\u003cp\u003eSensitivity up to the first false positive is defined as the fraction of true positives retrieved before the first incorrect hit appears in the ranked list:\u003c/p\u003e\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:\\text{S}\\text{e}\\text{n}\\text{s}\\text{i}\\text{t}\\text{i}\\text{v}\\text{i}\\text{t}{\\text{y}}_{\\text{F}\\text{P}1}=\\frac{{N}_{\\text{T}\\text{P}\\:\\text{b}\\text{e}\\text{f}\\text{o}\\text{r}\\text{e}\\:\\text{F}\\text{P}}}{{N}_{\\text{t}\\text{o}\\text{t}\\text{a}\\text{l}\\:\\text{T}\\text{P}}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{N}_{\\text{T}\\text{P}\\:\\text{b}\\text{e}\\text{f}\\text{o}\\text{r}\\text{e}\\:\\text{F}\\text{P}}\$\u003c/span\u003e\u003c/span\u003e denotes the number of true positive hits retrieved before the first false positive, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{N}_{\\text{t}\\text{o}\\text{t}\\text{a}\\text{l}\\:\\text{T}\\text{P}}\$\u003c/span\u003e\u003c/span\u003e is the total number of true positives for the given query at the specified CATH level.\u003c/p\u003e\u003ch2\u003eGO semantic similarity\u003c/h2\u003e\u003cp\u003eFor functional search and disorder protein search benchmarks, functional consistency was evaluated using Gene Ontology (GO) semantic similarity based on Molecular Function annotations. Functional similarity between a query protein and its retrieved targets was computed using the Wang semantic similarity measure[\u003cspan class=\"CitationRef\"\u003e74\u003c/span\u003e], as implemented in the python “goatools” package[\u003cspan class=\"CitationRef\"\u003e75\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eGiven two GO terms \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{\\text{g}}_{1}\$\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{\\text{g}}_{2}\$\u003c/span\u003e\u003c/span\u003e, the Wang method quantifies their semantic similarity by considering the graph structure of the GO directed acyclic graph and the contribution of shared ancestor terms. For proteins annotated with multiple GO terms, we aggregated term-level similarities using the best-match average (BMA) strategy.\u003c/p\u003e\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\:\\text{S}\\text{i}{\\text{m}}_{\\text{G}\\text{O}}\\left(q,t\\right)=\\frac{1}{\\left|{G}_{q}\\right|}{\\sum\\:}_{g\\in\\:{G}_{q}}\\underset{h{\\in\\:G}_{t}}{\\text{max}}\\text{S}\\text{i}{\\text{m}}_{\\text{W}\\text{a}\\text{n}\\text{g}}\\left(g,h\\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003cp\u003eThis formulation computes similarity in a query-to-target direction, reflecting the extent to which the functional annotations of the query are recovered among the retrieved targets.\u003c/p\u003e\u003ch2\u003eLocal Distance Difference Test (lDDT)\u003c/h2\u003e\u003cp\u003eLocal structural similarity was assessed using the Local Distance Difference Test (lDDT), a reference-free metric that quantifies the agreement of local inter-residue distances without requiring global structural superposition. lDDT measures the fraction of residue-residue distance pairs that are preserved within predefined distance tolerances between two aligned structures.\u003c/p\u003e\u003cp\u003eGiven an alignment between a query structure and a target structure, lDDT is computed as:\u003c/p\u003e\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$$\\:\\text{l}\\text{D}\\text{D}\\text{T}=\\frac{1}{N}{\\sum\\:}_{i=1}^{N}\\frac{1}{\\left|\\mathcal{N}\\left(i\\right)\\right|}{\\sum\\:}_{j\\in\\:\\mathcal{N}\\left(i\\right)}\\mathbf{I}\\left(\\left|{d}_{ij}^{\\left(q\\right)}\\right|-\\left|{d}_{ij}^{\\left(t\\right)}\\right|\u0026lt;\\delta\\:\\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{d}_{ij}^{\\left(q\\right)}\$\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{d}_{ij}^{\\left(t\\right)}\$\u003c/span\u003e\u003c/span\u003e denote inter-residue distances in the query and target structures, respectively, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\mathcal{N}\\left(i\\right)\$\u003c/span\u003e\u003c/span\u003e is the set of neighboring residues of residue \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:i\$\u003c/span\u003e\u003c/span\u003e within a fixed spatial cutoff (15 Å), \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\delta\\:\$\u003c/span\u003e\u003c/span\u003e represents distance tolerance thresholds (0.5, 1, 2, and 4 Å), and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\mathbf{I}\\left(\\cdot\\:\\right)\$\u003c/span\u003e\u003c/span\u003e is the indicator function. The final residue-wise lDDT score is obtained by averaging over the four distance thresholds.\u003c/p\u003e\u003cp\u003eFollowing previous work on reference-free multi-domain evaluation, we adopted a modified definition of lDDT in which the denominator \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\left|\\mathcal{N}\\left(i\\right)\\right|\$\u003c/span\u003e\u003c/span\u003e is defined as the total number of neighboring residues within 15 Å of residue \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\text{i}\$\u003c/span\u003e\u003c/span\u003e in the query structure, rather than only those neighbors that are aligned to the target. This modification penalizes non-compact or fragmented alignments, where few neighboring residues are aligned despite limited local structural consistency, and thus provides a more stringent assessment of local structural agreement[\u003cspan class=\"CitationRef\"\u003e38\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eFor each query-target pair, lDDT was computed based on structural alignments obtained from multiple alignment tools, and the maximum lDDT score was used as the final local similarity measure. Search performance was evaluated by averaging lDDT scores over the top 10 retrieved hits.\u003c/p\u003e\u003ch2\u003eEvaluated protein search tools\u003c/h2\u003e\u003cp\u003eWe evaluated a diverse set of protein search tools spanning sequence-based, structure-based, and representation-based approaches. All methods were applied using publicly available implementations with recommended or default parameters unless otherwise specified.\u003c/p\u003e\u003ch2\u003eStructure alignment tools\u003c/h2\u003e\u003ch2\u003eDali\u003c/h2\u003e\u003cp\u003eDali is a classical structure alignment tool that performs local structural alignment based on distance matrix comparisons. We installed the standalone DaliLite.v5 (available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://ekhidna2.biocenter.helsinki.fi/dali/README.v5.html\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). Input PDB files were converted to DAT format using Dali’s “\u003cem\u003eimport.pl\u003c/em\u003e”. Protein alignments were computed using Dali’s structural alignment algorithm, and results were sorted by Dali z-score in descending order.\u003c/p\u003e\u003ch2\u003eTMalign\u003c/h2\u003e\u003cp\u003eWe used TMalign (Version 20220412) for structural alignment. Alignments between query and target protein structures were computed with default parameters. All-vs-all dense comparisons were performed between queries and targets. For benchmarks focusing on global structural similarity, such as CATH-S20 and single-domain datasets, the average TM-score[\u003cspan class=\"CitationRef\"\u003e76\u003c/span\u003e] over all query-target pairs was used. For evaluations emphasizing local structural similarity, such as functional search, multi-domain, or GO semantic similarity datasets, the maximum TM-score across all alignments was used as the final similarity measurement.\u003c/p\u003e\u003cp\u003e \u003cb\u003eGTalign.\u003c/b\u003e \u003c/p\u003e\u003cp\u003eGTalign is an accelerated implementation of TMalign that preserves the underlying alignment strategy while substantially reducing the computational cost. We used the GPU Version 0.18.00 (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/minmarg/gtalign_alpha\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). All-versus-all dense retrieval was performed between query and target structures. Alignments were computed for all query-target pairs. TM-scores produced by GTalign were used in the same manner as TMalign.\u003c/p\u003e\u003ch3\u003eFoldseek\u003c/h3\u003e\u003cp\u003eFoldseek is a fast structure-based protein search tool that enables large-scale structural comparisons by discretizing protein structures into a sequence of structural alphabets (3Di) and performing alignment using a sequence search framework. We used Foldseek (Version dd579d9e6682519937e5c27d1ccb9eb4c9aeb87f) with default parameters. The software is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/steineggerlab/foldseek\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Searches were performed using the “\u003cem\u003eeasy-search\u003c/em\u003e” command. As Foldseek typically identifies enough target hits, we employed the default E-value threshold of 1.0, and results were ranked by bit score.\u003c/p\u003e\u003ch2\u003eRepresentation-based tools\u003c/h2\u003e\u003ch2\u003eGraSR\u003c/h2\u003e\u003cp\u003eGraSR is a graph-based structural representation method that encodes protein structures into embeddings using graph neural network (GNN)[\u003cspan class=\"CitationRef\"\u003e77\u003c/span\u003e] to enable efficient similarity search without explicit structural alignment. We used the publicly available GraSR implementation with pretrained model weights (available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/chunqiux/GraSR\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e\u003ch2\u003eTMvec\u003c/h2\u003e\u003cp\u003eTMvec is a representation-based protein structural similarity search method. It builds protein embeddings using protein large language model representations from ProtTrans, which are further refined by a four-layer Transformer[\u003cspan class=\"CitationRef\"\u003e78\u003c/span\u003e] network trained to approximate TM-score. It provides pretrained weights trained on four different types of datasets, and we used the “\u003cem\u003etmvec_swiss_model\u003c/em\u003e” weights. TMvec is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/tymor22/tm-vec\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003ch2\u003ePLMSearch\u003c/h2\u003e\u003cp\u003ePLMSearch is a protein search framework based on protein language model embeddings and includes two optional modules, SS-predictor and PfamClan. The PfamClan module relies on PfamScan[\u003cspan class=\"CitationRef\"\u003e79\u003c/span\u003e], a widely used third-party tool for Pfam domain annotation, and is independent of the core embedding model. As Pfam-based annotations can in some cases reflect curated functional knowledge in Swiss-Prot, including GO annotations, we restricted the use of PfamClan to the CATH evaluation only. For all other benchmark settings, we used the SS-predictor module to ensure a consistent and annotation-independent comparison across methods. PLMSearch is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://dmiip.sjtu.edu.cn/PLMSearch\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003ch3\u003eDHR\u003c/h3\u003e\u003cp\u003eDHR (Dense-Homolog-Retrieval) is a representation-based protein sequence search method based on a bi-encoder architecture with contrastive learning, initialized from the pretrained ESM-1b model. The model is trained on homologous sequence pairs derived from jackhmmer-generated MSAs built from large-scale sequence databases, with training queries primarily drawn from UniRef90 and UniClust30[\u003cspan class=\"CitationRef\"\u003e80\u003c/span\u003e], which include substantial coverage of SwissProt and may introduce partial overlap with our evaluation datasets. In our benchmark, we used the released v1 pretrained weights (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/ml4bio/Dense-Homolog-Retrieval/tree/v1\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) following the authors’ recommendation.\u003c/p\u003e\u003ch3\u003eFoldExplorer\u003c/h3\u003e\u003cp\u003eFoldExplorer is a protein search method based on joint sequence-structure representations. Structural information is encoded using a graph neural network, while sequence representations are obtained from an ESM-2 model with fine-tuning. Query and target proteins are mapped to fixed-length embeddings, and retrieval is performed by ranking targets according to cosine similarity. FoldExplorer is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/YuanLiu-SJTU/FoldExplorer\u003c/span\u003e\u003cspan class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003ch2\u003eSequence alignment tools\u003c/h2\u003e\u003ch2\u003eBLAST\u003c/h2\u003e\u003cp\u003eBLAST is a widely used sequence-based alignment tool that performs local sequence similarity searches using heuristic alignment strategies. We used protein-protein BLAST 2.16.0 + with default parameters unless otherwise specified. Protein sequences were searched against the pre-built target database, and hits were ranked by bit score.\u003c/p\u003e\u003ch2\u003eDiamond\u003c/h2\u003e\u003cp\u003eDiamond is a fast sequence alignment tool designed for large-scale protein searches, providing a substantial speed-up over BLAST while maintaining comparable sensitivity. We used Diamond (v2.1.10.164) in “\u003cem\u003eblastp\u003c/em\u003e” mode with default parameters. Search results were ranked by bit score.\u003c/p\u003e\u003ch2\u003eJackhmmer\u003c/h2\u003e\u003cp\u003eJackhmmer is an iterative sequence search tool based on profile hidden Markov models (HMMs), implemented in the HMMER suite[\u003cspan class=\"CitationRef\"\u003e81\u003c/span\u003e]. We used Jackhmmer (HMMER Version 3.4) with default parameters. Iterative searches were performed against the target sequence database for five iterations with “\u003cem\u003e-N 5\u003c/em\u003e”. Final hits were ranked according to the native scores reported by the software.\u003c/p\u003e\u003cp\u003e \u003cb\u003eMMseqs\u003c/b\u003e \u003c/p\u003e\u003cp\u003eMMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast, parallelized protein sequence searches and clustering of huge protein sequence data sets. We used MMseqs2 (Release 16.747c6) in search mode with default parameters. Query sequences were searched against the target database, and results were ranked by bit score.\u003c/p\u003e\u003cp\u003eFor all sequence alignment tools, we set the E-value threshold to 1e16 to ensure a sufficient number of hits. This is necessary because the sequence similarity in our datasets is generally low, and default thresholds often result in fewer than 10 hits per query.\u003c/p\u003e\u003cp\u003e \u003cb\u003eEvaluation protocol\u003c/b\u003e \u003c/p\u003e\u003cp\u003eTo ensure a fair and consistent comparison across protein search methods with heterogeneous output formats and scoring schemes, we adopted a unified top-k retrieval protocol with k = 10.\u003c/p\u003e\u003cp\u003eFor each query protein, we first configured each method to return as many target hits as possible by using permissive or unbounded output settings (e.g., disabling hit number limits or E-value thresholds where applicable). Retrieved targets were ranked according to the native similarity score of each method, and the top-10 ranked targets were selected for evaluation.\u003c/p\u003e\u003cp\u003eIn cases where a method returned fewer than 10 valid hits for a given query (e.g., due to limited sensitivity or strict internal filtering), the remaining slots were filled by randomly sampling proteins from the target set that were not retrieved by the method. These randomly sampled targets were appended to the ranked list to ensure that exactly 10 targets were associated with each query for all methods. Random sampling was performed using a fixed random seed to ensure the reproducibility.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eYuan Liu and Hong-Bin Shen designed the benchmark framework and conducted the experiments. Yuan Liu implemented the evaluation pipeline, wrote the code, performed data analysis, and drafted the manuscript. Yingquan Zhou assisted with dataset preparation and validation experiments. Yan Huang and Hongyi Xin contributed to method selection, experimental design, and result interpretation. Xiaoyong Pan provided critical insights into benchmark design and revised the manuscript. Hong-Bin Shen conceived and supervised the study, provided overall guidance, and revised the manuscript. All authors reviewed and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eThis work was supported by the National Key Research and Development Program of China (No. 2025YFA1805600), National Natural Science Foundation of China (No. 62573293, 62473257), and the Science and Technology Commission of Shanghai Municipality (No. 24ZR1435300, 24510714300).\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eAll data sets used in this paper and the scripts to reproduce the analyses are freely available at https://github.com/YuanLiu-SJTU/protein-search-benchmark. The CATH structural domains were obtained from https://www.cathdb.info/, predicted structures from AlphaFold are available at https://alphafold.com/, and domain annotations were sourced from https://ted.cathdb.info/.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eWilson CA, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol. 2000;297:233\u0026ndash;49.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJoshi T, Xu D. Quantitative assessment of relationship between sequence similarity and function similarity. BMC Genomics. 2007;8:222.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXie H, Wasserman A, Levine Z, Novik A, Grebinskiy V, Shoshan A, Mintz L. Large-scale protein annotation through gene ontology. Genome Res. 2002;12:785\u0026ndash;94.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoberts E, Eargle J, Wright D, Luthey-Schulten Z. MultiSeq: unifying sequence and structure data for evolutionary analysis. BMC Bioinformatics. 2006;7:382.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLisewski AM, Lichtarge O. Rapid detection of similarity in protein structure and function through contact metric distances. Nucleic Acids Res. 2006;34:e152\u0026ndash;152.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLoewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A. Protein function annotation by homology-based inference. Genome Biol. 2009;10:207.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlum M, Andreeva A, Florentino LC, Chuguransky SR, Grego T, Hobbs E, Pinto BL, Orr A, Paysan-Lafosse T, Ponamareva I. InterPro: the protein sequence classification resource in 2025. Nucleic Acids Res. 2025;53:D444\u0026ndash;56.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaysan-Lafosse T, Andreeva A, Blum M, Chuguransky SR, Grego T, Pinto BL, Salazar GA, Bileschi ML, Llinares-L\u0026oacute;pez F, Meng-Papaxanthos L. The Pfam protein families database: embracing AI/ML. Nucleic Acids Res. 2025;53:D523\u0026ndash;34.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUniProt. the universal protein knowledgebase in 2025. Nucleic Acids Res. 2025;53:D609\u0026ndash;17.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Ž\u0026iacute;dek A, Potapenko A. Highly accurate protein structure prediction with AlphaFold. \u003cem\u003enature\u003c/em\u003e 2021, 596:583\u0026ndash;589.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493\u0026ndash;500.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123\u0026ndash;30.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZheng W, Wuyun Q, Li Y, Liu Q, Zhou X, Peng C, Zhu Y, Freddolino L, Zhang Y. Deep-learning-based single-domain and multidomain protein structure prediction with DI-TASSER. Nat Biotechnol 2025:1\u0026ndash;13.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMirdita M, Sch\u0026uuml;tze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022;19:679\u0026ndash;82.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Ž\u0026iacute;dek A, Bridgland A, Cowie A, Meyer C, Laydon A. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596:590\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaradi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYeo J, Han Y, Bordin N, Lau AM, Kandathil SM, Kim H, Karin EL, Mirdita M, Jones DT, Orengo C. Metagenomic-scale analysis of the predicted protein structure universe. \u003cem\u003ebioRxiv\u003c/em\u003e 2025:2025.2004. 2023.650224.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim RS, Levy Karin E, Mirdita M, Chikhi R, Steinegger M. BFVD\u0026mdash;a large repository of predicted viral protein structures. Nucleic Acids Res. 2025;53:D340\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePorta-Pardo E, Ruiz-Serra V, Valentini S, Valencia A. The structural coverage of the human proteome before and after AlphaFold. PLoS Comput Biol. 2022;18:e1009818.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAltschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403\u0026ndash;10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAltschul SF, Madden TL, Sch\u0026auml;ffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389\u0026ndash;402.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKilinc M, Jia K, Jernigan RL. Improved global protein homolog detection with major gains in function identification. \u003cem\u003eProceedings of the National Academy of Sciences\u003c/em\u003e 2023, 120:e2211823120.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHolm L. Using Dali for protein structure comparison. Structural bioinformatics: methods and protocols. Springer; 2020. pp. 29\u0026ndash;42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHolm L, Laiho A, T\u0026ouml;r\u0026ouml;nen P, Salgado M. DALI shines a light on remote homologs: One hundred discoveries. Protein Sci. 2023;32:e4519.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKoehl P. Protein structure similarities. Curr Opin Struct Biol. 2001;11:348\u0026ndash;53.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D. Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins Struct Funct Bioinform. 2008;72:1259\u0026ndash;73.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChoi I-G, Kwon J, Kim S-H. Local feature frequency profile: a method to measure structural similarity in proteins. \u003cem\u003eProceedings of the National Academy of Sciences\u003c/em\u003e 2004, 101:3797\u0026ndash;3802.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBhaskara RM, Srinivasan N. Stability of domain structures in multi-domain proteins. Sci Rep. 2011;1:40.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRaghava G, Searle SM, Audley PC, Barber JD, Barton GJ. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics. 2003;4:47.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods. 2022;19:1109\u0026ndash;15.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBudowski-Tal I, Nov Y, Kolodny R. FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately. \u003cem\u003eProceedings of the National Academy of Sciences\u003c/em\u003e 2010, 107:3481\u0026ndash;3486.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y, Ye Q, Wang L, Peng J. Learning structural motif representations for efficient protein structure search. Bioinformatics. 2018;34:i773\u0026ndash;80.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDurairaj J, Akdel M, de Ridder D, van Dijk AD. Geometricus represents protein structures as shape-mers derived from moment invariants. Bioinformatics. 2020;36:i718\u0026ndash;25.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLlinares-L\u0026oacute;pez F, Berthet Q, Blondel M, Teboul O, Vert J-P. Deep embedding and alignment of protein sequences. Nat Methods. 2023;20:104\u0026ndash;11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaminski K, Ludwiczak J, Pawlicki K, Alva V, Dunin-Horkawicz S. pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics. 2023;39:btad579.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKandathil SM, Lau AM, Buchan DW, Jones DT. Foldclass and Merizo-search: scalable structural similarity search for single-and multi-domain proteins using geometric learning. Bioinformatics. 2025;41:btaf277.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, S\u0026ouml;ding J, Steinegger M. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2024;42:243\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu W, Wang Z, You R, Xie C, Wei H, Xiong Y, Yang J, Zhu S. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nat Commun. 2024;15:2775.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CE, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol. 2024;42:975\u0026ndash;85.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXia C, Feng S-H, Xia Y, Pan X, Shen H-B. Fast protein structure comparison through effective representation learning with contrastive graph neural networks. PLoS Comput Biol. 2022;18:e1009986.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHong L, Hu Z, Sun S, Tang X, Wang J, Tan Q, Zheng L, Wang S, Xu S, King I. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat Biotechnol. 2025;43:983\u0026ndash;95.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y, Zhang Y, Zhou Z, Shen H-B. FoldExplorer: fast and accurate protein structure search with sequence-enhanced graph embedding. J Mol Biol 2025:169412.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eS\u0026ouml;ding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol. 2011;21:404\u0026ndash;11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSauder JM, Arthur JW, Dunbrack RL Jr. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins Struct Funct Bioinform. 2000;40:6\u0026ndash;22.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCsaba G, Birzele F, Zimmer R. Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis. BMC Struct Biol. 2009;9:23.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang Y, Wu H, Cai Y. A benchmark study of sequence alignment methods for protein clustering. BMC Bioinformatics. 2018;19:529.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSykes J, Holland BR, Charleston MA. Benchmarking methods of protein structure alignment. J Mol Evol. 2020;88:575\u0026ndash;97.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOldfield CJ, Dunker AK. Intrinsically disordered proteins and intrinsically disordered protein regions. Annu Rev Biochem. 2014;83:553\u0026ndash;84.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDobson L, Tusn\u0026aacute;dy GE, Tompa P. Regularly updated benchmark sets for statistically correct evaluations of AlphaFold applications. Brief Bioinform. 2025;26:bbaf104.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAderinwale T, Bharadwaj V, Christoffer C, Terashi G, Zhang Z, Jahandideh R, Kagaya Y, Kihara D. Real-time structure search and structure classification for AlphaFold protein models. Commun biology. 2022;5:316.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWaman VP, Bordin N, Lau A, Kandathil S, Wells J, Miller D, Velankar S, Jones DT, Sillitoe I, Orengo C. CATH v4. 4: major expansion of CATH by experimental and predicted structural data. Nucleic Acids Res. 2025;53:D348\u0026ndash;55.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVander Meersche Y, Diharce J, Gelly J-C, Galochkina T. Flexibility or uncertainty? A critical assessment of AlphaFold 2 pLDDT. Structure. 2025;33:2157\u0026ndash;63. e2152.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMargelevičius M. GTalign: Spatial index-driven protein structure alignment, superposition, and search. Nat Commun. 2024;15:7305.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBuchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59\u0026ndash;60.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJohnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics. 2010;11:431.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSteinegger M, S\u0026ouml;ding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379:1358\u0026ndash;63.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, Harris NL. The gene ontology knowledgebase in 2023. Genetics. 2023;224:iyad031.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. \u003cem\u003eProceedings of the National Academy of Sciences\u003c/em\u003e 2021, 118:e2016239118.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2021;44:7112\u0026ndash;27.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Z, Wayment-Steele HK, Brixi G, Wang H, Kern D, Ovchinnikov S. Protein language models learn evolutionary statistics of interacting sequence motifs. \u003cem\u003eProceedings of the National Academy of Sciences\u003c/em\u003e 2024, 121:e2406285121.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85\u0026ndash;94.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan J-H, Batey S, Nickson AA, Teichmann SA, Clarke J. The folding and evolution of multidomain proteins. Nat Rev Mol Cell Biol. 2007;8:319\u0026ndash;30.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLau AM, Bordin N, Kandathil SM, Sillitoe I, Waman VP, Wells J, Orengo CA, Jones DT. Exploring structural diversity across the protein universe with The Encyclopedia of Domains. Science. 2024;386:eadq4946.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013;29:2722\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTrivedi R, Nagarajaram HA. Intrinsically disordered proteins: an overview. Int J Mol Sci. 2022;23:14050.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN. DisProt: the database of disordered proteins. Nucleic Acids Res. 2007;35:D786\u0026ndash;93.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKulkarni P, Leite VB, Roy S, Bhattacharyya S, Mohanty A, Achuthan S, Singh D, Appadurai R, Rangarajan G, Weninger K. Intrinsically disordered proteins: Ensembles at the limits of Anfinsen's dogma. Biophys Reviews 2022, 3.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWaterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018;46:W296\u0026ndash;303.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSuzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBertoni D, Tsenkov M, Magana P, Nair S, Pidruchna I, Querino Lima Afonso M, Midlik A, Paramval U, Lawal D, Tanweer A. AlphaFold Protein Structure Database 2025: a redesigned interface and updated structural coverage. Nucleic Acids Res. 2026;54:D358\u0026ndash;62.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150\u0026ndash;2.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26:976\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKlopfenstein DV, Zhang L, Pedersen BS, Ram\u0026iacute;rez F, Warwick Vesztrocy A, Naldi A, Mungall CJ, Yunes JM, Botvinnik O, Weigel M. GOATOOLS: A Python library for Gene Ontology analyses. Sci Rep. 2018;8:10872.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins Struct Funct Bioinform. 2004;57:702\u0026ndash;10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eScarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G. The graph neural network model. IEEE Trans Neural Networks. 2008;20:61\u0026ndash;80.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017, 30.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMistry J, Bateman A, Finn RD. Predicting active site residue annotations in the Pfam database. BMC Bioinformatics. 2007;8:298.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMirdita M, Von Den Driesch L, Galiez C, Martin MJ, S\u0026ouml;ding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45:D170\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFinn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29\u0026ndash;37.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"genome-biology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gbio","sideBox":"Learn more about [Genome Biology](https://genomebiology.biomedcentral.com/)","snPcode":"13059","submissionUrl":"https://submission.springernature.com/new-submission/13059/3","title":"Genome Biology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Protein similarity search, Benchmark, Structure alignment, Representation-based searching","lastPublishedDoi":"10.21203/rs.3.rs-8796067/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8796067/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eProtein sequence and structure similarity-based search is an important task, which underpins protein annotation, evolutionary analysis, large-scale functional inference, and the exploration of the protein \u0026ldquo;dark space\u0026rdquo;. The rapid growth of sequence and predicted structure databases has spurred diverse search methods, yet their evaluation remains limited to fold-level similarity and inconsistent benchmarking protocols.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eWe present a unified benchmark for protein sequence and structure search. Using this framework, we evaluate 13 representative methods spanning sequence alignment, structure alignment, and representation-based approaches across multiple biologically relevant scenarios. Our results show pronounced and context-dependent differences among methods. Structure alignment methods excel at detecting fold-level and geometric similarity, while representation-based searching approaches show advantages in capturing functional similarity under low sequence identity and robustness to predicted structures. Notably, all evaluated methods show limited effectiveness on intrinsically disordered proteins.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eThis benchmark establishes a standardized framework for evaluating protein similarity search methods, providing a practical resource for method selection and a foundation for the development of next-generation approaches capable of addressing diverse homology search challenges.\u003c/p\u003e","manuscriptTitle":"Benchmarking protein sequence and structure search methods for remote homology detection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-24 12:54:49","doi":"10.21203/rs.3.rs-8796067/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-03-25T19:51:12+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-24T04:15:12+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-05T04:30:58+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"317510098886957357107731353111051504120","date":"2026-02-25T07:21:42+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"140656651509172178364885798366413484305","date":"2026-02-20T08:12:00+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-02-19T19:43:12+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-12T20:07:01+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-06T04:23:11+00:00","index":"","fulltext":""},{"type":"submitted","content":"Genome Biology","date":"2026-02-05T10:25:35+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"genome-biology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gbio","sideBox":"Learn more about [Genome Biology](https://genomebiology.biomedcentral.com/)","snPcode":"13059","submissionUrl":"https://submission.springernature.com/new-submission/13059/3","title":"Genome Biology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"5c5a5da9-40b1-478c-9dfb-9a40fe1b4652","owner":[],"postedDate":"February 24th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-27T12:38:47+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-24 12:54:49","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8796067","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8796067","identity":"rs-8796067","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00