Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures Yuedong Yang, Yidong Song, Qianmu Yuan, Sheng Chen, Yuansong Zeng, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4344209/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 18 Sep, 2024 Read the published version in Nature Communications → Version 1 posted You are reading this latest preprint version Abstract Enzymes are crucial in numerous biological processes, with the Enzyme Commission (EC) number being a commonly used method for defining enzyme function. However, current EC number prediction technologies have not fully recognized the importance of enzyme active sites and structural characteristics. Here, we propose GraphEC, a geometric graph learning-based EC number predictor using the ESMFold-predicted structures and a pre-trained protein language model. Specifically, we first construct a model to predict the enzyme active sites, which is utilized to predict the EC number. The prediction is further improved through a label diffusion algorithm by incorporating homology information. In parallel, the optimum pH of enzymes is predicted to reflect the enzyme-catalyzed reactions. Experiments demonstrate the superior performance of our model in predicting active sites, EC numbers, and optimum pH compared to other state-of-the-art methods. Additional analysis reveals that GraphEC is capable of extracting functional information from protein structures, emphasizing the effectiveness of geometric graph learning. This technology can be used to identify unannotated enzyme functions, as well as to predict their active sites and optimum pH, with the potential to advance research in synthetic biology, genomics, and other fields. Biological sciences/Computational biology and bioinformatics/Protein function predictions Biological sciences/Biological techniques/Bioinformatics Biological sciences/Computational biology and bioinformatics/Machine learning Biological sciences/Computational biology and bioinformatics/Computational models Enzyme Commission number enzyme active sites enzyme optimum pH geometric graph learning pre-trained language model Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Enzymes play an essential role in various biological processes by catalyzing numerous reactions [ 1 , 2 ]. Identifying enzyme functions is crucial for the study of metabolism [ 3 ] and diseases [ 4 ]. Enzyme Commission (EC) number [5] is commonly utilized to formulate the enzyme function as a four-digit structure, which provides a unified scheme and expedites advancements in the field of enzyme engineering. However, the experimental determination [6] of EC numbers is time-consuming and costly. The development of computational approaches for identifying EC numbers has become imperative. The computational approaches can be categorized into homology-based [7, 8], structure-based [9, 10], and machine learning-based [11–13] approaches. Homology-based approaches, assuming that highly similar enzymes have similar functions, were proposed to annotate the enzyme function with alignment tools [14, 15]. These methods rely heavily on sequence similarity, which limits their coverage while lacking similar sequences. To improve the coverage, structure-based approaches [9, 16] scanned structurally similar protein templates to identify consensus functions. For instance, COFACTOR [10] compared the query structure to proteins with known structures and functions in the BioLiP library [17] for function annotation. Despite the improvement of these methods, difficulties remain due to a lack of high-quality templates. To alleviate the constraints of similar sequences and templates, machine learning-based approaches have been developed. The initial machine learning-based approaches [18, 19] first extracted vital features before utilizing machine learning algorithms to identify the corresponding EC numbers. The performance of these machine learning algorithms is greatly influenced by the manually crafted features, which are not adapted to rapidly expanding enzyme sequences. Recently, deep learning methods [11, 20] have achieved success in enzyme function annotation. To avoid manual feature extraction, DEEPre [21] employed CNN and RNN components to capture convolutional and sequential features. ProteInfer [12] utilized a dilated convolutional network to establish a mapping between protein space and enzyme function space. Utilizing the InterPro signatures as domain information, GrAPFI [22] performed label propagation on a weighted undirected graph. For ECPICK [23], the protein sequence was encoded using one-hot embedding, which was subsequently employed to compute the posterior probabilities of around 5000 EC numbers through convolutional and hierarchical layers. CLEAN [11], another deep learning method that learned abundant embeddings through contrastive learning [24], achieved better accuracy and EC coverage for EC number identification. Nevertheless, these methods still suffer from two limitations. Firstly, they only used protein sequences without incorporating protein structures, thus losing the crucial features implied by the structures. Secondly, the crucial information of enzyme active sites was not employed in the analysis of enzyme function. Due to the lack of native structures, present methods don’t fully exploit the information from protein structures. AlphaFold2 [25] has made a breakthrough in protein structure prediction, with the predicted structures confirmed to be useful in DNA-binding site prediction [26, 27], antibiotic discovery [28], and the study of intrinsically disordered proteins [29]. Regrettably, the high computational demand of AlphaFold2 limits its applicability for genome-wide use. To address this issue, Lin et al. [30] proposed a pre-trained language model ESMFold for precise and quick structure prediction, attaining comparable accuracy to AlphaFold2 while significantly reducing inference time by up to 60 times. The high efficiency of ESMFold enables the analysis of protein structures in metagenomics [31], which has shown remarkable achievements in nucleic-acid-binding site prediction [32] and drug discovery [33]. With the aid of predicted structures, geometric graph learning [34], a technique that has proven beneficial in protein design [35, 36] and docking [37], can extract structural information efficiently. To augment the geometric graph learning, some studies [32, 38] have attempted to incorporate informative sequence embeddings using unsupervised language models (ProtTrans [39] and ESM-1b [40]). On the other side, enzyme active sites are typically located on the surface of enzymes and play an important role in catalyzing reactions or binding substrates [41]. They exhibit a high level of conservation in the process of evolution and significantly determine the function of enzymes [42, 43]. So obviously, it would be highly beneficial to consider the active sites of enzymes when assigning the EC numbers. Meanwhile, current methods for predicting enzyme active sites mainly rely on templates or hand-crafted features, which are unable to keep up with the rapidly growing data. This highlights the need for a fast and accurate enzyme active site predictor. Besides active sites, a label diffusion algorithm [44] has been developed for protein function prediction, which can transfer functionally relevant data and aid in identifying EC numbers. In this work, we proposed GraphEC (geometric Graph learning-based EC number annotation), an accurate network for enzyme function prediction based on predicted protein structures and enzyme active sites. Specifically, the enzyme active sites were identified first, as they play a critical role in predicting enzyme function. With the guidance of active sites, GraphEC was trained through geometric graph learning with the protein structures predicted by ESMFold. To improve the model performance, informative sequence embeddings were generated via a pre-trained language model (ProtTrans) to augment the node features. In addition, a label diffusion algorithm was employed to further enhance the prediction using homology information. Considering that enzyme-catalyzed reactions require specific environmental conditions, we further extended the model to enzyme optimum pH prediction, which can assist in experimental procedures. Through comprehensive comparisons on several independent tests, our model outperformed all the state-of-the-art methods in the predictions of active sites, EC number, and optimum pH. Additional analysis demonstrated that GraphEC is able to learn functional information from enzyme structures, further emphasizing the effectiveness of geometric graph learning. Results The overview of the model GraphEC, an accurate EC number predictor based on geometric graph learning, incorporates the enzyme active sites and predicted protein structures into enzyme function prediction (Fig. 1 ). Given a protein sequence, its structure is predicted by ESMFold and used to construct the protein graph. Geometric features were extracted through the predicted structures, which are enhanced by sequence embeddings calculated through a pre-trained language model (PtrotTrans). These features are fed into a geometric graph learning network for learning geometric embeddings, which are utilized in the prediction of active sites, EC number, and optimum pH. Here, enzyme active sites are first predicted by GraphEC-AS, assigning weight scores to each residue. Guided by the weight scores, the initial prediction of EC number is computed with the attention and pooling layers, which is further improved through a label diffusion algorithm by extracting homologous information. Finally, the model is extended to optimum pH prediction through attention pooling for better representing the reaction conditions (GraphEC-pH). Enzyme active site prediction (GraphEC-AS) We first evaluated GraphEC-AS for enzyme active site prediction based on residue using the independent test TS124 (details shown in Methods). Figure 2 A displays an AUC (area under the receiver operating characteristic curve) of 0.9635 for GraphEC-AS on five-fold cross-validation and 0.9583 for TS124, demonstrating the robustness of the model. Six competing methods (PREvaIL_RF [45], PREvaIL_LR, CRpred (residues with coordinates) [46], CRpred (all residues), HA (residue identity filter) [47], and HA (combination filter) are located between the ROC curves of GraphEC-AS and BiLSTM (the method excluding structural information), indicating the importance of geometric information. In terms of MCC (Matthews correlation coefficient), recall, and precision (Fig. 2 B), our method consistently performed the best. The second-best method (PREvaIL_RF) achieved 0.2939, 0.6223, and 0.1487, lower than GraphEC-AS by 40.9%, 14.5%, and 57.1%, respectively. Source data are provided as a Source Data file. Additionally, the F1 score for GraphEC-AS on TS124 is 0.4698 (Table S1 ), while the second-best method, PREvaIL_RF, achieves a score of 0.240, reflecting a decrease of 48.9% relative to GraphEC-AS. The PREvaIL needs the calculation of time-consuming evolutionary profiles using PSI-BLAST [48], whereas GraphEC-AS can identify the enzyme active sites rapidly and accurately. Source data are provided as a Source Data file. The superiority of GraphEC-AS was further illustrated by its learned embeddings on TS124. The ProtTrans embeddings (Fig. 2 C) are scattered while the geometric embeddings learned by GraphEC-AS (Fig. 2 D) distinguished active sites from non-active sites clearly. This demonstrates the capability of geometric graph learning to identify the crucial distinctions between them. We further evaluated the impact of the quality of ESMFold-predicted structures using TM-align [49] on TS124. More than 85% of proteins had TM-scores greater than 0.8 (Supplementary Fig. S1 ), which reflects the high quality of the ESMFold-predicted structures. The AUC values increased with TM-scores (Supplementary Fig. S2 ), which indicates the necessity of high predicted structure quality and emphasizes the importance of employing ProtTrans to enhance the feature embeddings. Figure 2 E, F compare the three-dimensional structures of an example (cis-muconate cyclase) predicted by BiLSTM and GraphEC-AS. GraphEC-AS identified all four active sites, whereas BiLSTM only detected H149 due to the absence of local structure characteristics. Compared to H149, the remaining active sites were located far in sequence (more than 20 residues apart) but close in structure (less than 16 Å). These results indicate the capability of GraphEC-AS to learn the local structure information. Additional cases can be seen in Supplementary Fig. S3 . Enzyme EC number identification (GraphEC) With the guidance of predicted active sites, GraphEC was proposed to identify enzyme EC numbers. GraphEC was evaluated on two independent tests: NEW-392 and Price-149, where NEW-392 comprises 392 enzyme sequences covering 177 different EC numbers, and Price-149 is an experimental dataset validated by Price et al. [50]. In comparison to four state-of-the-art EC number predictors (i.e., CLEAN, ProteInfer, DeepEC [20], ECPred [51], GrAPFI, and ECPICK), GraphEC exhibited superior performance in various metrics. Figure 3 A illustrates that GraphEC achieved an AUC, recall, precision, and F1 of 0.8404, 0.6908, 0.6132, and 0.6131 on Price-149, surpassing the second-best method (CLEAN) by 14.6%, 47.9%, 4.9%, and 23.9%, respectively. On NEW-392, GraphEC achieved optimal values in AUC (0.8910), recall (0.7988), and F1 (0.5910) (Supplementary Fig. S4). Source data are provided as a Source Data file. As shown in Table S3 , GraphEC is able to achieve high EC number coverage (5106 EC numbers) while maintaining high performance. Benefiting from the contrastive learning-based representation, CLEAN achieved high precision, but its recall and F1 were 39.8% and 15.6% lower than those of GraphEC, respectively. Relying on the label propagation on a protein domain similarity graph, GrAPFI [22] achieved acceptable performance, with AUC values of 0.5095 and 0.5407 on Price-149 and NEW-392 (Table S2 ). ECPICK [23] attained the third-best performance through the implementation of a convolutional neural network and hierarchical module, achieving the AUC values of 0.5888 and 0.6502 on Price-149 and NEW-392 (Table S2 ), respectively. Source data are provided as a Source Data file. GraphEC was further evaluated on different levels of EC numbers and the frequency of each EC number in the training set. Considering the potential impact of EC number frequency in the training set on model performance, precision on NEW-392 was evaluated based on the number of times that the EC number appeared in the training set. (Fig. 3 B). More than 66.0% of enzymes have less than ten occurrences and only 8.9% of enzymes have more than 100 occurrences, demonstrating the challenge of the dataset. As expected, predicting EC numbers with low frequency proved to be difficult. However, GraphEC consistently exhibited higher precision at different occurrences of EC numbers compared to other methods, highlighting the superior performance of our model. The four digits of the EC number correspond to different levels of enzyme functional classification, with the first to fourth digits indicating a hierarchical breakdown. The recall of GraphEC on NEW-392, compared to CLEAN, improved by 1.1%, 1.7%, 3.4%, and 66.0% from the first level to the fourth level, with values of 0.9468, 0.9116, 0.8945, and 0.7988 (Fig. 3 C). The superiority of GraphEC becomes more apparent as the level increases, indicating the effectiveness of our model. Source data are provided as a Source Data file. Considering the utilization of active sites in EC number prediction, we have evaluated the impact of mutations in the active sites. We first identified the active sites of enzymes on NEW-392 and Price-149 based on the predicted results (score > 0.5). Subsequently, these active sites were mutated to Alanine (A), and the predicted scores for true EC numbers were compared before and after the mutation. After mutation, the predicted scores for true EC numbers have decreased (Fig. S6), demonstrating the influence of mutations in the active sites on the prediction of EC numbers. Among the mutated enzymes, 59.1% can be identified as non-enzymes, such as L-2-hydroxyglutarate dehydrogenase (Uniprot ID: A0A011QK89) and Farnesyl pyrophosphate synthase (Uniprot ID: B4YA15) (more cases can be seen in Table S4). Source data are provided as a Source Data file. Furthermore, the predicted scores for active sites before and after the mutation were compared, discovering a reduction in predicted scores for active sites after mutation (Fig. S7). This indicates a reduced focus of the model on the mutated active sites. Additionally, we have compared the average computational time per protein of different methods on Price-149. The average inference time for GraphEC is 0.26 seconds (s), while CLEAN, ProteInfer, and DeepEC have inference times of 1.28s, 0.21s, and 0.14s, respectively (Fig. S8). Source data are provided as a Source Data file. Due to the considerable time needed to compute the pairwise distances between the query sequence and each EC number cluster center in CLEAN, GraphEC's inference speed is 392.3% faster than that of CLEAN. By combining the time required for ESMFold to compute protein structures (11.44s) with the inference time of GraphEC (0.26s), a total of 11.7s is necessary for each enzyme. In this case, computing the functions of 1,000 enzymes requires just 3.25 hours, thereby meeting the needs for high-throughput analysis. The ablation studies of GraphEC The ablation studies of GraphEC were conducted to investigate the contribution of each module. When removing label diffusion, the AUC values slightly decreased (Fig. 3 D) likely because of the ability of GraphEC to learn homology information. The removal of active site guidance resulted in a decrease of 2.8% and 3.5% in AUC on NEW-392 and Price-149, demonstrating its great importance. For evaluating the impact of ESMFold-predicted structures, a geometrically agnostic baseline (BiLSTM) was constructed. Without structural information, the AUC decreased by 4.8% and 2.1% on NEW-392 and Price-149, indicating the crucial role of predicted structures. The ProtTrans embeddings were used to enhance the node features, and the removal of them led to a decrease in AUC by 6.6% and 2.8%. The PortTrans embeddings used here are residue-level representations, which are different from the protein-level ESM-1b representations (mean representations) used in CLEAN (Supplementary Fig. S9). Source data are provided as a Source Data file. Additionally, we have evaluated the effects of physicochemical properties in reference to previous studies [52, 53]. The incorporation of these physicochemical properties failed to further improve the performance of GraphEC (Table S5), suggesting that the geometric features and language model embeddings used in this study may have already inherently captured the physicochemical properties. Source data are provided as a Source Data file. As shown in Fig. 3 E, the learned geometric embeddings (GraphEC embeddings) were compared with ProtTrans embeddings and one-hot embeddings on NEW-392. Among the ten most frequent EC numbers, the one-hot embeddings exhibited limited discriminative capacity. And the ProtTrans embeddings can roughly distinguish these EC numbers, yet they cannot cluster the categories to which 3.1.2.22 and 4.2.1.113 belong. In contrast, GraphEC embeddings can clearly separate these EC numbers, demonstrating their strong expressive ability for different EC numbers. Similarly, on Price-149, the one-hot embeddings lacked the ability to distinguish, while the ProtTrans embeddings can provide a basic distinction, and the GraphEC embeddings were able to further differentiate them (Supplementary Fig. S10). Source data are provided as a Source Data file. To evaluate the importance of predicted structures, we replace the ESMFold-predicted structures with those predicted by AlphaFold2. Utilizing the AlphaFold2-predicted structures, the AUC, recall, precision, and F1 on NEW-392 are 0.9004, 0.8267, 0.5745, and 0.6044, respectively (Table S6), slightly higher than those of using ESMFold-predicted structures. On Price-149, comparable performance was obtained when utilizing AlphaFold2-predicted and ESMFold-predicted structures. These results indicate that ESMFold can generate structures with comparable accuracy in much less time than AlphaFold2. In addition, we also evaluated the impact of various cut-off distances (8 Å, 12 Å, and 14 Å) relative to 10 Å on model performance. When the distance is 8 Å, the AUC, recall, precision, and F1 of the model are 0.8761, 0.7729, 0.5577, and 0.5459 on NEW-392 (Table S7), lower by 1.7%, 3.2%, 2.3%, and 7.6% when the distance is 10 Å. This may be due to the decreased distance, which reduces the number of neighbor nodes associated with each node, ultimately causing some information loss. When the distance is 12 Å and 14 Å, the AUC of the model are 0.8876 and 0.8753 on NEW-392, respectively, 0.4% and 1.8% lower than when the distance is 10 Å (0.8910). This might be because a larger distance allows each node to have more edges, resulting in excessive aggregation of information from neighbor nodes during the iterative process, which eventually reduces the node specificity. Similar results on Price-149 are presented in Table S7. Source data are provided as a Source Data file. GraphEC captures the functional regions of enzymes To verify whether GraphEC can identify functional regions, we studied the connections between predicted enzyme active sites, multi-head attention scores, and true active sites. As shown in Fig. 3 F, the true active sites of Acyl-protein thioesterase 2 are S122, D176, and H210, which were correctly predicted through GraphEC-AS and used to guide the EC number prediction. The multi-head attention scores tended to be higher near the true active sites, suggesting that the model can focus on the functional regions. Similarly, the enzyme active sites of Proline racemase were accurately identified and the muti-head attention scores were prominent when approaching the true active sites (Fig. 3 G). Additional cases can be seen in Supplementary Fig. S11. These results indicate that GraphEC could capture the functional regions of enzymes. The prediction of enzyme optimum pH Since enzyme pH values are important for enzyme functions, we have also included enzyme optimum pH predictions. To train the model, we have curated a new dataset constructed from the Brenda database (released in January 2023) [54] (Supplementary Fig. S12), including 4110 proteins with sequence identity of < 25%. The dataset was divided into a training set (Brenda-train, 3297 enzymes) and an independent test set (Brenda-test, 813 enzymes) with a ratio of 4:1 according to the deposit time. As shown in Fig. 4 A, GraphEC-pH achieved an AUPR (area under the precision-recall curve) of 0.9321 for five-fold cross-validation and 0.9170 on the test, indicating the model’s robustness. By removing the structural information, the AUPR of GraphEC-pH w/o structures decreased by 1.4%. In comparison, the two latest methods, EpHod [55] and EpHod_SVR achieved lower performance with points located below the precision-recall curve of GraphEC-pH. Correspondingly, the F1, recall, and precision of GraphEC-pH were 0.8487, 0.8672, and 0.8461, surpassing the second-best method (EpHod) by 9.2%, 16.5%, and 0.09%, respectively (Fig. 4 B). Source data are provided as a Source Data file. These results have demonstrated the superior performance of our model. We then evaluate the model’s ability to discern differences among 289 homologous enzyme pairs searched by DIAMOND in Brenda-test. More than 87.9% (254 pairs) of the homologous enzyme pairs have the same type of optimum pH (i.e., “acidic” - “acidic” and “non-acidic” - “non-acidic”), and GraphEC-pH can correctly identify 95.7% of them (243 pairs). Only 35 pairs of enzymes exhibit different optimal pH types (i.e., “acidic” - “non-acidic”), with GraphEC-pH correctly distinguishing 14 pairs (Table S8), which is 75% more than EpHod (8 pairs). These results indicate that GraphEC-pH can discern the differences among homologous enzymes to some extent. Source data are provided as a Source Data file. GraphEC learns functional information from enzyme structures To discover new enzyme functions, a total of 570,830 protein sequences were collected from Swiss-Prot (January 2024 release). After removing the proteins with sequence identity greater than 25% and those with identity above 25% to the training dataset, 52,037 proteins without EC number annotations remained. These proteins were annotated by GraphEC and CLEAN, with over 21% of them including the same EC number annotations. For each protein, the predicted EC number was obtained and the TM-scores were calculated with proteins sharing the same EC number in the training set. Subsequently, the maximum TM-scores of proteins were further used to analyze. GraphEC generally has a higher score, with over 82% of the proteins found by Foldseek [56] showing a higher TM-score compared to CLEAN. When comparing the number of enzymes whose maximum TM-scores exceeded various thresholds (Fig. 5 A), GraphEC surpassed CLEAN by 158%, 136%, 128%, and 128% at thresholds of 0.5, 0.7, 0.8, and 0.9, respectively. Source data are provided as a Source Data file. Compared to CLEAN, the newly discovered enzyme functions identified by GraphEC with maximum TM-scores surpassing 0.8 are listed in Supplementary Dataset 1 partially. Despite low sequence similarity, GraphEC can learn functional information from enzymes with high structural similarity (Fig. 5 B). Even when the TM-score is low, the enzyme pocket (details shown in Methods) around the enzyme active sites can still be aligned (Fig. 5 C), demonstrating the capacity of GraphEC to learn critical functional information from enzyme structures. Additionally, an example (Q9NWA0) with disorder regions was found to be aligned to the enzyme pocket of Q980B8 in the training set (Supplementary Fig. S13F), which indicates the potential of our method for identifying disordered protein functions. More cases are available for reference in Supplementary Fig. S13. Source data are provided as a Source Data file. Discussion GraphEC is a geometric graph learning-based EC number predictor based on the enzyme active sites and predicted structures. The predicted active sites can guide the learning because of their crucial role in enzyme function. Based on the ESMFold-predicted structures, geometric graph learning can efficiently extract structural information, which is especially necessary when lacking homology information. Additionally, a label diffusion algorithm and ProtTrans embeddings are able to improve the model performance. For an enzyme, the EC number, active sites, and optimum pH can be analyzed comprehensively. Despite the essential role of EC numbers, current EC number prediction technologies have not fully recognized the importance of enzyme active sites and structural characteristics. The enzyme active sites represent the chemical reaction regions, which we first predict and use to guide subsequent learning. Due to the limitations of native structures, current methods for EC number prediction don’t fully exploit the information from protein structures. Benefiting from the rapid and precise structure prediction of ESMFold, GraphEC utilizes geometric graph learning to extract important structural information and surpass state-of-the-art methods. Experiments demonstrate the efficacy of our model in predicting active sites, EC numbers, and optimum pH. Furthermore, GraphEC is proven to be able to extract functional information from enzyme structures even in the absence of homology information, emphasizing the effectiveness of geometric graph learning. Although GraphEC has shown great performance, there is still room for improvement in several aspects. Considering the impact of predicted structure quality, we can explore enhancing the stability of the model by either improving the structural quality or incorporating additional sequence features. Additionally, as large language models continue to advance, we can utilize them to extract essential information from textual descriptions and enhance our model's predictions. In summary, we have developed an accurate and fast EC number predictor, GraphEC. Researchers can use it to accurately predict enzyme function solely from the enzyme sequences. For specific enzymes, we can further analyze their functional regions (active sites) and determine their reaction conditions (pH), which will be helpful for experimental investigations. Methods Dataset construction To predict the enzyme active sites, we collected eight enzyme datasets and constructed new training and test sets from them. The eight datasets, namely NN [57], PC [58], HA superfamily [47], EF family [59], EF superfamily, EF fold, T-37, and T-124 [46], collectively contain a total of 987 proteins. T-124 containing 124 proteins was used as the test set (TS124), while the remaining 863 proteins were utilized as a training set. For excluding the sequences with high identity, the chains in the training set that share > 25% identity with TS124 were removed using MMseqs2 [60], resulting in 588 sequences in the training set (Train588). For EC number prediction, referring to CLEAN [11], more than 220, 000 enzyme sequences were extracted from UniProt [61], and a training set of size 74, 487 for enzyme EC number identification was constructed through 70% clustering. Two independent test sets were used to evaluate the model performance. The first is NEW-392, which collected data from Swiss-Prot released after April 2022. In NEW-392, 392 enzyme sequences were included, encompassing a total of 177 EC numbers. The second is Price-149, an experimental dataset of 149 enzyme sequences described by Price et al. [50]. For predicting the enzyme optimum pH, 11383 enzymes were collected from BRENDA (released in January 2023) [54], which provides the experimental optimum pH for enzyme-catalyzed reactions. After removing the similar sequences with > 25% identity, 4110 enzymes remained and were ranked by the released time. The latest 813 sequences (about 20%) were utilized as the test set (Brenda-test), while the remaining were used as the training set (Brenda-train). The architecture of the model As shown in Fig. 1 , protein structures are predicted using ESMFold to construct the protein graph and sequence embeddings are extracted via ProtTrans, which are then fed into a featurizer layer to obtain node and edge features. These features are employed to obtain geometric embeddings through geometric graph learning. Based on the embeddings, enzyme active sites are predicted and a weight score is assigned to every residue. Using these weight scores, enzyme EC numbers are identified with an attention layer and label diffusion. In addition, for better determining the reaction conditions, the model is subsequently expanded to optimum pH prediction by incorporating attention pooling. Featurizer layer A protein is represented as a radius graph constructed by the \(\:{c}_{\alpha\:}\) atoms of residues, where the radius defaults to 10 Å. The protein graph comprises the adjacency matrix, as well as node and edge features, which are derived from a local coordinate system. The \(\:{\text{C}}_{{\alpha\:}},\:\text{C},\) and N atoms of residue \(\:i\) are employed to build the coordinate system \(\:{Q}_{i}={[b}_{i},\:{n}_{i},\:{b}_{i}\times\:{n}_{i}]\) . Formally, we define: $$\:\begin{array}{c}{u}_{i}={C}_{{\alpha\:}_{i}}-{N}_{i},\:{\:v}_{i}={C}_{i}-{C}_{{\alpha\:}_{i}},\:{b}_{i}=\frac{{u}_{i}-{v}_{i}}{\parallel\:{u}_{i}-{v}_{i}\parallel\:},\:{n}_{i}=\frac{{u}_{i}\times\:{v}_{i}}{\parallel\:{u}_{i}\times\:{v}_{i}\parallel\:}\:\#\left(1\right)\end{array}$$ Based on the local coordinate system, the node and edge features are defined as follows: (i) Node features. Given two atoms \(\:A\in\:\left\{{C}_{i},\:\:{C}_{{\alpha\:}_{i}},\:\:{N}_{i},\:\:{O}_{i},\:{\:R}_{i}\right\}\) and \(\:B\in\:\left\{{C}_{i},\:\:{C}_{{\alpha\:}_{i}},\:\:{N}_{i},\:\:{O}_{i},{\:R}_{i}\right\}\) , where \(\:{C}_{i}\) , \(\:{C}_{{\alpha\:}_{i}}\) , \(\:{N}_{i}\) , and \(\:{O}_{i}\) represent four atoms of residue \(\:i\) and \(\:{\:R}_{i}\) denotes the centroid of sidechain atoms. By analyzing the characteristics between A and B, the distance, direction, and angle features are computed for each residue. The distance features are \(\:RBF(\parallel\:A-B\parallel\:)\) , where \(\:A\ne\:B\) and \(\:RBF\) is a radial basis function. The direction features are regulated as \(\:{Q}_{i}^{T}\frac{A-{C}_{{\alpha\:}_{i}}}{\parallel\:A-{C}_{{\alpha\:}_{i}}\parallel\:}\) , indicating the direction of other atoms relative to \(\:{C}_{{\alpha\:}_{i}}\) . For adequately reflecting the geometrical information of backbone, the torsion angles ( \(\:{\varphi\:}_{i},\:{\psi\:}_{i},\:{\omega\:}_{i}\) ) and bond angles ( \(\:{\alpha\:}_{i},\:{\beta\:}_{i},{\gamma\:}_{i}\) ) have been exploited and their sine and cosine values are applied as angle features. For enhancing the node features, a pre-trained language model (ProtTrans) was utilized to extract informative protein embeddings from sequences. ProtTrans is a transformer-based pre-trained language model with 3B parameters, trained on BFD and fine-tuned on UniRef50 using the BERT’s denoising objective. Besides the sequence, we also attempted to extract more information from structures. DSSP was used to compute valuable structural properties, including one-hot secondary structure profile and relative solvent accessibility, which were used to further enhance the node features. (ii) Edge features. For atom pairs \(\:A\in\:\left\{{C}_{i},\:\:{C}_{{\alpha\:}_{i}},\:\:{N}_{i},\:\:{O}_{i},\:{\:R}_{i}\right\}\) and \(\:D\in\:\left\{{C}_{j},\:\:{C}_{{\alpha\:}_{j}},\:\:{N}_{j},\:\:{O}_{j},{\:R}_{j}\right\}\) representing residues \(\:i\) and \(\:j\) respectively, the edge features are defined similarly, including distance, direction, and orientation features. The distance features between residues \(\:i\) and \(\:j\) are \(\:RBF(\parallel\:A-D\parallel\:)\) , indicating the distance characteristics of given residue pairs. The direction features are defined as \(\:{Q}_{i}^{T}\frac{D-{C}_{{\alpha\:}_{i}}}{\parallel\:D-{C}_{{\alpha\:}_{i}}\parallel\:}\) , denoting the direction of atoms in residue \(\:j\) to \(\:{C}_{{\alpha\:}_{i}}\) . To represent the relative rotation between the local coordinate systems, \(\:{q(Q}_{i}^{T}{Q}_{j})\) is computed as orientation features, where \(\:q\) represents a quaternion encoding function [62]. Geometric graph learning The node and edge features obtained from featurizer layer were fed into several GNN layers for geometric graph learning. To learn the multi-scale residue interactions, node update, edge update and global context attention modules were employed at node, edge, and global context levels, respectively. (i) Node update. Due to the transformer's reputation as a powerful model for both sequence and graph data [63, 64], we employed its multi-head attention mechanism for efficient message passing. The feature vectors of node \(\:i\) and edge \(\:j\to\:i\) in layer \(\:l\) were represented as \(\:{h}_{i}^{l}\) and \(\:{e}_{ji}^{l}\) , which were transformed into a \(\:d\) -dimensional space before the GNN operation. To update node \(\:i\) in layer \(\:l\) , we execute the message passing in the following manner: $$\:\begin{array}{c}{\widehat{h}}_{i}^{l+1}={h}_{i}^{l}+\sum\:_{j\in\:{NB}_{i}\cup\:i}{\alpha\:}_{ji}^{l}\left({W}_{V}^{l}{h}_{j}^{l}+{W}_{E}^{l}{e}_{ji}^{l}\right)\#\left(2\right)\end{array}$$ the attention weight \(\:{\alpha\:}_{ji}^{l}\) is computed as follows: $$\:\begin{array}{c}\left\{\begin{array}{c}{w}_{ji}^{l}=\frac{{\left({W}_{Q}^{l}{h}_{i}^{l}\right)}^{T}\left({W}_{K}^{l}{h}_{j}^{l}+{W}_{E}^{l}{e}_{ji}^{l}\right)}{\sqrt{d}}\\\:{\alpha\:}_{ji}^{l}=\frac{{e}^{{w}_{ji}^{l}}}{{\sum\:}_{k\in\:{NB}_{i}\cup\:i}{e}^{{w}_{ki}^{l}}}\end{array}\right.\#\left(3\right)\end{array}$$ Where the \(\:{W}_{Q}^{l}\) , \(\:{W}_{K}^{l}\) , and \(\:{W}_{V}^{l}\) are three weight matrices utilized to convert the node vectors to query, key, and value representations, respectively. The key and value representations are further supplemented by edge vectors using weight matrice \(\:{W}_{E}^{l}\) . \(\:{NB}_{i}\:\) represents the neighbors of node \(\:i\) . The queries, keys, and values are translated multiple times, with parallel attention functions being performed before concatenating them together. (ii) Edge update. The edge features are updated through the neighbor nodes to enhance the model performance. $$\:\begin{array}{c}{e}_{ji}^{l+1}={e}_{ji}^{l}+EdgeMLP\left({\widehat{h}}_{j}^{l+1}\parallel\:{e}_{ji}^{l}\parallel\:{\widehat{h}}_{i}^{l+1}\right)\#\left(4\right)\end{array}$$ where \(\:EdgeMLP\) denotes the MLP operation for edge updates and \(\:\parallel\:\) represents the concatenation operation. (iii) Global context attention. Although local interactions are crucial for learning residue representations, global information has also been shown to be beneficial in enhancing method performance. However, the increased computational overhead in calculating global attention poses a major challenge. To reduce the complexity, an alternative is proposed to calculate a global context vector before employing it for node representations with gate attention [36]. $$\:\begin{array}{c}\left\{\begin{array}{c}{c}^{l}=\frac{{\sum\:}_{k=0}^{n-1}{\widehat{h}}_{k}^{l+1}}{n}\\\:{h}_{i}^{l+1}={\widehat{h}}_{i}^{l+1}⨀\sigma\:\left(GateMLP\left({c}^{l}\right)\right)\end{array}\right.\#\left(5\right)\end{array}$$ where \(\:n\) represents the quantity of residues in a protein, \(\:\sigma\:\) is the sigmoid function, \(\:⨀\) is the element-wise product operation and \(\:GateMLP\) denotes the MLP for gated attention. Enzyme active site prediction (GraphEC-AS) Due to the important role of enzyme active sites in enzyme function, we first predict the active sites before identifying the EC numbers. The geometric embeddings obtained from the geometric graph learning were fed into an MLP layer to assign a score to each residue, indicating its likelihood of belonging to an active site. Using these scores, each residue was assigned a weight to represent its level of importance. The identification of EC numbers (GraphEC) Under the guidance of weight scores generated by GraphEC-AS, an EC number predictor was proposed. The previously generated geometric embeddings were further input to an attention layer, where the attention functions were performed in parallel with the multi-head attention mechanism. By integrating the multi-head attention and weight scores, the residue-level information was aggregated to the protein level through a pooling layer. After pooling, the initial prediction was obtained, and a label diffusion algorithm was employed to enhance the prediction using DIAMOND. The label diffusion algorithm was used to extract homologous information, as referenced by S2F [44]. Following the label diffusion, the final pred was generated to identify the EC numbers as a multilabel classification task. Enzyme optimum pH prediction (GraphEC-pH) Since enzymes require certain environmental conditions to exert their catalytic activity, we further predicted the optimal pH of the enzyme. The pH values were categorized into three groups: acidic (less than 5), neutral (between 5 and 9), and alkaline (greater than 9). To get the characterization for predicting the enzyme optimum pH, multi-head attention was utilized to process the geometric embeddings derived from the geometric graph learning. Then an MLP layer was used to predict the optimum pH. By combining the previous identification of enzyme function with the current prediction of pH, a more effective method can be provided to guide actual experiments. Hierarchy of catalytic functions The Enzyme Commission (EC) number is a numerical system used to classify enzymes according to the reactions they catalyze. Each EC number comprises four digits, which hierarchically categorize enzymes based on their catalytic reaction types and specific substrates [65] (e.g., EC: 1.3.1.32 represents the maleylacetate reductase). In this study, we collected 5,106 EC numbers from the training set and defined a label of length 5,106, where each position corresponds to a specific EC number. The protein language model (ProtTrans) The informative sequence embeddings were generated through a pre-trained language model ProtT5-XL-U50 (ProtTrans [39]). ProtTrans is a transformer-based autoencoder known as T5 [66], which has been pre-trained on UniRef50 [67] to facilitate the prediction of masked amino acids. The features derived from the final layer of the ProtTrans encoder were employed to enhance the node representations. Protein structure prediction using a language model (ESMFold) ESMFold [30] is a large language model with up to 15B parameters, developed on the premise that language models can capture evolutionary patterns across millions of sequences. Achieving accurate and fast structure prediction, ESMFold reduces inference time by as much as 60 times in comparison to the state-of-the-art method. Benefiting from its high efficiency, the first evolutionary scale structural characterization of a metagenomic resource has been presented. In this study, we employed ESMFold to predict the protein structures, which were then applied in subsequent geometric graph learning. Label diffusion algorithm To enhance the initial predictions of EC numbers, a label diffusion algorithm [44, 68] was applied during the testing phase. First, the sequences in the training set similar to the test sequences were found using DIAMOND [15]. Second, based on the sequence identity of protein pairs, a homology network \(\:M\in\:{R}^{T\times\:T}\) was constructed ( \(\:T\) represents the sum of the number of proteins in the test set and the number of hits in the training set). Then, to measure the degree to which a pair of proteins belongs to the same community within the homology network, a Jaccard similarity matrix was defined as follows: $$\:\begin{array}{c}{J}_{ij}=\frac{{\sum\:}_{z}{M}_{iz}{M}_{jz}}{\sum\:_{z}{M}_{iz}+\sum\:_{z}{M}_{jz}-{\sum\:}_{z}{M}_{iz}{M}_{jz}}\#\left(6\right)\end{array}$$ For a target EC number \(\:x\) , the \(\:{x}^{th}\) column of the final annotation matrix \(\:S\) ( \(\:{S}_{x}\) ) was learned by minimizing the cost function \(\:P\left({S}_{x}\right)\) : $$\:\begin{array}{c}P\left({S}_{x}\right)=\:\sum\:_{i=1}^{T}{\left({S}_{ix}-{Y}_{ix}\right)}^{2}+\frac{\epsilon\:}{2}\sum\:_{i=1}^{T}\frac{1}{{d}_{i}}\sum\:_{j=1}^{T}{J}_{ij}{M}_{ij}{\left({S}_{ix}-{S}_{jx}\right)}^{2}\#\left(7\right)\end{array}$$ Where \(\:\epsilon\:\) represents the regularization parameter. The first term serves to preserve the initial labels ( \(\:{Y}_{ix}\) ), and the consistency of the labels of adjacent nodes is accounted for through the second term. And \(\:\frac{1}{{d}_{i}}\) is defined as: $$\:\begin{array}{c}\frac{1}{{d}_{i}}=\frac{1}{{\sum\:}_{j}{J}_{ij}{M}_{ij}}\#\left(8\right)\end{array}$$ Furthermore, we define \(\:{M}^{1}\) as: $$\:\begin{array}{c}{{M}^{1}}_{ij}=\frac{1}{2}\left(\frac{1}{{d}_{i}}+\frac{1}{{d}_{j}}\right){J}_{ij}{M}_{ij}\#\left(9\right)\end{array}$$ its Laplacian matrix \(\:L\) is: $$\:\begin{array}{c}L=DM-{M}^{1}\#\left(10\right)\end{array}$$ where \(\:DM\) is the diagonal degree matrix of \(\:{M}^{1}\) . The closed-form solution that minimizes \(\:P\left({S}_{x}\right)\) can be converted to: $$\:\begin{array}{c}S={\left(I+\epsilon\:L\right)}^{-1}Y\#\left(11\right)\end{array}$$ where \(\:S\) is the updated annotation matrix, \(\:I\in\:{R}^{T\times\:T}\) indicates an identity matrix, and \(\:Y\) represents the combination of the training set labels along with the initial predictions for the test set. Constructing the enzyme pocket from predicted enzyme active sites The construction of the enzyme pocket involved two steps. First, the predicted enzyme active sites were clustered (k-means), where the k was set to 2 empirically. For eliminating false positives, we removed the isolated points that were classified separately on their own. Second, using the \(\:{c}_{\alpha\:}\) coordinates, the enzyme pocket is defined as the area within 10 Å of the cluster center. Implementation and evaluation Five-fold cross-validation was performed on training data, where each time the model was trained on four folds and validated on the remaining one-fold data. This operation was repeated five times, with the best model saved at each iteration. After training, several independent tests were used to test the model performance on different tasks. In enzyme active prediction, TS124 was employed to compare the GraphEC-AS to other methods. And the performance of GraphEC in predicting the EC numbers was evaluated on NEW-392 and Price-149. In order to test the accuracy of GraphEC-pH in predicting the enzyme optimum pH, a new independent test (Brenda-test) was built and two of the latest methods were evaluated on it. During testing, the average predictions of the five models from the cross-validation were utilized as the final predictions. Specifically, Pytorch 1.13.1 was used to construct the geometric graph network, which consists of a 3-layer GNN with 256 hidden units. The attention layer of GraphEC employed the multi-head attention with 8 attention heads. Based on the binary cross-entropy loss, the Adam optimizer was employed to optimize the model. The training process was limited to a maximum of 35 epochs and an early stopping with patience of 4 was implemented, along with a dropout value of 0.1 to prevent overfitting. To comprehensively evaluate model performance, AUC, AUPR, recall, precision, F1-score (F1), and Matthews correlation coefficient (MCC) were utilized, as defined in detail in Supplementary Evaluation metrics. Declarations Data availability The enzyme function data is obtained from a previous study (CLEAN), which is available on GitHub ( https://github.com/tttianhao/CLEAN/tree/main/app/data ). The data about enzyme active sites is derived from a preceding work (CRpred), which is available in http://biomine.cs.vcu.edu/datasets/CRpred/CRpred.html . The data on enzyme optimal pH is curated newly from Brenda database ( https://www.brenda-enzymes.org/ ), which is available at https://github.com/biomed-AI/GraphEC/tree/main/Optimum_pH/data/datasets . A figshare version is also available at https://doi.org/10.6084/m9.figshare.25714305 . Source data are provided with this paper. Code availability The source code of GraphEC is available at https://github.com/biomed-AI/GraphEC . A Zenodo version is also available at https://doi.org/10.5281/zenodo.13375275 . [69] Author Contribution Statement Y.S. and Y.Y. conceived and supervised the project. Y.S. and Q.Y. made contributions to the implementation of the GraphEC algorithm. Y.S. and Y.Y wrote the manuscript. S.C., Y.Z., H.Z. and Y.Y. participated in the discussion and proofreading. Competing Interest Statement The authors declare that no competing interests exist. Acknowledgements This study has been supported by the National Natural Science Foundation of China (T2394502) and the National Key R&D Program of China (2022YFF1203100). References Kohli RM, Zhang Y (2013) TET enzymes, TDG and the dynamics of DNA demethylation. Nature 502(7472):472–479 Makrydaki E et al (2024) Immobilized enzyme cascade for targeted glycosylation. Nat Chem Biol, : p. 1–10 Finley SD, Broadbelt LJ, Hatzimanikatis V (2009) Computational framework for predictive biodegradation. Biotechnol Bioeng 104(6):1086–1097 Hoffmann B et al (2007) Nature and prevalence of pain in Fabry disease and its response to enzyme replacement therapy—a retrospective analysis from the Fabry Outcome Survey. Clin J Pain 23(6):535–542 Nomenclature, E., Recommendations of the nomenclature committee of the international union of biochemistry and molecular biology on the nomenclature and classification of enzymes . 1992, Academic, New York. Goddard, J.-P. and J.-L. Reymond, Enzyme assays for high-throughput screening. Current opinion in biotechnology, 2004. 15 (4): p. 314–322. Desai, D.K., et al., ModEnzA: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Advances in bioinformatics, 2011. 2011 . Kumar, N. and J. Skolnick, EFICAz2. 5: application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics, 2012. 28 (20): p. 2687–2688. Roy, A., J. Yang, and Y. Zhang, COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic acids research, 2012. 40 (W1): p. W471-W477. Zhang, C., P.L. Freddolino, and Y. Zhang, COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic acids research, 2017. 45 (W1): p. W291-W299. Yu, T., et al., Enzyme function prediction using contrastive learning. Science, 2023. 379 (6639): p. 1358–1363. Sanderson, T., et al., ProteInfer, deep neural networks for protein functional inference. Elife, 2023. 12 : p. e80942. Zou, H.-L. and X. Xiao, Classifying multifunctional enzymes by incorporating three different models into Chou’s general pseudo amino acid composition. The Journal of membrane biology, 2016. 249 (4): p. 551–557. Altschul, S.F., et al., Basic local alignment search tool. Journal of molecular biology, 1990. 215 (3): p. 403–410. Buchfink, B., C. Xie, and D.H. Huson, Fast and sensitive protein alignment using DIAMOND. Nature methods, 2015. 12 (1): p. 59–60. Yang, J., et al., The I-TASSER Suite: protein structure and function prediction. Nature methods, 2015. 12 (1): p. 7–8. Yang, J., A. Roy, and Y. Zhang, BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic acids research, 2012. 41 (D1): p. D1096-D1103. Volpato, V., A. Adelfio, and G. Pollastri, Accurate prediction of protein enzymatic class by N-to-1 Neural Networks. BMC bioinformatics, 2013. 14 (1): p. 1–7. Wang, Y.-C., et al., Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein and Peptide Letters, 2010. 17 (11): p. 1441–1449. Ryu, J.Y., H.U. Kim, and S.Y. Lee, Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proceedings of the National Academy of Sciences, 2019. 116 (28): p. 13996–14001. Li, Y., et al., DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics, 2018. 34 (5): p. 760–769. Sarker, B., D.W. Ritchie, and S. Aridhi, GrAPFI: predicting enzymatic function of proteins from domain similarity graphs. BMC bioinformatics, 2020. 21 : p. 1–15. Han, S.-R., et al., Evidential deep learning for trustworthy prediction of enzyme commission number. Briefings in Bioinformatics, 2024. 25 (1): p. bbad401. Heinzinger, M., et al., Contrastive learning on protein embeddings enlightens midnight zone. NAR genomics and bioinformatics, 2022. 4 (2): p. lqac043. Jumper, J., et al., Highly accurate protein structure prediction with AlphaFold. Nature, 2021. 596 (7873): p. 583–589. Yuan, Q., et al., AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Briefings in Bioinformatics, 2022. 23 (2): p. bbab564. Yidong, S., Y. Qianmu, and Y. Yuedong, Application of deep learning in protein function prediction. Synthetic Biology Journal, 2023. 4 (3): p. 488. Wong, F., et al., Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Molecular Systems Biology, 2022. 18 (9): p. e11081. Ruff, K.M. and R.V. Pappu, AlphaFold and implications for intrinsically disordered proteins. Journal of Molecular Biology, 2021. 433 (20): p. 167208. Lin, Z., et al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023. 379 (6637): p. 1123–1130. Handelsman, J., et al., Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & biology, 1998. 5 (10): p. R245-R249. Song, Y., et al., Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures. Briefings in Bioinformatics, 2023. 24 (6): p. bbad360. Bal, R., Y. Xiao, and W. Wang, PGraphDTA: Improving Drug Target Interaction Prediction using Protein Language Models and Contact Maps. arXiv preprint arXiv:2310.04017, 2023. Jing, B., et al., Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020. Dauparas, J., et al., Robust deep learning–based protein sequence design using ProteinMPNN. Science, 2022. 378 (6615): p. 49–56. Gao, Z., et al., PiFold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022. Stärk, H., et al. Equibind: Geometric deep learning for drug binding structure prediction . in International conference on machine learning . 2022. PMLR. Yuan, Q., C. Tian, and Y. Yang, Genome-scale annotation of protein binding sites via language model and geometric deep learning. eLife, 2024. 13 : p. RP93695. Elnaggar, A., et al., Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2021. 44 (10): p. 7112–7127. Rives, A., et al., Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 2021. 118 (15): p. e2016239118. Kahraman, A. and J.M. Thornton, Methods to characterize the structure of enzyme binding sites. Computational Structural Biology-Methods and Applications, 2008. 1 : p. 189–221. Torrance, J.W. and J.M. Thornton, Structure-Based Prediction of Enzymes and Their Active Sites. Prediction of Protein Structures, Functions, and Interactions, 2008: p. 187–209. Roche, D.B., D.A. Brackenridge, and L.J. McGuffin, Proteins and their interacting partners: An introduction to protein–ligand binding site prediction methods. International journal of molecular sciences, 2015. 16 (12): p. 29829–29842. Torres, M., et al., Protein function prediction for newly sequenced organisms. Nature Machine Intelligence, 2021. 3 (12): p. 1050–1060. Song, J., et al., PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. 2018. 443 : p. 125–137. Zhang, T., et al., Accurate sequence-based prediction of catalytic residues. Bioinformatics, 2008. 24 (20): p. 2329–2338. Chea, E. and D.R. Livesay, How accurate and statistically robust are catalytic site predictions based on closeness centrality? Bmc Bioinformatics, 2007. 8 (1): p. 1–14. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 1997. 25 (17): p. 3389–3402. Zhang, Y. and J. Skolnick, TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic acids research, 2005. 33 (7): p. 2302–2309. Price, M.N., et al., Mutant phenotypes for thousands of bacterial genes of unknown function. Nature, 2018. 557 (7706): p. 503–509. Dalkiran, A., et al., ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC bioinformatics, 2018. 19 (1): p. 1–13. Meiler, J., et al., Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Molecular modeling annual, 2001. 7 (9): p. 360–369. Chen, J., et al., Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map. Journal of cheminformatics, 2021. 13 : p. 1–10. Schomburg, I., et al., The BRENDA enzyme information system–From a database to an expert system. Journal of biotechnology, 2017. 261 : p. 194–206. Gado, J.E., et al., Deep learning prediction of enzyme optimum pH. bioRxiv, 2023: p. 2023.06. 22.544776. Van Kempen, M., et al., Fast and accurate protein structure search with Foldseek. Nature Biotechnology, 2024. 42 (2): p. 243–246. Gutteridge, A., G.J. Bartlett, and J.M. Thornton, Using a neural network and spatial clustering to predict the location of active sites in enzymes. Journal of molecular biology, 2003. 330 (4): p. 719–734. Petrova, N.V. and C.H. Wu, Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC bioinformatics, 2006. 7 : p. 1–12. Youn, E., et al., Evaluation of features for catalytic residue prediction in novel folds. Protein Science, 2007. 16 (2): p. 216–226. Steinegger, M. and J. Söding, MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 2017. 35 (11): p. 1026–1028. UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 2021. 49 (D1): p. D480-D489. Huynh, D.Q., Metrics for 3D rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision, 2009. 35 : p. 155–164. Ingraham, J., et al., Generative models for graph-based protein design. Advances in neural information processing systems, 2019. 32 . Song, Y., et al., Fast and accurate protein intrinsic disorder prediction by using a pretrained language model. Briefings in Bioinformatics, 2023: p. bbad173. Cornish-Bowden, A., Current IUBMB recommendations on enzyme nomenclature and kinetics. Perspectives in Science, 2014. 1 (1–6): p. 74–87. Raffel, C., et al., Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020. 21 (1): p. 5485–5551. Suzek, B.E., et al., UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 2007. 23 (10): p. 1282–1288. Yuan, Q., et al., Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion. Briefings in bioinformatics, 2023. 24 (3): p. bbad117. Song, Y., et al., Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Zenodo (2024) https://doi.org/10.5281/zenodo.13375275. Additional Declarations There is NO Competing Interest. Supplementary Files AuthorChecklist5053841attach1315395.docx NCOMMS2425923BRS.pdf Supplementarymaterials.pdf Cite Share Download PDF Status: Published Journal Publication published 18 Sep, 2024 Read the published version in Nature Communications → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4344209","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":352694339,"identity":"aa79f2a4-3bc8-4944-bfa4-dd84c0cee65b","order_by":0,"name":"Yuedong Yang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABFklEQVRIiWNgGAWjYNACNpt6fiB14AFMgIewlrQEyQaglgQStBxOMDgApInSYnD87OHXPGXMecbXDj8E2lKXOH9GAuODt20M8ua4tJzJS7PmOcdWbHY7zQCo5XDihhsJzIZz2xgMdzbg0HIgx8yYt42HcdvtBJCWA4kbJBLYpHnbGCBOxabl/BuQFgnGzbPTP8Acxv4br5YbOcaPedsMEjdI54BsYU5suJHAxoxPi+SNN2aMc84lGEvczik4kGBw2HjDmYfNknPOSRhuwKGF73yO8Yc3Zf/l+Genb/7woaJOdn578kGgiI08LlsUDjCwSSFiwYDBsYGBsQHIksCuHgjkGxiYP/5AErDHqXQUjIJRMApGLAAALfpkUn1EqhsAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0002-6782-2813","institution":"School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, 510000, China.","correspondingAuthor":true,"prefix":"","firstName":"Yuedong","middleName":"","lastName":"Yang","suffix":""},{"id":352694340,"identity":"ab34644f-068c-4d4d-b468-1a7274cfda8f","order_by":1,"name":"Yidong Song","email":"","orcid":"","institution":"School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, 510000, China.","correspondingAuthor":false,"prefix":"","firstName":"Yidong","middleName":"","lastName":"Song","suffix":""},{"id":352694341,"identity":"7f9612aa-b83e-4f91-935b-cd0778380fbc","order_by":2,"name":"Qianmu Yuan","email":"","orcid":"https://orcid.org/0000-0001-6098-9103","institution":"Sun Yat-sen University","correspondingAuthor":false,"prefix":"","firstName":"Qianmu","middleName":"","lastName":"Yuan","suffix":""},{"id":352694342,"identity":"2b149ea4-46bd-4088-9e21-ffa65028ab9f","order_by":3,"name":"Sheng Chen","email":"","orcid":"https://orcid.org/0000-0003-1428-6778","institution":"School of Computer Science and Engineering, Sun Yat-sen University","correspondingAuthor":false,"prefix":"","firstName":"Sheng","middleName":"","lastName":"Chen","suffix":""},{"id":352694343,"identity":"1ea9927f-7ec1-4ab6-a551-dd55f0ee4609","order_by":4,"name":"Yuansong Zeng","email":"","orcid":"","institution":"Chongqing University","correspondingAuthor":false,"prefix":"","firstName":"Yuansong","middleName":"","lastName":"Zeng","suffix":""},{"id":352694344,"identity":"d996b717-3e1c-4822-a233-87c1e59bf61b","order_by":5,"name":"Huiying Zhao","email":"","orcid":"","institution":"Sun Yat-Sen Memorial Hospital","correspondingAuthor":false,"prefix":"","firstName":"Huiying","middleName":"","lastName":"Zhao","suffix":""}],"badges":[],"createdAt":"2024-04-29 16:31:33","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4344209/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4344209/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41467-024-52533-w","type":"published","date":"2024-09-18T04:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":64450194,"identity":"8b50728e-93fd-4244-8431-4759a743e873","added_by":"auto","created_at":"2024-09-13 10:13:47","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":375613,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe overview of GraphEC.\u003c/strong\u003eGiven protein sequences, ESMFold was employed to predict the protein structures, which were then utilized to construct the protein graph and extract geometric features. To augment the features, informative sequence embeddings were calculated using a pre-trained language model (ProtTrans). The prepared features were then input into a geometric graph learning network to learn geometric embeddings. These embeddings were then used to predict enzyme active sites (GraphEC-AS), with each residue being assigned a weight score. Guided by the weight scores of GraphEC-AS, the initial pred of EC number was predicted with the attention and pooling layers. To improve the prediction, a label diffusion algorithm is employed to account for the overlapping communities of enzymes with correlative functions. Additionally, the model is further extended to optimum pH prediction through attention pooling for better representing the practical situation (GraphEC-pH).\u003c/p\u003e","description":"","filename":"Fig.1.png","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/01bc581aed4d78d8a16ef9b5.png"},{"id":64450193,"identity":"6ea69cdd-3ee2-4891-a8df-eec70e67a546","added_by":"auto","created_at":"2024-09-13 10:13:47","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":2161420,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe enzyme active site prediction. \u003c/strong\u003e(\u003cstrong\u003eA\u003c/strong\u003e) The receiver operating characteristic curves of GraphEC-AS and the geometrically agnostic baseline BiLSTM, as well as their comparison with other state-of-the-art methods. The error band of 5-fold cross-validation represents the standard deviation. (\u003cstrong\u003eB\u003c/strong\u003e) Evaluation of GraphEC-AS’s performance using three metrics (MCC, recall, and precision). Six methods were compared, where PREvaIL_RF and PREvaIL_LR represent the PREvaIL model using random forest and logistic regression algorithms; Crpred\u003csup\u003ea\u003c/sup\u003e and Crpred\u003csup\u003eb\u003c/sup\u003e represent CRpred model using residues with coordinates and all residues; and HA\u003csup\u003ec\u003c/sup\u003e and HA\u003csup\u003ed\u003c/sup\u003e represent the HA model using residue identity filter and combination filter. (\u003cstrong\u003eC\u003c/strong\u003e to \u003cstrong\u003eD\u003c/strong\u003e) Visualization of the raw ProtTrans embeddings and geometric embeddings learned by GraphEC-AS. (\u003cstrong\u003eE\u003c/strong\u003e) The three-dimensional structure of one example (cis-muconate cyclase, P38677) annotated by BiLSTM and (\u003cstrong\u003eF\u003c/strong\u003e) GraphEC-AS. Source data are provided as a Source Data file.\u003c/p\u003e","description":"","filename":"Fig.2.png","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/a2f8e17ca3e8addfe43d612a.png"},{"id":64450196,"identity":"40340e65-8710-4868-a757-8d2ad33d2876","added_by":"auto","created_at":"2024-09-13 10:13:47","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":728352,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe enzyme EC number prediction.\u003c/strong\u003e(\u003cstrong\u003eA\u003c/strong\u003e) The comparison between GraphEC and several state-of-the-art methods using AUC, recall, precision, and F1 on Price-149. (\u003cstrong\u003eB\u003c/strong\u003e) The model's precision varies depending on the frequency of the EC number in the training set. (\u003cstrong\u003eC\u003c/strong\u003e) The analysis of GraphEC and three methods (CLEAN, ProteInfer, and DeepEC) at four different levels. (\u003cstrong\u003eD\u003c/strong\u003e) The method ablation focused on the label diffusion algorithm, active site guidance, predicted protein structures, and ProtTrans embeddings. (\u003cstrong\u003eE\u003c/strong\u003e) Three embeddings were visualized on NEW-392, including the GraphEC embeddings, which represent the geometric embeddings learned by GraphEC, as well as the One-hot embeddings and ProtTrans embeddings, which represent the one-hot vector and ProtTrans vector, respectively. (\u003cstrong\u003eF\u003c/strong\u003e to \u003cstrong\u003eG\u003c/strong\u003e) The three-dimensional structures of Acyl-protein thioesterase 2 (O95372) and Proline racemase (E3PTZ4) were visualized, with the highlighted portion indicating higher attention scores. Source data are provided as a Source Data file.\u003c/p\u003e","description":"","filename":"Fig.3.png","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/3fd88bdbc724f479ed19624d.png"},{"id":64450897,"identity":"76b78dc1-badf-445d-b033-d90eb2b5057e","added_by":"auto","created_at":"2024-09-13 10:21:47","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":337151,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe prediction of enzyme optimum pH.\u003c/strong\u003e(\u003cstrong\u003eA\u003c/strong\u003e) The precision-recall curves of GraphEC-pH on Brenda-test, compared with 5-fold cross-validation, geometrically agnostic baseline (GraphEC-pH w/o structures), and two of the latest methods (EpHod and EpHod_SVR). The error band of 5-fold cross-validation represents the standard deviation. (\u003cstrong\u003eB\u003c/strong\u003e) F1, recall, and precision were compared for GraphEC-pH, EpHod, and EpHod_SVR. Source data are provided as a Source Data file.\u003c/p\u003e","description":"","filename":"Fig.4.png","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/69db2f714878baadad00ee4d.png"},{"id":64450896,"identity":"82f03d32-faf5-4ea8-b471-225f04cbf3da","added_by":"auto","created_at":"2024-09-13 10:21:47","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1021383,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGraphEC can extract functional information from protein structures.\u003c/strong\u003e (A) Comparison of the number of enzymes whose maximum TM-scores exceeded various thresholds. For each protein, the predicted EC number was obtained and the TM-scores were calculated with proteins sharing the same EC number in the training set. Subsequently, the maximum TM-score was further used to compare. The “w/o structures” represents the baseline model (MLP) that only uses ProtTrans embeddings without structures. GraphEC has a higher TM-score compared to CLEAN in over 82% of the proteins found by Foldseek. (\u003cstrong\u003eB\u003c/strong\u003e) The alignment of ESMFold-predicted structures with low sequence similarity, where Q6GIA3 represents the enzyme in the training set and P96284 represents the protein from Swiss-Prot with less than 25% identity to the training set. Despite low sequence similarity, GraphEC has the ability to learn the functional information from enzymes with high structural similarity. (\u003cstrong\u003eC\u003c/strong\u003e) Despite a low TM-score, the enzyme pocket around the enzyme active sites can still be aligned (the highlighted area represents the enzyme pocket), demonstrating that GraphEC is able to learn functional information from structures even with low structural similarity. Q9GZX3 and O29655 represent the proteins in the training set and Swiss-Prot, respectively. Source data are provided as a Source Data file.\u003c/p\u003e","description":"","filename":"Fig.5.png","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/6c4dd9259f7cbc74381515e5.png"},{"id":65431614,"identity":"17c277d4-f88b-4476-9f30-2170abe4c4f4","added_by":"auto","created_at":"2024-09-27 11:57:26","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":5462783,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/9c89ed3e-77e4-4e20-9ae5-ec290fea15af.pdf"},{"id":64450197,"identity":"020353f6-3958-499c-879a-bd6a3a3a1548","added_by":"auto","created_at":"2024-09-13 10:13:47","extension":"docx","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":75415,"visible":true,"origin":"","legend":"","description":"","filename":"AuthorChecklist5053841attach1315395.docx","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/59958d82cbc64c9b6e0be7fa.docx"},{"id":64450198,"identity":"30657f45-db59-434d-baab-b60279091c35","added_by":"auto","created_at":"2024-09-13 10:13:47","extension":"pdf","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":236848,"visible":true,"origin":"","legend":"","description":"","filename":"NCOMMS2425923BRS.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/6863b4133eca4c7d3211e577.pdf"},{"id":64450199,"identity":"cfad23ec-9efe-40fc-8ae5-e1b73d360ce6","added_by":"auto","created_at":"2024-09-13 10:13:47","extension":"pdf","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":15894814,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterials.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4344209/v1/e587a4d97bda30f716b553ac.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures","fulltext":[{"header":"Introduction","content":"\u003cp\u003eEnzymes play an essential role in various biological processes by catalyzing numerous reactions [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Identifying enzyme functions is crucial for the study of metabolism [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] and diseases [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Enzyme Commission (EC) number [5] is commonly utilized to formulate the enzyme function as a four-digit structure, which provides a unified scheme and expedites advancements in the field of enzyme engineering. However, the experimental determination [6] of EC numbers is time-consuming and costly. The development of computational approaches for identifying EC numbers has become imperative.\u003c/p\u003e \u003cp\u003eThe computational approaches can be categorized into homology-based [7, 8], structure-based [9, 10], and machine learning-based [11\u0026ndash;13] approaches. Homology-based approaches, assuming that highly similar enzymes have similar functions, were proposed to annotate the enzyme function with alignment tools [14, 15]. These methods rely heavily on sequence similarity, which limits their coverage while lacking similar sequences. To improve the coverage, structure-based approaches [9, 16] scanned structurally similar protein templates to identify consensus functions. For instance, COFACTOR [10] compared the query structure to proteins with known structures and functions in the BioLiP library [17] for function annotation. Despite the improvement of these methods, difficulties remain due to a lack of high-quality templates. To alleviate the constraints of similar sequences and templates, machine learning-based approaches have been developed. The initial machine learning-based approaches [18, 19] first extracted vital features before utilizing machine learning algorithms to identify the corresponding EC numbers. The performance of these machine learning algorithms is greatly influenced by the manually crafted features, which are not adapted to rapidly expanding enzyme sequences.\u003c/p\u003e \u003cp\u003eRecently, deep learning methods [11, 20] have achieved success in enzyme function annotation. To avoid manual feature extraction, DEEPre [21] employed CNN and RNN components to capture convolutional and sequential features. ProteInfer [12] utilized a dilated convolutional network to establish a mapping between protein space and enzyme function space. Utilizing the InterPro signatures as domain information, GrAPFI [22] performed label propagation on a weighted undirected graph. For ECPICK [23], the protein sequence was encoded using one-hot embedding, which was subsequently employed to compute the posterior probabilities of around 5000 EC numbers through convolutional and hierarchical layers. CLEAN [11], another deep learning method that learned abundant embeddings through contrastive learning [24], achieved better accuracy and EC coverage for EC number identification. Nevertheless, these methods still suffer from two limitations. Firstly, they only used protein sequences without incorporating protein structures, thus losing the crucial features implied by the structures. Secondly, the crucial information of enzyme active sites was not employed in the analysis of enzyme function.\u003c/p\u003e \u003cp\u003eDue to the lack of native structures, present methods don\u0026rsquo;t fully exploit the information from protein structures. AlphaFold2 [25] has made a breakthrough in protein structure prediction, with the predicted structures confirmed to be useful in DNA-binding site prediction [26, 27], antibiotic discovery [28], and the study of intrinsically disordered proteins [29]. Regrettably, the high computational demand of AlphaFold2 limits its applicability for genome-wide use. To address this issue, Lin et al. [30] proposed a pre-trained language model ESMFold for precise and quick structure prediction, attaining comparable accuracy to AlphaFold2 while significantly reducing inference time by up to 60 times. The high efficiency of ESMFold enables the analysis of protein structures in metagenomics [31], which has shown remarkable achievements in nucleic-acid-binding site prediction [32] and drug discovery [33]. With the aid of predicted structures, geometric graph learning [34], a technique that has proven beneficial in protein design [35, 36] and docking [37], can extract structural information efficiently. To augment the geometric graph learning, some studies [32, 38] have attempted to incorporate informative sequence embeddings using unsupervised language models (ProtTrans [39] and ESM-1b [40]).\u003c/p\u003e \u003cp\u003eOn the other side, enzyme active sites are typically located on the surface of enzymes and play an important role in catalyzing reactions or binding substrates [41]. They exhibit a high level of conservation in the process of evolution and significantly determine the function of enzymes [42, 43]. So obviously, it would be highly beneficial to consider the active sites of enzymes when assigning the EC numbers. Meanwhile, current methods for predicting enzyme active sites mainly rely on templates or hand-crafted features, which are unable to keep up with the rapidly growing data. This highlights the need for a fast and accurate enzyme active site predictor. Besides active sites, a label diffusion algorithm [44] has been developed for protein function prediction, which can transfer functionally relevant data and aid in identifying EC numbers.\u003c/p\u003e \u003cp\u003eIn this work, we proposed GraphEC (geometric Graph learning-based EC number annotation), an accurate network for enzyme function prediction based on predicted protein structures and enzyme active sites. Specifically, the enzyme active sites were identified first, as they play a critical role in predicting enzyme function. With the guidance of active sites, GraphEC was trained through geometric graph learning with the protein structures predicted by ESMFold. To improve the model performance, informative sequence embeddings were generated via a pre-trained language model (ProtTrans) to augment the node features. In addition, a label diffusion algorithm was employed to further enhance the prediction using homology information. Considering that enzyme-catalyzed reactions require specific environmental conditions, we further extended the model to enzyme optimum pH prediction, which can assist in experimental procedures. Through comprehensive comparisons on several independent tests, our model outperformed all the state-of-the-art methods in the predictions of active sites, EC number, and optimum pH. Additional analysis demonstrated that GraphEC is able to learn functional information from enzyme structures, further emphasizing the effectiveness of geometric graph learning.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eThe overview of the model\u003c/h2\u003e \u003cp\u003eGraphEC, an accurate EC number predictor based on geometric graph learning, incorporates the enzyme active sites and predicted protein structures into enzyme function prediction (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Given a protein sequence, its structure is predicted by ESMFold and used to construct the protein graph. Geometric features were extracted through the predicted structures, which are enhanced by sequence embeddings calculated through a pre-trained language model (PtrotTrans). These features are fed into a geometric graph learning network for learning geometric embeddings, which are utilized in the prediction of active sites, EC number, and optimum pH. Here, enzyme active sites are first predicted by GraphEC-AS, assigning weight scores to each residue. Guided by the weight scores, the initial prediction of EC number is computed with the attention and pooling layers, which is further improved through a label diffusion algorithm by extracting homologous information. Finally, the model is extended to optimum pH prediction through attention pooling for better representing the reaction conditions (GraphEC-pH).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eEnzyme active site prediction (GraphEC-AS)\u003c/h2\u003e \u003cp\u003eWe first evaluated GraphEC-AS for enzyme active site prediction based on residue using the independent test TS124 (details shown in Methods). Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA displays an AUC (area under the receiver operating characteristic curve) of 0.9635 for GraphEC-AS on five-fold cross-validation and 0.9583 for TS124, demonstrating the robustness of the model. Six competing methods (PREvaIL_RF [45], PREvaIL_LR, CRpred (residues with coordinates) [46], CRpred (all residues), HA (residue identity filter) [47], and HA (combination filter) are located between the ROC curves of GraphEC-AS and BiLSTM (the method excluding structural information), indicating the importance of geometric information. In terms of MCC (Matthews correlation coefficient), recall, and precision (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB), our method consistently performed the best. The second-best method (PREvaIL_RF) achieved 0.2939, 0.6223, and 0.1487, lower than GraphEC-AS by 40.9%, 14.5%, and 57.1%, respectively. Source data are provided as a Source Data file. Additionally, the F1 score for GraphEC-AS on TS124 is 0.4698 (Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e), while the second-best method, PREvaIL_RF, achieves a score of 0.240, reflecting a decrease of 48.9% relative to GraphEC-AS. The PREvaIL needs the calculation of time-consuming evolutionary profiles using PSI-BLAST [48], whereas GraphEC-AS can identify the enzyme active sites rapidly and accurately. Source data are provided as a Source Data file.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe superiority of GraphEC-AS was further illustrated by its learned embeddings on TS124. The ProtTrans embeddings (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC) are scattered while the geometric embeddings learned by GraphEC-AS (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eD) distinguished active sites from non-active sites clearly. This demonstrates the capability of geometric graph learning to identify the crucial distinctions between them. We further evaluated the impact of the quality of ESMFold-predicted structures using TM-align [49] on TS124. More than 85% of proteins had TM-scores greater than 0.8 (Supplementary Fig. \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e), which reflects the high quality of the ESMFold-predicted structures. The AUC values increased with TM-scores (Supplementary Fig. \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e), which indicates the necessity of high predicted structure quality and emphasizes the importance of employing ProtTrans to enhance the feature embeddings. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eE, F compare the three-dimensional structures of an example (cis-muconate cyclase) predicted by BiLSTM and GraphEC-AS. GraphEC-AS identified all four active sites, whereas BiLSTM only detected H149 due to the absence of local structure characteristics. Compared to H149, the remaining active sites were located far in sequence (more than 20 residues apart) but close in structure (less than 16 \u0026Aring;). These results indicate the capability of GraphEC-AS to learn the local structure information. Additional cases can be seen in Supplementary Fig. \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eEnzyme EC number identification (GraphEC)\u003c/h2\u003e \u003cp\u003eWith the guidance of predicted active sites, GraphEC was proposed to identify enzyme EC numbers. GraphEC was evaluated on two independent tests: NEW-392 and Price-149, where NEW-392 comprises 392 enzyme sequences covering 177 different EC numbers, and Price-149 is an experimental dataset validated by Price et al. [50]. In comparison to four state-of-the-art EC number predictors (i.e., CLEAN, ProteInfer, DeepEC [20], ECPred [51], GrAPFI, and ECPICK), GraphEC exhibited superior performance in various metrics. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA illustrates that GraphEC achieved an AUC, recall, precision, and F1 of 0.8404, 0.6908, 0.6132, and 0.6131 on Price-149, surpassing the second-best method (CLEAN) by 14.6%, 47.9%, 4.9%, and 23.9%, respectively. On NEW-392, GraphEC achieved optimal values in AUC (0.8910), recall (0.7988), and F1 (0.5910) (Supplementary Fig. S4). Source data are provided as a Source Data file. As shown in Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e, GraphEC is able to achieve high EC number coverage (5106 EC numbers) while maintaining high performance. Benefiting from the contrastive learning-based representation, CLEAN achieved high precision, but its recall and F1 were 39.8% and 15.6% lower than those of GraphEC, respectively. Relying on the label propagation on a protein domain similarity graph, GrAPFI [22] achieved acceptable performance, with AUC values of 0.5095 and 0.5407 on Price-149 and NEW-392 (Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e). ECPICK [23] attained the third-best performance through the implementation of a convolutional neural network and hierarchical module, achieving the AUC values of 0.5888 and 0.6502 on Price-149 and NEW-392 (Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e), respectively. Source data are provided as a Source Data file.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eGraphEC was further evaluated on different levels of EC numbers and the frequency of each EC number in the training set. Considering the potential impact of EC number frequency in the training set on model performance, precision on NEW-392 was evaluated based on the number of times that the EC number appeared in the training set. (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). More than 66.0% of enzymes have less than ten occurrences and only 8.9% of enzymes have more than 100 occurrences, demonstrating the challenge of the dataset. As expected, predicting EC numbers with low frequency proved to be difficult. However, GraphEC consistently exhibited higher precision at different occurrences of EC numbers compared to other methods, highlighting the superior performance of our model. The four digits of the EC number correspond to different levels of enzyme functional classification, with the first to fourth digits indicating a hierarchical breakdown. The recall of GraphEC on NEW-392, compared to CLEAN, improved by 1.1%, 1.7%, 3.4%, and 66.0% from the first level to the fourth level, with values of 0.9468, 0.9116, 0.8945, and 0.7988 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). The superiority of GraphEC becomes more apparent as the level increases, indicating the effectiveness of our model. Source data are provided as a Source Data file.\u003c/p\u003e \u003cp\u003eConsidering the utilization of active sites in EC number prediction, we have evaluated the impact of mutations in the active sites. We first identified the active sites of enzymes on NEW-392 and Price-149 based on the predicted results (score\u0026thinsp;\u0026gt;\u0026thinsp;0.5). Subsequently, these active sites were mutated to Alanine (A), and the predicted scores for true EC numbers were compared before and after the mutation. After mutation, the predicted scores for true EC numbers have decreased (Fig. S6), demonstrating the influence of mutations in the active sites on the prediction of EC numbers. Among the mutated enzymes, 59.1% can be identified as non-enzymes, such as L-2-hydroxyglutarate dehydrogenase (Uniprot ID: A0A011QK89) and Farnesyl pyrophosphate synthase (Uniprot ID: B4YA15) (more cases can be seen in Table S4). Source data are provided as a Source Data file. Furthermore, the predicted scores for active sites before and after the mutation were compared, discovering a reduction in predicted scores for active sites after mutation (Fig. S7). This indicates a reduced focus of the model on the mutated active sites. Additionally, we have compared the average computational time per protein of different methods on Price-149. The average inference time for GraphEC is 0.26 seconds (s), while CLEAN, ProteInfer, and DeepEC have inference times of 1.28s, 0.21s, and 0.14s, respectively (Fig. S8). Source data are provided as a Source Data file. Due to the considerable time needed to compute the pairwise distances between the query sequence and each EC number cluster center in CLEAN, GraphEC's inference speed is 392.3% faster than that of CLEAN. By combining the time required for ESMFold to compute protein structures (11.44s) with the inference time of GraphEC (0.26s), a total of 11.7s is necessary for each enzyme. In this case, computing the functions of 1,000 enzymes requires just 3.25 hours, thereby meeting the needs for high-throughput analysis.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eThe ablation studies of GraphEC\u003c/h2\u003e \u003cp\u003eThe ablation studies of GraphEC were conducted to investigate the contribution of each module. When removing label diffusion, the AUC values slightly decreased (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD) likely because of the ability of GraphEC to learn homology information. The removal of active site guidance resulted in a decrease of 2.8% and 3.5% in AUC on NEW-392 and Price-149, demonstrating its great importance. For evaluating the impact of ESMFold-predicted structures, a geometrically agnostic baseline (BiLSTM) was constructed. Without structural information, the AUC decreased by 4.8% and 2.1% on NEW-392 and Price-149, indicating the crucial role of predicted structures. The ProtTrans embeddings were used to enhance the node features, and the removal of them led to a decrease in AUC by 6.6% and 2.8%. The PortTrans embeddings used here are residue-level representations, which are different from the protein-level ESM-1b representations (mean representations) used in CLEAN (Supplementary Fig. S9). Source data are provided as a Source Data file. Additionally, we have evaluated the effects of physicochemical properties in reference to previous studies [52, 53]. The incorporation of these physicochemical properties failed to further improve the performance of GraphEC (Table S5), suggesting that the geometric features and language model embeddings used in this study may have already inherently captured the physicochemical properties. Source data are provided as a Source Data file.\u003c/p\u003e \u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eE, the learned geometric embeddings (GraphEC embeddings) were compared with ProtTrans embeddings and one-hot embeddings on NEW-392. Among the ten most frequent EC numbers, the one-hot embeddings exhibited limited discriminative capacity. And the ProtTrans embeddings can roughly distinguish these EC numbers, yet they cannot cluster the categories to which 3.1.2.22 and 4.2.1.113 belong. In contrast, GraphEC embeddings can clearly separate these EC numbers, demonstrating their strong expressive ability for different EC numbers. Similarly, on Price-149, the one-hot embeddings lacked the ability to distinguish, while the ProtTrans embeddings can provide a basic distinction, and the GraphEC embeddings were able to further differentiate them (Supplementary Fig. S10). Source data are provided as a Source Data file.\u003c/p\u003e \u003cp\u003eTo evaluate the importance of predicted structures, we replace the ESMFold-predicted structures with those predicted by AlphaFold2. Utilizing the AlphaFold2-predicted structures, the AUC, recall, precision, and F1 on NEW-392 are 0.9004, 0.8267, 0.5745, and 0.6044, respectively (Table S6), slightly higher than those of using ESMFold-predicted structures. On Price-149, comparable performance was obtained when utilizing AlphaFold2-predicted and ESMFold-predicted structures. These results indicate that ESMFold can generate structures with comparable accuracy in much less time than AlphaFold2. In addition, we also evaluated the impact of various cut-off distances (8 \u0026Aring;, 12 \u0026Aring;, and 14 \u0026Aring;) relative to 10 \u0026Aring; on model performance. When the distance is 8 \u0026Aring;, the AUC, recall, precision, and F1 of the model are 0.8761, 0.7729, 0.5577, and 0.5459 on NEW-392 (Table S7), lower by 1.7%, 3.2%, 2.3%, and 7.6% when the distance is 10 \u0026Aring;. This may be due to the decreased distance, which reduces the number of neighbor nodes associated with each node, ultimately causing some information loss. When the distance is 12 \u0026Aring; and 14 \u0026Aring;, the AUC of the model are 0.8876 and 0.8753 on NEW-392, respectively, 0.4% and 1.8% lower than when the distance is 10 \u0026Aring; (0.8910). This might be because a larger distance allows each node to have more edges, resulting in excessive aggregation of information from neighbor nodes during the iterative process, which eventually reduces the node specificity. Similar results on Price-149 are presented in Table S7. Source data are provided as a Source Data file.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eGraphEC captures the functional regions of enzymes\u003c/h2\u003e \u003cp\u003eTo verify whether GraphEC can identify functional regions, we studied the connections between predicted enzyme active sites, multi-head attention scores, and true active sites. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eF, the true active sites of Acyl-protein thioesterase 2 are S122, D176, and H210, which were correctly predicted through GraphEC-AS and used to guide the EC number prediction. The multi-head attention scores tended to be higher near the true active sites, suggesting that the model can focus on the functional regions. Similarly, the enzyme active sites of Proline racemase were accurately identified and the muti-head attention scores were prominent when approaching the true active sites (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eG). Additional cases can be seen in Supplementary Fig. S11. These results indicate that GraphEC could capture the functional regions of enzymes.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eThe prediction of enzyme optimum pH\u003c/h2\u003e \u003cp\u003eSince enzyme pH values are important for enzyme functions, we have also included enzyme optimum pH predictions. To train the model, we have curated a new dataset constructed from the Brenda database (released in January 2023) [54] (Supplementary Fig. S12), including 4110 proteins with sequence identity of \u0026lt;\u0026thinsp;25%. The dataset was divided into a training set (Brenda-train, 3297 enzymes) and an independent test set (Brenda-test, 813 enzymes) with a ratio of 4:1 according to the deposit time. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA, GraphEC-pH achieved an AUPR (area under the precision-recall curve) of 0.9321 for five-fold cross-validation and 0.9170 on the test, indicating the model\u0026rsquo;s robustness. By removing the structural information, the AUPR of GraphEC-pH w/o structures decreased by 1.4%. In comparison, the two latest methods, EpHod [55] and EpHod_SVR achieved lower performance with points located below the precision-recall curve of GraphEC-pH. Correspondingly, the F1, recall, and precision of GraphEC-pH were 0.8487, 0.8672, and 0.8461, surpassing the second-best method (EpHod) by 9.2%, 16.5%, and 0.09%, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB). Source data are provided as a Source Data file. These results have demonstrated the superior performance of our model. We then evaluate the model\u0026rsquo;s ability to discern differences among 289 homologous enzyme pairs searched by DIAMOND in Brenda-test. More than 87.9% (254 pairs) of the homologous enzyme pairs have the same type of optimum pH (i.e., \u0026ldquo;acidic\u0026rdquo; - \u0026ldquo;acidic\u0026rdquo; and \u0026ldquo;non-acidic\u0026rdquo; - \u0026ldquo;non-acidic\u0026rdquo;), and GraphEC-pH can correctly identify 95.7% of them (243 pairs). Only 35 pairs of enzymes exhibit different optimal pH types (i.e., \u0026ldquo;acidic\u0026rdquo; - \u0026ldquo;non-acidic\u0026rdquo;), with GraphEC-pH correctly distinguishing 14 pairs (Table S8), which is 75% more than EpHod (8 pairs). These results indicate that GraphEC-pH can discern the differences among homologous enzymes to some extent. Source data are provided as a Source Data file.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eGraphEC learns functional information from enzyme structures\u003c/h2\u003e \u003cp\u003eTo discover new enzyme functions, a total of 570,830 protein sequences were collected from Swiss-Prot (January 2024 release). After removing the proteins with sequence identity greater than 25% and those with identity above 25% to the training dataset, 52,037 proteins without EC number annotations remained. These proteins were annotated by GraphEC and CLEAN, with over 21% of them including the same EC number annotations. For each protein, the predicted EC number was obtained and the TM-scores were calculated with proteins sharing the same EC number in the training set. Subsequently, the maximum TM-scores of proteins were further used to analyze. GraphEC generally has a higher score, with over 82% of the proteins found by Foldseek [56] showing a higher TM-score compared to CLEAN. When comparing the number of enzymes whose maximum TM-scores exceeded various thresholds (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA), GraphEC surpassed CLEAN by 158%, 136%, 128%, and 128% at thresholds of 0.5, 0.7, 0.8, and 0.9, respectively. Source data are provided as a Source Data file. Compared to CLEAN, the newly discovered enzyme functions identified by GraphEC with maximum TM-scores surpassing 0.8 are listed in Supplementary Dataset 1 partially. Despite low sequence similarity, GraphEC can learn functional information from enzymes with high structural similarity (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB). Even when the TM-score is low, the enzyme pocket (details shown in Methods) around the enzyme active sites can still be aligned (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC), demonstrating the capacity of GraphEC to learn critical functional information from enzyme structures. Additionally, an example (Q9NWA0) with disorder regions was found to be aligned to the enzyme pocket of Q980B8 in the training set (Supplementary Fig. S13F), which indicates the potential of our method for identifying disordered protein functions. More cases are available for reference in Supplementary Fig. S13. Source data are provided as a Source Data file.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eGraphEC is a geometric graph learning-based EC number predictor based on the enzyme active sites and predicted structures. The predicted active sites can guide the learning because of their crucial role in enzyme function. Based on the ESMFold-predicted structures, geometric graph learning can efficiently extract structural information, which is especially necessary when lacking homology information. Additionally, a label diffusion algorithm and ProtTrans embeddings are able to improve the model performance. For an enzyme, the EC number, active sites, and optimum pH can be analyzed comprehensively.\u003c/p\u003e \u003cp\u003eDespite the essential role of EC numbers, current EC number prediction technologies have not fully recognized the importance of enzyme active sites and structural characteristics. The enzyme active sites represent the chemical reaction regions, which we first predict and use to guide subsequent learning. Due to the limitations of native structures, current methods for EC number prediction don\u0026rsquo;t fully exploit the information from protein structures. Benefiting from the rapid and precise structure prediction of ESMFold, GraphEC utilizes geometric graph learning to extract important structural information and surpass state-of-the-art methods. Experiments demonstrate the efficacy of our model in predicting active sites, EC numbers, and optimum pH. Furthermore, GraphEC is proven to be able to extract functional information from enzyme structures even in the absence of homology information, emphasizing the effectiveness of geometric graph learning.\u003c/p\u003e \u003cp\u003eAlthough GraphEC has shown great performance, there is still room for improvement in several aspects. Considering the impact of predicted structure quality, we can explore enhancing the stability of the model by either improving the structural quality or incorporating additional sequence features. Additionally, as large language models continue to advance, we can utilize them to extract essential information from textual descriptions and enhance our model's predictions.\u003c/p\u003e \u003cp\u003eIn summary, we have developed an accurate and fast EC number predictor, GraphEC. Researchers can use it to accurately predict enzyme function solely from the enzyme sequences. For specific enzymes, we can further analyze their functional regions (active sites) and determine their reaction conditions (pH), which will be helpful for experimental investigations.\u003c/p\u003e "},{"header":"Methods","content":"\u003cdiv id=\"Sec12\" class=\"Section3\"\u003e \u003ch2\u003eDataset construction\u003c/h2\u003e \u003cp\u003eTo predict the enzyme active sites, we collected eight enzyme datasets and constructed new training and test sets from them. The eight datasets, namely NN [57], PC [58], HA superfamily [47], EF family [59], EF superfamily, EF fold, T-37, and T-124 [46], collectively contain a total of 987 proteins. T-124 containing 124 proteins was used as the test set (TS124), while the remaining 863 proteins were utilized as a training set. For excluding the sequences with high identity, the chains in the training set that share\u0026thinsp;\u0026gt;\u0026thinsp;25% identity with TS124 were removed using MMseqs2 [60], resulting in 588 sequences in the training set (Train588). For EC number prediction, referring to CLEAN [11], more than 220, 000 enzyme sequences were extracted from UniProt [61], and a training set of size 74, 487 for enzyme EC number identification was constructed through 70% clustering. Two independent test sets were used to evaluate the model performance. The first is NEW-392, which collected data from Swiss-Prot released after April 2022. In NEW-392, 392 enzyme sequences were included, encompassing a total of 177 EC numbers. The second is Price-149, an experimental dataset of 149 enzyme sequences described by Price et al. [50]. For predicting the enzyme optimum pH, 11383 enzymes were collected from BRENDA (released in January 2023) [54], which provides the experimental optimum pH for enzyme-catalyzed reactions. After removing the similar sequences with \u0026gt;\u0026thinsp;25% identity, 4110 enzymes remained and were ranked by the released time. The latest 813 sequences (about 20%) were utilized as the test set (Brenda-test), while the remaining were used as the training set (Brenda-train).\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eThe architecture of the model\u003c/h2\u003e \u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, protein structures are predicted using ESMFold to construct the protein graph and sequence embeddings are extracted via ProtTrans, which are then fed into a featurizer layer to obtain node and edge features. These features are employed to obtain geometric embeddings through geometric graph learning. Based on the embeddings, enzyme active sites are predicted and a weight score is assigned to every residue. Using these weight scores, enzyme EC numbers are identified with an attention layer and label diffusion. In addition, for better determining the reaction conditions, the model is subsequently expanded to optimum pH prediction by incorporating attention pooling.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eFeaturizer layer\u003c/h2\u003e \u003cp\u003eA protein is represented as a radius graph constructed by the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{c}_{\\alpha\\:}\\)\u003c/span\u003e\u003c/span\u003e atoms of residues, where the radius defaults to 10 \u0026Aring;. The protein graph comprises the adjacency matrix, as well as node and edge features, which are derived from a local coordinate system. The \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{C}}_{{\\alpha\\:}},\\:\\text{C},\\)\u003c/span\u003e\u003c/span\u003e and N atoms of residue \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e are employed to build the coordinate system \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{Q}_{i}={[b}_{i},\\:{n}_{i},\\:{b}_{i}\\times\\:{n}_{i}]\\)\u003c/span\u003e\u003c/span\u003e. Formally, we define:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{u}_{i}={C}_{{\\alpha\\:}_{i}}-{N}_{i},\\:{\\:v}_{i}={C}_{i}-{C}_{{\\alpha\\:}_{i}},\\:{b}_{i}=\\frac{{u}_{i}-{v}_{i}}{\\parallel\\:{u}_{i}-{v}_{i}\\parallel\\:},\\:{n}_{i}=\\frac{{u}_{i}\\times\\:{v}_{i}}{\\parallel\\:{u}_{i}\\times\\:{v}_{i}\\parallel\\:}\\:\\#\\left(1\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eBased on the local coordinate system, the node and edge features are defined as follows:\u003c/p\u003e \u003cp\u003e(i) Node features. Given two atoms \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:A\\in\\:\\left\\{{C}_{i},\\:\\:{C}_{{\\alpha\\:}_{i}},\\:\\:{N}_{i},\\:\\:{O}_{i},\\:{\\:R}_{i}\\right\\}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:B\\in\\:\\left\\{{C}_{i},\\:\\:{C}_{{\\alpha\\:}_{i}},\\:\\:{N}_{i},\\:\\:{O}_{i},{\\:R}_{i}\\right\\}\\)\u003c/span\u003e\u003c/span\u003e, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{i}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{{\\alpha\\:}_{i}}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{N}_{i}\\)\u003c/span\u003e\u003c/span\u003e, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{O}_{i}\\)\u003c/span\u003e\u003c/span\u003erepresent four atoms of residue \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e and\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\:R}_{i}\\)\u003c/span\u003e\u003c/span\u003e denotes the centroid of sidechain atoms. By analyzing the characteristics between A and B, the distance, direction, and angle features are computed for each residue. The distance features are \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:RBF(\\parallel\\:A-B\\parallel\\:)\\)\u003c/span\u003e\u003c/span\u003e, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:A\\ne\\:B\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:RBF\\)\u003c/span\u003e\u003c/span\u003e is a radial basis function. The direction features are regulated as \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{Q}_{i}^{T}\\frac{A-{C}_{{\\alpha\\:}_{i}}}{\\parallel\\:A-{C}_{{\\alpha\\:}_{i}}\\parallel\\:}\\)\u003c/span\u003e\u003c/span\u003e, indicating the direction of other atoms relative to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{{\\alpha\\:}_{i}}\\)\u003c/span\u003e\u003c/span\u003e. For adequately reflecting the geometrical information of backbone, the torsion angles (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\varphi\\:}_{i},\\:{\\psi\\:}_{i},\\:{\\omega\\:}_{i}\\)\u003c/span\u003e\u003c/span\u003e) and bond angles (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\alpha\\:}_{i},\\:{\\beta\\:}_{i},{\\gamma\\:}_{i}\\)\u003c/span\u003e\u003c/span\u003e) have been exploited and their sine and cosine values are applied as angle features.\u003c/p\u003e \u003cp\u003eFor enhancing the node features, a pre-trained language model (ProtTrans) was utilized to extract informative protein embeddings from sequences. ProtTrans is a transformer-based pre-trained language model with 3B parameters, trained on BFD and fine-tuned on UniRef50 using the BERT\u0026rsquo;s denoising objective. Besides the sequence, we also attempted to extract more information from structures. DSSP was used to compute valuable structural properties, including one-hot secondary structure profile and relative solvent accessibility, which were used to further enhance the node features.\u003c/p\u003e \u003cp\u003e(ii) Edge features. For atom pairs \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:A\\in\\:\\left\\{{C}_{i},\\:\\:{C}_{{\\alpha\\:}_{i}},\\:\\:{N}_{i},\\:\\:{O}_{i},\\:{\\:R}_{i}\\right\\}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:D\\in\\:\\left\\{{C}_{j},\\:\\:{C}_{{\\alpha\\:}_{j}},\\:\\:{N}_{j},\\:\\:{O}_{j},{\\:R}_{j}\\right\\}\\)\u003c/span\u003e\u003c/span\u003erepresenting residues \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003eand \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:j\\)\u003c/span\u003e\u003c/span\u003erespectively, the edge features are defined similarly, including distance, direction, and orientation features. The distance features between residues \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:j\\)\u003c/span\u003e\u003c/span\u003e are \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:RBF(\\parallel\\:A-D\\parallel\\:)\\)\u003c/span\u003e\u003c/span\u003e, indicating the distance characteristics of given residue pairs. The direction features are defined as \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{Q}_{i}^{T}\\frac{D-{C}_{{\\alpha\\:}_{i}}}{\\parallel\\:D-{C}_{{\\alpha\\:}_{i}}\\parallel\\:}\\)\u003c/span\u003e\u003c/span\u003e, denoting the direction of atoms in residue \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:j\\)\u003c/span\u003e\u003c/span\u003e to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{C}_{{\\alpha\\:}_{i}}\\)\u003c/span\u003e\u003c/span\u003e. To represent the relative rotation between the local coordinate systems, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{q(Q}_{i}^{T}{Q}_{j})\\)\u003c/span\u003e\u003c/span\u003e is computed as orientation features, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:q\\)\u003c/span\u003e\u003c/span\u003e represents a quaternion encoding function [62].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eGeometric graph learning\u003c/h2\u003e \u003cp\u003eThe node and edge features obtained from featurizer layer were fed into several GNN layers for geometric graph learning. To learn the multi-scale residue interactions, node update, edge update and global context attention modules were employed at node, edge, and global context levels, respectively.\u003c/p\u003e \u003cp\u003e(i) Node update. Due to the transformer's reputation as a powerful model for both sequence and graph data [63, 64], we employed its multi-head attention mechanism for efficient message passing. The feature vectors of node \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e and edge \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:j\\to\\:i\\)\u003c/span\u003e\u003c/span\u003e in layer \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:l\\)\u003c/span\u003e\u003c/span\u003e were represented as \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{h}_{i}^{l}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{e}_{ji}^{l}\\)\u003c/span\u003e\u003c/span\u003e, which were transformed into a \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:d\\)\u003c/span\u003e\u003c/span\u003e-dimensional space before the GNN operation. To update node \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e in layer \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:l\\)\u003c/span\u003e\u003c/span\u003e, we execute the message passing in the following manner:\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{\\widehat{h}}_{i}^{l+1}={h}_{i}^{l}+\\sum\\:_{j\\in\\:{NB}_{i}\\cup\\:i}{\\alpha\\:}_{ji}^{l}\\left({W}_{V}^{l}{h}_{j}^{l}+{W}_{E}^{l}{e}_{ji}^{l}\\right)\\#\\left(2\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ethe attention weight \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\alpha\\:}_{ji}^{l}\\)\u003c/span\u003e\u003c/span\u003e is computed as follows:\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}\\left\\{\\begin{array}{c}{w}_{ji}^{l}=\\frac{{\\left({W}_{Q}^{l}{h}_{i}^{l}\\right)}^{T}\\left({W}_{K}^{l}{h}_{j}^{l}+{W}_{E}^{l}{e}_{ji}^{l}\\right)}{\\sqrt{d}}\\\\\\:{\\alpha\\:}_{ji}^{l}=\\frac{{e}^{{w}_{ji}^{l}}}{{\\sum\\:}_{k\\in\\:{NB}_{i}\\cup\\:i}{e}^{{w}_{ki}^{l}}}\\end{array}\\right.\\#\\left(3\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{W}_{Q}^{l}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{W}_{K}^{l}\\)\u003c/span\u003e\u003c/span\u003e, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{W}_{V}^{l}\\)\u003c/span\u003e\u003c/span\u003e are three weight matrices utilized to convert the node vectors to query, key, and value representations, respectively. The key and value representations are further supplemented by edge vectors using weight matrice \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{W}_{E}^{l}\\)\u003c/span\u003e\u003c/span\u003e. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{NB}_{i}\\:\\)\u003c/span\u003e\u003c/span\u003erepresents the neighbors of node \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e. The queries, keys, and values are translated multiple times, with parallel attention functions being performed before concatenating them together.\u003c/p\u003e \u003cp\u003e(ii) Edge update. The edge features are updated through the neighbor nodes to enhance the model performance.\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{e}_{ji}^{l+1}={e}_{ji}^{l}+EdgeMLP\\left({\\widehat{h}}_{j}^{l+1}\\parallel\\:{e}_{ji}^{l}\\parallel\\:{\\widehat{h}}_{i}^{l+1}\\right)\\#\\left(4\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:EdgeMLP\\)\u003c/span\u003e\u003c/span\u003e denotes the MLP operation for edge updates and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\parallel\\:\\)\u003c/span\u003e\u003c/span\u003e represents the concatenation operation.\u003c/p\u003e \u003cp\u003e(iii) Global context attention. Although local interactions are crucial for learning residue representations, global information has also been shown to be beneficial in enhancing method performance. However, the increased computational overhead in calculating global attention poses a major challenge. To reduce the complexity, an alternative is proposed to calculate a global context vector before employing it for node representations with gate attention [36].\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}\\left\\{\\begin{array}{c}{c}^{l}=\\frac{{\\sum\\:}_{k=0}^{n-1}{\\widehat{h}}_{k}^{l+1}}{n}\\\\\\:{h}_{i}^{l+1}={\\widehat{h}}_{i}^{l+1}⨀\\sigma\\:\\left(GateMLP\\left({c}^{l}\\right)\\right)\\end{array}\\right.\\#\\left(5\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:n\\)\u003c/span\u003e\u003c/span\u003e represents the quantity of residues in a protein, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\sigma\\:\\)\u003c/span\u003e\u003c/span\u003e is the sigmoid function, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:⨀\\)\u003c/span\u003e\u003c/span\u003e is the element-wise product operation and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:GateMLP\\)\u003c/span\u003e\u003c/span\u003e denotes the MLP for gated attention.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eEnzyme active site prediction (GraphEC-AS)\u003c/h2\u003e \u003cp\u003eDue to the important role of enzyme active sites in enzyme function, we first predict the active sites before identifying the EC numbers. The geometric embeddings obtained from the geometric graph learning were fed into an MLP layer to assign a score to each residue, indicating its likelihood of belonging to an active site. Using these scores, each residue was assigned a weight to represent its level of importance.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eThe identification of EC numbers (GraphEC)\u003c/h2\u003e \u003cp\u003eUnder the guidance of weight scores generated by GraphEC-AS, an EC number predictor was proposed. The previously generated geometric embeddings were further input to an attention layer, where the attention functions were performed in parallel with the multi-head attention mechanism. By integrating the multi-head attention and weight scores, the residue-level information was aggregated to the protein level through a pooling layer. After pooling, the initial prediction was obtained, and a label diffusion algorithm was employed to enhance the prediction using DIAMOND. The label diffusion algorithm was used to extract homologous information, as referenced by S2F [44]. Following the label diffusion, the final pred was generated to identify the EC numbers as a multilabel classification task.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eEnzyme optimum pH prediction (GraphEC-pH)\u003c/h2\u003e \u003cp\u003eSince enzymes require certain environmental conditions to exert their catalytic activity, we further predicted the optimal pH of the enzyme. The pH values were categorized into three groups: acidic (less than 5), neutral (between 5 and 9), and alkaline (greater than 9). To get the characterization for predicting the enzyme optimum pH, multi-head attention was utilized to process the geometric embeddings derived from the geometric graph learning. Then an MLP layer was used to predict the optimum pH. By combining the previous identification of enzyme function with the current prediction of pH, a more effective method can be provided to guide actual experiments.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eHierarchy of catalytic functions\u003c/h2\u003e \u003cp\u003eThe Enzyme Commission (EC) number is a numerical system used to classify enzymes according to the reactions they catalyze. Each EC number comprises four digits, which hierarchically categorize enzymes based on their catalytic reaction types and specific substrates [65] (e.g., EC: 1.3.1.32 represents the maleylacetate reductase). In this study, we collected 5,106 EC numbers from the training set and defined a label of length 5,106, where each position corresponds to a specific EC number.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eThe protein language model (ProtTrans)\u003c/h2\u003e \u003cp\u003eThe informative sequence embeddings were generated through a pre-trained language model ProtT5-XL-U50 (ProtTrans [39]). ProtTrans is a transformer-based autoencoder known as T5 [66], which has been pre-trained on UniRef50 [67] to facilitate the prediction of masked amino acids. The features derived from the final layer of the ProtTrans encoder were employed to enhance the node representations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eProtein structure prediction using a language model (ESMFold)\u003c/h2\u003e \u003cp\u003eESMFold [30] is a large language model with up to 15B parameters, developed on the premise that language models can capture evolutionary patterns across millions of sequences. Achieving accurate and fast structure prediction, ESMFold reduces inference time by as much as 60 times in comparison to the state-of-the-art method. Benefiting from its high efficiency, the first evolutionary scale structural characterization of a metagenomic resource has been presented. In this study, we employed ESMFold to predict the protein structures, which were then applied in subsequent geometric graph learning.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eLabel diffusion algorithm\u003c/h2\u003e \u003cp\u003eTo enhance the initial predictions of EC numbers, a label diffusion algorithm [44, 68] was applied during the testing phase. First, the sequences in the training set similar to the test sequences were found using DIAMOND [15]. Second, based on the sequence identity of protein pairs, a homology network \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:M\\in\\:{R}^{T\\times\\:T}\\)\u003c/span\u003e\u003c/span\u003e was constructed (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:T\\)\u003c/span\u003e\u003c/span\u003e represents the sum of the number of proteins in the test set and the number of hits in the training set). Then, to measure the degree to which a pair of proteins belongs to the same community within the homology network, a Jaccard similarity matrix was defined as follows:\u003cdiv id=\"Equf\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equf\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{J}_{ij}=\\frac{{\\sum\\:}_{z}{M}_{iz}{M}_{jz}}{\\sum\\:_{z}{M}_{iz}+\\sum\\:_{z}{M}_{jz}-{\\sum\\:}_{z}{M}_{iz}{M}_{jz}}\\#\\left(6\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eFor a target EC number \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:x\\)\u003c/span\u003e\u003c/span\u003e, the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{x}^{th}\\)\u003c/span\u003e\u003c/span\u003e column of the final annotation matrix \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:S\\)\u003c/span\u003e\u003c/span\u003e (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{S}_{x}\\)\u003c/span\u003e\u003c/span\u003e) was learned by minimizing the cost function \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:P\\left({S}_{x}\\right)\\)\u003c/span\u003e\u003c/span\u003e:\u003cdiv id=\"Equg\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equg\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}P\\left({S}_{x}\\right)=\\:\\sum\\:_{i=1}^{T}{\\left({S}_{ix}-{Y}_{ix}\\right)}^{2}+\\frac{\\epsilon\\:}{2}\\sum\\:_{i=1}^{T}\\frac{1}{{d}_{i}}\\sum\\:_{j=1}^{T}{J}_{ij}{M}_{ij}{\\left({S}_{ix}-{S}_{jx}\\right)}^{2}\\#\\left(7\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\epsilon\\:\\)\u003c/span\u003e\u003c/span\u003e represents the regularization parameter. The first term serves to preserve the initial labels (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{Y}_{ix}\\)\u003c/span\u003e\u003c/span\u003e), and the consistency of the labels of adjacent nodes is accounted for through the second term. And \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{1}{{d}_{i}}\\)\u003c/span\u003e\u003c/span\u003e is defined as:\u003cdiv id=\"Equh\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equh\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}\\frac{1}{{d}_{i}}=\\frac{1}{{\\sum\\:}_{j}{J}_{ij}{M}_{ij}}\\#\\left(8\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eFurthermore, we define \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{M}^{1}\\)\u003c/span\u003e\u003c/span\u003e as:\u003cdiv id=\"Equi\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equi\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{{M}^{1}}_{ij}=\\frac{1}{2}\\left(\\frac{1}{{d}_{i}}+\\frac{1}{{d}_{j}}\\right){J}_{ij}{M}_{ij}\\#\\left(9\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eits Laplacian matrix \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:L\\)\u003c/span\u003e\u003c/span\u003e is:\u003cdiv id=\"Equj\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equj\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}L=DM-{M}^{1}\\#\\left(10\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:DM\\)\u003c/span\u003e\u003c/span\u003e is the diagonal degree matrix of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{M}^{1}\\)\u003c/span\u003e\u003c/span\u003e. The closed-form solution that minimizes \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:P\\left({S}_{x}\\right)\\)\u003c/span\u003e\u003c/span\u003e can be converted to:\u003cdiv id=\"Equk\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equk\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}S={\\left(I+\\epsilon\\:L\\right)}^{-1}Y\\#\\left(11\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:S\\)\u003c/span\u003e\u003c/span\u003e is the updated annotation matrix, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:I\\in\\:{R}^{T\\times\\:T}\\)\u003c/span\u003e\u003c/span\u003e indicates an identity matrix, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:Y\\)\u003c/span\u003e\u003c/span\u003e represents the combination of the training set labels along with the initial predictions for the test set.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003eConstructing the enzyme pocket from predicted enzyme active sites\u003c/h2\u003e \u003cp\u003eThe construction of the enzyme pocket involved two steps. First, the predicted enzyme active sites were clustered (k-means), where the k was set to 2 empirically. For eliminating false positives, we removed the isolated points that were classified separately on their own. Second, using the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{c}_{\\alpha\\:}\\)\u003c/span\u003e\u003c/span\u003e coordinates, the enzyme pocket is defined as the area within 10 \u0026Aring; of the cluster center.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003eImplementation and evaluation\u003c/h2\u003e \u003cp\u003eFive-fold cross-validation was performed on training data, where each time the model was trained on four folds and validated on the remaining one-fold data. This operation was repeated five times, with the best model saved at each iteration. After training, several independent tests were used to test the model performance on different tasks. In enzyme active prediction, TS124 was employed to compare the GraphEC-AS to other methods. And the performance of GraphEC in predicting the EC numbers was evaluated on NEW-392 and Price-149. In order to test the accuracy of GraphEC-pH in predicting the enzyme optimum pH, a new independent test (Brenda-test) was built and two of the latest methods were evaluated on it. During testing, the average predictions of the five models from the cross-validation were utilized as the final predictions. Specifically, Pytorch 1.13.1 was used to construct the geometric graph network, which consists of a 3-layer GNN with 256 hidden units. The attention layer of GraphEC employed the multi-head attention with 8 attention heads. Based on the binary cross-entropy loss, the Adam optimizer was employed to optimize the model. The training process was limited to a maximum of 35 epochs and an early stopping with patience of 4 was implemented, along with a dropout value of 0.1 to prevent overfitting. To comprehensively evaluate model performance, AUC, AUPR, recall, precision, F1-score (F1), and Matthews correlation coefficient (MCC) were utilized, as defined in detail in Supplementary Evaluation metrics.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cdiv id=\"Sec25\" class=\"Section3\"\u003e \u003ch2\u003eData availability\u003c/h2\u003e \u003cp\u003eThe enzyme function data is obtained from a previous study (CLEAN), which is available on GitHub (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/tttianhao/CLEAN/tree/main/app/data\u003c/span\u003e\u003cspan address=\"https://github.com/tttianhao/CLEAN/tree/main/app/data\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). The data about enzyme active sites is derived from a preceding work (CRpred), which is available in \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://biomine.cs.vcu.edu/datasets/CRpred/CRpred.html\u003c/span\u003e\u003cspan address=\"http://biomine.cs.vcu.edu/datasets/CRpred/CRpred.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. The data on enzyme optimal pH is curated newly from Brenda database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.brenda-enzymes.org/\u003c/span\u003e\u003cspan address=\"https://www.brenda-enzymes.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), which is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/biomed-AI/GraphEC/tree/main/Optimum_pH/data/datasets\u003c/span\u003e\u003cspan address=\"https://github.com/biomed-AI/GraphEC/tree/main/Optimum_pH/data/datasets\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. A figshare version is also available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.6084/m9.figshare.25714305\u003c/span\u003e\u003cspan address=\"10.6084/m9.figshare.25714305\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Source data are provided with this paper.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e \u003ch2\u003eCode availability\u003c/h2\u003e \u003cp\u003eThe source code of GraphEC is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/biomed-AI/GraphEC\u003c/span\u003e\u003cspan address=\"https://github.com/biomed-AI/GraphEC\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. A Zenodo version is also available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/zenodo.13375275\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.13375275\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. [69]\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\u003cp\u003e \u003ch2\u003eAuthor Contribution Statement\u003c/h2\u003e \u003cp\u003eY.S. and Y.Y. conceived and supervised the project. Y.S. and Q.Y. made contributions to the implementation of the GraphEC algorithm. Y.S. and Y.Y wrote the manuscript. S.C., Y.Z., H.Z. and Y.Y. participated in the discussion and proofreading.\u003c/p\u003e \u003c/p\u003e\u003cp\u003e \u003ch2\u003eCompeting Interest Statement\u003c/h2\u003e \u003cp\u003eThe authors declare that no competing interests exist.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eThis study has been supported by the National Natural Science Foundation of China (T2394502) and the National Key R\u0026amp;D Program of China (2022YFF1203100).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eKohli RM, Zhang Y (2013) TET enzymes, TDG and the dynamics of DNA demethylation. Nature 502(7472):472\u0026ndash;479\u003c/li\u003e\n\u003cli\u003eMakrydaki E et al (2024) Immobilized enzyme cascade for targeted glycosylation. Nat Chem Biol, : p. 1\u0026ndash;10\u003c/li\u003e\n\u003cli\u003eFinley SD, Broadbelt LJ, Hatzimanikatis V (2009) Computational framework for predictive biodegradation. Biotechnol Bioeng 104(6):1086\u0026ndash;1097\u003c/li\u003e\n\u003cli\u003eHoffmann B et al (2007) Nature and prevalence of pain in Fabry disease and its response to enzyme replacement therapy\u0026mdash;a retrospective analysis from the Fabry Outcome Survey. Clin J Pain 23(6):535\u0026ndash;542\u003c/li\u003e\n\u003cli\u003eNomenclature, E., \u003cem\u003eRecommendations of the nomenclature committee of the international union of biochemistry and molecular biology on the nomenclature and classification of enzymes\u003c/em\u003e. 1992, Academic, New York.\u003c/li\u003e\n\u003cli\u003eGoddard, J.-P. and J.-L. Reymond, \u003cem\u003eEnzyme assays for high-throughput screening.\u003c/em\u003e Current opinion in biotechnology, 2004. \u003cstrong\u003e15\u003c/strong\u003e(4): p. 314\u0026ndash;322.\u003c/li\u003e\n\u003cli\u003eDesai, D.K., et al., \u003cem\u003eModEnzA: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities.\u003c/em\u003e Advances in bioinformatics, 2011. \u003cstrong\u003e2011\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003eKumar, N. and J. Skolnick, \u003cem\u003eEFICAz2. 5: application of a high-precision enzyme function predictor to 396 proteomes.\u003c/em\u003e Bioinformatics, 2012. \u003cstrong\u003e28\u003c/strong\u003e(20): p. 2687\u0026ndash;2688.\u003c/li\u003e\n\u003cli\u003eRoy, A., J. Yang, and Y. Zhang, \u003cem\u003eCOFACTOR: an accurate comparative algorithm for structure-based protein function annotation.\u003c/em\u003e Nucleic acids research, 2012. \u003cstrong\u003e40\u003c/strong\u003e(W1): p. W471-W477.\u003c/li\u003e\n\u003cli\u003eZhang, C., P.L. Freddolino, and Y. Zhang, \u003cem\u003eCOFACTOR: improved protein function prediction by combining structure, sequence and protein\u0026ndash;protein interaction information.\u003c/em\u003e Nucleic acids research, 2017. \u003cstrong\u003e45\u003c/strong\u003e(W1): p. W291-W299.\u003c/li\u003e\n\u003cli\u003eYu, T., et al., \u003cem\u003eEnzyme function prediction using contrastive learning.\u003c/em\u003e Science, 2023. \u003cstrong\u003e379\u003c/strong\u003e(6639): p. 1358\u0026ndash;1363.\u003c/li\u003e\n\u003cli\u003eSanderson, T., et al., \u003cem\u003eProteInfer, deep neural networks for protein functional inference.\u003c/em\u003e Elife, 2023. \u003cstrong\u003e12\u003c/strong\u003e: p. e80942.\u003c/li\u003e\n\u003cli\u003eZou, H.-L. and X. Xiao, \u003cem\u003eClassifying multifunctional enzymes by incorporating three different models into Chou\u0026rsquo;s general pseudo amino acid composition.\u003c/em\u003e The Journal of membrane biology, 2016. \u003cstrong\u003e249\u003c/strong\u003e(4): p. 551\u0026ndash;557.\u003c/li\u003e\n\u003cli\u003eAltschul, S.F., et al., \u003cem\u003eBasic local alignment search tool.\u003c/em\u003e Journal of molecular biology, 1990. \u003cstrong\u003e215\u003c/strong\u003e(3): p. 403\u0026ndash;410.\u003c/li\u003e\n\u003cli\u003eBuchfink, B., C. Xie, and D.H. Huson, \u003cem\u003eFast and sensitive protein alignment using DIAMOND.\u003c/em\u003e Nature methods, 2015. \u003cstrong\u003e12\u003c/strong\u003e(1): p. 59\u0026ndash;60.\u003c/li\u003e\n\u003cli\u003eYang, J., et al., \u003cem\u003eThe I-TASSER Suite: protein structure and function prediction.\u003c/em\u003e Nature methods, 2015. \u003cstrong\u003e12\u003c/strong\u003e(1): p. 7\u0026ndash;8.\u003c/li\u003e\n\u003cli\u003eYang, J., A. Roy, and Y. Zhang, \u003cem\u003eBioLiP: a semi-manually curated database for biologically relevant ligand\u0026ndash;protein interactions.\u003c/em\u003e Nucleic acids research, 2012. \u003cstrong\u003e41\u003c/strong\u003e(D1): p. D1096-D1103.\u003c/li\u003e\n\u003cli\u003eVolpato, V., A. Adelfio, and G. Pollastri, \u003cem\u003eAccurate prediction of protein enzymatic class by N-to-1 Neural Networks.\u003c/em\u003e BMC bioinformatics, 2013. \u003cstrong\u003e14\u003c/strong\u003e(1): p. 1\u0026ndash;7.\u003c/li\u003e\n\u003cli\u003eWang, Y.-C., et al., \u003cem\u003ePrediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature.\u003c/em\u003e Protein and Peptide Letters, 2010. \u003cstrong\u003e17\u003c/strong\u003e(11): p. 1441\u0026ndash;1449.\u003c/li\u003e\n\u003cli\u003eRyu, J.Y., H.U. Kim, and S.Y. Lee, \u003cem\u003eDeep learning enables high-quality and high-throughput prediction of enzyme commission numbers.\u003c/em\u003e Proceedings of the National Academy of Sciences, 2019. \u003cstrong\u003e116\u003c/strong\u003e(28): p. 13996\u0026ndash;14001.\u003c/li\u003e\n\u003cli\u003eLi, Y., et al., \u003cem\u003eDEEPre: sequence-based enzyme EC number prediction by deep learning.\u003c/em\u003e Bioinformatics, 2018. \u003cstrong\u003e34\u003c/strong\u003e(5): p. 760\u0026ndash;769.\u003c/li\u003e\n\u003cli\u003eSarker, B., D.W. Ritchie, and S. Aridhi, \u003cem\u003eGrAPFI: predicting enzymatic function of proteins from domain similarity graphs.\u003c/em\u003e BMC bioinformatics, 2020. \u003cstrong\u003e21\u003c/strong\u003e: p. 1\u0026ndash;15.\u003c/li\u003e\n\u003cli\u003eHan, S.-R., et al., \u003cem\u003eEvidential deep learning for trustworthy prediction of enzyme commission number.\u003c/em\u003e Briefings in Bioinformatics, 2024. \u003cstrong\u003e25\u003c/strong\u003e(1): p. bbad401.\u003c/li\u003e\n\u003cli\u003eHeinzinger, M., et al., \u003cem\u003eContrastive learning on protein embeddings enlightens midnight zone.\u003c/em\u003e NAR genomics and bioinformatics, 2022. \u003cstrong\u003e4\u003c/strong\u003e(2): p. lqac043.\u003c/li\u003e\n\u003cli\u003eJumper, J., et al., \u003cem\u003eHighly accurate protein structure prediction with AlphaFold.\u003c/em\u003e Nature, 2021. \u003cstrong\u003e596\u003c/strong\u003e(7873): p. 583\u0026ndash;589.\u003c/li\u003e\n\u003cli\u003eYuan, Q., et al., \u003cem\u003eAlphaFold2-aware protein\u0026ndash;DNA binding site prediction using graph transformer.\u003c/em\u003e Briefings in Bioinformatics, 2022. \u003cstrong\u003e23\u003c/strong\u003e(2): p. bbab564.\u003c/li\u003e\n\u003cli\u003eYidong, S., Y. Qianmu, and Y. Yuedong, \u003cem\u003eApplication of deep learning in protein function prediction.\u003c/em\u003e Synthetic Biology Journal, 2023. \u003cstrong\u003e4\u003c/strong\u003e(3): p. 488.\u003c/li\u003e\n\u003cli\u003eWong, F., et al., \u003cem\u003eBenchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery.\u003c/em\u003e Molecular Systems Biology, 2022. \u003cstrong\u003e18\u003c/strong\u003e(9): p. e11081.\u003c/li\u003e\n\u003cli\u003eRuff, K.M. and R.V. Pappu, \u003cem\u003eAlphaFold and implications for intrinsically disordered proteins.\u003c/em\u003e Journal of Molecular Biology, 2021. \u003cstrong\u003e433\u003c/strong\u003e(20): p. 167208.\u003c/li\u003e\n\u003cli\u003eLin, Z., et al., \u003cem\u003eEvolutionary-scale prediction of atomic-level protein structure with a language model.\u003c/em\u003e Science, 2023. \u003cstrong\u003e379\u003c/strong\u003e(6637): p. 1123\u0026ndash;1130.\u003c/li\u003e\n\u003cli\u003eHandelsman, J., et al., \u003cem\u003eMolecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products.\u003c/em\u003e Chemistry \u0026amp; biology, 1998. \u003cstrong\u003e5\u003c/strong\u003e(10): p. R245-R249.\u003c/li\u003e\n\u003cli\u003eSong, Y., et al., \u003cem\u003eAccurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures.\u003c/em\u003e Briefings in Bioinformatics, 2023. \u003cstrong\u003e24\u003c/strong\u003e(6): p. bbad360.\u003c/li\u003e\n\u003cli\u003eBal, R., Y. Xiao, and W. Wang, \u003cem\u003ePGraphDTA: Improving Drug Target Interaction Prediction using Protein Language Models and Contact Maps.\u003c/em\u003e arXiv preprint arXiv:2310.04017, 2023.\u003c/li\u003e\n\u003cli\u003eJing, B., et al., \u003cem\u003eLearning from protein structure with geometric vector perceptrons.\u003c/em\u003e arXiv preprint arXiv:2009.01411, 2020.\u003c/li\u003e\n\u003cli\u003eDauparas, J., et al., \u003cem\u003eRobust deep learning\u0026ndash;based protein sequence design using ProteinMPNN.\u003c/em\u003e Science, 2022. \u003cstrong\u003e378\u003c/strong\u003e(6615): p. 49\u0026ndash;56.\u003c/li\u003e\n\u003cli\u003eGao, Z., et al., \u003cem\u003ePiFold: Toward effective and efficient protein inverse folding.\u003c/em\u003e arXiv preprint arXiv:2209.12643, 2022.\u003c/li\u003e\n\u003cli\u003eSt\u0026auml;rk, H., et al. \u003cem\u003eEquibind: Geometric deep learning for drug binding structure prediction\u003c/em\u003e. in \u003cem\u003eInternational conference on machine learning\u003c/em\u003e. 2022. PMLR.\u003c/li\u003e\n\u003cli\u003eYuan, Q., C. Tian, and Y. Yang, \u003cem\u003eGenome-scale annotation of protein binding sites via language model and geometric deep learning.\u003c/em\u003e eLife, 2024. \u003cstrong\u003e13\u003c/strong\u003e: p. RP93695.\u003c/li\u003e\n\u003cli\u003eElnaggar, A., et al., \u003cem\u003eProttrans: Toward understanding the language of life through self-supervised learning.\u003c/em\u003e IEEE transactions on pattern analysis and machine intelligence, 2021. \u003cstrong\u003e44\u003c/strong\u003e(10): p. 7112\u0026ndash;7127.\u003c/li\u003e\n\u003cli\u003eRives, A., et al., \u003cem\u003eBiological structure and function emerge from scaling unsupervised learning to 250\u0026nbsp;million protein sequences.\u003c/em\u003e Proceedings of the National Academy of Sciences, 2021. \u003cstrong\u003e118\u003c/strong\u003e(15): p. e2016239118.\u003c/li\u003e\n\u003cli\u003eKahraman, A. and J.M. Thornton, \u003cem\u003eMethods to characterize the structure of enzyme binding sites.\u003c/em\u003e Computational Structural Biology-Methods and Applications, 2008. \u003cstrong\u003e1\u003c/strong\u003e: p. 189\u0026ndash;221.\u003c/li\u003e\n\u003cli\u003eTorrance, J.W. and J.M. Thornton, \u003cem\u003eStructure-Based Prediction of Enzymes and Their Active Sites.\u003c/em\u003e Prediction of Protein Structures, Functions, and Interactions, 2008: p. 187\u0026ndash;209.\u003c/li\u003e\n\u003cli\u003eRoche, D.B., D.A. Brackenridge, and L.J. McGuffin, \u003cem\u003eProteins and their interacting partners: An introduction to protein\u0026ndash;ligand binding site prediction methods.\u003c/em\u003e International journal of molecular sciences, 2015. \u003cstrong\u003e16\u003c/strong\u003e(12): p. 29829\u0026ndash;29842.\u003c/li\u003e\n\u003cli\u003eTorres, M., et al., \u003cem\u003eProtein function prediction for newly sequenced organisms.\u003c/em\u003e Nature Machine Intelligence, 2021. \u003cstrong\u003e3\u003c/strong\u003e(12): p. 1050\u0026ndash;1060.\u003c/li\u003e\n\u003cli\u003eSong, J., et al., \u003cem\u003ePREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework.\u003c/em\u003e 2018. \u003cstrong\u003e443\u003c/strong\u003e: p. 125\u0026ndash;137.\u003c/li\u003e\n\u003cli\u003eZhang, T., et al., \u003cem\u003eAccurate sequence-based prediction of catalytic residues.\u003c/em\u003e Bioinformatics, 2008. \u003cstrong\u003e24\u003c/strong\u003e(20): p. 2329\u0026ndash;2338.\u003c/li\u003e\n\u003cli\u003eChea, E. and D.R. Livesay, \u003cem\u003eHow accurate and statistically robust are catalytic site predictions based on closeness centrality?\u003c/em\u003e Bmc Bioinformatics, 2007. \u003cstrong\u003e8\u003c/strong\u003e(1): p. 1\u0026ndash;14.\u003c/li\u003e\n\u003cli\u003eAltschul, S.F., et al., \u003cem\u003eGapped BLAST and PSI-BLAST: a new generation of protein database search programs.\u003c/em\u003e Nucleic acids research, 1997. \u003cstrong\u003e25\u003c/strong\u003e(17): p. 3389\u0026ndash;3402.\u003c/li\u003e\n\u003cli\u003eZhang, Y. and J. Skolnick, \u003cem\u003eTM-align: a protein structure alignment algorithm based on the TM-score.\u003c/em\u003e Nucleic acids research, 2005. \u003cstrong\u003e33\u003c/strong\u003e(7): p. 2302\u0026ndash;2309.\u003c/li\u003e\n\u003cli\u003ePrice, M.N., et al., \u003cem\u003eMutant phenotypes for thousands of bacterial genes of unknown function.\u003c/em\u003e Nature, 2018. \u003cstrong\u003e557\u003c/strong\u003e(7706): p. 503\u0026ndash;509.\u003c/li\u003e\n\u003cli\u003eDalkiran, A., et al., \u003cem\u003eECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature.\u003c/em\u003e BMC bioinformatics, 2018. \u003cstrong\u003e19\u003c/strong\u003e(1): p. 1\u0026ndash;13.\u003c/li\u003e\n\u003cli\u003eMeiler, J., et al., \u003cem\u003eGeneration and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks.\u003c/em\u003e Molecular modeling annual, 2001. \u003cstrong\u003e7\u003c/strong\u003e(9): p. 360\u0026ndash;369.\u003c/li\u003e\n\u003cli\u003eChen, J., et al., \u003cem\u003eStructure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map.\u003c/em\u003e Journal of cheminformatics, 2021. \u003cstrong\u003e13\u003c/strong\u003e: p. 1\u0026ndash;10.\u003c/li\u003e\n\u003cli\u003eSchomburg, I., et al., \u003cem\u003eThe BRENDA enzyme information system\u0026ndash;From a database to an expert system.\u003c/em\u003e Journal of biotechnology, 2017. \u003cstrong\u003e261\u003c/strong\u003e: p. 194\u0026ndash;206.\u003c/li\u003e\n\u003cli\u003eGado, J.E., et al., \u003cem\u003eDeep learning prediction of enzyme optimum pH.\u003c/em\u003e bioRxiv, 2023: p. 2023.06. 22.544776.\u003c/li\u003e\n\u003cli\u003eVan Kempen, M., et al., \u003cem\u003eFast and accurate protein structure search with Foldseek.\u003c/em\u003e Nature Biotechnology, 2024. \u003cstrong\u003e42\u003c/strong\u003e(2): p. 243\u0026ndash;246.\u003c/li\u003e\n\u003cli\u003eGutteridge, A., G.J. Bartlett, and J.M. Thornton, \u003cem\u003eUsing a neural network and spatial clustering to predict the location of active sites in enzymes.\u003c/em\u003e Journal of molecular biology, 2003. \u003cstrong\u003e330\u003c/strong\u003e(4): p. 719\u0026ndash;734.\u003c/li\u003e\n\u003cli\u003ePetrova, N.V. and C.H. Wu, \u003cem\u003ePrediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.\u003c/em\u003e BMC bioinformatics, 2006. \u003cstrong\u003e7\u003c/strong\u003e: p. 1\u0026ndash;12.\u003c/li\u003e\n\u003cli\u003eYoun, E., et al., \u003cem\u003eEvaluation of features for catalytic residue prediction in novel folds.\u003c/em\u003e Protein Science, 2007. \u003cstrong\u003e16\u003c/strong\u003e(2): p. 216\u0026ndash;226.\u003c/li\u003e\n\u003cli\u003eSteinegger, M. and J. S\u0026ouml;ding, \u003cem\u003eMMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.\u003c/em\u003e Nature biotechnology, 2017. \u003cstrong\u003e35\u003c/strong\u003e(11): p. 1026\u0026ndash;1028.\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eUniProt: the universal protein knowledgebase in 2021.\u003c/em\u003e Nucleic acids research, 2021. \u003cstrong\u003e49\u003c/strong\u003e(D1): p. D480-D489.\u003c/li\u003e\n\u003cli\u003eHuynh, D.Q., \u003cem\u003eMetrics for 3D rotations: Comparison and analysis.\u003c/em\u003e Journal of Mathematical Imaging and Vision, 2009. \u003cstrong\u003e35\u003c/strong\u003e: p. 155\u0026ndash;164.\u003c/li\u003e\n\u003cli\u003eIngraham, J., et al., \u003cem\u003eGenerative models for graph-based protein design.\u003c/em\u003e Advances in neural information processing systems, 2019. \u003cstrong\u003e32\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003eSong, Y., et al., \u003cem\u003eFast and accurate protein intrinsic disorder prediction by using a pretrained language model.\u003c/em\u003e Briefings in Bioinformatics, 2023: p. bbad173.\u003c/li\u003e\n\u003cli\u003eCornish-Bowden, A., \u003cem\u003eCurrent IUBMB recommendations on enzyme nomenclature and kinetics.\u003c/em\u003e Perspectives in Science, 2014. \u003cstrong\u003e1\u003c/strong\u003e(1\u0026ndash;6): p. 74\u0026ndash;87.\u003c/li\u003e\n\u003cli\u003eRaffel, C., et al., \u003cem\u003eExploring the limits of transfer learning with a unified text-to-text transformer.\u003c/em\u003e The Journal of Machine Learning Research, 2020. \u003cstrong\u003e21\u003c/strong\u003e(1): p. 5485\u0026ndash;5551.\u003c/li\u003e\n\u003cli\u003eSuzek, B.E., et al., \u003cem\u003eUniRef: comprehensive and non-redundant UniProt reference clusters.\u003c/em\u003e Bioinformatics, 2007. \u003cstrong\u003e23\u003c/strong\u003e(10): p. 1282\u0026ndash;1288.\u003c/li\u003e\n\u003cli\u003eYuan, Q., et al., \u003cem\u003eFast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion.\u003c/em\u003e Briefings in bioinformatics, 2023. \u003cstrong\u003e24\u003c/strong\u003e(3): p. bbad117.\u003c/li\u003e\n\u003cli\u003eSong, Y., et al., \u003cem\u003eAccurately predicting enzyme functions through geometric graph learning\u0026nbsp;\u003c/em\u003e\u003cem\u003eon ESMFold-predicted structures.\u003c/em\u003e Zenodo (2024) https://doi.org/10.5281/zenodo.13375275.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Enzyme Commission number, enzyme active sites, enzyme optimum pH, geometric graph learning, pre-trained language model","lastPublishedDoi":"10.21203/rs.3.rs-4344209/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4344209/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eEnzymes are crucial in numerous biological processes, with the Enzyme Commission (EC) number being a commonly used method for defining enzyme function. However, current EC number prediction technologies have not fully recognized the importance of enzyme active sites and structural characteristics. Here, we propose GraphEC, a geometric graph learning-based EC number predictor using the ESMFold-predicted structures and a pre-trained protein language model. Specifically, we first construct a model to predict the enzyme active sites, which is utilized to predict the EC number. The prediction is further improved through a label diffusion algorithm by incorporating homology information. In parallel, the optimum pH of enzymes is predicted to reflect the enzyme-catalyzed reactions. Experiments demonstrate the superior performance of our model in predicting active sites, EC numbers, and optimum pH compared to other state-of-the-art methods. Additional analysis reveals that GraphEC is capable of extracting functional information from protein structures, emphasizing the effectiveness of geometric graph learning. This technology can be used to identify unannotated enzyme functions, as well as to predict their active sites and optimum pH, with the potential to advance research in synthetic biology, genomics, and other fields.\u003c/p\u003e","manuscriptTitle":"Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-09-13 10:13:42","doi":"10.21203/rs.3.rs-4344209/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"1e98db35-576e-4079-ad55-a2f39392c8b1","owner":[],"postedDate":"September 13th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":37462253,"name":"Biological sciences/Computational biology and bioinformatics/Protein function predictions"},{"id":37462254,"name":"Biological sciences/Biological techniques/Bioinformatics"},{"id":37462255,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"},{"id":37462256,"name":"Biological sciences/Computational biology and bioinformatics/Computational models"}],"tags":[],"updatedAt":"2024-09-27T10:41:12+00:00","versionOfRecord":{"articleIdentity":"rs-4344209","link":"https://doi.org/10.1038/s41467-024-52533-w","journal":{"identity":"nature-communications","isVorOnly":false,"title":"Nature Communications"},"publishedOn":"2024-09-18 04:00:00","publishedOnDateReadable":"September 18th, 2024"},"versionCreatedAt":"2024-09-13 10:13:42","video":"","vorDoi":"10.1038/s41467-024-52533-w","vorDoiUrl":"https://doi.org/10.1038/s41467-024-52533-w","workflowStages":[]},"version":"v1","identity":"rs-4344209","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4344209","identity":"rs-4344209","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.