A physics-informed cluster graph neural network enables generalizable and interpretable prediction for material discovery | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article A physics-informed cluster graph neural network enables generalizable and interpretable prediction for material discovery Ming Yang, Hao Cheng, Tong Yang, Minggang Zeng, Hao Wu, Zhengzhong Wang, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4429598/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Machine learning (ML) plays a pivotal role in the development of functional materials, in which graph neural networks (GNNs) have shown improved performance by utilizing the graph representation of atoms and bonds to effectively characterize materials. However, it remains challenging to achieve efficient, robust and interpretable predictions due to the limited integration of domain knowledge. In this study, we propose leveraging the local structure and short-range atomic interactions of materials using a cluster graph representation to improve the performance. This physics-informed cluster graph neural network (CG-NET) significantly enhances computational efficiency through a cluster sampling strategy. Importantly, by incorporating pseudo nodes as neighbors to the nodes at the cluster boundaries, we maintain the bonding coordination environment, enhancing the prediction accuracy. We further demonstrate CG-NET’s remarkable prediction accuracy and efficiency across diverse material systems and properties and reveal its superior interpretability and generalizability with extensive experiments. Our work highlights the importance of integrating domain-specific scientific knowledge into the design of a generalizable and interpretable ML framework. The cluster graph representation in the CG-NET could be extended to other graph-based neural networks to accelerate the development of functional materials while significantly reducing computational cost. Physical sciences/Materials science/Theory and computation/Computational methods Physical sciences/Mathematics and computing/Computational science Cluster graph convolutional neural network Short-range interaction Interpretable machine learning Disordered materials High-entropy alloys Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Compared with conventional material property prediction approaches 1 , 2 , the application of machine learning (ML) has significantly expedited the process of material discovery, offering favorable effectiveness and cost-efficiency 3 – 5 . Among the vast array of ML-based approaches, graph neural networks (GNNs) have achieved superior performance over prior ML methods on predicting the physical and chemical properties of the materials 6 – 12 . The primary advantage of GNNs lies in their capacity to handle systems of arbitrary complexity without suffering from combinatorial explosion typically associated with the increasing number of distinct elements in the large material system 9 , 13 . Graph structures provide natural representations for crystals and molecules by seamlessly embedding atomic information and bonding interactions into graph nodes and edges, respectively. Recent studies have integrated the concepts of equivariance and many-body interactions into graph representations, showing remarkable effectiveness in deciphering nonlinear atomic interactions and structure-property relationships in functional materials 8 – 12 , 14 . However, despite their impressive performance, these advancements have also led to a surge in model complexity and a growing demand for large training datasets 15 . Additionally, the vast search space for materials with desirable properties presents further challenges 16 – 19 . As a result, existing methods generally suffer significant computational demands and lengthy training times, highlighting the urgent need for more efficient approaches to accurately predict the properties of functional materials. In light of this, we propose to exploit the localized atomic information through a cluster graph representation aimed at alleviating the computational burden typically associated with conventional GNNs, while maintaining high prediction accuracy. The physical and chemical properties of solid materials are inherently determined by their atomic structures, i.e. the periodic arrangement of atoms in the crystal lattice. 20 Existing GNNs for material development generally incorporate the entire crystal structure into their graph representation 6 – 12 . Consequently, these approaches account for all atoms and bonds within the crystal lattice as graph nodes and edges, leading to high memory consumption and intricate learning processes, particularly when handling large material systems. However, it is worth noting that many material properties, such as catalytic properties 21 , 22 , defect properties 23 , 24 and polaron states 25 , 26 , are predominantly influenced by short-range interactions resulting from the local atomic structure, rather than long-range interactions across the crystal lattice. For instance, the adsorption energy, a critical factor in catalytic performance, is primarily determined by the interactions between the molecules and their nearby catalyst atoms at the surface, while the atoms far from the molecule only exert minimal influence. 27 Several recent studies have attempted to incorporate the local structure and short-range interactions into the design of GNNs. For example, Ghanekar et al. proposed an adsorbate chemical environment-based GNN that simplifies surface atomistic configurations by focusing on the local chemical environment of adsorbates through subgraph generation 28 . Similarly, Pablo-García et al. developed a Graph-based Adsorption on Metal Energy-neural Network, which rapidly evaluates adsorption energy by emphasizing local chemical structures 29 . These methods have shown promising results in terms of both accuracy and efficiency, affirming the value of exploiting local structure and short-range interactions for improving material property predictions. However, these approaches only account for the first nearest neighbors (1NN) between adsorbates and surfaces, neglecting the influence of other neighboring atoms. Moreover, coordination environment of atoms at the sampling boundaries is changed within the graph representation, potentially leading to information loss and suboptimal performance. In addition, to the best of our knowledge, these efforts have predominantly concentrated on catalytic adsorption processes. Developing GNNs that incorporate short-range interactions into graph representations for a broader range of material systems remains largely underexplored. In this work, we introduce a cluster graph convolutional neural network (CG-NET) to predict local structural properties governed by short-range interactions across diverse material systems. CG-NET emphasizes localized atomic information and employs a cluster graph representation to efficiently model the local structure of materials, facilitating fast and reliable predictions for complex, disordered, and varied material systems. Importantly, we develop cluster graph sampling strategy to incorporate the short-range interactions more comprehensively. Futheremore, we introduce pseudo nodes as neighbors to the nodes near the boundary of the cluster to preserve the coordination environment information of the local structure. All these efforts significantly improve the prediction accuracy. This approach offers a significant improvement in computational efficiency while preserving high prediction accuracy compared to traditional crystal graph methods. Besides, CG-NET demonstrates much enhanced generalizability and interpretability across multiple datasets. Construction of the CG-NET We develop a CG-NET that exploits short-range interactions to achieve enhanced material property prediction for material systems with the properties primarily influenced by local atomic structure, rather than relying on the entire crystal lattices. The framework and architecture of the CG-NET are illustrated in Fig. 1 and Supplementary Fig. 1, respectively, with the detailed information and formulations provided in Supplementary Section 1. As shown in Fig. 1 a, for a local structure, we employ a cluster sampling strategy to select a representative cluster. In the context of physics, cluster sampling involves selecting atomic clusters within a material as the study subjects, a method known to improve efficiency and simplicity, particularly when investigating complex materials with intricate structures 30 , 31 . Compared to conventional approaches that utilize all atoms within a unit cell for representation, the use of cluster sampling significantly reduces the number of nodes, and the size of the material graph required for prediction, thus alleviating the computational load. With this method, only the atoms within the selected cluster are considered as nodes for generating cluster graphs, while distant atoms inside the unit cell are excluded. This selective inclusion of atoms highly relevant to the local structural features ensures that the model could effectively capture short-range atomic interactions for accurate property prediction. In the CG-NET, the selected cluster is defined by a sphere centered around the local structure, with cluster size determined by the sphere radius \(\:{\text{R}}_{\text{cluster}}\) . The center of the local structure is chosen on a case-by-case basis, depending on the specific local structural properties of interest. For example, in the prediction of surface adsorption, the local structure center can be naturally defined by the positions of adsorbate atoms or adsorption sites on the surface. Unlike prior works that only account for the first nearest neighbors (1NN) between adsorbates and surfaces 28 , 29 , our approach defines the extent of the local structure by the cluster radius \(\:{\text{R}}_{\text{cluster}}\) and the number of nodes inside the cluster \(\:{\text{N}}_{\text{cluster}}\) . This allows for a more accurate and effective representation of the local structure induced short-range interactions (see more discussion in the later section), making it adaptable to various material systems and structural properties. The selected cluster is then transformed into a graph representation, as shown in Fig. 1 b, where nodes represent the atoms within the cluster and edges signify the connections between these atoms. However, generating cluster graphs can lead to the loss of vital information about the local coordination environment of nodes within the cluster, potentially impairing model performance. To address this issue, we introduce pseudo nodes to supplement any missing information. These pseudo nodes serve as neighboring coordinates for the nodes near the boundary of the cluster, as indicated by dashed circles in Fig. 1 b. In particular, the nodes residing along the boundaries of cluster graph maintain their local structural integrity through interactions with neighboring pseudo nodes, effectively preserving the local structural information. This approach ensures that nodes close to the cluster center retain their coordinates within the material, while distant nodes are disregarded due to their minimal impact on local structural properties. The resulting cluster graph is then processed by CNNs to extract essential local structural features, which are subsequently used to predict a wide range of material properties, as depicted in Figs. 1 c and 1 d. In this study, we use the CGCNN as a baseline for comparison with our CG-NET. As one of the pioneering successes in applying GNNs to model crystalline materials, CGCNN has demonstrated remarkable capabilities in learning atomic embeddings directly from data, outperforming traditional human-engineered features. Since its development, CGCNN has consistently provided competitive performance and has been widely adopted as a benchmark in many studies 15 . However, it should be noted that the cluster graph representation employed in CG-NET is not limited to CGCNN, which primarily considers pair-wise interactions through an invariant graph model. In fact, this approach can be extended to more advanced geometrically equivariant graph models or integrated with many-body interaction frameworks. Such an extension could provide a more comprehensive characterization of the geometric and topological information intrinsic to material systems, while still maintaining computational efficiency. Improved CG-NET Performance on Diverse Material Systems To evaluate the performance of the CG-NET on diverse materials systems and properties, we employ four different datasets: the high-entropy alloy (HEA) dataset, two-dimensional impurity (2D-impurity) dataset 32 , the Open Catalyst 2020 (OC20) dataset 33 and the Open DAC 2023 (ODAC23) dataset 34 . The HEA dataset is built by us through analyzing the adsorption process of intermediate adsorbates involved in the \(\:\text{C}{\text{O}}_{\text{2}}\) reduction reaction ( \(\:\text{C}{\text{O}}_{\text{2}}\text{RR}\) ) on the surface of AgAuAlCuPt, AgAuCuPdPt and CoCuGaNiZn HEA alloys. Using density functional theory (DFT) calculations, we present the adsorption energy of adsorbates including \(\:\text{*CO}\) , \(\:\text{*CHO}\) , \(\:\text{*COH}\) , and \(\:\text{*COOH}\) , which are involved in potential limiting steps of the \(\:\text{C}{\text{O}}_{\text{2}}\text{RR}\) for \(\:{\text{C}}_{\text{1}}\) products, as well as the \(\:\text{*H}\) adsorbate, which is crucial for the competing hydrogen evolution reaction. 35 In addition to the HEA dataset, we utilize the 2D-impurity, OC20, and ODAC23 datasets sourced from established studies 32 – 34 . The OC20 dataset includes a large quantity of adsorption energies of various adsorbates on diverse catalyst surfaces, while the 2D-impurity dataset contains the defect formation energies of 2D materials. The ODAC23 dataset, containing over 8,400 metal-organic framework (MOF) materials with adsorbed CO 2 and H 2 O molecules, represents the largest collection of MOF adsorption energy using DFT calculations. To generate cluster graphs effectively, we apply filters to these three open datasets and constrain the energy range to avoid substantial deviations. Further details are provided in Supplementary Section 2 and Supplementary Table 1. All four datasets focus on material properties primarily determined by local structure-induced short-range interactions. Specifically, for the adsorbates on catalyst surfaces in HEA, OC20 and ODAC23 datasets, the relevant adsorption energy predominantly relies on the local structure composed of adsorbates and their neighboring catalytic atoms; For impurities in 2D materials within the 2D-impurity dataset, the formation energy of impurities is heavily dependent on the local bonding structure around the defects. Together, these datasets provide a comprehensive assessment of the efficacy of cluster graph method across a diverse range of functional materials. Table 1 Statistical comparison of graph datasets generated using crystal graph and cluster graph representations. Dataset Structure \(\:{\stackrel{-}{\varvec{N}}}_{\varvec{a}\varvec{t}\varvec{o}\varvec{m}\varvec{s}}\) Crystal Graph Cluster Graph Nodes Edges FLOPs /structure (M) Nodes Edges FLOPs /structure (M) HEA 16,033 38 615,047 7,380,564 83.35 384,792 3,684,324 41.61 2D- Impurity 9,962 42 419,451 5,033,412 91.48 209,202 2,434,788 44.25 OC20 130,758 72 9,412,414 112,948,883 156.40 2,353,644 27,489,707 38.07 ODAC23 20,000 201 4,013,760 48,165,120 436.05 360,000 5,907,252 53.48 Local structure induced short-range interactions are crucial in determining physical and chemical properties of functional materials such as catalytic materials, semiconductors, and alloys 36 – 38 , which however, are not only determined by 1NN interaction. The effect of short-range interactions on the materials’ properties can be evaluated using ligand effect 27 , which arises when the electronic structure of a central metal atom is altered by surrounding atoms in close proximity. For example, we examine the effect of short-range interactions on CO adsorption at the surface of HEA using ligand effect by DFT calculations, as illustrated in the left panel of Fig. 2 a. By sequentially occluding one atom at a time in the slab model, we analyze the impact of each atom on the adsorption energy of CO. The degree of influence is visually represented by coloring each atom based on the magnitude of change in adsorption energy \(\:\varDelta\:{E}_{ads}\) . Results in the right panel of Fig. 2 a show that atoms closest to the central atom, particularly 1NN on the surface and subsurface, significantly affect the adsorption energy. Atoms farther away, such as those in the fourth layer, contribute negligibly. Interestingly, some metal atoms in the third layer, which are fourth nearest neighbors, have non-trivial effect on adsorption energy. This phenomenon, reported in previous studies 39 – 42 , suggests that the effect of short-range interactions is not confined solely to 1NN atoms but can extend to second nearest neighbors (2NN) or even beyond. This underscores the importance of considering short-range interactions when studying local properties and highlights the need to account for neighboring atoms beyond the 1NN in graph-based models representing material structures. In this work, we develop a cluster sampling strategy and use the cluster radius \(\:{\text{R}}_{\text{cluster}}\) to define the extent of the graph construction. A performance comparison between graphs considering only the 1NN and cluster graphs including additional neighboring atoms is presented in Supplementary Fig. 5. The results demonstrate a significant improvement in prediction accuracy of 135%, 165%, 145%, and 119% on the HEA, 2D-impurity, OC20, and ODAC23 datasets, respectively, when neighboring atoms beyond the 1NN are included in the cluster graph. Therefore, our cluster graph representation provides a more accurate depiction of local atomic structure, ensuring that crucial short-range interactions are captured while maintaining a balance between computational efficiency and prediction accuracy. We use both the CG-NET and CGCNN models to predict the adsorption energy on the HEA, OC20 and ODAC23 datasets and the defect formation energy on the 2D-impurity dataset. Further details regarding the training process can be found in Supplementary Section 3. To illustrate the cluster graph formulation in our approach, we use the HEA dataset as an example. As depicted in Fig. 2 b (left), the AgAuAlCuPt HEA surface model consists of a four-layer slab with randomly distributed Ag, Au, Al, Cu, and Pt metal elements in a face-centered cubic lattice, maintaining an approximately equimolar atomic ratio. In this model, the \(\:\text{*CO}\) adsorbate is positioned at the on-top site of HEA surface. Utilizing the cluster sampling strategy, a cluster graph is generated, as shown in Fig. 2 b (right), centered around an Al metal atom, which is the binding site on the HEA surface. The extent of this cluster is determined by the cluster radius \(\:{\text{R}}_{\text{cluster}}\) . The transformation from the surface model to the cluster graph involves \(\:{\text{N}}_{\text{cluster}}\) atoms within the cluster. For instance, with \(\:{\text{R}}_{\text{cluster}}\) = 5 Å, the cluster encompasses all 1NN and 2NN atoms of the binding site, both on the surface and in the subsurface layers. The \(\:{\text{R}}_{\text{cluster}}\) and \(\:{\text{N}}_{\text{cluster}}\) are critical hyperparameters in generating the cluster graph and ensuring convergence of these hyperparameters is key to obtaining reliable prediction results. Implementing the cluster graph approach significantly reduces computational costs due to the dramatically decreased number of nodes and edges within the graph representation. We present a statistical comparison of graph datasets generated using the crystal graph and cluster graph representations in Table 1 . Across the datasets—HEA, 2D-impurity, OC20, and ODAC23—the average number of atoms per structure \(\:{\stackrel{-}{N}}_{atoms}\) increases from 38 to 42, 72 and 201, respectively. Consequently, the floating-point operations (FLOPs) required per structure for a single convolution layer in the crystal graph representation increase proportionally with the size of the material structures, from 83.35 to 91.48, 156.40, and 436.05 FLOPs, as detailed in Supplementary Section 4. In contrast, the cluster graph representation substantially reduces FLOPs for each dataset. CG-NET achieves FLOPs savings of approximately 50%, 52%, 76%, and 88% on the HEA, 2D-impurity, OC20, and ODAC23 datasets, respectively. From the comparative results in Table 1 , the computational resource efficiency of our method is strikingly evident. Furthermore, the cluster graph representation outperforms the crystal graph in GPU video memory (VRAM) consumption, making it an even more resource-efficient approach. As shown in Fig. 2 c, the VRAM required for CG-NET model is reduced by over 84% compared to CGCNN on the same OC20 dataset. This improvement in efficiency allows for the use of larger batch sizes without requiring additional computational resources, as highlighted in Fig. 2 d. For instance, with an RTX 3080 12GB GPU on the complete ODAC23 dataset (162,219 structures, including 32,824,409 atoms), CGCNN is limited to a maximum batch size of 32, whereas CG-NET can handle batch sizes exceeding 1024. Other models like M3GNet, which incorporate equivariance and many-body interactions into graph representations, face more severe VRAM limitations. VRAM overflow occurs with a batch size as small as 16 using M2GNet. These comparisons demonstrate the significantly improved scalability and resource efficiency of CG-NET. Importantly, this efficiency boost does not compromise prediction accuracy. Figure 2 c and 3c illustrate that CG-NET achieves a test MAE of approximately 0.465 eV, nearly identical to the 0.459 eV obtained by CGCNN. Further examinations of CG-NET on other datasets reinforce these advantages. On the HEA dataset, Fig. 3a shows that CG-NET reduces VRAM usage by approximately 35% compared to CGCNN, while maintaining an equivalent test MAE of 0.088 eV. On the 2D-impurity dataset, Fig. 3b shows that the VRAM usage of CG-NET is 61% lower than that of CGCNN, and its prediction accuracy (test MAE 0.445 eV) surpasses that of CGCNN (test MAE 0.453 eV). For the ODAC23 dataset, CG-NET achieves a lower test MAE of 0.186 eV, compared to 0.203 eV for CGCNN, while reducing VRAM usage by 90%, as shown in Fig. 3d. Note that, when MAE accuracy is set at 90% of CGCNN, CG-NET offers average VRAM savings of over 77% across all four datasets. From these detailed comparisons, we can see that the cluster graph approach in CG-NET can reduce computational requirements while maintaining or even improving prediction accuracy. The significantly lower computational costs in terms of FLOPs and VRAM usage establish CG-NET as a highly scalable and efficient model for material property prediction across diverse materials systems. Furthermore, we compare the performance of CG-NET models with and without pseudo neighboring nodes, as shown in Figs. 4 a-d. The incorporation of pseudo nodes plays a crucial role in preserving the coordination environment of the local structure, which is vital for sustaining prediction accuracy. A comparison of the cluster graph with and without pseudo neighboring nodes is illustrated in Supplementary Fig. 2, with further details in Supplementary Section 5. Across all datasets, CG-NET with pseudo neighboring nodes consistently outperforms models without the pseudo nodes. In particular, the exclusion of pseudo neighboring nodes results in a significant decline in performance on the HEA, 2D-impurity, and OC20 datasets, though not much variation is observed on the ODAC23 dataset. This discrepancy is primarily attributed to the nature of the ODAC23 dataset, where the adsorption of H₂O and CO₂ molecules onto the MOF structures mainly involves weak van der Waals (VDW) interactions, rather than the stronger covalent or ionic bonding seen in the HEA, 2D-impurity, and OC20 datasets. As a result, short-range interactions in the ODAC23 dataset are less critical to the accuracy of the model. While pseudo neighboring nodes provide essential coordination information, they are only used as neighboring coordinates for nodes within the cluster graph and are not aggregated into the final readout phase, somewhat mitigating their impact on the model. Improvements could be made through several means. One approach involves expanding the cluster radius to include a larger number of nodes encompassed within the cluster, thereby reducing the proportion of nodes connected to pseudo neighboring nodes. The effect of cluster radius on prediction performance is detailed in Fig. 3 and Supplementary Fig. 9, where results show that increasing the cluster radius and the number of nodes leads to convergence in the prediction performance. Another improvement involves introducing higher-order pseudo neighboring nodes to refine the embedding feature assigned to them. We tested this by incorporating 2nd-order pseudo neighboring nodes (i.e., neighboring nodes of pseudo nodes), as shown in Fig. 4 . The results reveal that using 2nd-order pseudo neighboring nodes further enhances the accuracy of CG-NET on HEA, 2D-impurity, and OC20 datasets, albeit at the cost of increased VARM usage. The flexibility of CG-NET in allowing higher-order pseudo nodes based on the dataset provides an adaptable framework for optimizing both accuracy and efficiency. Interpretability of CG-NET Conventional ML models frequently exhibit opacity due to their complex architecture, compromising their interpretability. This lack of model interpretability poses a significant challenge to materials discovery and design, where understanding the structure-property-performance relationship is of paramount importance 43 . Next, we show that the employment of cluster graph representation enables much improved interpretability for the prediction of material properties. To elucidate interpretability, the pooling and convolution layers within the CG-NET are visualized in Fig. 5 , respectively. These visualizations provide insights into the intricate relationship between atomic features in the latent space and the target properties, which are intimately connected to the local structure of the materials. We use principal component analysis (PCA) for the interpretative analysis of \(\:\text{*CO}\) adsorption at the top site of AgAuAlCuPt HEA surfaces. Here, the 64-dimensional atomic feature vectors derived from the pooling layer are reduced to a 2D representation via projection onto the plane defined by the first two principal components. Figure 5 a displays the resulting distribution of binding sites for each metallic element within the AgAuAlCuPt HEA. Significantly, this distribution mirrors the element-specific adsorption energy distribution obtained from our DFT calculations, as evidenced by the comparative inset in Fig. 5 a. This concordance underscores the ability of the cluster graph to accurately capture the local structural configurations related to the adsorption energy of molecules, thereby affirming the interpretive power of the CG-NET in correlating local structural features with physicochemical properties. Notably, CG-NET can further distinguish the binding strengths of five metallic elements in HEA with adsorbates. This is achieved by analyzing the atomic features within the convolutional layers of the network. As shown in Figs. 5 b–f, we categorize the atomic features of HEA into five distinct groups based on the metallic elements at the binding sites. Each group of atomic features is then encoded with a unique color corresponding to each atom type, facilitating a clear visual representation of the data. Upon examination, we observe that the atomic features segregate into two patterns. One pattern (Figs. 5 b and 5 c) includes Ag and Au, while the other pattern (Figs. 5 d–f) comprises Al, Cu, and Pt. This segregation is in good agreement with our DFT calculations (the inset in Fig. 5 a and Supplementary Fig. 3). Specifically, Ag and Au are associated with the first pattern identified by CG-NET, which aligns with the lower reactivity among the elements due to their higher adsorption energies, as illustrated in the inset of Fig. 5 a. In contrast, the second pattern, which includes Al, Cu, and Pt, is indicative of a higher reactivity, corroborated by their lower \(\:\text{*CO}\) adsorption energies according to DFT calculations. The consistency between the visualizations of the latent space of CG-NET and DFT results validates the network’s ability to reliably map atomic features to material reactivity. The CG-NET also offers an insightful interpretation of the short-range interactions between elements at binding sites and their neighboring atoms through an analysis of inter-cluster distances in the latent space. The inter-cluster distances measure the influence that neighboring atoms exert on the adsorption site, with larger distances indicating stronger impacts and thus higher reactivity of the metal elements at the binding site. As shown in Figs. 5 b–f, CG-NET has different inter-cluster distances for the elements within the AgAuAlCuPt HEA, revealing a distinct trend: \(\:\text{Pt>Cu>Al>Au>Ag}\) . This trend corresponds with the reactivity observed for the pure metals, as shown in Supplementary Fig. 3. Such interpretations by CG-NET shed light on the strength of short-range interactions among atoms and how the cluster graph encapsulates these interactions to predict adsorbate binding energies. It is essential to recognize the significance of constructing a cluster graph to capture these vital short-range interactions, an aspect on which the conventional crystal graph representations fall short due to their lack of focus on the local structure of materials. As depicted in Supplementary Fig. 6, applying the same analysis to the CGCNN, the crystal graph suggests that oxygen atoms have the largest inter-cluster distance from metallic and carbon atoms. This observation does not align with the experimentally observed \(\:\text{*CO}\) adsorption on the HEA surface. In the context of \(\:\text{*CO}\) adsorption, the binding between adsorbates and the HEA surface occurs between the carbon atoms and metal binding sites. Therefore, the depiction of oxygen atoms as distant from neighboring atoms in Supplementary Fig. 6 is physically misleading. The cluster graph, in contrast, offers a more accurate representation, positioning carbon atoms and metal binding sites in close proximity while grouping oxygen atoms with other neighboring atoms, as illustrated in Figs. 5 b–f. This interpretation aligns well with the known physical interactions, thereby enabling a more self-consistent explanation of material properties. We also present the t-distributed Stochastic Neighbor Embedding (t-SNE) analysis for embedding features in the pooling and convolution layers (Supplementary Figs. 7 and 8), and a consistent mapping could be obtained to confirm the interpretability of CG-NET. Ultimately, by focusing on short-range interactions, the cluster graph yields a physically meaningful depiction of the local structure of materials, facilitating a deeper understanding of material properties. Generalizability of CG-NET The influence of coverage-dependent adsorbate-adsorbate interactions on catalyst surfaces has been well established through theoretical and experimental studies 44 , 45 . These interactions are pivotal, as they not only determine the adsorption and desorption dynamics but also alter the stability of transition states, thereby influencing catalytic reaction rates. 21 , 44 A thorough understanding of adsorbate-adsorbate interactions across diverse catalyst surfaces is crucial for designing novel catalysts. In this endeavor, the cluster graph model offers a straightforward approach to explicitly account for these interactions. This is accomplished by expanding the cluster radius \(\:{\text{R}}_{\text{cluster}}\) beyond the inter-adatom distances, thereby explicitly incorporating these interactions into the cluster graph. However, increasing \(\:{\text{R}}_{\text{cluster}}\) incurs additional computational costs due to the enlarged graph comprising more nodes and edges. Typically, a modest cluster radius of CG-NET is sufficient for capturing the local structure of materials. As shown in Supplementary Fig. 8, a radius of \(\:{\text{R}}_{\text{cluster}}\text{\:}\text{=}\text{\:}\text{5Å}\) achieves convergence in the validation of the MAE for the prediction of the adsorption energy on the HEA surface. While a conservative \(\:{\text{R}}_{\text{cluster}}\) limits our ability to model adsorbate-adsorbate interactions explicitly, the cluster graph model compensates it by implicitly considering these interactions through multilayer convolutional operations. This implicit inclusion is critical not only for enhancing the model’s generalizability but also for its applicability in transfer learning to new datasets. To substantiate this, we train the CG-NET model on a (3×3) surface HEA dataset and evaluate its prediction performance on (2×2) and (4×4) surface HEA datasets with the same metal composition, which are related to varying CO coverage on the HEA surface. We also train and assess the CGCNN model using the same dataset for comparison. The results, as presented in Fig. 6 , demonstrate much improved prediction effectiveness of the CG-NET model for the varying CO coverage, as indicated by the consistently lower test MAEs (0.13 eV for (2×2) and 0.10 eV for (4×4) surface supercells) and higher \(\:{\text{R}}^{\text{2}}\) values (0.80 for (2×2) and 0.91 for (4×4) surface supercells) than those of the CGCNN model. This generalizability is attributed to CG-NET’s ability to maintain consistent graph structure integrity when transitioning from training to evaluating datasets with different coverages. In contrast, the graph structure of the CGCNN model undergoes significant changes to accommodate the varying adsorption coverages. From a computational perspective, generating extensive surface datasets for theoretical calculations is immensely challenging due to the increasing computational demands associated with the increased number of atoms. Therefore, it is often more practical to generate a training dataset from smaller surface supercells, such as (2×2) or (3×3) surface and use transfer learning to apply the knowledge to larger surface supercells, such as (4×4) surface or beyond. This is why the ability of the CG-NET to generalize to unseen datasets becomes particularly important. In addition, the CG-NET model can be fine-tuned with a relatively small dataset, providing a considerable advantage in scenarios with limited computational resources. Conclusion In conclusion, we develop a CG-NET that leverages local structural information through graph representations for enhanced material property prediction. We also report the introduction of pseudo nodes to preserve the bonding coordinates of the nodes near the boundaries of the cluster graph, which ensures the prediction performance. Through effectively incorporating short-range interactions within the cluster graph, the CG-NET offers an effective representation of the local structure of materials, yielding computational efficiency while maintaining high prediction accuracy. Our results also highlight the superior generalizability and interpretability of CG-NET when applied to diversified datasets, enabling robust and rapid prediction capabilities for complex, disordered, and diverse material systems with reduced computational resources. The capabilities of CG-NET could be further enhanced by incorporating dynamic graph updates that can reflect changes in atomic configurations over time, paving the way for simulations of material behavior under various conditions and during chemical reactions. The methodologies and insights derived from our proposed CG-NET could be applied to other graph-based neural network architectures, potentially revolutionizing the field of functional material development. Declarations Competing Interests The authors declare no competing interests. Author Contributions M. Y. and H. C. conceived the idea. M. Y., W. L., and K. C. T. designed the study. H. C., T. Y., and M. G. Z. implemented the code. H. C., H. W., Z. W., Z. R. W., H. K. H., C. C., S. P. L., W. L., K.C.T., and M. Y. prepared the dataset and performed the analysis. H.C. drafted the manuscript. All the authors have read and corrected the final manuscript. Acknowledgments This work was supported by the Hong Kong Polytechnic University (project number: P0042711, P0049524, P0050570, and P0048122), Hong Kong Research Grants Council (project number: P0046939 and P0045061), and Guangdong Natural Science Foundation (project number: 2024A1515010031). References Olson GB (2000) Designing a new material world. Science 288:993–998 Seh ZW et al (2017) Combining theory and experiment in electrocatalysis: insights into materials design. Science 355 Raccuglia P et al (2016) Machine-learning-assisted materials discovery using failed experiments. Nature 533:73–76 Butler KT, Davies DW, Cartwright H, Isayev O, Walsh A (2018) Machine learning for molecular and materials science. Nature 559:547–555 Rao Z et al (2022) Machine learning–enabled high-entropy alloy discovery. Science 378:78–85 Xie T, Grossman JC (2018) Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys Rev Lett 120:145301 Park CW, Wolverton C (2020) Developing an improved crystal graph convolutional neural network framework for accelerated materials discovery. Phys Rev Mater 4:063801 Batzner S et al (2022) E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat Commun 13:2453 Chen C, Ong SP (2022) A universal graph deep learning interatomic potential for the periodic table. Nat Comput Sci 2:718–728 Batatia I, Kovács DP, Simm GNC, Ortner C, Csányi G (2023) MACE: Higher order equivariant message passing neural networks for fast and accurate force fields. Preprint at. https://doi.org/10.48550/arXiv.2206.07697 Merchant A et al (2023) Scaling deep learning for materials discovery. Nature 624:80–85 Deng B et al (2023) CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nat Mach Intell 5:1031–1041 Ko TW, Ong SP (2023) Recent advances and outstanding challenges for machine learning interatomic potentials. Nat Comput Sci Han J, Rong Y, Xu T, Huang W (2022) Geometrically equivariant graph neural networks: a survey. Preprint at. https://doi.org/10.48550/arXiv.2202.07230 Riebesell J et al (2024) Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions. Preprint at. https://doi.org/10.48550/arXiv.2308.14920 Greeley J, Jaramillo TF, Bonde J, Chorkendorff IB, Norskov JK (2006) Computational high-throughput screening of electrocatalytic materials for hydrogen evolution. Nat Mater 5:909–913 Curtarolo S et al (2013) The high-throughput highway to computational materials design. Nat Mater 12:191–201 Yao Y et al (2022) High-entropy nanoparticles: Synthesis-structure-property relationships and data-driven discovery. Science 376:eabn3103 Huang B, Von Rudorff GF (2023) Von Lilienfeld, O. A. The central role of density functional theory in the AI age. Science 381:170–175 Harrison WA (1989) Electronic structure and the properties of solids: the physics of the chemical bond. Dover, New York Hammer B, Nørskov JK (2000) Theoretical surface science and catalysis—calculations and concepts. Adv Catal 45:71–129 Norskov JK, Bligaard T, Rossmeisl J, Christensen CH (2009) Towards the computational design of solid catalysts. Nat Chem 1:37–46 Banhart F, Kotakoski J, Krasheninnikov AV (2011) Structural defects in graphene. ACS Nano 5:26–41 Freysoldt C et al (2014) First-principles calculations for point defects in solids. Rev Mod Phys 86 Alexandrov AS (2009) Advances in polaron physics. Springer-, Berlin; Grancini G et al (2013) Hot exciton dissociation in polymer solar cells. Nat Mater 12:29–33 Nørskov JK, Studt F, Abild-Pedersen F, Bligaard T (2014) Fundamental concepts in heterogeneous catalysis. Wiley Ghanekar PG, Deshpande S, Greeley J (2022) Adsorbate chemical environment-based machine learning framework for heterogeneous catalysis. Nat Commun 13:5788 Pablo-García S et al (2023) Fast evaluation of the adsorption energy of organic molecules on metals via graph neural networks. Nat Comput Sci 3:433–442 Ankudinov AL, Ravel B, Rehr JJ, Conradson SD (1998) Real-space multiple-scattering calculation and interpretation of x-ray-absorption near-edge structure. Phys Rev B 58:7565–7576 Surface analysis: the principal techniques . (Wiley, Chichester, U.K, (2009) Davidsson J, Bertoldo F, Thygesen KS, Armiento R (2023) Absorption versus adsorption: high-throughput computation of impurities in 2D materials. npj 2D Mater Appl 7:26 Chanussot L et al (2021) Open catalyst 2020 (OC20) dataset and community challenges. ACS Catal 11:6059–6072 Sriram A et al (2024) The open DAC 2023 dataset and challenges for sorbent discovery in direct air capture. ACS Cent Sci 10:923–941 Wang X et al (2024) Electrochemical CO 2 activation and valorization on metallic copper and carbon-embedded N‐coordinated single metal MNC catalysts. Angew Chem 136:e202401821 Somorjai GA, Li Y (2010) Introduction to surface chemistry and catalysis. Wiley YU P, Cardona M (2010) Fundamentals of semiconductors: physics and materials properties. Springer Science & Business Media Porter DA, Easterling KE, Sherif MY (2021) Phase transformations in metals and alloys. CRC, Boca Raton Schlapka A, Lischka M, Groß A, Käsberger U, Jakob P (2003) Surface strain versus substrate interaction in heteroepitaxial metal layers: Pt on Ru(0001). Phys Rev Lett 91:016101 Hoster HE, Alves OB, Koper M (2010) T. M. Tuning adsorption via strain and vertical ligand effects. ChemPhysChem 11:1518–1524 Asano M, Kawamura R, Sasakawa R, Todoroki N, Wadayama T (2016) Oxygen reduction reaction activity for strain-controlled Pt-based model alloy catalysts: surface strains and direct electronic effects induced by alloying elements. ACS Catal 6:5285–5289 Clausen CM, Batchelor TAA, Pedersen JK, Rossmeisl J (2021) What atomic positions determines reactivity of a surface? Long-range, directional ligand effects in metallic alloys. Adv Sci 8:2003357 Schmidt J, Marques MRG, Botti S, Marques MA (2019) L. Recent advances and applications of machine learning in solid-state materials science. npj Comput Mater 5:1–36 Grabow LC, Hvolbæk B, Nørskov JK (2010) Understanding trends in catalytic activity: the effect of adsorbate–adsorbate interactions for CO oxidation over transition metals. Top Catal 53:298–310 Clark EL, Hahn C, Jaramillo TF, Bell AT (2017) Electrochemical CO2 reduction over compressively strained CuAg surface alloys with enhanced multi-carbon oxygenate selectivity. J Am Chem Soc 139:15848–15857 Additional Declarations There is NO Competing Interest. Supplementary Files SM.docx Supplementary Information Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4429598","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":410597485,"identity":"b7059256-9e1c-4828-818a-e0104553ed40","order_by":0,"name":"Ming Yang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7ElEQVRIiWNgGAWjYDACZijNz8DAeADMOsDAxthAjBbJBpBiorTAgMEBYrXotvMefs1Tc8du8+3mA4d52xjk+G4ksD2cgUeL2WG+NGueY8+St905lgDSYix5I4HdcANeLTxmxjxsh5PNbuQYgLQkbgDaIvmAoJZ/h5ONZ0C01BOjxfgxb9thOwMJiJYEA5AWQg5jnNt3OEEC6JeDc85JGM4887BNEq/3z58x/vDm22F7/tnNBx+8KbOR5zuefEyyB48WIGCT4mFgSGyQYGBg4mEAkgyEI5L54w8GBnuQYsYfhNSOglEwCkbBiAQAEjBV34wBWdcAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0002-0876-1221","institution":"The Hong Kong Polytechnic University","correspondingAuthor":true,"prefix":"","firstName":"Ming","middleName":"","lastName":"Yang","suffix":""},{"id":410597486,"identity":"f6d1fa7b-fca6-4fac-b783-09b2cbd9478d","order_by":1,"name":"Hao Cheng","email":"","orcid":"https://orcid.org/0000-0001-5210-5267","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Hao","middleName":"","lastName":"Cheng","suffix":""},{"id":410597487,"identity":"f235646d-5c4e-4782-8833-8d72d352933c","order_by":2,"name":"Tong Yang","email":"","orcid":"","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Tong","middleName":"","lastName":"Yang","suffix":""},{"id":410597488,"identity":"693a3710-d524-4c62-a731-9bf73f88eff9","order_by":3,"name":"Minggang Zeng","email":"","orcid":"","institution":"Institute of High-Performance Computing","correspondingAuthor":false,"prefix":"","firstName":"Minggang","middleName":"","lastName":"Zeng","suffix":""},{"id":410597489,"identity":"102ac752-80bb-4c19-998a-ab3b0e931bd6","order_by":4,"name":"Hao Wu","email":"","orcid":"https://orcid.org/0009-0000-1030-9819","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Hao","middleName":"","lastName":"Wu","suffix":""},{"id":410597490,"identity":"1ddccef1-dfd4-4c3d-8376-1da6d0a16437","order_by":5,"name":"Zhengzhong Wang","email":"","orcid":"","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Zhengzhong","middleName":"","lastName":"Wang","suffix":""},{"id":410597491,"identity":"0d4d2185-2dcb-47c5-9931-fdbb3cbb2568","order_by":6,"name":"Haokai Hong","email":"","orcid":"","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Haokai","middleName":"","lastName":"Hong","suffix":""},{"id":410597492,"identity":"8c835be4-f526-4110-bffa-f525aa4c1bc7","order_by":7,"name":"Cheng Chen","email":"","orcid":"","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Cheng","middleName":"","lastName":"Chen","suffix":""},{"id":410597493,"identity":"9e839a9f-4380-4b78-ae9a-2e2433a560d9","order_by":8,"name":"Bing Li","email":"","orcid":"","institution":"Harbin Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Bing","middleName":"","lastName":"Li","suffix":""},{"id":410597494,"identity":"a3a6ea44-babc-43e9-a3c7-7de409388357","order_by":9,"name":"Zhongrui Wang","email":"","orcid":"","institution":"Southern University of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Zhongrui","middleName":"","lastName":"Wang","suffix":""},{"id":410597495,"identity":"1c8a039c-011f-4b8f-a173-45d1b433bfba","order_by":10,"name":"Shu Ping Lau","email":"","orcid":"https://orcid.org/0000-0002-5315-8472","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Shu","middleName":"Ping","lastName":"Lau","suffix":""},{"id":410597496,"identity":"4cd11259-fb01-45ee-bd9d-5edf905df9d4","order_by":11,"name":"Wanyu Lin","email":"","orcid":"","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Wanyu","middleName":"","lastName":"Lin","suffix":""},{"id":410597497,"identity":"177eae4d-e216-453d-b636-6929dd1d2790","order_by":12,"name":"Kay Chen Tan","email":"","orcid":"","institution":"The Hong Kong Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Kay","middleName":"Chen","lastName":"Tan","suffix":""}],"badges":[],"createdAt":"2024-05-16 08:30:33","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4429598/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4429598/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":75580240,"identity":"b5644c75-2217-45e0-aa22-69e4d9a366f4","added_by":"auto","created_at":"2025-02-06 05:30:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":1817342,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSchematic illustration for the framework of the proposed CG-NET.\u003c/strong\u003e (\u003cstrong\u003ea\u003c/strong\u003e) Physical Model. A cluster sampling strategy is used to select a cluster of atoms representing the local structure of the material, rather than utilizing all atoms inside a unit cell (black square) for representation. The selected cluster (dashed purple circle) has a center determined case-by-case and a sphere radius \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ecluster\u003c/em\u003e\u003c/sub\u003e. (\u003cstrong\u003eb\u003c/strong\u003e) Cluster Graph. The selected cluster is transformed into a graph representation, in which nodes represent atoms within the cluster and edges denote connections between the atoms. Note that pseudo nodes are introduced that serve as neighbor coordinates for nodes within the cluster to preserve the local structural information. (\u003cstrong\u003ec\u003c/strong\u003e) Graph Convolution. The resulting cluster graph is fed into CNNs to extract crucial local structural features. (\u003cstrong\u003ed\u003c/strong\u003e) Physical Properties Prediction. Diverse material properties governed by short-range interactions can be effectively predicted.\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/894f4d87265d66139969eb4a.png"},{"id":75580219,"identity":"52da392b-49c5-4547-ac87-e18c96aeaae9","added_by":"auto","created_at":"2025-02-06 05:29:36","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1183575,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eIllustration of cluster graph generation and its computational efficiency.\u003c/strong\u003e (a) Slab model illustrating CO adsorption at the surface of AgAuAlCuPt HEA (left panel), which is superimposed with the colored atomic contribution to the adsorption energy change \u003cem\u003e∆E\u003c/em\u003e\u003csub\u003e\u003cem\u003eads\u003c/em\u003e\u003c/sub\u003e (right panel). (b) Cluster graph representation of the HEA slab model, where the cluster includes \u003cem\u003eN\u003c/em\u003e\u003csub\u003e\u003cem\u003ecluster\u003c/em\u003e\u003c/sub\u003e \u0026nbsp;atoms (solid spheres) connected by edges (solid lines) to capture the local coordination information. Neighboring atoms outside the cluster (solid circles) are used to preserve the bonding coordination of the atoms at the boundary of the cluster, while distant atoms (dashed circles) and their connections (dashed lines) are excluded. (c) Performance comparison between CG-NET and CGCNN for predicting adsorption energy using the OC20 dataset. (d) GPU Video Ram (VRAM) usage comparison between CG-NET, CGCNN and M3GNet as batch size increases for the adsorption energy prediction using the ODAC23 dataset.\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/c0fb2d5fde6aecd25d742e93.png"},{"id":75580570,"identity":"94dc7469-9251-4c0c-b312-1d2936dcd6ff","added_by":"auto","created_at":"2025-02-06 05:37:36","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1092611,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance of CG-NET for diverse material systems. \u003c/strong\u003eTest MAE and the VRAM usage of CGCNN and CG-NET with 1st-order pseudo neighboring nodes using the \u003cstrong\u003e(a)\u003c/strong\u003e HEA, \u003cstrong\u003e(b)\u003c/strong\u003e 2D-impurity, \u003cstrong\u003e(c) \u003c/strong\u003eOC20 and \u003cstrong\u003e(d)\u003c/strong\u003eODAC23 datasets, respectively. Red and green stars denote the best performance and MAE accuracy \u0026gt;90% of CG-NET, respectively.\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/07d59c015ea4094253ba2cf1.png"},{"id":75580221,"identity":"fe934831-143b-4fbe-a591-cff4cbc4e6a3","added_by":"auto","created_at":"2025-02-06 05:29:36","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":780939,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEnhanced performance of CG-NET with pseudo nodes.\u003c/strong\u003e Performance comparison between CG-NET models using 1st-order and 2nd-order, and without pseudo neighboring nodes across (\u003cstrong\u003ea\u003c/strong\u003e) HEA, (\u003cstrong\u003eb\u003c/strong\u003e) 2D-impurity, (\u003cstrong\u003ec\u003c/strong\u003e) OC20 and (\u003cstrong\u003ed\u003c/strong\u003e) ODAC23 datasets.\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/d0d08020db53cdc54866ffa5.png"},{"id":75580262,"identity":"8a7aad8c-1465-4aa1-90eb-5ca2416ac317","added_by":"auto","created_at":"2025-02-06 05:30:52","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1176746,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEnhanced interpretability of the CG-NET.\u003c/strong\u003e (\u003cstrong\u003ea\u003c/strong\u003e) 2D visualization for element-dependent adsorption sites on the AgAuAlCuPt HEA surface using principal component analysis (PCA). The inset depicts the element-specific adsorption energy distribution obtained from DFT calculations. (\u003cstrong\u003eb–f\u003c/strong\u003e) PCA visualization of the local atomic environment for five element-specific adsorption sites, where triangle symbols denote the atomic features of the binding site, and circles represent the other surrounding atoms in the cluster graph.\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/0dc2d26c476a4b7bba4de769.png"},{"id":75580222,"identity":"d0804deb-1779-4eb6-84a0-0e7ecb12935a","added_by":"auto","created_at":"2025-02-06 05:29:36","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":1662289,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGeneralizability of the CG-NET.\u003c/strong\u003e (\u003cstrong\u003ea\u003c/strong\u003e) An illustration of training the CG-NET and CGCNN models on the (3×3) surface HEA dataset and their generalizability on the unseen (2×2) and (4×4) HEA datasets. A comparison between the CG-NET and CGCNN models on the (\u003cstrong\u003eb, c\u003c/strong\u003e) (2×2) and (\u003cstrong\u003ed, e\u003c/strong\u003e) (4×4) surface HEA datasets, using DFT-calculated results as the reference.\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/55ef354b5c8fa9e90abb0f68.png"},{"id":78378829,"identity":"90de119d-7ea1-4ff3-81ea-c7d11b146ffa","added_by":"auto","created_at":"2025-03-12 15:13:48","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":8870987,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/f55955ef-61b8-4557-8fd1-3beb6cd90bbb.pdf"},{"id":75580243,"identity":"5f97b6c8-0ec7-4ed9-813c-40370a7d9df4","added_by":"auto","created_at":"2025-02-06 05:30:30","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":9085217,"visible":true,"origin":"","legend":"Supplementary Information","description":"","filename":"SM.docx","url":"https://assets-eu.researchsquare.com/files/rs-4429598/v1/7aa807d2c1bc68e3917f3bad.docx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"A physics-informed cluster graph neural network enables generalizable and interpretable prediction for material discovery","fulltext":[{"header":"Introduction","content":"\u003cp\u003eCompared with conventional material property prediction approaches\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e, the application of machine learning (ML) has significantly expedited the process of material discovery, offering favorable effectiveness and cost-efficiency\u003csup\u003e\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Among the vast array of ML-based approaches, graph neural networks (GNNs) have achieved superior performance over prior ML methods on predicting the physical and chemical properties of the materials\u003csup\u003e\u003cspan additionalcitationids=\"CR7 CR8 CR9 CR10 CR11\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. The primary advantage of GNNs lies in their capacity to handle systems of arbitrary complexity without suffering from combinatorial explosion typically associated with the increasing number of distinct elements in the large material system\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. Graph structures provide natural representations for crystals and molecules by seamlessly embedding atomic information and bonding interactions into graph nodes and edges, respectively. Recent studies have integrated the concepts of equivariance and many-body interactions into graph representations, showing remarkable effectiveness in deciphering nonlinear atomic interactions and structure-property relationships in functional materials\u003csup\u003e\u003cspan additionalcitationids=\"CR9 CR10 CR11\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. However, despite their impressive performance, these advancements have also led to a surge in model complexity and a growing demand for large training datasets\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Additionally, the vast search space for materials with desirable properties presents further challenges\u003csup\u003e\u003cspan additionalcitationids=\"CR17 CR18\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. As a result, existing methods generally suffer significant computational demands and lengthy training times, highlighting the urgent need for more efficient approaches to accurately predict the properties of functional materials.\u003c/p\u003e \u003cp\u003eIn light of this, we propose to exploit the localized atomic information through a cluster graph representation aimed at alleviating the computational burden typically associated with conventional GNNs, while maintaining high prediction accuracy. The physical and chemical properties of solid materials are inherently determined by their atomic structures, i.e. the periodic arrangement of atoms in the crystal lattice.\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e Existing GNNs for material development generally incorporate the entire crystal structure into their graph representation\u003csup\u003e\u003cspan additionalcitationids=\"CR7 CR8 CR9 CR10 CR11\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. Consequently, these approaches account for all atoms and bonds within the crystal lattice as graph nodes and edges, leading to high memory consumption and intricate learning processes, particularly when handling large material systems. However, it is worth noting that many material properties, such as catalytic properties\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, defect properties\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e,\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e and polaron states\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e,\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e, are predominantly influenced by short-range interactions resulting from the local atomic structure, rather than long-range interactions across the crystal lattice. For instance, the adsorption energy, a critical factor in catalytic performance, is primarily determined by the interactions between the molecules and their nearby catalyst atoms at the surface, while the atoms far from the molecule only exert minimal influence.\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e Several recent studies have attempted to incorporate the local structure and short-range interactions into the design of GNNs. For example, Ghanekar \u003cem\u003eet al.\u003c/em\u003e proposed an adsorbate chemical environment-based GNN that simplifies surface atomistic configurations by focusing on the local chemical environment of adsorbates through subgraph generation\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. Similarly, Pablo-Garc\u0026iacute;a \u003cem\u003eet al.\u003c/em\u003e developed a Graph-based Adsorption on Metal Energy-neural Network, which rapidly evaluates adsorption energy by emphasizing local chemical structures\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. These methods have shown promising results in terms of both accuracy and efficiency, affirming the value of exploiting local structure and short-range interactions for improving material property predictions. However, these approaches only account for the first nearest neighbors (1NN) between adsorbates and surfaces, neglecting the influence of other neighboring atoms. Moreover, coordination environment of atoms at the sampling boundaries is changed within the graph representation, potentially leading to information loss and suboptimal performance. In addition, to the best of our knowledge, these efforts have predominantly concentrated on catalytic adsorption processes. Developing GNNs that incorporate short-range interactions into graph representations for a broader range of material systems remains largely underexplored.\u003c/p\u003e \u003cp\u003eIn this work, we introduce a cluster graph convolutional neural network (CG-NET) to predict local structural properties governed by short-range interactions across diverse material systems. CG-NET emphasizes localized atomic information and employs a cluster graph representation to efficiently model the local structure of materials, facilitating fast and reliable predictions for complex, disordered, and varied material systems. Importantly, we develop cluster graph sampling strategy to incorporate the short-range interactions more comprehensively. Futheremore, we introduce pseudo nodes as neighbors to the nodes near the boundary of the cluster to preserve the coordination environment information of the local structure. All these efforts significantly improve the prediction accuracy. This approach offers a significant improvement in computational efficiency while preserving high prediction accuracy compared to traditional crystal graph methods. Besides, CG-NET demonstrates much enhanced generalizability and interpretability across multiple datasets.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Construction of the CG-NET","content":"\u003cp\u003eWe develop a CG-NET that exploits short-range interactions to achieve enhanced material property prediction for material systems with the properties primarily influenced by local atomic structure, rather than relying on the entire crystal lattices. The framework and architecture of the CG-NET are illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Supplementary Fig.\u0026nbsp;1, respectively, with the detailed information and formulations provided in Supplementary Section 1. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea, for a local structure, we employ a cluster sampling strategy to select a representative cluster. In the context of physics, cluster sampling involves selecting atomic clusters within a material as the study subjects, a method known to improve efficiency and simplicity, particularly when investigating complex materials with intricate structures\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e,\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. Compared to conventional approaches that utilize all atoms within a unit cell for representation, the use of cluster sampling significantly reduces the number of nodes, and the size of the material graph required for prediction, thus alleviating the computational load. With this method, only the atoms within the selected cluster are considered as nodes for generating cluster graphs, while distant atoms inside the unit cell are excluded. This selective inclusion of atoms highly relevant to the local structural features ensures that the model could effectively capture short-range atomic interactions for accurate property prediction.\u003c/p\u003e \u003cp\u003eIn the CG-NET, the selected cluster is defined by a sphere centered around the local structure, with cluster size determined by the sphere radius \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e. The center of the local structure is chosen on a case-by-case basis, depending on the specific local structural properties of interest. For example, in the prediction of surface adsorption, the local structure center can be naturally defined by the positions of adsorbate atoms or adsorption sites on the surface. Unlike prior works that only account for the first nearest neighbors (1NN) between adsorbates and surfaces\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e,\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e, our approach defines the extent of the local structure by the cluster radius \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e and the number of nodes inside the cluster \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{N}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e. This allows for a more accurate and effective representation of the local structure induced short-range interactions (see more discussion in the later section), making it adaptable to various material systems and structural properties.\u003c/p\u003e \u003cp\u003eThe selected cluster is then transformed into a graph representation, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb, where nodes represent the atoms within the cluster and edges signify the connections between these atoms. However, generating cluster graphs can lead to the loss of vital information about the local coordination environment of nodes within the cluster, potentially impairing model performance. To address this issue, we introduce pseudo nodes to supplement any missing information. These pseudo nodes serve as neighboring coordinates for the nodes near the boundary of the cluster, as indicated by dashed circles in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb. In particular, the nodes residing along the boundaries of cluster graph maintain their local structural integrity through interactions with neighboring pseudo nodes, effectively preserving the local structural information. This approach ensures that nodes close to the cluster center retain their coordinates within the material, while distant nodes are disregarded due to their minimal impact on local structural properties. The resulting cluster graph is then processed by CNNs to extract essential local structural features, which are subsequently used to predict a wide range of material properties, as depicted in Figs.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec and \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed.\u003c/p\u003e \u003cp\u003eIn this study, we use the CGCNN as a baseline for comparison with our CG-NET. As one of the pioneering successes in applying GNNs to model crystalline materials, CGCNN has demonstrated remarkable capabilities in learning atomic embeddings directly from data, outperforming traditional human-engineered features. Since its development, CGCNN has consistently provided competitive performance and has been widely adopted as a benchmark in many studies\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. However, it should be noted that the cluster graph representation employed in CG-NET is not limited to CGCNN, which primarily considers pair-wise interactions through an invariant graph model. In fact, this approach can be extended to more advanced geometrically equivariant graph models or integrated with many-body interaction frameworks. Such an extension could provide a more comprehensive characterization of the geometric and topological information intrinsic to material systems, while still maintaining computational efficiency.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Improved CG-NET Performance on Diverse Material Systems","content":"\u003cp\u003eTo evaluate the performance of the CG-NET on diverse materials systems and properties, we employ four different datasets: the high-entropy alloy (HEA) dataset, two-dimensional impurity (2D-impurity) dataset\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e, the Open Catalyst 2020 (OC20) dataset\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e and the Open DAC 2023 (ODAC23) dataset\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. The HEA dataset is built by us through analyzing the adsorption process of intermediate adsorbates involved in the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{C}{\\text{O}}_{\\text{2}}\\)\u003c/span\u003e\u003c/span\u003e reduction reaction (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{C}{\\text{O}}_{\\text{2}}\\text{RR}\\)\u003c/span\u003e\u003c/span\u003e) on the surface of AgAuAlCuPt, AgAuCuPdPt and CoCuGaNiZn HEA alloys. Using density functional theory (DFT) calculations, we present the adsorption energy of adsorbates including \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*CO}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*CHO}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*COH}\\)\u003c/span\u003e\u003c/span\u003e, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*COOH}\\)\u003c/span\u003e\u003c/span\u003e, which are involved in potential limiting steps of the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{C}{\\text{O}}_{\\text{2}}\\text{RR}\\)\u003c/span\u003e\u003c/span\u003e for \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{C}}_{\\text{1}}\\)\u003c/span\u003e\u003c/span\u003e products, as well as the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*H}\\)\u003c/span\u003e\u003c/span\u003e adsorbate, which is crucial for the competing hydrogen evolution reaction.\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e In addition to the HEA dataset, we utilize the 2D-impurity, OC20, and ODAC23 datasets sourced from established studies\u003csup\u003e\u003cspan additionalcitationids=\"CR33\" citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e–\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. The OC20 dataset includes a large quantity of adsorption energies of various adsorbates on diverse catalyst surfaces, while the 2D-impurity dataset contains the defect formation energies of 2D materials. The ODAC23 dataset, containing over 8,400 metal-organic framework (MOF) materials with adsorbed CO\u003csub\u003e2\u003c/sub\u003e and H\u003csub\u003e2\u003c/sub\u003eO molecules, represents the largest collection of MOF adsorption energy using DFT calculations. To generate cluster graphs effectively, we apply filters to these three open datasets and constrain the energy range to avoid substantial deviations. Further details are provided in Supplementary Section 2 and Supplementary Table\u0026nbsp;1. All four datasets focus on material properties primarily determined by local structure-induced short-range interactions. Specifically, for the adsorbates on catalyst surfaces in HEA, OC20 and ODAC23 datasets, the relevant adsorption energy predominantly relies on the local structure composed of adsorbates and their neighboring catalytic atoms; For impurities in 2D materials within the 2D-impurity dataset, the formation energy of impurities is heavily dependent on the local bonding structure around the defects. Together, these datasets provide a comprehensive assessment of the efficacy of cluster graph method across a diverse range of functional materials.\u003c/p\u003e\u003cp\u003e \u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eStatistical comparison of graph datasets generated using crystal graph and cluster graph representations.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e\u003ccolgroup cols=\"9\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eStructure\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\stackrel{-}{\\varvec{N}}}_{\\varvec{a}\\varvec{t}\\varvec{o}\\varvec{m}\\varvec{s}}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colspan=\"3\" nameend=\"c6\" namest=\"c4\"\u003e \u003cp\u003eCrystal Graph\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colspan=\"3\" nameend=\"c9\" namest=\"c7\"\u003e \u003cp\u003eCluster Graph\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNodes\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eEdges\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eFLOPs\u003c/p\u003e \u003cp\u003e/structure\u003c/p\u003e \u003cp\u003e(M)\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eNodes\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eEdges\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eFLOPs\u003c/p\u003e \u003cp\u003e/structure (M)\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHEA\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e16,033\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e38\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e615,047\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e7,380,564\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e83.35\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e384,792\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e3,684,324\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e41.61\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2D-\u003c/p\u003e \u003cp\u003eImpurity\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e9,962\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e42\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e419,451\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5,033,412\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e91.48\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e209,202\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e2,434,788\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e44.25\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOC20\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e130,758\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e72\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e9,412,414\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e112,948,883\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e156.40\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e2,353,644\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e27,489,707\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e38.07\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eODAC23\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e20,000\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e201\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4,013,760\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e48,165,120\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e436.05\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e360,000\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e5,907,252\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e53.48\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eLocal structure induced short-range interactions are crucial in determining physical and chemical properties of functional materials such as catalytic materials, semiconductors, and alloys\u003csup\u003e\u003cspan additionalcitationids=\"CR37\" citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e–\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e, which however, are not only determined by 1NN interaction. The effect of short-range interactions on the materials’ properties can be evaluated using ligand effect\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e, which arises when the electronic structure of a central metal atom is altered by surrounding atoms in close proximity. For example, we examine the effect of short-range interactions on CO adsorption at the surface of HEA using ligand effect by DFT calculations, as illustrated in the left panel of Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea. By sequentially occluding one atom at a time in the slab model, we analyze the impact of each atom on the adsorption energy of CO. The degree of influence is visually represented by coloring each atom based on the magnitude of change in adsorption energy \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\varDelta\\:{E}_{ads}\\)\u003c/span\u003e\u003c/span\u003e. Results in the right panel of Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea show that atoms closest to the central atom, particularly 1NN on the surface and subsurface, significantly affect the adsorption energy. Atoms farther away, such as those in the fourth layer, contribute negligibly. Interestingly, some metal atoms in the third layer, which are fourth nearest neighbors, have non-trivial effect on adsorption energy. This phenomenon, reported in previous studies\u003csup\u003e\u003cspan additionalcitationids=\"CR40 CR41\" citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e–\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e, suggests that the effect of short-range interactions is not confined solely to 1NN atoms but can extend to second nearest neighbors (2NN) or even beyond. This underscores the importance of considering short-range interactions when studying local properties and highlights the need to account for neighboring atoms beyond the 1NN in graph-based models representing material structures. In this work, we develop a cluster sampling strategy and use the cluster radius \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e to define the extent of the graph construction. A performance comparison between graphs considering only the 1NN and cluster graphs including additional neighboring atoms is presented in Supplementary Fig.\u0026nbsp;5. The results demonstrate a significant improvement in prediction accuracy of 135%, 165%, 145%, and 119% on the HEA, 2D-impurity, OC20, and ODAC23 datasets, respectively, when neighboring atoms beyond the 1NN are included in the cluster graph. Therefore, our cluster graph representation provides a more accurate depiction of local atomic structure, ensuring that crucial short-range interactions are captured while maintaining a balance between computational efficiency and prediction accuracy.\u003c/p\u003e\u003cp\u003e \u003c/p\u003e\u003cp\u003eWe use both the CG-NET and CGCNN models to predict the adsorption energy on the HEA, OC20 and ODAC23 datasets and the defect formation energy on the 2D-impurity dataset. Further details regarding the training process can be found in Supplementary Section 3. To illustrate the cluster graph formulation in our approach, we use the HEA dataset as an example. As depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb (left), the AgAuAlCuPt HEA surface model consists of a four-layer slab with randomly distributed Ag, Au, Al, Cu, and Pt metal elements in a face-centered cubic lattice, maintaining an approximately equimolar atomic ratio. In this model, the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*CO}\\)\u003c/span\u003e\u003c/span\u003e adsorbate is positioned at the on-top site of HEA surface. Utilizing the cluster sampling strategy, a cluster graph is generated, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb (right), centered around an Al metal atom, which is the binding site on the HEA surface. The extent of this cluster is determined by the cluster radius \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e. The transformation from the surface model to the cluster graph involves \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{N}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e atoms within the cluster. For instance, with \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e= 5 Å, the cluster encompasses all 1NN and 2NN atoms of the binding site, both on the surface and in the subsurface layers. The \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{N}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e are critical hyperparameters in generating the cluster graph and ensuring convergence of these hyperparameters is key to obtaining reliable prediction results.\u003c/p\u003e\u003cp\u003eImplementing the cluster graph approach significantly reduces computational costs due to the dramatically decreased number of nodes and edges within the graph representation. We present a statistical comparison of graph datasets generated using the crystal graph and cluster graph representations in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Across the datasets—HEA, 2D-impurity, OC20, and ODAC23—the average number of atoms per structure \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\stackrel{-}{N}}_{atoms}\\)\u003c/span\u003e\u003c/span\u003e increases from 38 to 42, 72 and 201, respectively. Consequently, the floating-point operations (FLOPs) required per structure for a single convolution layer in the crystal graph representation increase proportionally with the size of the material structures, from 83.35 to 91.48, 156.40, and 436.05 FLOPs, as detailed in Supplementary Section 4. In contrast, the cluster graph representation substantially reduces FLOPs for each dataset. CG-NET achieves FLOPs savings of approximately 50%, 52%, 76%, and 88% on the HEA, 2D-impurity, OC20, and ODAC23 datasets, respectively. From the comparative results in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the computational resource efficiency of our method is strikingly evident. Furthermore, the cluster graph representation outperforms the crystal graph in GPU video memory (VRAM) consumption, making it an even more resource-efficient approach. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec, the VRAM required for CG-NET model is reduced by over 84% compared to CGCNN on the same OC20 dataset. This improvement in efficiency allows for the use of larger batch sizes without requiring additional computational resources, as highlighted in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed. For instance, with an RTX 3080 12GB GPU on the complete ODAC23 dataset (162,219 structures, including 32,824,409 atoms), CGCNN is limited to a maximum batch size of 32, whereas CG-NET can handle batch sizes exceeding 1024. Other models like M3GNet, which incorporate equivariance and many-body interactions into graph representations, face more severe VRAM limitations. VRAM overflow occurs with a batch size as small as 16 using M2GNet. These comparisons demonstrate the significantly improved scalability and resource efficiency of CG-NET.\u003c/p\u003e\u003cp\u003eImportantly, this efficiency boost does not compromise prediction accuracy. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec and 3c illustrate that CG-NET achieves a test MAE of approximately 0.465 eV, nearly identical to the 0.459 eV obtained by CGCNN. Further examinations of CG-NET on other datasets reinforce these advantages. On the HEA dataset, Fig.\u0026nbsp;3a shows that CG-NET reduces VRAM usage by approximately 35% compared to CGCNN, while maintaining an equivalent test MAE of 0.088 eV. On the 2D-impurity dataset, Fig.\u0026nbsp;3b shows that the VRAM usage of CG-NET is 61% lower than that of CGCNN, and its prediction accuracy (test MAE 0.445 eV) surpasses that of CGCNN (test MAE 0.453 eV). For the ODAC23 dataset, CG-NET achieves a lower test MAE of 0.186 eV, compared to 0.203 eV for CGCNN, while reducing VRAM usage by 90%, as shown in Fig.\u0026nbsp;3d. Note that, when MAE accuracy is set at 90% of CGCNN, CG-NET offers average VRAM savings of over 77% across all four datasets. From these detailed comparisons, we can see that the cluster graph approach in CG-NET can reduce computational requirements while maintaining or even improving prediction accuracy. The significantly lower computational costs in terms of FLOPs and VRAM usage establish CG-NET as a highly scalable and efficient model for material property prediction across diverse materials systems.\u003c/p\u003e\u003cp\u003e \u003c/p\u003e\u003cp\u003eFurthermore, we compare the performance of CG-NET models with and without pseudo neighboring nodes, as shown in Figs.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e4\u003c/span\u003ea-d. The incorporation of pseudo nodes plays a crucial role in preserving the coordination environment of the local structure, which is vital for sustaining prediction accuracy. A comparison of the cluster graph with and without pseudo neighboring nodes is illustrated in Supplementary Fig.\u0026nbsp;2, with further details in Supplementary Section 5. Across all datasets, CG-NET with pseudo neighboring nodes consistently outperforms models without the pseudo nodes. In particular, the exclusion of pseudo neighboring nodes results in a significant decline in performance on the HEA, 2D-impurity, and OC20 datasets, though not much variation is observed on the ODAC23 dataset. This discrepancy is primarily attributed to the nature of the ODAC23 dataset, where the adsorption of H₂O and CO₂ molecules onto the MOF structures mainly involves weak van der Waals (VDW) interactions, rather than the stronger covalent or ionic bonding seen in the HEA, 2D-impurity, and OC20 datasets. As a result, short-range interactions in the ODAC23 dataset are less critical to the accuracy of the model. While pseudo neighboring nodes provide essential coordination information, they are only used as neighboring coordinates for nodes within the cluster graph and are not aggregated into the final readout phase, somewhat mitigating their impact on the model. Improvements could be made through several means. One approach involves expanding the cluster radius to include a larger number of nodes encompassed within the cluster, thereby reducing the proportion of nodes connected to pseudo neighboring nodes. The effect of cluster radius on prediction performance is detailed in Fig.\u0026nbsp;3 and Supplementary Fig.\u0026nbsp;9, where results show that increasing the cluster radius and the number of nodes leads to convergence in the prediction performance. Another improvement involves introducing higher-order pseudo neighboring nodes to refine the embedding feature assigned to them. We tested this by incorporating 2nd-order pseudo neighboring nodes (i.e., neighboring nodes of pseudo nodes), as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e4\u003c/span\u003e. The results reveal that using 2nd-order pseudo neighboring nodes further enhances the accuracy of CG-NET on HEA, 2D-impurity, and OC20 datasets, albeit at the cost of increased VARM usage. The flexibility of CG-NET in allowing higher-order pseudo nodes based on the dataset provides an adaptable framework for optimizing both accuracy and efficiency.\u003c/p\u003e"},{"header":"Interpretability of CG-NET","content":"\u003cp\u003eConventional ML models frequently exhibit opacity due to their complex architecture, compromising their interpretability. This lack of model interpretability poses a significant challenge to materials discovery and design, where understanding the structure-property-performance relationship is of paramount importance\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e. Next, we show that the employment of cluster graph representation enables much improved interpretability for the prediction of material properties.\u003c/p\u003e \u003cp\u003eTo elucidate interpretability, the pooling and convolution layers within the CG-NET are visualized in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003e, respectively. These visualizations provide insights into the intricate relationship between atomic features in the latent space and the target properties, which are intimately connected to the local structure of the materials. We use principal component analysis (PCA) for the interpretative analysis of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*CO}\\)\u003c/span\u003e\u003c/span\u003e adsorption at the top site of AgAuAlCuPt HEA surfaces. Here, the 64-dimensional atomic feature vectors derived from the pooling layer are reduced to a 2D representation via projection onto the plane defined by the first two principal components. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003ea displays the resulting distribution of binding sites for each metallic element within the AgAuAlCuPt HEA. Significantly, this distribution mirrors the element-specific adsorption energy distribution obtained from our DFT calculations, as evidenced by the comparative inset in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003ea. This concordance underscores the ability of the cluster graph to accurately capture the local structural configurations related to the adsorption energy of molecules, thereby affirming the interpretive power of the CG-NET in correlating local structural features with physicochemical properties.\u003c/p\u003e \u003cp\u003eNotably, CG-NET can further distinguish the binding strengths of five metallic elements in HEA with adsorbates. This is achieved by analyzing the atomic features within the convolutional layers of the network. As shown in Figs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003eb\u0026ndash;f, we categorize the atomic features of HEA into five distinct groups based on the metallic elements at the binding sites. Each group of atomic features is then encoded with a unique color corresponding to each atom type, facilitating a clear visual representation of the data. Upon examination, we observe that the atomic features segregate into two patterns. One pattern (Figs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003eb and \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003ec) includes Ag and Au, while the other pattern (Figs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003ed\u0026ndash;f) comprises Al, Cu, and Pt. This segregation is in good agreement with our DFT calculations (the inset in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003ea and Supplementary Fig.\u0026nbsp;3). Specifically, Ag and Au are associated with the first pattern identified by CG-NET, which aligns with the lower reactivity among the elements due to their higher adsorption energies, as illustrated in the inset of Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003ea. In contrast, the second pattern, which includes Al, Cu, and Pt, is indicative of a higher reactivity, corroborated by their lower \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*CO}\\)\u003c/span\u003e\u003c/span\u003e adsorption energies according to DFT calculations. The consistency between the visualizations of the latent space of CG-NET and DFT results validates the network\u0026rsquo;s ability to reliably map atomic features to material reactivity. The CG-NET also offers an insightful interpretation of the short-range interactions between elements at binding sites and their neighboring atoms through an analysis of inter-cluster distances in the latent space. The inter-cluster distances measure the influence that neighboring atoms exert on the adsorption site, with larger distances indicating stronger impacts and thus higher reactivity of the metal elements at the binding site. As shown in Figs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003eb\u0026ndash;f, CG-NET has different inter-cluster distances for the elements within the AgAuAlCuPt HEA, revealing a distinct trend: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{Pt\u0026gt;Cu\u0026gt;Al\u0026gt;Au\u0026gt;Ag}\\)\u003c/span\u003e\u003c/span\u003e. This trend corresponds with the reactivity observed for the pure metals, as shown in Supplementary Fig.\u0026nbsp;3. Such interpretations by CG-NET shed light on the strength of short-range interactions among atoms and how the cluster graph encapsulates these interactions to predict adsorbate binding energies.\u003c/p\u003e \u003cp\u003eIt is essential to recognize the significance of constructing a cluster graph to capture these vital short-range interactions, an aspect on which the conventional crystal graph representations fall short due to their lack of focus on the local structure of materials. As depicted in Supplementary Fig.\u0026nbsp;6, applying the same analysis to the CGCNN, the crystal graph suggests that oxygen atoms have the largest inter-cluster distance from metallic and carbon atoms. This observation does not align with the experimentally observed \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*CO}\\)\u003c/span\u003e\u003c/span\u003e adsorption on the HEA surface. In the context of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{*CO}\\)\u003c/span\u003e\u003c/span\u003e adsorption, the binding between adsorbates and the HEA surface occurs between the carbon atoms and metal binding sites. Therefore, the depiction of oxygen atoms as distant from neighboring atoms in Supplementary Fig.\u0026nbsp;6 is physically misleading. The cluster graph, in contrast, offers a more accurate representation, positioning carbon atoms and metal binding sites in close proximity while grouping oxygen atoms with other neighboring atoms, as illustrated in Figs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003eb\u0026ndash;f. This interpretation aligns well with the known physical interactions, thereby enabling a more self-consistent explanation of material properties. We also present the t-distributed Stochastic Neighbor Embedding (t-SNE) analysis for embedding features in the pooling and convolution layers (Supplementary Figs.\u0026nbsp;7 and 8), and a consistent mapping could be obtained to confirm the interpretability of CG-NET. Ultimately, by focusing on short-range interactions, the cluster graph yields a physically meaningful depiction of the local structure of materials, facilitating a deeper understanding of material properties.\u003c/p\u003e"},{"header":"Generalizability of CG-NET","content":"\u003cp\u003eThe influence of coverage-dependent adsorbate-adsorbate interactions on catalyst surfaces has been well established through theoretical and experimental studies\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e,\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e. These interactions are pivotal, as they not only determine the adsorption and desorption dynamics but also alter the stability of transition states, thereby influencing catalytic reaction rates.\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e A thorough understanding of adsorbate-adsorbate interactions across diverse catalyst surfaces is crucial for designing novel catalysts. In this endeavor, the cluster graph model offers a straightforward approach to explicitly account for these interactions. This is accomplished by expanding the cluster radius \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e beyond the inter-adatom distances, thereby explicitly incorporating these interactions into the cluster graph. However, increasing \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e incurs additional computational costs due to the enlarged graph comprising more nodes and edges.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTypically, a modest cluster radius of CG-NET is sufficient for capturing the local structure of materials. As shown in Supplementary Fig.\u0026nbsp;8, a radius of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\text{\\:}\\text{=}\\text{\\:}\\text{5\u0026Aring;}\\)\u003c/span\u003e\u003c/span\u003e achieves convergence in the validation of the MAE for the prediction of the adsorption energy on the HEA surface. While a conservative \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}_{\\text{cluster}}\\)\u003c/span\u003e\u003c/span\u003e limits our ability to model adsorbate-adsorbate interactions explicitly, the cluster graph model compensates it by implicitly considering these interactions through multilayer convolutional operations. This implicit inclusion is critical not only for enhancing the model\u0026rsquo;s generalizability but also for its applicability in transfer learning to new datasets.\u003c/p\u003e \u003cp\u003eTo substantiate this, we train the CG-NET model on a (3\u0026times;3) surface HEA dataset and evaluate its prediction performance on (2\u0026times;2) and (4\u0026times;4) surface HEA datasets with the same metal composition, which are related to varying CO coverage on the HEA surface. We also train and assess the CGCNN model using the same dataset for comparison. The results, as presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e, demonstrate much improved prediction effectiveness of the CG-NET model for the varying CO coverage, as indicated by the consistently lower test MAEs (0.13 eV for (2\u0026times;2) and 0.10 eV for (4\u0026times;4) surface supercells) and higher \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{R}}^{\\text{2}}\\)\u003c/span\u003e\u003c/span\u003e values (0.80 for (2\u0026times;2) and 0.91 for (4\u0026times;4) surface supercells) than those of the CGCNN model. This generalizability is attributed to CG-NET\u0026rsquo;s ability to maintain consistent graph structure integrity when transitioning from training to evaluating datasets with different coverages. In contrast, the graph structure of the CGCNN model undergoes significant changes to accommodate the varying adsorption coverages.\u003c/p\u003e \u003cp\u003eFrom a computational perspective, generating extensive surface datasets for theoretical calculations is immensely challenging due to the increasing computational demands associated with the increased number of atoms. Therefore, it is often more practical to generate a training dataset from smaller surface supercells, such as (2\u0026times;2) or (3\u0026times;3) surface and use transfer learning to apply the knowledge to larger surface supercells, such as (4\u0026times;4) surface or beyond. This is why the ability of the CG-NET to generalize to unseen datasets becomes particularly important. In addition, the CG-NET model can be fine-tuned with a relatively small dataset, providing a considerable advantage in scenarios with limited computational resources.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn conclusion, we develop a CG-NET that leverages local structural information through graph representations for enhanced material property prediction. We also report the introduction of pseudo nodes to preserve the bonding coordinates of the nodes near the boundaries of the cluster graph, which ensures the prediction performance. Through effectively incorporating short-range interactions within the cluster graph, the CG-NET offers an effective representation of the local structure of materials, yielding computational efficiency while maintaining high prediction accuracy. Our results also highlight the superior generalizability and interpretability of CG-NET when applied to diversified datasets, enabling robust and rapid prediction capabilities for complex, disordered, and diverse material systems with reduced computational resources. The capabilities of CG-NET could be further enhanced by incorporating dynamic graph updates that can reflect changes in atomic configurations over time, paving the way for simulations of material behavior under various conditions and during chemical reactions. The methodologies and insights derived from our proposed CG-NET could be applied to other graph-based neural network architectures, potentially revolutionizing the field of functional material development.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eCompeting Interests\u003c/h2\u003e \u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAuthor Contributions\u003c/h2\u003e \u003cp\u003eM. Y. and H. C. conceived the idea. M. Y., W. L., and K. C. T. designed the study. H. C., T. Y., and M. G. Z. implemented the code. H. C., H. W., Z. W., Z. R. W., H. K. H., C. C., S. P. L., W. L., K.C.T., and M. Y. prepared the dataset and performed the analysis. H.C. drafted the manuscript. All the authors have read and corrected the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgments\u003c/h2\u003e \u003cp\u003eThis work was supported by the Hong Kong Polytechnic University (project number: P0042711, P0049524, P0050570, and P0048122), Hong Kong Research Grants Council (project number: P0046939 and P0045061), and Guangdong Natural Science Foundation (project number: 2024A1515010031).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eOlson GB (2000) Designing a new material world. Science 288:993\u0026ndash;998\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSeh ZW et al (2017) Combining theory and experiment in electrocatalysis: insights into materials design. Science 355\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRaccuglia P et al (2016) Machine-learning-assisted materials discovery using failed experiments. Nature 533:73\u0026ndash;76\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eButler KT, Davies DW, Cartwright H, Isayev O, Walsh A (2018) Machine learning for molecular and materials science. Nature 559:547\u0026ndash;555\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRao Z et al (2022) Machine learning\u0026ndash;enabled high-entropy alloy discovery. Science 378:78\u0026ndash;85\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXie T, Grossman JC (2018) Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys Rev Lett 120:145301\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePark CW, Wolverton C (2020) Developing an improved crystal graph convolutional neural network framework for accelerated materials discovery. Phys Rev Mater 4:063801\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBatzner S et al (2022) E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat Commun 13:2453\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen C, Ong SP (2022) A universal graph deep learning interatomic potential for the periodic table. Nat Comput Sci 2:718\u0026ndash;728\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBatatia I, Kov\u0026aacute;cs DP, Simm GNC, Ortner C, Cs\u0026aacute;nyi G (2023) MACE: Higher order equivariant message passing neural networks for fast and accurate force fields. Preprint at. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2206.07697\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2206.07697\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMerchant A et al (2023) Scaling deep learning for materials discovery. Nature 624:80\u0026ndash;85\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeng B et al (2023) CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nat Mach Intell 5:1031\u0026ndash;1041\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKo TW, Ong SP (2023) Recent advances and outstanding challenges for machine learning interatomic potentials. Nat Comput Sci\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan J, Rong Y, Xu T, Huang W (2022) Geometrically equivariant graph neural networks: a survey. Preprint at. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2202.07230\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2202.07230\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRiebesell J et al (2024) Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions. Preprint at. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2308.14920\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2308.14920\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGreeley J, Jaramillo TF, Bonde J, Chorkendorff IB, Norskov JK (2006) Computational high-throughput screening of electrocatalytic materials for hydrogen evolution. Nat Mater 5:909\u0026ndash;913\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCurtarolo S et al (2013) The high-throughput highway to computational materials design. Nat Mater 12:191\u0026ndash;201\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYao Y et al (2022) High-entropy nanoparticles: Synthesis-structure-property relationships and data-driven discovery. Science 376:eabn3103\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang B, Von Rudorff GF (2023) Von Lilienfeld, O. A. The central role of density functional theory in the AI age. Science 381:170\u0026ndash;175\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHarrison WA (1989) Electronic structure and the properties of solids: the physics of the chemical bond. Dover, New York\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHammer B, N\u0026oslash;rskov JK (2000) Theoretical surface science and catalysis\u0026mdash;calculations and concepts. Adv Catal 45:71\u0026ndash;129\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNorskov JK, Bligaard T, Rossmeisl J, Christensen CH (2009) Towards the computational design of solid catalysts. Nat Chem 1:37\u0026ndash;46\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBanhart F, Kotakoski J, Krasheninnikov AV (2011) Structural defects in graphene. ACS Nano 5:26\u0026ndash;41\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFreysoldt C et al (2014) First-principles calculations for point defects in solids. Rev Mod Phys 86\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlexandrov AS (2009) Advances in polaron physics. Springer-, Berlin;\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrancini G et al (2013) Hot exciton dissociation in polymer solar cells. Nat Mater 12:29\u0026ndash;33\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eN\u0026oslash;rskov JK, Studt F, Abild-Pedersen F, Bligaard T (2014) Fundamental concepts in heterogeneous catalysis. Wiley\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhanekar PG, Deshpande S, Greeley J (2022) Adsorbate chemical environment-based machine learning framework for heterogeneous catalysis. Nat Commun 13:5788\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePablo-Garc\u0026iacute;a S et al (2023) Fast evaluation of the adsorption energy of organic molecules on metals via graph neural networks. Nat Comput Sci 3:433\u0026ndash;442\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAnkudinov AL, Ravel B, Rehr JJ, Conradson SD (1998) Real-space multiple-scattering calculation and interpretation of x-ray-absorption near-edge structure. Phys Rev B 58:7565\u0026ndash;7576\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003e\u003cem\u003eSurface analysis: the principal techniques\u003c/em\u003e. (Wiley, Chichester, U.K, (2009)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDavidsson J, Bertoldo F, Thygesen KS, Armiento R (2023) Absorption versus adsorption: high-throughput computation of impurities in 2D materials. npj 2D Mater Appl 7:26\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChanussot L et al (2021) Open catalyst 2020 (OC20) dataset and community challenges. ACS Catal 11:6059\u0026ndash;6072\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSriram A et al (2024) The open DAC 2023 dataset and challenges for sorbent discovery in direct air capture. ACS Cent Sci 10:923\u0026ndash;941\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang X et al (2024) Electrochemical CO\u003csub\u003e2\u003c/sub\u003e activation and valorization on metallic copper and carbon-embedded N‐coordinated single metal MNC catalysts. Angew Chem 136:e202401821\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSomorjai GA, Li Y (2010) Introduction to surface chemistry and catalysis. Wiley\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYU P, Cardona M (2010) Fundamentals of semiconductors: physics and materials properties. Springer Science \u0026amp; Business Media\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePorter DA, Easterling KE, Sherif MY (2021) Phase transformations in metals and alloys. CRC, Boca Raton\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchlapka A, Lischka M, Gro\u0026szlig; A, K\u0026auml;sberger U, Jakob P (2003) Surface strain versus substrate interaction in heteroepitaxial metal layers: Pt on Ru(0001). Phys Rev Lett 91:016101\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoster HE, Alves OB, Koper M (2010) T. M. Tuning adsorption via strain and vertical ligand effects. ChemPhysChem 11:1518\u0026ndash;1524\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAsano M, Kawamura R, Sasakawa R, Todoroki N, Wadayama T (2016) Oxygen reduction reaction activity for strain-controlled Pt-based model alloy catalysts: surface strains and direct electronic effects induced by alloying elements. ACS Catal 6:5285\u0026ndash;5289\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eClausen CM, Batchelor TAA, Pedersen JK, Rossmeisl J (2021) What atomic positions determines reactivity of a surface? Long-range, directional ligand effects in metallic alloys. Adv Sci 8:2003357\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchmidt J, Marques MRG, Botti S, Marques MA (2019) L. Recent advances and applications of machine learning in solid-state materials science. npj Comput Mater 5:1\u0026ndash;36\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGrabow LC, Hvolb\u0026aelig;k B, N\u0026oslash;rskov JK (2010) Understanding trends in catalytic activity: the effect of adsorbate\u0026ndash;adsorbate interactions for CO oxidation over transition metals. Top Catal 53:298\u0026ndash;310\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eClark EL, Hahn C, Jaramillo TF, Bell AT (2017) Electrochemical CO2 reduction over compressively strained CuAg surface alloys with enhanced multi-carbon oxygenate selectivity. J Am Chem Soc 139:15848\u0026ndash;15857\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Cluster graph convolutional neural network, Short-range interaction, Interpretable machine learning, Disordered materials, High-entropy alloys","lastPublishedDoi":"10.21203/rs.3.rs-4429598/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4429598/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMachine learning (ML) plays a pivotal role in the development of functional materials, in which graph neural networks (GNNs) have shown improved performance by utilizing the graph representation of atoms and bonds to effectively characterize materials. However, it remains challenging to achieve efficient, robust and interpretable predictions due to the limited integration of domain knowledge. In this study, we propose leveraging the local structure and short-range atomic interactions of materials using a cluster graph representation to improve the performance. This physics-informed cluster graph neural network (CG-NET) significantly enhances computational efficiency through a cluster sampling strategy. Importantly, by incorporating pseudo nodes as neighbors to the nodes at the cluster boundaries, we maintain the bonding coordination environment, enhancing the prediction accuracy. We further demonstrate CG-NET\u0026rsquo;s remarkable prediction accuracy and efficiency across diverse material systems and properties and reveal its superior interpretability and generalizability with extensive experiments. Our work highlights the importance of integrating domain-specific scientific knowledge into the design of a generalizable and interpretable ML framework. The cluster graph representation in the CG-NET could be extended to other graph-based neural networks to accelerate the development of functional materials while significantly reducing computational cost.\u003c/p\u003e","manuscriptTitle":"A physics-informed cluster graph neural network enables generalizable and interpretable prediction for material discovery","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-02-06 05:13:30","doi":"10.21203/rs.3.rs-4429598/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"99ec0af2-9918-4351-b296-4f916816e8ac","owner":[],"postedDate":"February 6th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":43774632,"name":"Physical sciences/Materials science/Theory and computation/Computational methods"},{"id":43774633,"name":"Physical sciences/Mathematics and computing/Computational science"}],"tags":[],"updatedAt":"2025-06-24T12:40:48+00:00","versionOfRecord":[],"versionCreatedAt":"2025-02-06 05:13:30","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4429598","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4429598","identity":"rs-4429598","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.